Building a Natural Language Interface for OpenTargets Data with AI on AWS

With the advent of AI, knowledge bases and MCP servers I'm less inclined to write code than I ever have been. This blog post details how I built a complete natural language interface for querying OpenTargets data using Amazon Q Developer, AWS infrastructure and the Model Context Protocol (MCP).

As a quick disclaimer the Amazon Q Developer service does allow for deploying AWS resources on your behalf. You should absolutely check and make sure the CLI isn't going wild.

If you'd like to see the complete codebase, it's available on GitHub.

Background: The Challenge of Genomic Data Access

OpenTargets is a comprehensive platform that integrates genetic, genomic, and chemical data to support systematic drug target identification and prioritization. With over 3.8 million gene-disease associations, 78,000 genes, 39,000 diseases, and 18,000 drug molecules. It's a lot of data.

However, accessing this data traditionally requires:

Deep knowledge of the OpenTargets data model
SQL expertise to write complex queries
Understanding of table relationships and schema structures
Manual exploration to discover relevant associations
SDKs or APIs to programmatically access data

Before the advent of LLMs and AI agents, I would have had to write hundreds of lines of code to build a web interface or API for researchers to query the data. Of course, I'd never have quite the correct use cases or queries, and we'd have to have many, many go arounds before getting something useful deployed.

The Vision: What if researchers could simply ask questions in natural language like:

"What genes are associated with breast cancer?"
"Show me drugs targeting BRCA1 with high confidence scores"
"Find rare diseases with few known genetic associations"

More importantly, what if I can use the AI tools to write the code that then calls the AI services?

The answer to this question is kind of. You can't just throw a problem at an LLM and expect a full solution. The AI is more like a really smart intern. You need to have a trust but verify mentality.

Amazon Q Developer Process Overview

I used Amazon Q Developer throughout the entire project.

Map out the steps you'll take. Ask the Q developer to write these to a TODO.md file, and ask if it's unclear on anything.
Ask the Q chat to complete one step at a time. For example, I asked it to write the infrastructure as code in Terraform first.
Once Q generates the code ask it to review and summarize it for you.
Think of tests for your use case. If you're using Q chat to create S3 buckets, ask if it the S3 buckets exist. Ask the AI to generate tests for your use case.
Leverage existing software where you can. I used the PyAthena library as a client to Athena, and the MCP library to create the MCP server. I told Q chat to use these libraries as context.
Describe, describe, describe. The more context you give the better the results. If you're working with data ask the Q chat to describe the data and the schema. Then write tests against that schema. Something, somewhere will fail. Describe the error to the Q chat and ask for a solution.
Keep your end goal in mind, and be flexible on how you get there. I wanted to interact with the OpenTargets data without having to write SQL. I ended up with a CLI tool that does exactly that, and an MCP server that can be used with any MCP compatible AI tool.

Here is my TODO file. I used the Q chat to generate this, and then I asked it to complete one task at a time. When it completed a task, I asked it to review the code, summarize what it did, and then write tests.

## Completed ✅
- [x] Upload complete Open targets dataset to S3
- [x] Set up AWS Glue database and tables for OpenTargets data
- [x] Create Athena workgroup and named queries
- [x] Build SQL agent for natural language to SQL conversion
- [x] Implement MCP server for integration with AI tools
- [x] Create comprehensive schema validation tests
- [x] Document actual table schemas and query patterns

## Current Architecture
- **Data Storage**: S3 buckets with OpenTargets parquet files
- **Query Engine**: Amazon Athena with Glue Data Catalog
- **AI Integration**: Bedrock Claude for natural language processing
- **Interface**: MCP server + CLI agent

AWS Infrastructure Architecture

As always, there are a million ways to architect a solution. I went with this because I wanted to understand the process that goes into building an MCP server, and I wanted to keep costs low. We're using S3 with Athena and Glue because it's serverless. No complicated databases to manage, and we only pay for what we use.

Here's a description of each of the pieces of the AWS architecture:

Data Storage and Cataloging

Amazon S3: Stores OpenTargets parquet files in parquet format.
Data bucket: Raw OpenTargets datasets organized by table type
Results bucket: Athena query results with 30-day lifecycle management
Server-side encryption and private access controls
AWS Glue Data Catalog: Provides schema discovery and metadata management
12 tables representing different aspects of the OpenTargets data model
Automatic schema inference from parquet files
Centralized metadata repository for consistent querying

Query Processing

Amazon Athena: Serverless SQL query engine for interactive analytics
Direct queries against parquet files in S3
Pay-per-query pricing model (~$5 per TB scanned)
Workgroup configuration for query management and cost control
Named queries for common use cases

AI Integration

Amazon Bedrock with Claude 3.5 Sonnet: Natural language to SQL conversion
Advanced reasoning capabilities for complex query generation
Context-aware SQL generation based on schema understanding
Cost-effective token-based pricing (~$0.003 per 1K input tokens)

Cost Analysis

2025 is expensive and we like serverless.

Per Query: $0.01−$0.11 (primarily Athena data scanning)
Monthly Usage (100 queries): ~$1−$11
No infrastructure overhead: Serverless components scale to zero

Schema Testing and Validation

Biological data is messy. Always. It's just how that works. Some of the columns were JSON or arrays. It took the AI several tries to get the correct structure into Athena, but we got there.

Results

Our MCP server/AI agent exposes three main tools for interacting with the OpenTargets data:

Available Tools:

query_opentargets: Full natural language query processing
execute_sql: Direct SQL execution for advanced users
explain_query: SQL generation without execution for debugging

In order to test the system, I ran a series of natural language queries through the CLI tool. Here are some example queries, along with the generated SQL, and their results:

============================================================
Natural Query: Find high-confidence associations for Alzheimer's disease
============================================================
Generated SQL:
SELECT 
    d.name AS disease_name,
    t.approvedSymbol AS gene_symbol,
    t.approvedName AS gene_name,
    a.score AS association_score,
    a.evidenceCount AS evidence_count
FROM 
    association_overall_direct a
JOIN 
    disease d ON a.diseaseId = d.id
JOIN 
    target t ON a.targetId = t.id
WHERE 
    (d.name LIKE '%Alzheimer%' OR array_join(d.synonyms.hasExactSynonym, ' ') LIKE '%Alzheimer%')
    AND a.score > 0.7
ORDER BY 
    a.score DESC
LIMIT 100;
============================================================
✅ Results (5 rows):

Row 1:
  disease_name: Alzheimer disease 3
  gene_symbol: PSEN1
  gene_name: presenilin 1
  association_score: 0.8370820664829206
  evidence_count: 566

Row 2:
  disease_name: early-onset autosomal dominant Alzheimer disease
  gene_symbol: PSEN2
  gene_name: presenilin 2
  association_score: 0.7960670314352667
  evidence_count: 271

Row 3:
  disease_name: Alzheimer disease type 1
  gene_symbol: APP
  gene_name: amyloid beta precursor protein
  association_score: 0.7855977451858343
  evidence_count: 62

Row 4:
  disease_name: Alzheimer disease
  gene_symbol: APP
  gene_name: amyloid beta precursor protein
  association_score: 0.7817288493909762
  evidence_count: 22877

Row 5:
  disease_name: Alzheimer disease
  gene_symbol: ABCA7
  gene_name: ATP binding cassette subfamily A member 7
  association_score: 0.706722491576517
  evidence_count: 477

============================================================
Natural Query: Which diseases have the most genetic associations?
============================================================
Generated SQL:
SELECT 
    d.name AS disease_name,
    COUNT(DISTINCT a.targetId) AS genetic_association_count
FROM 
    association_overall_direct a
JOIN 
    disease d ON a.diseaseId = d.id
WHERE 
    a.score > 0.5
GROUP BY 
    d.name
ORDER BY 
    genetic_association_count DESC
LIMIT 100;
============================================================
✅ Results (100 rows):

Row 1:
  disease_name: neurodegenerative disease
  genetic_association_count: 1382

Row 2:
  disease_name: genetic disorder
  genetic_association_count: 404

Row 3:
  disease_name: platelet count
  genetic_association_count: 401

Row 4:
  disease_name: glomerular filtration rate
  genetic_association_count: 379

============================================================
Natural Query: What genes are associated with breast cancer?
============================================================
Generated SQL:
SELECT DISTINCT t.approvedSymbol AS gene_symbol, t.approvedName AS gene_name, d.name AS disease_name, a.score AS association_score
FROM association_overall_direct a
JOIN target t ON a.targetId = t.id
JOIN disease d ON a.diseaseId = d.id
WHERE d.name LIKE '%breast cancer%'
   OR array_join(d.synonyms.hasExactSynonym, ' ') LIKE '%breast cancer%'
   OR array_join(d.synonyms.hasRelatedSynonym, ' ') LIKE '%breast cancer%'
ORDER BY a.score DESC
LIMIT 100;
============================================================
✅ Results (100 rows):

Row 1:
  gene_symbol: BRCA2
  gene_name: BRCA2 DNA repair associated
  disease_name: breast cancer
  association_score: 0.8548846942723206

Row 2:
  gene_symbol: BRCA1
  gene_name: BRCA1 DNA repair associated
  disease_name: breast cancer
  association_score: 0.838391491656388

Row 3:
  gene_symbol: AR
  gene_name: androgen receptor
  disease_name: partial androgen insensitivity syndrome
  association_score: 0.8229769855485846

Row 4:
  gene_symbol: CHEK2
  gene_name: checkpoint kinase 2
  disease_name: breast cancer
  association_score: 0.8155025936777466

Row 5:
  gene_symbol: CDH1
  gene_name: cadherin 1
  disease_name: CDH1-related diffuse gastric and lobular breast cancer syndrome
  association_score: 0.7950273350532806

Example Results

High-confidence associations discovered:

CFTR ↔ cystic fibrosis: 91.3% confidence
F8 ↔ hemophilia A: 90.8% confidence
MYH7 ↔ hypertrophic cardiomyopathy: 89.8% confidence

Data Source Analysis

Evidence diversity across different sources:

UniProt variants: Highest average confidence (75.8%)
Reactome pathways: Strong pathway evidence (73.6%)
Genomics England: Clinical evidence (71.7%)

Lessons Learned

Break the problem down into small pieces. Don't try to do everything at once. Especially don't let the AI try to do everything at once!

Conclusion

Building a natural language interface for genomic data demonstrates the power of combining modern cloud infrastructure with AI capabilities. Our solution provides:

Accessibility: Researchers can query complex genomic data without SQL expertise. No code required!
Cost Efficiency: Serverless architecture scales with usage
Reliability: Comprehensive testing ensures data integrity
Extensibility: MCP protocol enables broad integration possibilities

The project showcases how thoughtful architecture decisions, thorough testing, and modern AI capabilities can democratize access to complex scientific datasets. By removing technical barriers, we enable researchers to focus on discovery rather than data wrangling.

Key Takeaways:

Test Early and Thoroughly: Schema validation prevented major production issues
Embrace Serverless: Pay-per-use models align perfectly with research workloads
Standardize Interfaces: MCP protocol provides future-proof integration capabilities
Document Everything: Comprehensive testing and documentation enable team collaboration

The complete codebase, including infrastructure definitions and test suites, demonstrates an approach to building AI-powered data interfaces for scientific research.