Building a Natural Language Interface for OpenTargets Data with AI on AWS
With the advent of AI, knowledge bases and MCP servers I'm less inclined to write code than I ever have been. This blog post details how I built a complete natural language interface for querying OpenTargets data using Amazon Q Developer, AWS infrastructure and the Model Context Protocol (MCP).
As a quick disclaimer the Amazon Q Developer service does allow for deploying AWS resources on your behalf. You should absolutely check and make sure the CLI isn't going wild.
If you'd like to see the complete codebase, it's available on GitHub.
Background: The Challenge of Genomic Data Access
OpenTargets is a comprehensive platform that integrates genetic, genomic, and chemical data to support systematic drug target identification and prioritization. With over 3.8 million gene-disease associations, 78,000 genes, 39,000 diseases, and 18,000 drug molecules. It's a lot of data.
However, accessing this data traditionally requires:
- Deep knowledge of the OpenTargets data model
- SQL expertise to write complex queries
- Understanding of table relationships and schema structures
- Manual exploration to discover relevant associations
- SDKs or APIs to programmatically access data
Before the advent of LLMs and AI agents, I would have had to write hundreds of lines of code to build a web interface or API for researchers to query the data. Of course, I'd never have quite the correct use cases or queries, and we'd have to have many, many go arounds before getting something useful deployed.
The Vision: What if researchers could simply ask questions in natural language like:
- "What genes are associated with breast cancer?"
- "Show me drugs targeting BRCA1 with high confidence scores"
- "Find rare diseases with few known genetic associations"
More importantly, what if I can use the AI tools to write the code that then calls the AI services?
The answer to this question is kind of. You can't just throw a problem at an LLM and expect a full solution. The AI is more like a really smart intern. You need to have a trust but verify mentality.
Amazon Q Developer Process Overview
I used Amazon Q Developer throughout the entire project.
- Map out the steps you'll take. Ask the Q developer to write these to a
TODO.mdfile, and ask if it's unclear on anything. - Ask the Q chat to complete one step at a time. For example, I asked it to write the infrastructure as code in Terraform first.
- Once Q generates the code ask it to review and summarize it for you.
- Think of tests for your use case. If you're using Q chat to create S3 buckets, ask if it the S3 buckets exist. Ask the AI to generate tests for your use case.
- Leverage existing software where you can. I used the PyAthena library as a client to Athena, and the MCP library to create the MCP server. I told Q chat to use these libraries as context.
- Describe, describe, describe. The more context you give the better the results. If you're working with data ask the Q chat to describe the data and the schema. Then write tests against that schema. Something, somewhere will fail. Describe the error to the Q chat and ask for a solution.
- Keep your end goal in mind, and be flexible on how you get there. I wanted to interact with the OpenTargets data without having to write SQL. I ended up with a CLI tool that does exactly that, and an MCP server that can be used with any MCP compatible AI tool.
Here is my TODO file. I used the Q chat to generate this, and then I asked it to complete one task at a time. When it completed a task, I asked it to review the code, summarize what it did, and then write tests.
## Completed ✅
- [x] Upload complete Open targets dataset to S3
- [x] Set up AWS Glue database and tables for OpenTargets data
- [x] Create Athena workgroup and named queries
- [x] Build SQL agent for natural language to SQL conversion
- [x] Implement MCP server for integration with AI tools
- [x] Create comprehensive schema validation tests
- [x] Document actual table schemas and query patterns
## Current Architecture
- **Data Storage**: S3 buckets with OpenTargets parquet files
- **Query Engine**: Amazon Athena with Glue Data Catalog
- **AI Integration**: Bedrock Claude for natural language processing
- **Interface**: MCP server + CLI agent
AWS Infrastructure Architecture
As always, there are a million ways to architect a solution. I went with this because I wanted to understand the process that goes into building an MCP server, and I wanted to keep costs low. We're using S3 with Athena and Glue because it's serverless. No complicated databases to manage, and we only pay for what we use.
Here's a description of each of the pieces of the AWS architecture:
Data Storage and Cataloging
- Amazon S3: Stores OpenTargets parquet files in parquet format.
- Data bucket: Raw OpenTargets datasets organized by table type
- Results bucket: Athena query results with 30-day lifecycle management
- Server-side encryption and private access controls
- AWS Glue Data Catalog: Provides schema discovery and metadata management
- 12 tables representing different aspects of the OpenTargets data model
- Automatic schema inference from parquet files
- Centralized metadata repository for consistent querying
Query Processing
- Amazon Athena: Serverless SQL query engine for interactive analytics
- Direct queries against parquet files in S3
- Pay-per-query pricing model (~$5 per TB scanned)
- Workgroup configuration for query management and cost control
- Named queries for common use cases
AI Integration
- Amazon Bedrock with Claude 3.5 Sonnet: Natural language to SQL conversion
- Advanced reasoning capabilities for complex query generation
- Context-aware SQL generation based on schema understanding
- Cost-effective token-based pricing (~$0.003 per 1K input tokens)
Cost Analysis
2025 is expensive and we like serverless.
- Per Query: $0.01−$0.11 (primarily Athena data scanning)
- Monthly Usage (100 queries): ~$1−$11
- No infrastructure overhead: Serverless components scale to zero
Schema Testing and Validation
Biological data is messy. Always. It's just how that works. Some of the columns were JSON or arrays. It took the AI several tries to get the correct structure into Athena, but we got there.
Results
Our MCP server/AI agent exposes three main tools for interacting with the OpenTargets data:
Available Tools:
query_opentargets: Full natural language query processingexecute_sql: Direct SQL execution for advanced usersexplain_query: SQL generation without execution for debugging
In order to test the system, I ran a series of natural language queries through the CLI tool. Here are some example queries, along with the generated SQL, and their results:
============================================================
Natural Query: Find high-confidence associations for Alzheimer's disease
============================================================
Generated SQL:
SELECT
d.name AS disease_name,
t.approvedSymbol AS gene_symbol,
t.approvedName AS gene_name,
a.score AS association_score,
a.evidenceCount AS evidence_count
FROM
association_overall_direct a
JOIN
disease d ON a.diseaseId = d.id
JOIN
target t ON a.targetId = t.id
WHERE
(d.name LIKE '%Alzheimer%' OR array_join(d.synonyms.hasExactSynonym, ' ') LIKE '%Alzheimer%')
AND a.score > 0.7
ORDER BY
a.score DESC
LIMIT 100;
============================================================
✅ Results (5 rows):
Row 1:
disease_name: Alzheimer disease 3
gene_symbol: PSEN1
gene_name: presenilin 1
association_score: 0.8370820664829206
evidence_count: 566
Row 2:
disease_name: early-onset autosomal dominant Alzheimer disease
gene_symbol: PSEN2
gene_name: presenilin 2
association_score: 0.7960670314352667
evidence_count: 271
Row 3:
disease_name: Alzheimer disease type 1
gene_symbol: APP
gene_name: amyloid beta precursor protein
association_score: 0.7855977451858343
evidence_count: 62
Row 4:
disease_name: Alzheimer disease
gene_symbol: APP
gene_name: amyloid beta precursor protein
association_score: 0.7817288493909762
evidence_count: 22877
Row 5:
disease_name: Alzheimer disease
gene_symbol: ABCA7
gene_name: ATP binding cassette subfamily A member 7
association_score: 0.706722491576517
evidence_count: 477
============================================================
Natural Query: Which diseases have the most genetic associations?
============================================================
Generated SQL:
SELECT
d.name AS disease_name,
COUNT(DISTINCT a.targetId) AS genetic_association_count
FROM
association_overall_direct a
JOIN
disease d ON a.diseaseId = d.id
WHERE
a.score > 0.5
GROUP BY
d.name
ORDER BY
genetic_association_count DESC
LIMIT 100;
============================================================
✅ Results (100 rows):
Row 1:
disease_name: neurodegenerative disease
genetic_association_count: 1382
Row 2:
disease_name: genetic disorder
genetic_association_count: 404
Row 3:
disease_name: platelet count
genetic_association_count: 401
Row 4:
disease_name: glomerular filtration rate
genetic_association_count: 379
============================================================
Natural Query: What genes are associated with breast cancer?
============================================================
Generated SQL:
SELECT DISTINCT t.approvedSymbol AS gene_symbol, t.approvedName AS gene_name, d.name AS disease_name, a.score AS association_score
FROM association_overall_direct a
JOIN target t ON a.targetId = t.id
JOIN disease d ON a.diseaseId = d.id
WHERE d.name LIKE '%breast cancer%'
OR array_join(d.synonyms.hasExactSynonym, ' ') LIKE '%breast cancer%'
OR array_join(d.synonyms.hasRelatedSynonym, ' ') LIKE '%breast cancer%'
ORDER BY a.score DESC
LIMIT 100;
============================================================
✅ Results (100 rows):
Row 1:
gene_symbol: BRCA2
gene_name: BRCA2 DNA repair associated
disease_name: breast cancer
association_score: 0.8548846942723206
Row 2:
gene_symbol: BRCA1
gene_name: BRCA1 DNA repair associated
disease_name: breast cancer
association_score: 0.838391491656388
Row 3:
gene_symbol: AR
gene_name: androgen receptor
disease_name: partial androgen insensitivity syndrome
association_score: 0.8229769855485846
Row 4:
gene_symbol: CHEK2
gene_name: checkpoint kinase 2
disease_name: breast cancer
association_score: 0.8155025936777466
Row 5:
gene_symbol: CDH1
gene_name: cadherin 1
disease_name: CDH1-related diffuse gastric and lobular breast cancer syndrome
association_score: 0.7950273350532806
Example Results
High-confidence associations discovered:
- CFTR ↔ cystic fibrosis: 91.3% confidence
- F8 ↔ hemophilia A: 90.8% confidence
- MYH7 ↔ hypertrophic cardiomyopathy: 89.8% confidence
Data Source Analysis
Evidence diversity across different sources:
- UniProt variants: Highest average confidence (75.8%)
- Reactome pathways: Strong pathway evidence (73.6%)
- Genomics England: Clinical evidence (71.7%)
Lessons Learned
Break the problem down into small pieces. Don't try to do everything at once. Especially don't let the AI try to do everything at once!
Conclusion
Building a natural language interface for genomic data demonstrates the power of combining modern cloud infrastructure with AI capabilities. Our solution provides:
- Accessibility: Researchers can query complex genomic data without SQL expertise. No code required!
- Cost Efficiency: Serverless architecture scales with usage
- Reliability: Comprehensive testing ensures data integrity
- Extensibility: MCP protocol enables broad integration possibilities
The project showcases how thoughtful architecture decisions, thorough testing, and modern AI capabilities can democratize access to complex scientific datasets. By removing technical barriers, we enable researchers to focus on discovery rather than data wrangling.
Key Takeaways:
- Test Early and Thoroughly: Schema validation prevented major production issues
- Embrace Serverless: Pay-per-use models align perfectly with research workloads
- Standardize Interfaces: MCP protocol provides future-proof integration capabilities
- Document Everything: Comprehensive testing and documentation enable team collaboration
The complete codebase, including infrastructure definitions and test suites, demonstrates an approach to building AI-powered data interfaces for scientific research.