DeepBoner / docs /brainstorming /02_CLINICALTRIALS_IMPROVEMENTS.md
VibecoderMcSwaggins's picture
feat: add roadmap summary and detailed improvement plans for data sources
9286db5
# ClinicalTrials.gov Tool: Current State & Future Improvements
**Status**: Currently Implemented
**Priority**: High (Core Data Source for Drug Repurposing)
---
## Current Implementation
### What We Have (`src/tools/clinicaltrials.py`)
- V2 API search via `clinicaltrials.gov/api/v2/studies`
- Filters: `INTERVENTIONAL` study type, `RECRUITING` status
- Returns: NCT ID, title, conditions, interventions, phase, status
- Query preprocessing via shared `query_utils.py`
### Current Strengths
1. **Good Filtering**: Already filtering for interventional + recruiting
2. **V2 API**: Using the modern API (v1 deprecated)
3. **Phase Info**: Extracting trial phases for drug development context
### Current Limitations
1. **No Outcome Data**: Missing primary/secondary outcomes
2. **No Eligibility Criteria**: Missing inclusion/exclusion details
3. **No Sponsor Info**: Missing who's running the trial
4. **No Result Data**: For completed trials, no efficacy data
5. **Limited Drug Mapping**: No integration with drug databases
---
## API Capabilities We're Not Using
### Fields We Could Request
```python
# Current fields
fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
# Additional valuable fields
additional_fields = [
"PrimaryOutcomeMeasure", # What are they measuring?
"SecondaryOutcomeMeasure", # Secondary endpoints
"EligibilityCriteria", # Who can participate?
"LeadSponsorName", # Who's funding?
"ResultsFirstPostDate", # Has results?
"StudyFirstPostDate", # When started?
"CompletionDate", # When finished?
"EnrollmentCount", # Sample size
"InterventionDescription", # Drug details
"ArmGroupLabel", # Treatment arms
"InterventionOtherName", # Drug aliases
]
```
### Filter Enhancements
```python
# Current
aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
# Could add
"status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
"phase:PHASE2,PHASE3" # Only later-stage trials
"resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
```
---
## Recommended Improvements
### Phase 1: Richer Metadata
```python
EXTENDED_FIELDS = [
"NCTId",
"BriefTitle",
"OfficialTitle",
"Condition",
"InterventionName",
"InterventionDescription",
"InterventionOtherName", # Drug synonyms!
"Phase",
"OverallStatus",
"PrimaryOutcomeMeasure",
"EnrollmentCount",
"LeadSponsorName",
"StudyFirstPostDate",
]
```
### Phase 2: Results Retrieval
For completed trials, we can get actual efficacy data:
```python
async def get_trial_results(nct_id: str) -> dict | None:
"""Fetch results for completed trials."""
url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
params = {
"fields": "ResultsSection",
}
# Returns outcome measures and statistics
```
### Phase 3: Drug Name Normalization
Map intervention names to standard identifiers:
```python
# Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
# Solution: Use RxNorm or DrugBank for normalization
async def normalize_drug_name(intervention: str) -> str:
"""Normalize drug name via RxNorm API."""
url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
# Returns standardized RxCUI
```
---
## Integration Opportunities
### With PubMed
Cross-reference trials with publications:
```python
# ClinicalTrials.gov provides PMID links
# Can correlate trial results with published papers
```
### With DrugBank/ChEMBL
Map interventions to:
- Mechanism of action
- Known targets
- Adverse effects
- Drug-drug interactions
---
## Python Libraries to Consider
| Library | Purpose | Notes |
|---------|---------|-------|
| [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
| [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
| [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
---
## API Quirks & Gotchas
1. **Rate Limiting**: Undocumented, be conservative
2. **Pagination**: Max 1000 results per request
3. **Field Names**: Case-sensitive, camelCase
4. **Empty Results**: Some fields may be null even if requested
5. **Status Changes**: Trials change status frequently
---
## Example Enhanced Query
```python
async def search_drug_repurposing_trials(
drug_name: str,
condition: str,
include_completed: bool = True,
) -> list[Evidence]:
"""Search for trials repurposing a drug for a new condition."""
statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
if include_completed:
statuses.append("COMPLETED")
params = {
"query.intr": drug_name,
"query.cond": condition,
"filter.overallStatus": ",".join(statuses),
"filter.studyType": "INTERVENTIONAL",
"fields": ",".join(EXTENDED_FIELDS),
"pageSize": 50,
}
```
---
## Sources
- [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
- [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
- [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)