File size: 5,614 Bytes
5d12635
9286db5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d12635
9286db5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d12635
9286db5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# DeepBoner Data Sources: Roadmap Summary

**Created**: 2024-11-27
**Purpose**: Future maintainability and hackathon continuation

---

## Current State

### Working Tools

| Tool | Status | Data Quality |
|------|--------|--------------|
| PubMed | βœ… Works | Good (abstracts only) |
| ClinicalTrials.gov | βœ… Works | Good (filtered for interventional) |
| Europe PMC | βœ… Works | Good (includes preprints) |

### Removed Tools

| Tool | Status | Reason |
|------|--------|--------|
| bioRxiv | ❌ Removed | No search API - only date/DOI lookup |

---

## Priority Improvements

### P0: Critical (Do First)

1. **Add Rate Limiting to PubMed**
   - NCBI will block us without it
   - Use `limits` library (see reference repo)
   - 3/sec without key, 10/sec with key

### P1: High Value, Medium Effort

2. **Add OpenAlex as 4th Source**
   - Citation network (huge for drug repurposing)
   - Concept tagging (semantic discovery)
   - Already implemented in reference repo
   - Free, no API key

3. **PubMed Full-Text via BioC**
   - Get full paper text for PMC papers
   - Already in reference repo

### P2: Nice to Have

4. **ClinicalTrials.gov Results**
   - Get efficacy data from completed trials
   - Requires more complex API calls

5. **Europe PMC Annotations**
   - Text-mined entities (genes, drugs, diseases)
   - Automatic entity extraction

---

## Effort Estimates

| Improvement | Effort | Impact | Priority |
|-------------|--------|--------|----------|
| PubMed rate limiting | 1 hour | Stability | P0 |
| OpenAlex basic search | 2 hours | High | P1 |
| OpenAlex citations | 2 hours | Very High | P1 |
| PubMed full-text | 3 hours | Medium | P1 |
| CT.gov results | 4 hours | Medium | P2 |
| Europe PMC annotations | 3 hours | Medium | P2 |

---

## Architecture Decision

### Option A: Keep Current + Add OpenAlex

```
                    User Query
                        ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    ↓                   ↓                   ↓
 PubMed          ClinicalTrials        Europe PMC
 (abstracts)     (trials only)         (preprints)
    ↓                   ↓                   ↓
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        ↓
                   OpenAlex              ← NEW
               (citations, concepts)
                        ↓
                  Orchestrator
                        ↓
                     Report
```

**Pros**: Low risk, additive
**Cons**: More complexity, some overlap

### Option B: OpenAlex as Primary

```
                    User Query
                        ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    ↓                   ↓                   ↓
 OpenAlex          ClinicalTrials      Europe PMC
 (primary          (trials only)       (full-text
  search)                               fallback)
    ↓                   ↓                   ↓
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        ↓
                  Orchestrator
                        ↓
                     Report
```

**Pros**: Simpler, citation network built-in
**Cons**: Lose some PubMed-specific features

### Recommendation: Option A

Keep current architecture working, add OpenAlex incrementally.

---

## Quick Wins (Can Do Today)

1. **Add `limits` to `pyproject.toml`**
   ```toml
   dependencies = [
       "limits>=3.0",
   ]
   ```

2. **Copy OpenAlex tool from reference repo**
   - File: `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`
   - Adapt to our `SearchTool` base class

3. **Enable NCBI API Key**
   - Add to `.env`: `NCBI_API_KEY=your_key`
   - 10x rate limit improvement

---

## External Resources Worth Exploring

### Python Libraries

| Library | For | Notes |
|---------|-----|-------|
| `limits` | Rate limiting | Used by reference repo |
| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
| `metapub` | PubMed | Full-featured |
| `sentence-transformers` | Semantic search | For embeddings |

### APIs Not Yet Used

| API | Provides | Effort |
|-----|----------|--------|
| RxNorm | Drug name normalization | Low |
| DrugBank | Drug targets/mechanisms | Medium (license) |
| UniProt | Protein data | Medium |
| ChEMBL | Bioactivity data | Medium |

### RAG Tools (Future)

| Tool | Purpose |
|------|---------|
| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |

---

## Files in This Directory

| File | Contents |
|------|----------|
| `00_ROADMAP_SUMMARY.md` | This file |
| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |

---

## For Future Maintainers

If you're picking this up after the hackathon:

1. **Start with OpenAlex** - biggest bang for buck
2. **Add rate limiting** - prevents API blocks
3. **Don't bother with bioRxiv** - use Europe PMC instead
4. **Reference repo is gold** - `reference_repos/DeepBoner/` has working implementations

Good luck! πŸš€