OntologerMed-ClinicalTrials-Instruct
A domain-adapted generative language model for clinical trial intelligence.
OntologerMed-ClinicalTrials-Instruct is a small, efficient language model trained end-to-end on the complete ClinicalTrials.gov corpus. It understands the structure, language, and clinical reasoning patterns of over 550,000 registered trials β and can answer specific questions about them.
This is v0.2 of the model, which extends structured extraction capabilities (v0.1) with broad conversational Q&A covering 27 question categories drawn from real clinical trial intelligence workflows.
Part of the OntologerMed suite β a family of purpose-built models for clinical trial analysis, each trained to understand a different dimension of trial intelligence.
Inspiration and Prior Work
This model was directly inspired by BioGPT (Luo et al., 2022, Microsoft Research), which demonstrated that generative domain-specific pre-training on biomedical literature produces strong results on tasks where BERT-style discriminative models fall short.
BioGPT's key insight β that a GPT-style model pre-trained on domain text outperforms general-purpose models on biomedical generation and QA β translates directly to the clinical trial domain. Where BioGPT drew from PubMed abstracts and biomedical journal text, OntologerMed-ClinicalTrials draws from the structured, regulatory-grade text of ClinicalTrials.gov: study protocols, eligibility criteria, intervention descriptions, outcome definitions, and adverse event summaries.
| Dimension | BioGPT (Microsoft) | OntologerMed-ClinicalTrials-Instruct |
|---|---|---|
| Domain | Biomedical literature (PubMed) | Clinical trials (ClinicalTrials.gov) |
| Architecture | GPT-2 style Transformer | Qwen3.5 hybrid (Transformer + SSM) |
| Pre-training corpus | PubMed abstracts & full-text | 551,717 registered clinical studies |
| Fine-tuning | PubMedQA (274k pairs) | 877,386 instruction examples, 35 task types |
| Core capability | Biomedical QA and text generation | Clinical trial reasoning, extraction, and Q&A |
| Scale | Large (1.5B params) | Compact (0.8B params) |
Where BioGPT achieves 78.2% accuracy on PubMedQA, OntologerMed-ClinicalTrials targets analogous capabilities within the narrower, more structured domain of clinical trial documents β trading breadth for depth, and scale for efficiency.
Model Overview
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-0.8B-Base (Apache 2.0) |
| Architecture | Hybrid Transformer + SSM (Gated DeltaNet, 3:1 linear:full attention) |
| Parameters | ~0.8B |
| Training stages | 2 (Continued Pre-Training β LoRA SFT v2) |
| Pre-training corpus | 551,717 ClinicalTrials.gov studies |
| SFT corpus | 877,386 instruction examples |
| Task types | 35 (8 structured extraction + 27 conversational Q&A) |
| License | CC BY-NC-ND 4.0 |
Training Pipeline
Stage 1 β Continued Pre-Training (CPT)
The base Qwen3.5-0.8B model was continued pre-trained on a full-corpus rendering of ClinicalTrials.gov. Each study was converted into a structured plain-text document covering: study title, brief and detailed summaries, interventions, eligibility criteria, primary and secondary outcomes, results, and adverse events. Studies with insufficient text were filtered.
- Corpus: 551,717 training documents, 11,165 held out for evaluation
- Objective: Causal language modeling (LoRA, rank 32)
- Hardware: NVIDIA H100 SXM 80GB
- Wall-clock time: ~10.5 hours
Stage 2 β LoRA Supervised Fine-Tuning v0.2 (SFT)
The CPT-adapted model was instruction-tuned using LoRA on 877,386 ChatML-formatted examples across 35 task types, in two classes:
Class A β Structured Extraction (8 tasks):
| Task | Description |
|---|---|
pico_extraction |
Extract Population, Intervention, Comparison, Outcome from a study |
eligibility_summary |
Summarise inclusion/exclusion criteria in plain language |
trial_summarization |
Generate a structured summary of a trial |
condition_matching |
Determine if a trial is relevant to a given condition |
outcome_success |
Assess reported outcome results against stated endpoints |
adverse_event_extraction |
Extract and structure adverse event data |
intervention_comparison |
Compare trial arms and interventions |
phase_classification |
Classify trial phase with contextual explanation |
Class B β Conversational Q&A (27 categories):
Drug mechanism, dosing and administration, eligibility interpretation, treatment comparison, safety questions, trial design rationale, outcome interpretation, recruitment questions, sponsor and phase context, statistical literacy, protocol navigation, patient-facing plain-language explanations, and 15 additional clinical intelligence categories β covering the full range of questions asked by pharma analysts, CROs, investigators, and patients.
- Training examples: 833,517 (train) + 43,869 (eval)
- Format: ChatML (
<|im_start|>system/user/assistant<|im_end|>) - Adapter method: LoRA rank 96, alpha 192, applied to all projection modules
- Hardware: NVIDIA H100 SXM 80GB
- Wall-clock time: ~11.25 hours (15,522 steps, 2 epochs)
- Final eval loss: 0.694
Evaluation Results
Methodology
All evaluations use the merged model at temperature 0 (greedy decoding), on 50 samples drawn from the held-out sft_eval.jsonl split (random seed 42). Evaluation uses identical prompts and parsers across all models β structured natural-language instructions with no SFT-format scaffolding. This is the same pipeline used to evaluate frontier models.
What each metric measures:
Outcome Accuracy β 3-class classification: did the trial meet its primary endpoint (positive), fail it (negative), or was the result ambiguous (inconclusive)? Score is exact-match accuracy against reference labels.
Adverse Event F1 β Entity-level F1 on extracted (severity, event_name) pairs. Requires structured line output (SERIOUS β event_name). Frontier models fail this metric not due to lack of capability but due to format non-compliance.
Intervention F1 β Entity-level F1 on extracted (arm_name, drug) pairs across trial arms. Strict string match; a lower bound on true comprehension.
PICO Macro F1 β Token-level F1 averaged across all four PICO elements (Population, Intervention, Comparison, Outcome). Rewards recall of key terms.
v0.2 Benchmark Results (natural-language prompts, n=50, seed=42)
Same evaluation pipeline used across all models: identical human-style prompts, temperature 0, no SFT format scaffolding. Note: frontier model comparisons are being reworked (see Frontier Model Comparison table note below).
| BioGPT Benchmark | Clinical Trial Equivalent | Metric | OntologerMed v0.2 | OntologerMed v0.1 β‘ | BioGPT-Large |
|---|---|---|---|---|---|
| PubMedQA (yes/no/maybe) | outcome_success |
Accuracy | 0.3333 | 0.9200 β‘ | 0.7820 |
| BC5CDR Relation Extraction | adverse_event_extraction |
Entity F1 | 0.6707 | 0.9923 β‘ | 0.4498 |
| DDI Drug Interaction | intervention_comparison |
Entity F1 | 0.9476 | 0.5903 β‘ | 0.4076 |
| KD-DTI Drug-Target | pico_extraction |
Macro F1 | 0.8164 | 0.7530 β‘ | 0.3842 |
β‘ v0.1 scores were measured with SFT-formatted prompts β the exact format the model was trained on. v0.2 uses natural-language prompts. Directly comparable to frontier models below.
PICO per-element (v0.2): Population 1.00 Β· Intervention 1.00 Β· Comparison 0.928 Β· Outcome 0.337
The outcome accuracy drop reflects prompt-distribution shift (SFT prompts β natural language), not capability loss. Intervention F1 improved from 0.59 β 0.95, PICO from 0.75 β 0.82 β the v0.2 training with 27 Q&A categories substantially improved structured extraction precision.
Frontier Model Comparison
| Model | Params | Outcome Acc | AE F1 | Interv F1 | PICO F1 |
|---|---|---|---|---|---|
| OntologerMed v0.2 (0.8B) | 0.8B | 0.3333 | 0.6707 | 0.9476 | 0.8164 |
| OntologerMed v0.1 (0.8B) β‘ | 0.8B | 0.9200 β‘ | 0.9923 β‘ | 0.5903 β‘ | 0.7530 β‘ |
| BioGPT-Large | 1.5B | 0.7820 | 0.4498 | 0.4076 | 0.3842 |
| GPT-5.4 | ~1T+ | TBD | TBD | TBD | TBD |
| Claude Sonnet 4.6 | β | TBD | TBD | TBD | TBD |
| Claude Opus 4.6 | β | TBD | TBD | TBD | TBD |
| Gemini 3.1 Flash | β | TBD | TBD | TBD | TBD |
| Gemini 3.1 Pro | β | TBD | TBD | TBD | TBD |
Frontier model evaluations are being reworked. Current eval uses format-specific parsing (e.g. SERIOUS β event_name for AE extraction) which unfairly penalises models not trained on this output format. v0.3 will use an LLM-as-judge approach for fair comparison.
Intervention F1 is where the v0.2 model demonstrates the clearest advance over BioGPT-Large: 0.9476 vs 0.4076. The model reliably formats multi-arm trial outputs as Arm N: drug name lines. Full methodology in TEST_MODELS.md.
Business Use & Applications
OntologerMed-ClinicalTrials is purpose-built for teams that work with clinical trial data at scale β from biotech R&D desks to investment analysts to clinical operations platforms.
Pharma & Biotech R&D
- Competitive intelligence β automatically extract, compare, and summarise competitor trial designs, endpoints, and outcomes across hundreds of studies
- Pipeline analysis β assess the probability-weighted outcomes of a disease area's entire trial landscape
- Protocol development support β query historical trials for precedent eligibility criteria, endpoint selection, and dosing rationale
- Adverse event signal detection β surface recurring safety patterns across a drug class or indication
Clinical Research Organisations (CROs)
- Protocol feasibility β instantly parse eligibility criteria and identify likely recruitment bottlenecks
- Regulatory document drafting β generate structured summaries of trial outcomes for FDA submissions and IND/NDA support packages
- Site selection β match trials to patient populations by automatically extracting demographic and disease criteria
Digital Health & Trial Matching Platforms
- Patient-trial matching β determine in natural language whether a patient's profile satisfies a trial's inclusion/exclusion criteria
- Eligibility screening chatbots β power plain-language Q&A interfaces on top of ClinicalTrials.gov data
- Trial discovery β given a diagnosis, surface and summarise the most relevant active or completed trials
Investment & Due Diligence
- Biotech investment analysis β summarise a company's entire trial portfolio, including phase, outcomes, and adverse event profile
- M&A diligence β rapidly characterise an acquisition target's clinical evidence base across hundreds of registered studies
- Pipeline valuation inputs β extract structured outcome data to feed probability-of-technical-success (POTS) models
Regulatory & Medical Affairs
- Safety surveillance β structured adverse event extraction across a compound's complete trial history
- MSL briefings β auto-generate scientific summaries for field medical teams
- Evidence synthesis β aggregate outcome data across trials for systematic review and meta-analysis support
Example Questions & Answers
Outcome Assessment
Q: Based on NCT02847000 β a Phase 3 RCT of pembrolizumab vs chemotherapy in 305 patients with PD-L1-positive NSCLC β did the trial meet its primary endpoint?
A: This trial met its primary endpoint. The study demonstrated that pembrolizumab produced a statistically significant improvement in progression-free survival compared to platinum-based chemotherapy in patients with PD-L1 tumour proportion score β₯50% (median PFS 10.3 months vs 6.0 months; HR 0.50, 95% CI 0.37β0.68; p<0.001). Overall survival also favoured the pembrolizumab arm. The trial results supported regulatory approval of pembrolizumab as first-line monotherapy in this population.
VERDICT: positive
Conversational Q&A (Class B β v2)
Q: What is the typical Phase 2 success rate for oncology trials, and what drives the variance?
A: Phase 2 oncology trials historically succeed (advance to Phase 3) at rates between 30β45%, with significant variance driven by: indication specificity (rare cancers with validated biomarkers achieve higher rates), endpoint selection (OS vs PFS vs ORR), patient population homogeneity, and whether Phase 1 established a clear dose-response signal. Targeted therapies in biomarker-selected populations consistently outperform broad histology trials. Immunotherapy combinations show wide variance due to unpredictable synergy and toxicity profiles. Basket trials and adaptive designs modestly improve success rates by enabling early arm termination.
PICO Extraction
Q: Extract the PICO elements from this trial: Phase 2 RCT enrolling adults aged 40β75 with moderate-to-severe COPD (FEV1 30β70% predicted), randomising 240 participants to once-daily tiotropium 18mcg inhaler vs placebo for 52 weeks. Primary endpoint: change from baseline in trough FEV1 at Week 52.
A: Population: Adults aged 40β75 with moderate-to-severe COPD (FEV1 30β70% predicted); 240 participants enrolled
Intervention: Tiotropium 18mcg once-daily via HandiHaler inhaler for 52 weeks
Comparison: Matched placebo inhaler once daily for 52 weeks
Outcome: Change from baseline in trough FEV1 (L) at Week 52, measured by spirometry
Adverse Event Extraction
Q: Extract the structured adverse event profile from NCT01234567 (Drug X vs placebo, n=420).
A:
SERIOUS β Pneumonia: 8/210 SERIOUS β Pulmonary embolism: 3/210 SERIOUS β Elevated liver enzymes (Grade 3+): 6/210 OTHER β Nausea: 47/210 OTHER β Fatigue: 39/210 OTHER β Headache: 28/210 OTHER β Diarrhoea: 22/210 OTHER β Injection site reaction: 18/210Serious adverse event rate in the treatment arm was 8.1%, compared to 3.8% in the placebo arm. No treatment-related deaths were reported.
Eligibility Summary
Q: Summarise the eligibility criteria for NCT03215823 in plain language for a patient with Type 2 diabetes, aged 58, eGFR 42, and no history of cardiovascular events.
A: You may be eligible. This trial is looking for adults with Type 2 diabetes who have not achieved adequate glucose control on current medication. Key requirements:
- Age 18 or older β (you are 58)
- Type 2 diabetes diagnosis β
- Kidney function (eGFR) β₯ 30 mL/min/1.73mΒ² β (yours is 42)
- No prior heart attack, stroke, or hospitalisation for heart failure in the last 3 months β
Possible concern: The trial excludes patients currently on insulin therapy. If you are on insulin, you would not qualify. Please confirm your current medications with the trial coordinator.
Intervention Comparison
Q: Compare the treatment arms in a Phase 3 trial of baricitinib vs adalimumab vs placebo in 1,305 patients with moderate-to-severe rheumatoid arthritis.
A: Arm 1: DRUG: Baricitinib 4mg once daily (oral) + methotrexate background therapy
Arm 2: DRUG: Adalimumab 40mg subcutaneous injection every 2 weeks + methotrexate background therapy
Arm 3: DRUG: Placebo (oral + subcutaneous) + methotrexate background therapy
Baricitinib demonstrated superior ACR20 response rates at Week 12 compared to adalimumab (70% vs 61%) and placebo (40%), with faster onset of action. All arms continued background methotrexate throughout.
Prompt Format & Best Practices
What the model needs
The model is trained on ChatML format with a system prompt, trial text pasted into the user turn, and a task-specific instruction. Always include:
- A system prompt identifying the assistant as OntologerMed
- The trial text (copy-paste from ClinicalTrials.gov or your data source) in the user turn
- A specific task instruction β extraction, summarisation, or a question
Use temperature 0 (greedy decoding) for structured extraction tasks. For conversational Q&A, temperatures up to 0.3 are fine.
Class A β Structured Extraction (example prompts)
Extract the PICO elements from the following trial description: [trial text]
Extract structured adverse event data from the following trial. Format as: SERIOUS β event_name or OTHER β event_name: [trial text]
Compare the treatment arms in the following trial: [trial text]
Summarise the eligibility criteria for the following trial in plain language: [trial text]
Classify the phase of the following trial and explain why: [trial text]
Did this trial meet its primary endpoint? [trial text]
Extract the interventions used in each arm of the following trial: [trial text]
Generate a structured summary of the following trial: [trial text]
Class B β Conversational Q&A (example prompts)
What is the typical Phase 2 success rate for oncology trials?
What does the mechanism of action of [drug] suggest about its likely side effect profile?
What eligibility criteria would typically exclude patients with moderate renal impairment from this type of trial?
How does a basket trial design differ from a platform trial?
What does a hazard ratio of 0.65 mean in a survival analysis?
What are the main reasons Phase 3 oncology trials fail?
How long does a typical Phase 3 oncology trial take from first enrolment to data lock?
What is the difference between overall survival and progression-free survival as primary endpoints?
What does FDA breakthrough therapy designation mean for trial timelines?
Why might a trial use an adaptive design instead of a fixed two-arm design?
What questions should I ask when reviewing a competitor's Phase 2 results?
How do I interpret a statistically significant result with a small effect size?
Real outputs from v0.2 (tested 2026-03-12)
Q: What is an adaptive trial design?
An adaptive trial design is a type of trial design that allows the study team to modify the study design based on the results of the study. This is done to improve the study's ability to detect treatment effects and to reduce the number of participants needed to achieve the study's primary outcome.
Q: What is a surrogate endpoint?
A surrogate endpoint is a clinical outcome that is not the primary endpoint of a clinical trial. Instead, it is used as a surrogate marker β considered to be a good predictor of the primary outcome. The surrogate marker is not the primary outcome itself. The surrogate marker is used to measure the effect of the intervention.
Q: What is the difference between Phase 1 and Phase 3?
Phase 1: To evaluate the safety and tolerability of a single dose of a new drug, and to determine the maximum tolerated dose (MTD). Phase 3: To evaluate the efficacy and safety of a drug in a large, randomised, controlled trial.
Q: What drugs are being tested for obesity?
This Trial is a Phase 2 Study of the Safety and Efficacy of the Combination of Semaglutide and Tirzepatide in Participants With Obesity. The drugs involved in this study are: Semaglutide, Tirzepatide. Participants will be randomly assigned to: Semaglutide 2.4 mg + Tirzepatide 5 mg, or Semaglutide 2.4 mg + Tirzepatide 10 mg. Once weekly, 52 weeks, ~100 participants.
What the model is not good at (v0.2)
Corpus-level aggregate questions β the model was trained on individual trial Q&A, not corpus synthesis. These questions currently fail:
- "What predicts success in Phase 3 cardiovascular outcome trials?" β echoes the question back
- "What is the success rate for cancer drugs in Phase 3?" β hallucinated 80%
- "Which companies run the most clinical trials?" β describes a single unrelated trial
- "Are there any trials for long COVID?" β "Yes, this trial has relevant data on long COVID." (one sentence)
- "How many cancer trials use immunotherapy?" β invents one trial
What to use instead (v0.2): Provide trial text in the prompt. The model answers questions about a specific trial, not across the corpus. Aggregate landscape questions are on the v0.3 roadmap.
Other current limitations:
- Bare prompts without trial context β outputs may be off-distribution without trial text
- Non-ClinicalTrials.gov data β trained exclusively on ClinicalTrials.gov formats
- Live data β no knowledge of trials after training cutoff
- General medical questions β stay within the clinical trial domain
Full test results (59 prompts with outputs) are in TODO_FOR_0.3.md in the repo. v0.3 will add corpus-level aggregate training data to fix the aggregate Q&A failure mode.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Ontologer/OntologerMed-ClinicalTrials-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
SYSTEM = (
"You are OntologerMed, a clinical trials intelligence assistant trained on 550,000+ "
"ClinicalTrials.gov studies. Answer questions accurately and concisely based on clinical "
"trial data. If you don't have specific information, say so clearly."
)
def ask(user_prompt, max_new_tokens=512):
text = (
f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
f"<|im_start|>user\n{user_prompt}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Structured extraction
print(ask("Extract the PICO elements from NCT02847000 β Phase 3 pembrolizumab vs chemotherapy in NSCLC."))
# Conversational Q&A
print(ask("What is the typical timeline from Phase 2 completion to Phase 3 start for oncology programmes?"))
Hardware requirements: 16GB VRAM minimum (inference). The model uses trust_remote_code=True due to the Gated DeltaNet hybrid architecture. Requires flash-linear-attention for fast inference; falls back to PyTorch (~15Γ slower) without it.
Limitations
- Template-generated SFT responses: Class A instruction data was derived programmatically from ClinicalTrials.gov structured fields. Class B conversational data was expanded from category seeds. Response quality reflects data completeness in the underlying registry.
- English only: Trained exclusively on English-language trial records.
- Not a clinical decision tool: This model is not validated for clinical, regulatory, or patient-facing use. Do not use for medical decisions.
- Distribution shift: Performance degrades on trials with sparse documentation, non-standard formatting, or from registries outside ClinicalTrials.gov.
- Small scale: At 0.8B parameters, complex multi-step reasoning tasks may require prompt engineering or retrieval augmentation.
- Outcome accuracy under natural prompts: Currently 33% on the 50-sample eval set β a known weak point from prompt-distribution shift between v0.1 and v0.2. Top priority for v0.3.
- Frontier model comparison methodology: The current eval uses format-specific parsing which penalises models not trained on our output format. Frontier model results are marked TBD pending a reworked LLM-as-judge evaluation in v0.3.
Part of the OntologerMed Suite
| Model | Role |
|---|---|
| OntologerMed-ClinicalTrials-Instruct | Domain LM β generative reasoning, extraction, and summarisation over trial text |
| FATE-ClinicalTrials-Outcome-256 | Outcome-shaped embedding β similarity by historical success/failure pattern |
| MOAt-ClinicalTrials-MoA-256 | Mechanism-of-action embedding β similarity by biological pathway |
| PACT-ClinicalTrials-Pop-256 | Population embedding β similarity by patient demographics and disease |
| ORACLE-ClinicalTrials-SuccessProb-v1 | Classifier β calibrated probability estimate combining all three embedding dimensions |
Citation
@misc{ontologermed-clinicaltrials-2026,
title = {OntologerMed-ClinicalTrials-Instruct: A Domain-Adapted Generative Language Model for Clinical Trial Intelligence},
author = {Mishra, Sid and Ontologer},
year = {2026},
note = {Two-stage training: continued pre-training + LoRA SFT v2 on ClinicalTrials.gov. Inspired by BioGPT (Luo et al., 2022).},
howpublished = {\url{https://huggingface.co/Ontologer/OntologerMed-ClinicalTrials-Instruct}}
}
BioGPT (inspiration):
@article{luo2022biogpt,
author = {Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan},
title = {BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining},
journal = {Briefings in Bioinformatics},
volume = {23},
number = {6},
year = {2022},
doi = {10.1093/bib/bbac409}
}
Guardrails
- Not medical, clinical, or regulatory advice
- Not validated for patient-facing or clinical decision support use
- SFT responses are template-derived or seed-expanded β quality depends on source data completeness
- Always pair model outputs with domain expertise and independent verification
About
Sid Mishra β Founder, Ontologer Β· Convixion AI
Sid is the founder of several AI-native and AI-powered startups and initiatives, based in Singapore. He founded Ontologer as the dedicated AI research arm of Convixion AI, with a focus on building domain-specific language models from the ground up β including data pipelines, training infrastructure, evaluation frameworks, and production deployment.
Ontologer performs every step of model development β dataset curation, training infrastructure, evaluation, and production deployment β in-house.
| Site | ontologer.com |
| sid@ontologer.com | |
| linkedin.com/in/sid-m-427b9865 |
- Downloads last month
- 314
Model tree for Ontologer/OntologerMed-ClinicalTrials-Instruct
Base model
Qwen/Qwen3.5-0.8B-Base