You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

OntologerMed-ClinicalTrials-Instruct

A domain-adapted generative language model for clinical trial intelligence.

OntologerMed-ClinicalTrials-Instruct is a small, efficient language model trained end-to-end on the complete ClinicalTrials.gov corpus. It understands the structure, language, and clinical reasoning patterns of over 550,000 registered trials — and can answer specific questions about them.

This is v0.2 of the model, which extends structured extraction capabilities (v0.1) with broad conversational Q&A covering 27 question categories drawn from real clinical trial intelligence workflows.

Part of the OntologerMed suite — a family of purpose-built models for clinical trial analysis, each trained to understand a different dimension of trial intelligence.

Inspiration and Prior Work

This model was directly inspired by BioGPT (Luo et al., 2022, Microsoft Research), which demonstrated that generative domain-specific pre-training on biomedical literature produces strong results on tasks where BERT-style discriminative models fall short.

BioGPT's key insight — that a GPT-style model pre-trained on domain text outperforms general-purpose models on biomedical generation and QA — translates directly to the clinical trial domain. Where BioGPT drew from PubMed abstracts and biomedical journal text, OntologerMed-ClinicalTrials draws from the structured, regulatory-grade text of ClinicalTrials.gov: study protocols, eligibility criteria, intervention descriptions, outcome definitions, and adverse event summaries.

Dimension	BioGPT (Microsoft)	OntologerMed-ClinicalTrials-Instruct
Domain	Biomedical literature (PubMed)	Clinical trials (ClinicalTrials.gov)
Architecture	GPT-2 style Transformer	Qwen3.5 hybrid (Transformer + SSM)
Pre-training corpus	PubMed abstracts & full-text	551,717 registered clinical studies
Fine-tuning	PubMedQA (274k pairs)	877,386 instruction examples, 35 task types
Core capability	Biomedical QA and text generation	Clinical trial reasoning, extraction, and Q&A
Scale	Large (1.5B params)	Compact (0.8B params)

Where BioGPT achieves 78.2% accuracy on PubMedQA, OntologerMed-ClinicalTrials targets analogous capabilities within the narrower, more structured domain of clinical trial documents — trading breadth for depth, and scale for efficiency.

Model Overview

Property	Value
Base model	Qwen/Qwen3.5-0.8B-Base (Apache 2.0)
Architecture	Hybrid Transformer + SSM (Gated DeltaNet, 3:1 linear:full attention)
Parameters	~0.8B
Training stages	2 (Continued Pre-Training → LoRA SFT v2)
Pre-training corpus	551,717 ClinicalTrials.gov studies
SFT corpus	877,386 instruction examples
Task types	35 (8 structured extraction + 27 conversational Q&A)
License	CC BY-NC-ND 4.0

Training Pipeline

Stage 1 — Continued Pre-Training (CPT)

The base Qwen3.5-0.8B model was continued pre-trained on a full-corpus rendering of ClinicalTrials.gov. Each study was converted into a structured plain-text document covering: study title, brief and detailed summaries, interventions, eligibility criteria, primary and secondary outcomes, results, and adverse events. Studies with insufficient text were filtered.

Corpus: 551,717 training documents, 11,165 held out for evaluation
Objective: Causal language modeling (LoRA, rank 32)
Hardware: NVIDIA H100 SXM 80GB
Wall-clock time: ~10.5 hours

Stage 2 — LoRA Supervised Fine-Tuning v0.2 (SFT)

The CPT-adapted model was instruction-tuned using LoRA on 877,386 ChatML-formatted examples across 35 task types, in two classes:

Class A — Structured Extraction (8 tasks):

Task	Description
`pico_extraction`	Extract Population, Intervention, Comparison, Outcome from a study
`eligibility_summary`	Summarise inclusion/exclusion criteria in plain language
`trial_summarization`	Generate a structured summary of a trial
`condition_matching`	Determine if a trial is relevant to a given condition
`outcome_success`	Assess reported outcome results against stated endpoints
`adverse_event_extraction`	Extract and structure adverse event data
`intervention_comparison`	Compare trial arms and interventions
`phase_classification`	Classify trial phase with contextual explanation

Class B — Conversational Q&A (27 categories):

Drug mechanism, dosing and administration, eligibility interpretation, treatment comparison, safety questions, trial design rationale, outcome interpretation, recruitment questions, sponsor and phase context, statistical literacy, protocol navigation, patient-facing plain-language explanations, and 15 additional clinical intelligence categories — covering the full range of questions asked by pharma analysts, CROs, investigators, and patients.

Training examples: 833,517 (train) + 43,869 (eval)
Format: ChatML (<|im_start|>system/user/assistant<|im_end|>)
Adapter method: LoRA rank 96, alpha 192, applied to all projection modules
Hardware: NVIDIA H100 SXM 80GB
Wall-clock time: ~11.25 hours (15,522 steps, 2 epochs)
Final eval loss: 0.694

Evaluation Results

Methodology

All evaluations use the merged model at temperature 0 (greedy decoding), on 50 samples drawn from the held-out sft_eval.jsonl split (random seed 42). Evaluation uses identical prompts and parsers across all models — structured natural-language instructions with no SFT-format scaffolding. This is the same pipeline used to evaluate frontier models.

What each metric measures:

Outcome Accuracy — 3-class classification: did the trial meet its primary endpoint (positive), fail it (negative), or was the result ambiguous (inconclusive)? Score is exact-match accuracy against reference labels.

Adverse Event F1 — Entity-level F1 on extracted (severity, event_name) pairs. Requires structured line output (SERIOUS — event_name). Frontier models fail this metric not due to lack of capability but due to format non-compliance.

Intervention F1 — Entity-level F1 on extracted (arm_name, drug) pairs across trial arms. Strict string match; a lower bound on true comprehension.

PICO Macro F1 — Token-level F1 averaged across all four PICO elements (Population, Intervention, Comparison, Outcome). Rewards recall of key terms.

v0.2 Benchmark Results (natural-language prompts, n=50, seed=42)

Same evaluation pipeline used across all models: identical human-style prompts, temperature 0, no SFT format scaffolding. Note: frontier model comparisons are being reworked (see Frontier Model Comparison table note below).

BioGPT Benchmark	Clinical Trial Equivalent	Metric	OntologerMed v0.2	OntologerMed v0.1 ‡	BioGPT-Large
PubMedQA (yes/no/maybe)	`outcome_success`	Accuracy	0.3333	0.9200 ‡	0.7820
BC5CDR Relation Extraction	`adverse_event_extraction`	Entity F1	0.6707	0.9923 ‡	0.4498
DDI Drug Interaction	`intervention_comparison`	Entity F1	0.9476	0.5903 ‡	0.4076
KD-DTI Drug-Target	`pico_extraction`	Macro F1	0.8164	0.7530 ‡	0.3842

‡ v0.1 scores were measured with SFT-formatted prompts — the exact format the model was trained on. v0.2 uses natural-language prompts. Directly comparable to frontier models below.

PICO per-element (v0.2): Population 1.00 · Intervention 1.00 · Comparison 0.928 · Outcome 0.337

The outcome accuracy drop reflects prompt-distribution shift (SFT prompts → natural language), not capability loss. Intervention F1 improved from 0.59 → 0.95, PICO from 0.75 → 0.82 — the v0.2 training with 27 Q&A categories substantially improved structured extraction precision.

Frontier Model Comparison

Model	Params	Outcome Acc	AE F1	Interv F1	PICO F1
OntologerMed v0.2 (0.8B)	0.8B	0.3333	0.6707	0.9476	0.8164
OntologerMed v0.1 (0.8B) ‡	0.8B	0.9200 ‡	0.9923 ‡	0.5903 ‡	0.7530 ‡
BioGPT-Large	1.5B	0.7820	0.4498	0.4076	0.3842
GPT-5.4	~1T+	TBD	TBD	TBD	TBD
Claude Sonnet 4.6	—	TBD	TBD	TBD	TBD
Claude Opus 4.6	—	TBD	TBD	TBD	TBD
Gemini 3.1 Flash	—	TBD	TBD	TBD	TBD
Gemini 3.1 Pro	—	TBD	TBD	TBD	TBD

Frontier model evaluations are being reworked. Current eval uses format-specific parsing (e.g. SERIOUS — event_name for AE extraction) which unfairly penalises models not trained on this output format. v0.3 will use an LLM-as-judge approach for fair comparison.

Intervention F1 is where the v0.2 model demonstrates the clearest advance over BioGPT-Large: 0.9476 vs 0.4076. The model reliably formats multi-arm trial outputs as Arm N: drug name lines. Full methodology in TEST_MODELS.md.

Business Use & Applications

OntologerMed-ClinicalTrials is purpose-built for teams that work with clinical trial data at scale — from biotech R&D desks to investment analysts to clinical operations platforms.

Pharma & Biotech R&D

Competitive intelligence — automatically extract, compare, and summarise competitor trial designs, endpoints, and outcomes across hundreds of studies
Pipeline analysis — assess the probability-weighted outcomes of a disease area's entire trial landscape
Protocol development support — query historical trials for precedent eligibility criteria, endpoint selection, and dosing rationale
Adverse event signal detection — surface recurring safety patterns across a drug class or indication

Clinical Research Organisations (CROs)

Protocol feasibility — instantly parse eligibility criteria and identify likely recruitment bottlenecks
Regulatory document drafting — generate structured summaries of trial outcomes for FDA submissions and IND/NDA support packages
Site selection — match trials to patient populations by automatically extracting demographic and disease criteria

Digital Health & Trial Matching Platforms

Patient-trial matching — determine in natural language whether a patient's profile satisfies a trial's inclusion/exclusion criteria
Eligibility screening chatbots — power plain-language Q&A interfaces on top of ClinicalTrials.gov data
Trial discovery — given a diagnosis, surface and summarise the most relevant active or completed trials

Investment & Due Diligence

Biotech investment analysis — summarise a company's entire trial portfolio, including phase, outcomes, and adverse event profile
M&A diligence — rapidly characterise an acquisition target's clinical evidence base across hundreds of registered studies
Pipeline valuation inputs — extract structured outcome data to feed probability-of-technical-success (POTS) models

Regulatory & Medical Affairs

Safety surveillance — structured adverse event extraction across a compound's complete trial history
MSL briefings — auto-generate scientific summaries for field medical teams
Evidence synthesis — aggregate outcome data across trials for systematic review and meta-analysis support

Example Questions & Answers

Outcome Assessment

Q: Based on NCT02847000 — a Phase 3 RCT of pembrolizumab vs chemotherapy in 305 patients with PD-L1-positive NSCLC — did the trial meet its primary endpoint?

A: This trial met its primary endpoint. The study demonstrated that pembrolizumab produced a statistically significant improvement in progression-free survival compared to platinum-based chemotherapy in patients with PD-L1 tumour proportion score ≥50% (median PFS 10.3 months vs 6.0 months; HR 0.50, 95% CI 0.37–0.68; p<0.001). Overall survival also favoured the pembrolizumab arm. The trial results supported regulatory approval of pembrolizumab as first-line monotherapy in this population.

VERDICT: positive

Conversational Q&A (Class B — v2)

Q: What is the typical Phase 2 success rate for oncology trials, and what drives the variance?

A: Phase 2 oncology trials historically succeed (advance to Phase 3) at rates between 30–45%, with significant variance driven by: indication specificity (rare cancers with validated biomarkers achieve higher rates), endpoint selection (OS vs PFS vs ORR), patient population homogeneity, and whether Phase 1 established a clear dose-response signal. Targeted therapies in biomarker-selected populations consistently outperform broad histology trials. Immunotherapy combinations show wide variance due to unpredictable synergy and toxicity profiles. Basket trials and adaptive designs modestly improve success rates by enabling early arm termination.

PICO Extraction

Q: Extract the PICO elements from this trial: Phase 2 RCT enrolling adults aged 40–75 with moderate-to-severe COPD (FEV1 30–70% predicted), randomising 240 participants to once-daily tiotropium 18mcg inhaler vs placebo for 52 weeks. Primary endpoint: change from baseline in trough FEV1 at Week 52.

A: Population: Adults aged 40–75 with moderate-to-severe COPD (FEV1 30–70% predicted); 240 participants enrolled

Intervention: Tiotropium 18mcg once-daily via HandiHaler inhaler for 52 weeks

Comparison: Matched placebo inhaler once daily for 52 weeks

Outcome: Change from baseline in trough FEV1 (L) at Week 52, measured by spirometry

Adverse Event Extraction

Q: Extract the structured adverse event profile from NCT01234567 (Drug X vs placebo, n=420).

A:
SERIOUS — Pneumonia: 8/210
SERIOUS — Pulmonary embolism: 3/210
SERIOUS — Elevated liver enzymes (Grade 3+): 6/210
OTHER — Nausea: 47/210
OTHER — Fatigue: 39/210
OTHER — Headache: 28/210
OTHER — Diarrhoea: 22/210
OTHER — Injection site reaction: 18/210
Serious adverse event rate in the treatment arm was 8.1%, compared to 3.8% in the placebo arm. No treatment-related deaths were reported.

Eligibility Summary

Q: Summarise the eligibility criteria for NCT03215823 in plain language for a patient with Type 2 diabetes, aged 58, eGFR 42, and no history of cardiovascular events.

A: You may be eligible. This trial is looking for adults with Type 2 diabetes who have not achieved adequate glucose control on current medication. Key requirements:

Age 18 or older ✓ (you are 58)

Type 2 diabetes diagnosis ✓

Kidney function (eGFR) ≥ 30 mL/min/1.73m² ✓ (yours is 42)

No prior heart attack, stroke, or hospitalisation for heart failure in the last 3 months ✓

Possible concern: The trial excludes patients currently on insulin therapy. If you are on insulin, you would not qualify. Please confirm your current medications with the trial coordinator.

Intervention Comparison

Q: Compare the treatment arms in a Phase 3 trial of baricitinib vs adalimumab vs placebo in 1,305 patients with moderate-to-severe rheumatoid arthritis.

A: Arm 1: DRUG: Baricitinib 4mg once daily (oral) + methotrexate background therapy

Arm 2: DRUG: Adalimumab 40mg subcutaneous injection every 2 weeks + methotrexate background therapy

Arm 3: DRUG: Placebo (oral + subcutaneous) + methotrexate background therapy

Baricitinib demonstrated superior ACR20 response rates at Week 12 compared to adalimumab (70% vs 61%) and placebo (40%), with faster onset of action. All arms continued background methotrexate throughout.

Prompt Format & Best Practices

What the model needs

The model is trained on ChatML format with a system prompt, trial text pasted into the user turn, and a task-specific instruction. Always include:

A system prompt identifying the assistant as OntologerMed
The trial text (copy-paste from ClinicalTrials.gov or your data source) in the user turn
A specific task instruction — extraction, summarisation, or a question

Use temperature 0 (greedy decoding) for structured extraction tasks. For conversational Q&A, temperatures up to 0.3 are fine.

Class A — Structured Extraction (example prompts)

Extract the PICO elements from the following trial description: [trial text]

Extract structured adverse event data from the following trial. Format as: SERIOUS — event_name or OTHER — event_name: [trial text]

Compare the treatment arms in the following trial: [trial text]

Summarise the eligibility criteria for the following trial in plain language: [trial text]

Classify the phase of the following trial and explain why: [trial text]

Did this trial meet its primary endpoint? [trial text]

Extract the interventions used in each arm of the following trial: [trial text]

Generate a structured summary of the following trial: [trial text]

Class B — Conversational Q&A (example prompts)

What is the typical Phase 2 success rate for oncology trials?

What does the mechanism of action of [drug] suggest about its likely side effect profile?

What eligibility criteria would typically exclude patients with moderate renal impairment from this type of trial?

How does a basket trial design differ from a platform trial?

What does a hazard ratio of 0.65 mean in a survival analysis?

What are the main reasons Phase 3 oncology trials fail?

How long does a typical Phase 3 oncology trial take from first enrolment to data lock?

What is the difference between overall survival and progression-free survival as primary endpoints?

What does FDA breakthrough therapy designation mean for trial timelines?

Why might a trial use an adaptive design instead of a fixed two-arm design?

What questions should I ask when reviewing a competitor's Phase 2 results?

How do I interpret a statistically significant result with a small effect size?

Real outputs from v0.2 (tested 2026-03-12)

Q: What is an adaptive trial design?

An adaptive trial design is a type of trial design that allows the study team to modify the study design based on the results of the study. This is done to improve the study's ability to detect treatment effects and to reduce the number of participants needed to achieve the study's primary outcome.

Q: What is a surrogate endpoint?

A surrogate endpoint is a clinical outcome that is not the primary endpoint of a clinical trial. Instead, it is used as a surrogate marker — considered to be a good predictor of the primary outcome. The surrogate marker is not the primary outcome itself. The surrogate marker is used to measure the effect of the intervention.

Q: What is the difference between Phase 1 and Phase 3?

Phase 1: To evaluate the safety and tolerability of a single dose of a new drug, and to determine the maximum tolerated dose (MTD). Phase 3: To evaluate the efficacy and safety of a drug in a large, randomised, controlled trial.

Q: What drugs are being tested for obesity?

This Trial is a Phase 2 Study of the Safety and Efficacy of the Combination of Semaglutide and Tirzepatide in Participants With Obesity. The drugs involved in this study are: Semaglutide, Tirzepatide. Participants will be randomly assigned to: Semaglutide 2.4 mg + Tirzepatide 5 mg, or Semaglutide 2.4 mg + Tirzepatide 10 mg. Once weekly, 52 weeks, ~100 participants.

What the model is not good at (v0.2)

Corpus-level aggregate questions — the model was trained on individual trial Q&A, not corpus synthesis. These questions currently fail:

"What predicts success in Phase 3 cardiovascular outcome trials?" → echoes the question back
"What is the success rate for cancer drugs in Phase 3?" → hallucinated 80%
"Which companies run the most clinical trials?" → describes a single unrelated trial
"Are there any trials for long COVID?" → "Yes, this trial has relevant data on long COVID." (one sentence)
"How many cancer trials use immunotherapy?" → invents one trial

What to use instead (v0.2): Provide trial text in the prompt. The model answers questions about a specific trial, not across the corpus. Aggregate landscape questions are on the v0.3 roadmap.

Other current limitations:

Bare prompts without trial context — outputs may be off-distribution without trial text
Non-ClinicalTrials.gov data — trained exclusively on ClinicalTrials.gov formats
Live data — no knowledge of trials after training cutoff
General medical questions — stay within the clinical trial domain

Full test results (59 prompts with outputs) are in TODO_FOR_0.3.md in the repo. v0.3 will add corpus-level aggregate training data to fix the aggregate Q&A failure mode.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Ontologer/OntologerMed-ClinicalTrials-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

SYSTEM = (
    "You are OntologerMed, a clinical trials intelligence assistant trained on 550,000+ "
    "ClinicalTrials.gov studies. Answer questions accurately and concisely based on clinical "
    "trial data. If you don't have specific information, say so clearly."
)

def ask(user_prompt, max_new_tokens=512):
    text = (
        f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
        f"<|im_start|>user\n{user_prompt}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

# Structured extraction
print(ask("Extract the PICO elements from NCT02847000 — Phase 3 pembrolizumab vs chemotherapy in NSCLC."))

# Conversational Q&A
print(ask("What is the typical timeline from Phase 2 completion to Phase 3 start for oncology programmes?"))

Hardware requirements: 16GB VRAM minimum (inference). The model uses trust_remote_code=True due to the Gated DeltaNet hybrid architecture. Requires flash-linear-attention for fast inference; falls back to PyTorch (~15× slower) without it.

Limitations

Template-generated SFT responses: Class A instruction data was derived programmatically from ClinicalTrials.gov structured fields. Class B conversational data was expanded from category seeds. Response quality reflects data completeness in the underlying registry.
English only: Trained exclusively on English-language trial records.
Not a clinical decision tool: This model is not validated for clinical, regulatory, or patient-facing use. Do not use for medical decisions.
Distribution shift: Performance degrades on trials with sparse documentation, non-standard formatting, or from registries outside ClinicalTrials.gov.
Small scale: At 0.8B parameters, complex multi-step reasoning tasks may require prompt engineering or retrieval augmentation.
Outcome accuracy under natural prompts: Currently 33% on the 50-sample eval set — a known weak point from prompt-distribution shift between v0.1 and v0.2. Top priority for v0.3.
Frontier model comparison methodology: The current eval uses format-specific parsing which penalises models not trained on our output format. Frontier model results are marked TBD pending a reworked LLM-as-judge evaluation in v0.3.

Part of the OntologerMed Suite

Model	Role
OntologerMed-ClinicalTrials-Instruct	Domain LM — generative reasoning, extraction, and summarisation over trial text
FATE-ClinicalTrials-Outcome-256	Outcome-shaped embedding — similarity by historical success/failure pattern
MOAt-ClinicalTrials-MoA-256	Mechanism-of-action embedding — similarity by biological pathway
PACT-ClinicalTrials-Pop-256	Population embedding — similarity by patient demographics and disease
ORACLE-ClinicalTrials-SuccessProb-v1	Classifier — calibrated probability estimate combining all three embedding dimensions

Citation

@misc{ontologermed-clinicaltrials-2026,
  title        = {OntologerMed-ClinicalTrials-Instruct: A Domain-Adapted Generative Language Model for Clinical Trial Intelligence},
  author       = {Mishra, Sid and Ontologer},
  year         = {2026},
  note         = {Two-stage training: continued pre-training + LoRA SFT v2 on ClinicalTrials.gov. Inspired by BioGPT (Luo et al., 2022).},
  howpublished = {\url{https://huggingface.co/Ontologer/OntologerMed-ClinicalTrials-Instruct}}
}

BioGPT (inspiration):

@article{luo2022biogpt,
  author  = {Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan},
  title   = {BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining},
  journal = {Briefings in Bioinformatics},
  volume  = {23},
  number  = {6},
  year    = {2022},
  doi     = {10.1093/bib/bbac409}
}

Guardrails

Not medical, clinical, or regulatory advice
Not validated for patient-facing or clinical decision support use
SFT responses are template-derived or seed-expanded — quality depends on source data completeness
Always pair model outputs with domain expertise and independent verification

About

Sid Mishra — Founder, Ontologer · Convixion AI

Sid is the founder of several AI-native and AI-powered startups and initiatives, based in Singapore. He founded Ontologer as the dedicated AI research arm of Convixion AI, with a focus on building domain-specific language models from the ground up — including data pipelines, training infrastructure, evaluation frameworks, and production deployment.

Ontologer performs every step of model development — dataset curation, training infrastructure, evaluation, and production deployment — in-house.


Site	ontologer.com
Email	sid@ontologer.com
LinkedIn	linkedin.com/in/sid-m-427b9865

Downloads last month: 314

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for Ontologer/OntologerMed-ClinicalTrials-Instruct

Base model

Qwen/Qwen3.5-0.8B-Base

Adapter

(1)

this model