You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

OntologerMed-ClinicalTrials-Instruct

A domain-adapted generative language model for clinical trial intelligence.

OntologerMed-ClinicalTrials-Instruct is a small, efficient language model trained end-to-end on the complete ClinicalTrials.gov corpus. It understands the structure, language, and clinical reasoning patterns of over 550,000 registered trials β€” and can answer specific questions about them.

This is v0.2 of the model, which extends structured extraction capabilities (v0.1) with broad conversational Q&A covering 27 question categories drawn from real clinical trial intelligence workflows.

Part of the OntologerMed suite β€” a family of purpose-built models for clinical trial analysis, each trained to understand a different dimension of trial intelligence.


Inspiration and Prior Work

This model was directly inspired by BioGPT (Luo et al., 2022, Microsoft Research), which demonstrated that generative domain-specific pre-training on biomedical literature produces strong results on tasks where BERT-style discriminative models fall short.

BioGPT's key insight β€” that a GPT-style model pre-trained on domain text outperforms general-purpose models on biomedical generation and QA β€” translates directly to the clinical trial domain. Where BioGPT drew from PubMed abstracts and biomedical journal text, OntologerMed-ClinicalTrials draws from the structured, regulatory-grade text of ClinicalTrials.gov: study protocols, eligibility criteria, intervention descriptions, outcome definitions, and adverse event summaries.

Dimension BioGPT (Microsoft) OntologerMed-ClinicalTrials-Instruct
Domain Biomedical literature (PubMed) Clinical trials (ClinicalTrials.gov)
Architecture GPT-2 style Transformer Qwen3.5 hybrid (Transformer + SSM)
Pre-training corpus PubMed abstracts & full-text 551,717 registered clinical studies
Fine-tuning PubMedQA (274k pairs) 877,386 instruction examples, 35 task types
Core capability Biomedical QA and text generation Clinical trial reasoning, extraction, and Q&A
Scale Large (1.5B params) Compact (0.8B params)

Where BioGPT achieves 78.2% accuracy on PubMedQA, OntologerMed-ClinicalTrials targets analogous capabilities within the narrower, more structured domain of clinical trial documents β€” trading breadth for depth, and scale for efficiency.


Model Overview

Property Value
Base model Qwen/Qwen3.5-0.8B-Base (Apache 2.0)
Architecture Hybrid Transformer + SSM (Gated DeltaNet, 3:1 linear:full attention)
Parameters ~0.8B
Training stages 2 (Continued Pre-Training β†’ LoRA SFT v2)
Pre-training corpus 551,717 ClinicalTrials.gov studies
SFT corpus 877,386 instruction examples
Task types 35 (8 structured extraction + 27 conversational Q&A)
License CC BY-NC-ND 4.0

Training Pipeline

Stage 1 β€” Continued Pre-Training (CPT)

The base Qwen3.5-0.8B model was continued pre-trained on a full-corpus rendering of ClinicalTrials.gov. Each study was converted into a structured plain-text document covering: study title, brief and detailed summaries, interventions, eligibility criteria, primary and secondary outcomes, results, and adverse events. Studies with insufficient text were filtered.

  • Corpus: 551,717 training documents, 11,165 held out for evaluation
  • Objective: Causal language modeling (LoRA, rank 32)
  • Hardware: NVIDIA H100 SXM 80GB
  • Wall-clock time: ~10.5 hours

Stage 2 β€” LoRA Supervised Fine-Tuning v0.2 (SFT)

The CPT-adapted model was instruction-tuned using LoRA on 877,386 ChatML-formatted examples across 35 task types, in two classes:

Class A β€” Structured Extraction (8 tasks):

Task Description
pico_extraction Extract Population, Intervention, Comparison, Outcome from a study
eligibility_summary Summarise inclusion/exclusion criteria in plain language
trial_summarization Generate a structured summary of a trial
condition_matching Determine if a trial is relevant to a given condition
outcome_success Assess reported outcome results against stated endpoints
adverse_event_extraction Extract and structure adverse event data
intervention_comparison Compare trial arms and interventions
phase_classification Classify trial phase with contextual explanation

Class B β€” Conversational Q&A (27 categories):

Drug mechanism, dosing and administration, eligibility interpretation, treatment comparison, safety questions, trial design rationale, outcome interpretation, recruitment questions, sponsor and phase context, statistical literacy, protocol navigation, patient-facing plain-language explanations, and 15 additional clinical intelligence categories β€” covering the full range of questions asked by pharma analysts, CROs, investigators, and patients.

  • Training examples: 833,517 (train) + 43,869 (eval)
  • Format: ChatML (<|im_start|>system/user/assistant<|im_end|>)
  • Adapter method: LoRA rank 96, alpha 192, applied to all projection modules
  • Hardware: NVIDIA H100 SXM 80GB
  • Wall-clock time: ~11.25 hours (15,522 steps, 2 epochs)
  • Final eval loss: 0.694

Evaluation Results

Methodology

All evaluations use the merged model at temperature 0 (greedy decoding), on 50 samples drawn from the held-out sft_eval.jsonl split (random seed 42). Evaluation uses identical prompts and parsers across all models β€” structured natural-language instructions with no SFT-format scaffolding. This is the same pipeline used to evaluate frontier models.

What each metric measures:

Outcome Accuracy β€” 3-class classification: did the trial meet its primary endpoint (positive), fail it (negative), or was the result ambiguous (inconclusive)? Score is exact-match accuracy against reference labels.

Adverse Event F1 β€” Entity-level F1 on extracted (severity, event_name) pairs. Requires structured line output (SERIOUS β€” event_name). Frontier models fail this metric not due to lack of capability but due to format non-compliance.

Intervention F1 β€” Entity-level F1 on extracted (arm_name, drug) pairs across trial arms. Strict string match; a lower bound on true comprehension.

PICO Macro F1 β€” Token-level F1 averaged across all four PICO elements (Population, Intervention, Comparison, Outcome). Rewards recall of key terms.

v0.2 Benchmark Results (natural-language prompts, n=50, seed=42)

Same evaluation pipeline used across all models: identical human-style prompts, temperature 0, no SFT format scaffolding. Note: frontier model comparisons are being reworked (see Frontier Model Comparison table note below).

BioGPT Benchmark Clinical Trial Equivalent Metric OntologerMed v0.2 OntologerMed v0.1 ‑ BioGPT-Large
PubMedQA (yes/no/maybe) outcome_success Accuracy 0.3333 0.9200 ‑ 0.7820
BC5CDR Relation Extraction adverse_event_extraction Entity F1 0.6707 0.9923 ‑ 0.4498
DDI Drug Interaction intervention_comparison Entity F1 0.9476 0.5903 ‑ 0.4076
KD-DTI Drug-Target pico_extraction Macro F1 0.8164 0.7530 ‑ 0.3842

‑ v0.1 scores were measured with SFT-formatted prompts β€” the exact format the model was trained on. v0.2 uses natural-language prompts. Directly comparable to frontier models below.

PICO per-element (v0.2): Population 1.00 Β· Intervention 1.00 Β· Comparison 0.928 Β· Outcome 0.337

The outcome accuracy drop reflects prompt-distribution shift (SFT prompts β†’ natural language), not capability loss. Intervention F1 improved from 0.59 β†’ 0.95, PICO from 0.75 β†’ 0.82 β€” the v0.2 training with 27 Q&A categories substantially improved structured extraction precision.

Frontier Model Comparison

Model Params Outcome Acc AE F1 Interv F1 PICO F1
OntologerMed v0.2 (0.8B) 0.8B 0.3333 0.6707 0.9476 0.8164
OntologerMed v0.1 (0.8B) ‑ 0.8B 0.9200 ‑ 0.9923 ‑ 0.5903 ‑ 0.7530 ‑
BioGPT-Large 1.5B 0.7820 0.4498 0.4076 0.3842
GPT-5.4 ~1T+ TBD TBD TBD TBD
Claude Sonnet 4.6 β€” TBD TBD TBD TBD
Claude Opus 4.6 β€” TBD TBD TBD TBD
Gemini 3.1 Flash β€” TBD TBD TBD TBD
Gemini 3.1 Pro β€” TBD TBD TBD TBD

Frontier model evaluations are being reworked. Current eval uses format-specific parsing (e.g. SERIOUS β€” event_name for AE extraction) which unfairly penalises models not trained on this output format. v0.3 will use an LLM-as-judge approach for fair comparison.

Intervention F1 is where the v0.2 model demonstrates the clearest advance over BioGPT-Large: 0.9476 vs 0.4076. The model reliably formats multi-arm trial outputs as Arm N: drug name lines. Full methodology in TEST_MODELS.md.


Business Use & Applications

OntologerMed-ClinicalTrials is purpose-built for teams that work with clinical trial data at scale β€” from biotech R&D desks to investment analysts to clinical operations platforms.

Pharma & Biotech R&D

  • Competitive intelligence β€” automatically extract, compare, and summarise competitor trial designs, endpoints, and outcomes across hundreds of studies
  • Pipeline analysis β€” assess the probability-weighted outcomes of a disease area's entire trial landscape
  • Protocol development support β€” query historical trials for precedent eligibility criteria, endpoint selection, and dosing rationale
  • Adverse event signal detection β€” surface recurring safety patterns across a drug class or indication

Clinical Research Organisations (CROs)

  • Protocol feasibility β€” instantly parse eligibility criteria and identify likely recruitment bottlenecks
  • Regulatory document drafting β€” generate structured summaries of trial outcomes for FDA submissions and IND/NDA support packages
  • Site selection β€” match trials to patient populations by automatically extracting demographic and disease criteria

Digital Health & Trial Matching Platforms

  • Patient-trial matching β€” determine in natural language whether a patient's profile satisfies a trial's inclusion/exclusion criteria
  • Eligibility screening chatbots β€” power plain-language Q&A interfaces on top of ClinicalTrials.gov data
  • Trial discovery β€” given a diagnosis, surface and summarise the most relevant active or completed trials

Investment & Due Diligence

  • Biotech investment analysis β€” summarise a company's entire trial portfolio, including phase, outcomes, and adverse event profile
  • M&A diligence β€” rapidly characterise an acquisition target's clinical evidence base across hundreds of registered studies
  • Pipeline valuation inputs β€” extract structured outcome data to feed probability-of-technical-success (POTS) models

Regulatory & Medical Affairs

  • Safety surveillance β€” structured adverse event extraction across a compound's complete trial history
  • MSL briefings β€” auto-generate scientific summaries for field medical teams
  • Evidence synthesis β€” aggregate outcome data across trials for systematic review and meta-analysis support

Example Questions & Answers

Outcome Assessment

Q: Based on NCT02847000 β€” a Phase 3 RCT of pembrolizumab vs chemotherapy in 305 patients with PD-L1-positive NSCLC β€” did the trial meet its primary endpoint?

A: This trial met its primary endpoint. The study demonstrated that pembrolizumab produced a statistically significant improvement in progression-free survival compared to platinum-based chemotherapy in patients with PD-L1 tumour proportion score β‰₯50% (median PFS 10.3 months vs 6.0 months; HR 0.50, 95% CI 0.37–0.68; p<0.001). Overall survival also favoured the pembrolizumab arm. The trial results supported regulatory approval of pembrolizumab as first-line monotherapy in this population.

VERDICT: positive


Conversational Q&A (Class B β€” v2)

Q: What is the typical Phase 2 success rate for oncology trials, and what drives the variance?

A: Phase 2 oncology trials historically succeed (advance to Phase 3) at rates between 30–45%, with significant variance driven by: indication specificity (rare cancers with validated biomarkers achieve higher rates), endpoint selection (OS vs PFS vs ORR), patient population homogeneity, and whether Phase 1 established a clear dose-response signal. Targeted therapies in biomarker-selected populations consistently outperform broad histology trials. Immunotherapy combinations show wide variance due to unpredictable synergy and toxicity profiles. Basket trials and adaptive designs modestly improve success rates by enabling early arm termination.


PICO Extraction

Q: Extract the PICO elements from this trial: Phase 2 RCT enrolling adults aged 40–75 with moderate-to-severe COPD (FEV1 30–70% predicted), randomising 240 participants to once-daily tiotropium 18mcg inhaler vs placebo for 52 weeks. Primary endpoint: change from baseline in trough FEV1 at Week 52.

A: Population: Adults aged 40–75 with moderate-to-severe COPD (FEV1 30–70% predicted); 240 participants enrolled

Intervention: Tiotropium 18mcg once-daily via HandiHaler inhaler for 52 weeks

Comparison: Matched placebo inhaler once daily for 52 weeks

Outcome: Change from baseline in trough FEV1 (L) at Week 52, measured by spirometry


Adverse Event Extraction

Q: Extract the structured adverse event profile from NCT01234567 (Drug X vs placebo, n=420).

A:

SERIOUS β€” Pneumonia: 8/210
SERIOUS β€” Pulmonary embolism: 3/210
SERIOUS β€” Elevated liver enzymes (Grade 3+): 6/210
OTHER β€” Nausea: 47/210
OTHER β€” Fatigue: 39/210
OTHER β€” Headache: 28/210
OTHER β€” Diarrhoea: 22/210
OTHER β€” Injection site reaction: 18/210

Serious adverse event rate in the treatment arm was 8.1%, compared to 3.8% in the placebo arm. No treatment-related deaths were reported.


Eligibility Summary

Q: Summarise the eligibility criteria for NCT03215823 in plain language for a patient with Type 2 diabetes, aged 58, eGFR 42, and no history of cardiovascular events.

A: You may be eligible. This trial is looking for adults with Type 2 diabetes who have not achieved adequate glucose control on current medication. Key requirements:

  • Age 18 or older βœ“ (you are 58)
  • Type 2 diabetes diagnosis βœ“
  • Kidney function (eGFR) β‰₯ 30 mL/min/1.73mΒ² βœ“ (yours is 42)
  • No prior heart attack, stroke, or hospitalisation for heart failure in the last 3 months βœ“

Possible concern: The trial excludes patients currently on insulin therapy. If you are on insulin, you would not qualify. Please confirm your current medications with the trial coordinator.


Intervention Comparison

Q: Compare the treatment arms in a Phase 3 trial of baricitinib vs adalimumab vs placebo in 1,305 patients with moderate-to-severe rheumatoid arthritis.

A: Arm 1: DRUG: Baricitinib 4mg once daily (oral) + methotrexate background therapy

Arm 2: DRUG: Adalimumab 40mg subcutaneous injection every 2 weeks + methotrexate background therapy

Arm 3: DRUG: Placebo (oral + subcutaneous) + methotrexate background therapy

Baricitinib demonstrated superior ACR20 response rates at Week 12 compared to adalimumab (70% vs 61%) and placebo (40%), with faster onset of action. All arms continued background methotrexate throughout.


Prompt Format & Best Practices

What the model needs

The model is trained on ChatML format with a system prompt, trial text pasted into the user turn, and a task-specific instruction. Always include:

  1. A system prompt identifying the assistant as OntologerMed
  2. The trial text (copy-paste from ClinicalTrials.gov or your data source) in the user turn
  3. A specific task instruction β€” extraction, summarisation, or a question

Use temperature 0 (greedy decoding) for structured extraction tasks. For conversational Q&A, temperatures up to 0.3 are fine.

Class A β€” Structured Extraction (example prompts)

Extract the PICO elements from the following trial description: [trial text]

Extract structured adverse event data from the following trial. Format as: SERIOUS β€” event_name or OTHER β€” event_name: [trial text]

Compare the treatment arms in the following trial: [trial text]

Summarise the eligibility criteria for the following trial in plain language: [trial text]

Classify the phase of the following trial and explain why: [trial text]

Did this trial meet its primary endpoint? [trial text]

Extract the interventions used in each arm of the following trial: [trial text]

Generate a structured summary of the following trial: [trial text]

Class B β€” Conversational Q&A (example prompts)

What is the typical Phase 2 success rate for oncology trials?

What does the mechanism of action of [drug] suggest about its likely side effect profile?

What eligibility criteria would typically exclude patients with moderate renal impairment from this type of trial?

How does a basket trial design differ from a platform trial?

What does a hazard ratio of 0.65 mean in a survival analysis?

What are the main reasons Phase 3 oncology trials fail?

How long does a typical Phase 3 oncology trial take from first enrolment to data lock?

What is the difference between overall survival and progression-free survival as primary endpoints?

What does FDA breakthrough therapy designation mean for trial timelines?

Why might a trial use an adaptive design instead of a fixed two-arm design?

What questions should I ask when reviewing a competitor's Phase 2 results?

How do I interpret a statistically significant result with a small effect size?

Real outputs from v0.2 (tested 2026-03-12)

Q: What is an adaptive trial design?

An adaptive trial design is a type of trial design that allows the study team to modify the study design based on the results of the study. This is done to improve the study's ability to detect treatment effects and to reduce the number of participants needed to achieve the study's primary outcome.

Q: What is a surrogate endpoint?

A surrogate endpoint is a clinical outcome that is not the primary endpoint of a clinical trial. Instead, it is used as a surrogate marker β€” considered to be a good predictor of the primary outcome. The surrogate marker is not the primary outcome itself. The surrogate marker is used to measure the effect of the intervention.

Q: What is the difference between Phase 1 and Phase 3?

Phase 1: To evaluate the safety and tolerability of a single dose of a new drug, and to determine the maximum tolerated dose (MTD). Phase 3: To evaluate the efficacy and safety of a drug in a large, randomised, controlled trial.

Q: What drugs are being tested for obesity?

This Trial is a Phase 2 Study of the Safety and Efficacy of the Combination of Semaglutide and Tirzepatide in Participants With Obesity. The drugs involved in this study are: Semaglutide, Tirzepatide. Participants will be randomly assigned to: Semaglutide 2.4 mg + Tirzepatide 5 mg, or Semaglutide 2.4 mg + Tirzepatide 10 mg. Once weekly, 52 weeks, ~100 participants.

What the model is not good at (v0.2)

Corpus-level aggregate questions β€” the model was trained on individual trial Q&A, not corpus synthesis. These questions currently fail:

  • "What predicts success in Phase 3 cardiovascular outcome trials?" β†’ echoes the question back
  • "What is the success rate for cancer drugs in Phase 3?" β†’ hallucinated 80%
  • "Which companies run the most clinical trials?" β†’ describes a single unrelated trial
  • "Are there any trials for long COVID?" β†’ "Yes, this trial has relevant data on long COVID." (one sentence)
  • "How many cancer trials use immunotherapy?" β†’ invents one trial

What to use instead (v0.2): Provide trial text in the prompt. The model answers questions about a specific trial, not across the corpus. Aggregate landscape questions are on the v0.3 roadmap.

Other current limitations:

  • Bare prompts without trial context β€” outputs may be off-distribution without trial text
  • Non-ClinicalTrials.gov data β€” trained exclusively on ClinicalTrials.gov formats
  • Live data β€” no knowledge of trials after training cutoff
  • General medical questions β€” stay within the clinical trial domain

Full test results (59 prompts with outputs) are in TODO_FOR_0.3.md in the repo. v0.3 will add corpus-level aggregate training data to fix the aggregate Q&A failure mode.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Ontologer/OntologerMed-ClinicalTrials-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

SYSTEM = (
    "You are OntologerMed, a clinical trials intelligence assistant trained on 550,000+ "
    "ClinicalTrials.gov studies. Answer questions accurately and concisely based on clinical "
    "trial data. If you don't have specific information, say so clearly."
)

def ask(user_prompt, max_new_tokens=512):
    text = (
        f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
        f"<|im_start|>user\n{user_prompt}<|im_end|>\n"
        f"<|im_start|>assistant\n"
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

# Structured extraction
print(ask("Extract the PICO elements from NCT02847000 β€” Phase 3 pembrolizumab vs chemotherapy in NSCLC."))

# Conversational Q&A
print(ask("What is the typical timeline from Phase 2 completion to Phase 3 start for oncology programmes?"))

Hardware requirements: 16GB VRAM minimum (inference). The model uses trust_remote_code=True due to the Gated DeltaNet hybrid architecture. Requires flash-linear-attention for fast inference; falls back to PyTorch (~15Γ— slower) without it.


Limitations

  • Template-generated SFT responses: Class A instruction data was derived programmatically from ClinicalTrials.gov structured fields. Class B conversational data was expanded from category seeds. Response quality reflects data completeness in the underlying registry.
  • English only: Trained exclusively on English-language trial records.
  • Not a clinical decision tool: This model is not validated for clinical, regulatory, or patient-facing use. Do not use for medical decisions.
  • Distribution shift: Performance degrades on trials with sparse documentation, non-standard formatting, or from registries outside ClinicalTrials.gov.
  • Small scale: At 0.8B parameters, complex multi-step reasoning tasks may require prompt engineering or retrieval augmentation.
  • Outcome accuracy under natural prompts: Currently 33% on the 50-sample eval set β€” a known weak point from prompt-distribution shift between v0.1 and v0.2. Top priority for v0.3.
  • Frontier model comparison methodology: The current eval uses format-specific parsing which penalises models not trained on our output format. Frontier model results are marked TBD pending a reworked LLM-as-judge evaluation in v0.3.

Part of the OntologerMed Suite

Model Role
OntologerMed-ClinicalTrials-Instruct Domain LM β€” generative reasoning, extraction, and summarisation over trial text
FATE-ClinicalTrials-Outcome-256 Outcome-shaped embedding β€” similarity by historical success/failure pattern
MOAt-ClinicalTrials-MoA-256 Mechanism-of-action embedding β€” similarity by biological pathway
PACT-ClinicalTrials-Pop-256 Population embedding β€” similarity by patient demographics and disease
ORACLE-ClinicalTrials-SuccessProb-v1 Classifier β€” calibrated probability estimate combining all three embedding dimensions

Citation

@misc{ontologermed-clinicaltrials-2026,
  title        = {OntologerMed-ClinicalTrials-Instruct: A Domain-Adapted Generative Language Model for Clinical Trial Intelligence},
  author       = {Mishra, Sid and Ontologer},
  year         = {2026},
  note         = {Two-stage training: continued pre-training + LoRA SFT v2 on ClinicalTrials.gov. Inspired by BioGPT (Luo et al., 2022).},
  howpublished = {\url{https://huggingface.co/Ontologer/OntologerMed-ClinicalTrials-Instruct}}
}

BioGPT (inspiration):

@article{luo2022biogpt,
  author  = {Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan},
  title   = {BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining},
  journal = {Briefings in Bioinformatics},
  volume  = {23},
  number  = {6},
  year    = {2022},
  doi     = {10.1093/bib/bbac409}
}

Guardrails

  • Not medical, clinical, or regulatory advice
  • Not validated for patient-facing or clinical decision support use
  • SFT responses are template-derived or seed-expanded β€” quality depends on source data completeness
  • Always pair model outputs with domain expertise and independent verification

About

Sid Mishra β€” Founder, Ontologer Β· Convixion AI

Sid is the founder of several AI-native and AI-powered startups and initiatives, based in Singapore. He founded Ontologer as the dedicated AI research arm of Convixion AI, with a focus on building domain-specific language models from the ground up β€” including data pipelines, training infrastructure, evaluation frameworks, and production deployment.

Ontologer performs every step of model development β€” dataset curation, training infrastructure, evaluation, and production deployment β€” in-house.

Downloads last month
314
Safetensors
Model size
0.8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ontologer/OntologerMed-ClinicalTrials-Instruct

Adapter
(1)
this model