πŸ₯ Medical Query Router

Few-shot classifier that routes patient queries into 3 safety tiers.

Built with SetFit β€” trained on just 90 hand-crafted examples (30 per class) using contrastive learning.

Classes

Tier Label Action Example
🟒 low_stakes Chatbot answers directly "How much paracetamol for a headache? I'm 30 and healthy"
🟑 high_stakes Doctor reviews before responding "Can I take ibuprofen while on blood thinners?"
πŸ”΄ urgent Tell patient to call 911/999 NOW "Crushing chest pain going down my left arm"

Performance

Evaluated on 45 held-out examples (15 per class) including deliberate edge cases:

Metric Score
Weighted F1 0.888
Accuracy 88.9%
Urgent Recall 93.3%
Urgent Precision 82.4%
Low Stakes F1 0.933
High Stakes F1 0.857

Confusion Matrix

              Predicted β†’  low   high  urgent
low_stakes                  14     0      1
high_stakes                  1    12      2
urgent                       0     1     14

Backbone Comparison

We trained 3 models and selected the best:

Backbone F1 Urgent Recall Safety Score
all-mpnet-base-v2 β˜… 0.888 0.933 0.859
all-MiniLM-L6-v2 0.846 0.867 0.789
MedEmbed-base-v0.1 0.801 0.867 0.748

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("boredpanda9/medical-query-router")

queries = [
    "What are some healthy ways to lose weight?",
    "Can I take naproxen with my blood pressure medication?",
    "I have crushing chest pain spreading to my left arm",
]

predictions = model.predict(queries)
print(predictions)
# ['low_stakes', 'high_stakes', 'urgent']

# With confidence scores
probabilities = model.predict_proba(queries)
print(probabilities)

Training Details

  • Method: SetFit (Sentence-Transformer fine-tuning + Logistic Regression head)
  • Paper: Efficient Few-Shot Learning Without Prompts
  • Base model: sentence-transformers/all-mpnet-base-v2 (109.5M params)
  • Training data: 90 hand-crafted examples (30 per class)
  • Contrastive pairs: 3,600 (generated via R=20 pair sampling)
  • Epochs: 1 (contrastive phase) + 1 (head phase)
  • Body learning rate: 2e-5
  • Head learning rate: 1e-2
  • Batch size: 16 (contrastive), 2 (head)
  • Loss: CosineSimilarityLoss
  • Head: Logistic Regression with balanced class weights

Class Design Rationale

🟒 Low Stakes

Queries where a chatbot can safely provide general information:

  • OTC medication dosing for otherwise healthy adults (paracetamol, ibuprofen, antihistamines)
  • General wellness (weight loss, sleep, hydration, exercise)
  • Mild, self-limiting symptoms with no red flags (common cold, mild fever in children who are otherwise well, minor cuts/grazes)
  • Lifestyle and prevention advice

🟑 High Stakes

Queries requiring clinical judgement β€” a doctor must review before responding:

  • Prescription medication dosing where errors cause harm (insulin, warfarin, metformin, chemotherapy)
  • Drug interactions (especially with narrow therapeutic index drugs)
  • Comorbidities that change management (diabetes + wound, COPD + ankle swelling)
  • Pregnancy/breastfeeding medication safety
  • Chronic disease management and flare-ups
  • Red flags in symptoms (unexplained weight loss, persistent cough >3 weeks, changing moles)
  • Children's prescription medications
  • Mental health (non-crisis)

πŸ”΄ Urgent

Life-threatening emergencies β€” patient must call 911/999/112 immediately:

  • Signs of heart attack (chest pain + arm/jaw, sweating, collapse)
  • Signs of stroke (FAST: Face drooping, Arm weakness, Speech difficulty, Time to call)
  • Breathing emergencies (anaphylaxis, severe asthma, choking, blue lips)
  • Overdose or poisoning (especially in children)
  • Suicidal crisis (active plan, immediate danger)
  • Severe bleeding or major trauma
  • Meningitis signs (non-blanching rash + fever + neck stiffness)
  • Seizures lasting >5 minutes
  • Unconscious/unresponsive person

Limitations

⚠️ This is a routing tool, not a diagnostic tool. It decides who should answer a query, not what the answer is.

  • Trained on 90 examples β€” may misclassify unusual or ambiguous queries
  • Designed for English-language queries in UK/US healthcare contexts
  • Should be used as a first-pass filter with human oversight, never as the sole decision-maker
  • The model errs toward safety (high_stakes/urgent) when uncertain β€” this is by design
  • Not validated on real clinical data β€” performance on actual patient messages may differ from the eval set

License

Apache 2.0

Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for boredpanda9/medical-query-router

Finetuned
(380)
this model

Paper for boredpanda9/medical-query-router

Evaluation results