Chaperone-Thinking-LQ-1.0
A domain-optimized reasoning model built on DeepSeek-R1-Distill-Qwen-32B, refined through a multi-stage pipeline of GPTQ quantization-aware training and QLoRA fine-tuning. Achieves 84% on MedQA — within 4 points of GPT-4o — in a ~20GB package that fits on a single L40/L40s GPU.
Fully open-source under CC-BY-4.0.
Highlights
- Base model: DeepSeek-R1-Distill-Qwen-32B (32B parameters)
- Size reduction: ~60GB → ~20GB (4-bit GPTQ)
- MedQA accuracy: 84% (GPT-4o: ~88%)
- Hardware target: Runs on a single NVIDIA L40, L40s, or A100 GPU
- License: CC-BY-4.0
How We Built It
This model is not a simple quantization. It was produced through a four-stage pipeline:
| Stage | Method | What it does |
|---|---|---|
| 1. Quantization | 4-bit GPTQ | Compresses weights from ~60GB to ~20GB for efficient inference |
| 2. Quantization-Aware Training | GPTQ-based QAT with calibration | Minimizes accuracy loss during quantization by optimizing scale/zero-point parameters against a calibration dataset |
| 3. Domain Fine-Tuning | QLoRA | Adapts the quantized model on medical and scientific corpora, recovering and improving accuracy for domain-specific reasoning |
| 4. Transparency | Adaptive layer removal | Removes the identity adaptive layer so the model correctly attributes its foundational architecture to its original creators |
Benchmark Results
MedQA
| Model | Accuracy |
|---|---|
| Chaperone-Thinking-LQ-1.0 | 84% |
| GPT-4o | 88% |
Multi-Model Comparison
| Benchmark | DeepSeek-R1 | OpenAI-o1-1217 | DeepSeek-R1-32B | OpenAI-o1-mini | Chaperone-Thinking-LQ-1.0 |
|---|---|---|---|---|---|
| AIME 2024 | 79.8 | 79.2 | 72.6 | 63.6 | 66.7 |
| GPQA Diamond | 71.5 | 75.7 | 62.1 | 60.0 | 56.7 |
| MATH-500 | 97.3 | 96.4 | 94.3 | 90.0 | 91.9 |
| MMLU | 90.8 | 91.8 | 87.4 | 85.2 | 85.9 |
Chaperone-Thinking-LQ-1.0 delivers competitive performance against full-precision frontier models at ~3x smaller model size.
Speed & Latency
| Metric | Chaperone-Thinking-LQ-1.0 | DeepSeek-R1-Distill-Qwen-32B |
|---|---|---|
| Throughput | 36.86 tok/s | 22.84 tok/s |
| Latency p50 | 11.49s | 20.10s |
| Latency p95 | 13.06s | 20.11s |
1.6x higher throughput with ~43% lower median latency. Averages over 10 trials, concurrency=1, max_tokens=512, temperature=0.
Model Details
| Base model | DeepSeek-R1-Distill-Qwen-32B |
| Parameters | 32 billion |
| Quantization | 4-bit GPTQ |
| Fine-tuning | QLoRA on medical/scientific corpora |
| Model size | ~20GB |
| Precision | torch.float16 |
| Evaluation hardware | NVIDIA A100 80GB PCIe |
| CUDA | 12.4 |
| PyTorch | 2.6.0+cu124 |
Intended Use
- Medical and clinical reasoning tasks
- Scientific Q&A and research workflows
- Enterprise deployments requiring data sovereignty (on-premises, private cloud)
- Domain-specific text analysis and insight extraction
Limitations
- 4-bit quantization introduces some accuracy trade-off on general benchmarks vs. the full-precision base model
- Domain fine-tuning is optimized for medical/scientific reasoning; general-purpose performance may differ
- Not intended as a replacement for professional medical judgment
Citation
If you use this model, please cite:
@misc{chaperone-thinking-lq,
title={Chaperone-Thinking-LQ-1.0: Domain-Optimized Reasoning via GPTQ-QAT and QLoRA},
author={Empirisch Technologies},
year={2025},
url={https://huggingface.co/empirischtech}
}
Links
- Website: chaperoneai.net
- Hugging Face: empirischtech
- Downloads last month
- 157
Model tree for empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-32BDataset used to train empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit
Evaluation results
- accuracy on MedQAself-reported84.000
- accuracy on MATH-500self-reported91.900
- accuracy on AIME 2024self-reported66.700
- accuracy on GPQA Diamondself-reported56.700
- accuracy on MMLUself-reported85.900
- accuracy on GSM8K-Platinumself-reported84.040
- accuracy on IFEvalself-reported83.340
- accuracy on MMLU-PROself-reported65.760