dknguyen2304
/

model-router

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+*.egg-info/
+dist/
+build/
+*.egg
+# Environment
+.env
+.venv/
+venv/
+env/
+# Training artifacts (unignored for commit)
+!checkpoints/
+!artifacts/
+!logs/
+# Data
+data/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Claude / AI agent
+.claude/
+# OS
+.DS_Store
+Thumbs.db

README.md ADDED Viewed

	@@ -0,0 +1,211 @@

+---
+license: apache-2.0
+language:
+  - en
+library_name: peft
+base_model: unsloth/Qwen2.5-0.5B-Instruct
+tags:
+  - router
+  - model-routing
+  - lora
+  - classification
+  - ai-gateway
+  - qwen2.5
+  - peft
+  - deepspeed
+datasets:
+  - synthetic
+pipeline_tag: text-classification
+metrics:
+  - accuracy
+  - f1
+model-index:
+  - name: model-router
+    results:
+      - task:
+          type: text-classification
+          name: AI Model Routing
+        metrics:
+          - name: Routing Accuracy
+            type: accuracy
+            value: 1.0
+          - name: Macro F1
+            type: f1
+            value: 1.0
+          - name: Avg Latency (ms)
+            type: latency
+            value: 1.44
+---
+# 🚀 Model Router — Intelligent AI Gateway Router
+An autonomous AI gateway router that intelligently routes incoming API requests to the most appropriate backend model. Built with **LoRA fine-tuning** on **Qwen2.5-0.5B-Instruct** + a classification head, achieving **100% routing accuracy** with **1.44ms average latency**.
+## ✨ Highlights
+| Metric | Value |
+|--------|-------|
+| **Routing Accuracy** | 100% |
+| **Macro F1** | 1.0 |
+| **Avg Latency** | 1.44ms |
+| **P50 Latency** | 0.62ms |
+| **Base Model** | Qwen2.5-0.5B-Instruct |
+| **Training** | 8x NVIDIA H200 GPUs (DDP) |
+## 🏗️ Architecture
+```
+Input: "Analyze this research paper..."
+         │
+         ▼
+┌─────────────────────────────────────────┐
+│  Qwen2.5-0.5B-Instruct (LoRA-adapted)  │
+│  Target modules: q/k/v/o/gate/up/down   │
+│  LoRA rank: 64, alpha: 64               │
+│  Output: Last token hidden state [896]   │
+└─────────────────────────────────────────┘
+         │
+         ▼
+┌─────────────────────────────────────────┐
+│  Classification Head                     │
+│  Dropout(0.1) → Linear(896 → 6)         │
+└─────────────────────────────────────────┘
+         │
+         ▼
+Output: "gpt-4-turbo" (probability: 0.92)
+```
+## 🎯 Supported Routes
+| Route | Use Case |
+|-------|----------|
+| `gpt-4-turbo` | Complex reasoning, advanced coding, creative writing, long context analysis |
+| `gpt-3.5-turbo` | Simple QA, basic summarization, casual conversation, quick translation |
+| `claude-3-opus` | Deep research synthesis, long document analysis, nuanced analysis |
+| `claude-3-sonnet` | Balanced analysis, code assistance, general writing, data interpretation |
+| `gemini-pro` | Multimodal content, factual QA, web-grounded generation, visual reasoning |
+| `mixtral-8x7b` | Fast inference, code generation, roleplay, instruction following |
+## 📊 Evaluation Results
+### Per-Class Performance (Test Set: 1,001 samples)
+| Backend Model | Precision | Recall | F1 | Support |
+|--------------|----------|--------|-----|---------|
+| gpt-4-turbo | 1.00 | 1.00 | 1.00 | 149 |
+| gpt-3.5-turbo | 1.00 | 1.00 | 1.00 | 711 |
+| claude-3-opus | 1.00 | 1.00 | 1.00 | 49 |
+| claude-3-sonnet | 1.00 | 1.00 | 1.00 | 56 |
+| gemini-pro | 1.00 | 1.00 | 1.00 | 13 |
+| mixtral-8x7b | 1.00 | 1.00 | 1.00 | 23 |
+### Training Convergence
+| Epoch | Train Loss | Eval Accuracy |
+|-------|-----------|---------------|
+| 1 | 1.0108 | 76.8% |
+| 2 | 0.2813 | 100.0% |
+| 3 | 0.0602 | 100.0% |
+| 10 | ~0.0 | 100.0% |
+## 🚀 Quick Start
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import json
+# Load model
+base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
+model = PeftModel.from_pretrained(base_model, "dknguyen2304/model-router")
+tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
+# Load classifier head
+classifier = torch.nn.Sequential(
+    torch.nn.Dropout(0.1),
+    torch.nn.Linear(896, 6)
+)
+classifier.load_state_dict(torch.load("classifier.pt", map_location="cpu"))
+# Label mapping
+labels = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-opus",
+          "claude-3-sonnet", "gemini-pro", "mixtral-8x7b"]
+# Inference
+prompt = "Write a complex recursive algorithm to solve the Tower of Hanoi"
+inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
+with torch.no_grad():
+    outputs = model(**inputs, output_hidden_states=True)
+    hidden = outputs.hidden_states[-1][:, -1, :]  # last token
+    logits = classifier(hidden)
+    prediction = labels[logits.argmax(dim=-1).item()]
+print(f"Route to: {prediction}")
+```
+## 📁 Model Files
+```
+├── adapter_model.safetensors   # LoRA adapter weights
+├── adapter_config.json         # PEFT/LoRA configuration
+├── classifier.pt               # Classification head weights
+├── router_config.json          # Router configuration
+├── label_mapping.json          # Label ↔ ID mappings
+└── config/
+    ├── training_config.yaml    # Training hyperparameters
+    └── deepspeed_config.json   # DeepSpeed config
+```
+## ⚙️ Training Details
+| Parameter | Value |
+|-----------|-------|
+| Base Model | `unsloth/Qwen2.5-0.5B-Instruct` |
+| LoRA Rank (r) | 64 |
+| LoRA Alpha | 64 |
+| LoRA Dropout | 0.1 |
+| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+| Learning Rate | 1e-3 |
+| Batch Size | 8 per GPU × 8 GPUs × 4 grad accum = **256 effective** |
+| Epochs | 10 |
+| Max Seq Length | 512 |
+| Optimizer | AdamW |
+| Scheduler | Cosine with warmup (5%) |
+| Precision | BF16 |
+| Hardware | 8x NVIDIA H200 (143 GB each) |
+| Training Data | 10,000 synthetic samples (80/10/10 split) |
+| Total Steps | 350 |
+## 🔄 Pipeline
+The model was trained via a fully autonomous 5-stage pipeline:
+1. **Data Generation** — 10,000 synthetic requests with controlled class balance
+2. **LLM-as-Judge Labeling** — Keyword matching (60%) + semantic scoring (40%)
+3. **Distributed Fine-tuning** — DDP training on 8x H200 GPUs
+4. **Evaluation** — Batch inference with latency measurement
+5. **Export** — Production-ready artifacts
+## ⚠️ Limitations
+- Trained on **synthetic data** — real-world distribution may differ
+- **Fixed label set** — only routes to 6 predefined models
+- **No confidence calibration** — consider adding uncertainty thresholds for production
+- Recommend validation on real production traffic before deployment
+## 📜 License
+Apache 2.0
+## 📖 Citation
+```bibtex
+@misc{model-router-2026,
+  title={Model Router: Intelligent AI Gateway Request Routing via LoRA Fine-tuning},
+  author={dknguyen2304},
+  year={2026},
+  url={https://huggingface.co/dknguyen2304/model-router}
+}
+```