dknguyen2304 commited on
Commit
d709296
Β·
verified Β·
1 Parent(s): 8bc62e2

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. .gitignore +37 -0
  2. README.md +211 -0
.gitignore ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ *.egg-info/
7
+ dist/
8
+ build/
9
+ *.egg
10
+
11
+ # Environment
12
+ .env
13
+ .venv/
14
+ venv/
15
+ env/
16
+
17
+ # Training artifacts (unignored for commit)
18
+ !checkpoints/
19
+ !artifacts/
20
+ !logs/
21
+
22
+ # Data
23
+ data/
24
+
25
+ # IDE
26
+ .vscode/
27
+ .idea/
28
+ *.swp
29
+ *.swo
30
+ *~
31
+
32
+ # Claude / AI agent
33
+ .claude/
34
+
35
+ # OS
36
+ .DS_Store
37
+ Thumbs.db
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: peft
6
+ base_model: unsloth/Qwen2.5-0.5B-Instruct
7
+ tags:
8
+ - router
9
+ - model-routing
10
+ - lora
11
+ - classification
12
+ - ai-gateway
13
+ - qwen2.5
14
+ - peft
15
+ - deepspeed
16
+ datasets:
17
+ - synthetic
18
+ pipeline_tag: text-classification
19
+ metrics:
20
+ - accuracy
21
+ - f1
22
+ model-index:
23
+ - name: model-router
24
+ results:
25
+ - task:
26
+ type: text-classification
27
+ name: AI Model Routing
28
+ metrics:
29
+ - name: Routing Accuracy
30
+ type: accuracy
31
+ value: 1.0
32
+ - name: Macro F1
33
+ type: f1
34
+ value: 1.0
35
+ - name: Avg Latency (ms)
36
+ type: latency
37
+ value: 1.44
38
+ ---
39
+
40
+ # πŸš€ Model Router β€” Intelligent AI Gateway Router
41
+
42
+ An autonomous AI gateway router that intelligently routes incoming API requests to the most appropriate backend model. Built with **LoRA fine-tuning** on **Qwen2.5-0.5B-Instruct** + a classification head, achieving **100% routing accuracy** with **1.44ms average latency**.
43
+
44
+ ## ✨ Highlights
45
+
46
+ | Metric | Value |
47
+ |--------|-------|
48
+ | **Routing Accuracy** | 100% |
49
+ | **Macro F1** | 1.0 |
50
+ | **Avg Latency** | 1.44ms |
51
+ | **P50 Latency** | 0.62ms |
52
+ | **Base Model** | Qwen2.5-0.5B-Instruct |
53
+ | **Training** | 8x NVIDIA H200 GPUs (DDP) |
54
+
55
+ ## πŸ—οΈ Architecture
56
+
57
+ ```
58
+ Input: "Analyze this research paper..."
59
+ β”‚
60
+ β–Ό
61
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
62
+ β”‚ Qwen2.5-0.5B-Instruct (LoRA-adapted) β”‚
63
+ β”‚ Target modules: q/k/v/o/gate/up/down β”‚
64
+ β”‚ LoRA rank: 64, alpha: 64 β”‚
65
+ β”‚ Output: Last token hidden state [896] β”‚
66
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
67
+ β”‚
68
+ β–Ό
69
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
70
+ β”‚ Classification Head β”‚
71
+ β”‚ Dropout(0.1) β†’ Linear(896 β†’ 6) β”‚
72
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
73
+ β”‚
74
+ β–Ό
75
+ Output: "gpt-4-turbo" (probability: 0.92)
76
+ ```
77
+
78
+ ## 🎯 Supported Routes
79
+
80
+ | Route | Use Case |
81
+ |-------|----------|
82
+ | `gpt-4-turbo` | Complex reasoning, advanced coding, creative writing, long context analysis |
83
+ | `gpt-3.5-turbo` | Simple QA, basic summarization, casual conversation, quick translation |
84
+ | `claude-3-opus` | Deep research synthesis, long document analysis, nuanced analysis |
85
+ | `claude-3-sonnet` | Balanced analysis, code assistance, general writing, data interpretation |
86
+ | `gemini-pro` | Multimodal content, factual QA, web-grounded generation, visual reasoning |
87
+ | `mixtral-8x7b` | Fast inference, code generation, roleplay, instruction following |
88
+
89
+ ## πŸ“Š Evaluation Results
90
+
91
+ ### Per-Class Performance (Test Set: 1,001 samples)
92
+
93
+ | Backend Model | Precision | Recall | F1 | Support |
94
+ |--------------|----------|--------|-----|---------|
95
+ | gpt-4-turbo | 1.00 | 1.00 | 1.00 | 149 |
96
+ | gpt-3.5-turbo | 1.00 | 1.00 | 1.00 | 711 |
97
+ | claude-3-opus | 1.00 | 1.00 | 1.00 | 49 |
98
+ | claude-3-sonnet | 1.00 | 1.00 | 1.00 | 56 |
99
+ | gemini-pro | 1.00 | 1.00 | 1.00 | 13 |
100
+ | mixtral-8x7b | 1.00 | 1.00 | 1.00 | 23 |
101
+
102
+ ### Training Convergence
103
+
104
+ | Epoch | Train Loss | Eval Accuracy |
105
+ |-------|-----------|---------------|
106
+ | 1 | 1.0108 | 76.8% |
107
+ | 2 | 0.2813 | 100.0% |
108
+ | 3 | 0.0602 | 100.0% |
109
+ | 10 | ~0.0 | 100.0% |
110
+
111
+ ## πŸš€ Quick Start
112
+
113
+ ```python
114
+ import torch
115
+ from transformers import AutoTokenizer, AutoModelForCausalLM
116
+ from peft import PeftModel
117
+ import json
118
+
119
+ # Load model
120
+ base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
121
+ model = PeftModel.from_pretrained(base_model, "dknguyen2304/model-router")
122
+ tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
123
+
124
+ # Load classifier head
125
+ classifier = torch.nn.Sequential(
126
+ torch.nn.Dropout(0.1),
127
+ torch.nn.Linear(896, 6)
128
+ )
129
+ classifier.load_state_dict(torch.load("classifier.pt", map_location="cpu"))
130
+
131
+ # Label mapping
132
+ labels = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-opus",
133
+ "claude-3-sonnet", "gemini-pro", "mixtral-8x7b"]
134
+
135
+ # Inference
136
+ prompt = "Write a complex recursive algorithm to solve the Tower of Hanoi"
137
+ inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
138
+
139
+ with torch.no_grad():
140
+ outputs = model(**inputs, output_hidden_states=True)
141
+ hidden = outputs.hidden_states[-1][:, -1, :] # last token
142
+ logits = classifier(hidden)
143
+ prediction = labels[logits.argmax(dim=-1).item()]
144
+
145
+ print(f"Route to: {prediction}")
146
+ ```
147
+
148
+ ## πŸ“ Model Files
149
+
150
+ ```
151
+ β”œβ”€β”€ adapter_model.safetensors # LoRA adapter weights
152
+ β”œβ”€β”€ adapter_config.json # PEFT/LoRA configuration
153
+ β”œβ”€β”€ classifier.pt # Classification head weights
154
+ β”œβ”€β”€ router_config.json # Router configuration
155
+ β”œβ”€β”€ label_mapping.json # Label ↔ ID mappings
156
+ └── config/
157
+ β”œβ”€β”€ training_config.yaml # Training hyperparameters
158
+ └── deepspeed_config.json # DeepSpeed config
159
+ ```
160
+
161
+ ## βš™οΈ Training Details
162
+
163
+ | Parameter | Value |
164
+ |-----------|-------|
165
+ | Base Model | `unsloth/Qwen2.5-0.5B-Instruct` |
166
+ | LoRA Rank (r) | 64 |
167
+ | LoRA Alpha | 64 |
168
+ | LoRA Dropout | 0.1 |
169
+ | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
170
+ | Learning Rate | 1e-3 |
171
+ | Batch Size | 8 per GPU Γ— 8 GPUs Γ— 4 grad accum = **256 effective** |
172
+ | Epochs | 10 |
173
+ | Max Seq Length | 512 |
174
+ | Optimizer | AdamW |
175
+ | Scheduler | Cosine with warmup (5%) |
176
+ | Precision | BF16 |
177
+ | Hardware | 8x NVIDIA H200 (143 GB each) |
178
+ | Training Data | 10,000 synthetic samples (80/10/10 split) |
179
+ | Total Steps | 350 |
180
+
181
+ ## πŸ”„ Pipeline
182
+
183
+ The model was trained via a fully autonomous 5-stage pipeline:
184
+
185
+ 1. **Data Generation** β€” 10,000 synthetic requests with controlled class balance
186
+ 2. **LLM-as-Judge Labeling** β€” Keyword matching (60%) + semantic scoring (40%)
187
+ 3. **Distributed Fine-tuning** β€” DDP training on 8x H200 GPUs
188
+ 4. **Evaluation** β€” Batch inference with latency measurement
189
+ 5. **Export** β€” Production-ready artifacts
190
+
191
+ ## ⚠️ Limitations
192
+
193
+ - Trained on **synthetic data** β€” real-world distribution may differ
194
+ - **Fixed label set** β€” only routes to 6 predefined models
195
+ - **No confidence calibration** β€” consider adding uncertainty thresholds for production
196
+ - Recommend validation on real production traffic before deployment
197
+
198
+ ## πŸ“œ License
199
+
200
+ Apache 2.0
201
+
202
+ ## πŸ“– Citation
203
+
204
+ ```bibtex
205
+ @misc{model-router-2026,
206
+ title={Model Router: Intelligent AI Gateway Request Routing via LoRA Fine-tuning},
207
+ author={dknguyen2304},
208
+ year={2026},
209
+ url={https://huggingface.co/dknguyen2304/model-router}
210
+ }
211
+ ```