Qwen2.5-1.5B-R1-SLERP
A SLERP merge (t=0.5) of:
Qwen/Qwen2.5-1.5B-Instructโ strong general instruction followingdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5Bโ RL-distilled chain-of-thought reasoning
Part of a systematic merge study on the Qwen2.5-1.5B family. See also:
Mohaaxa/Qwen2.5-1.5B-R1-SLERP-AWQโ AWQ 4-bit quantized version
Benchmarks
Evaluated against both parent models on PPL (Wikitext-2) and GSM8K (100 samples):
| Model | PPL | GSM8K |
|---|---|---|
| Qwen2.5-1.5B-Instruct (parent) | 16.141 | 38.0% |
| DeepSeek-R1-Distill-Qwen-1.5B (parent) | 107.467 | 3.0% |
| Qwen2.5-1.5B-R1-SLERP (this model) | 1205.427 | 2.0% |
PPL delta vs Instruct parent: +1189.286 GSM8K delta vs Instruct parent: -36.0%
Merge Config
merge_method: slerp
base_model:
model: Qwen/Qwen2.5-1.5B-Instruct
slices:
- sources:
- model: Qwen/Qwen2.5-1.5B-Instruct
layer_range: [0, 28]
- model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
layer_range: [0, 28]
parameters:
t: 0.5
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Mohaaxa/Qwen2.5-1.5B-R1-SLERP",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/Qwen2.5-1.5B-R1-SLERP")
Notes
- t=0.5 gives equal weight to both parents
- SLERP preserves weight magnitude better than linear interpolation
- Both parents share identical Qwen2.5 architecture (28 layers, hidden_dim=1536)
- For a quantized version with ~67% VRAM reduction, use the AWQ variant
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support