Qwen2.5-7B-ODA-Mixture-100k

Leaderboard Performance

Qwen2.5-7B-ODA-Mixture-100k is a supervised fine-tuned (SFT) model built on top of Qwen2.5-7B-Base, trained with ODA-Mixture-100k. This training set is curated by mixing top-performing open corpora selected via the OpenDataArena leaderboard, and refined through deduplication and benchmark decontamination, aiming to improve the model’s general capabilities across General, Math, Code, and Reasoning domains under a compact ~100K data budget.


🧠 Model Summary

  • Base Model: Qwen/Qwen2.5-7B-Base
  • Training Data: OpenDataArena/ODA-Mixture-100k
  • Domain Coverage: General, Math, Code, Reasoning
  • Scale (selected training set): ~100K samples
  • Goal: Achieve significant general-purpose gains with a compact curated dataset, improving multi-domain reasoning and problem-solving ability.

βš™οΈ Training Data Curation Pipeline

ODA-Mixture-100k is built by following a single rule: trust the OpenDataArena leaderboard.

1️⃣ Data Collection

We chose LIMO as our foundation because it achieves a high ranking on the ODA overall leaderboard with very few samples. This efficiency allows us to establish a strong reasoning baseline. We then augment this core with AM-Thinking-v1-Distilled-math and AM-Thinking-v1-Distilled-code, the top-performing and efficient datasets on the ODA Math and Code leaderboards, to enhance specialized domain capabilities.

2️⃣ Deduplication & Decontamination

We first perform exact deduplication over all questions to remove identical items, and then run benchmark decontamination to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.

3️⃣ Data Selection

To adhere to our ~100K data budget while maximizing the impact of each sample, we employ semantic clustering to map the overall data distribution. Within each cluster, we preferentially sample the most challenging instances, using sequence length as a practical proxy for reasoning complexity and problem difficulty.


πŸ“š Training Data Source Composition

Source Count Percentage
LIMO 817 0.81%
AM-Thinking-Distilled-math 50,244 49.59%
AM-Thinking-Distilled-code 50,245 49.60%

🧩 Data Format

The training data sample format is as follows (aligned with the dataset schema):

{
  "id": "unique_identifier",
  "source": "data source",
  "question": "textual question or instruction",
  "response": "textual response"
}

πŸ“ˆ Performance

Qwen2.5-7B-ODA-Mixture-100k is evaluated as an SFT model built on Qwen2.5-7B-Base across the full ODA benchmark suite spanning four domains:

  • General (DROP, IFEVAL, AGIEVAL, MMLU-Pro)
  • Math (GSM8K, MATH500, Omni-Math, OlympiadBench, AIME2024)
  • Code (HumanEval, MBPP, LCB (V5), HumanEval+)
  • Reasoning (ARC-C, BBH, CALM, KOR-BENCH).

We observe consistent improvements over the base checkpoint, with particularly strong gains on several benchmarks.

Leaderboard Performance Comparison. Best scores in bold, second-best underlined. Eff. denotes Data Efficiency.
Model / Training Data Size Eff. General Math Code Reasoning AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base -- 51.439.850.142.7 46.0
OpenThoughts3-1.2M 1.2M+0.011 45.571.867.054.3 59.6
OmniThought-0528 365k+0.027 47.171.247.657.2 55.8
SYNTHETIC-2-SFT-verified 105k+0.086 51.369.840.158.9 55.0
AM-Thinking-v1-Distilled-math 558k+0.016 57.777.439.544.8 54.8
LIMO 817+9.920 60.744.057.953.8 54.1
MiroMind-M1-SFT-719K 719k+0.006 52.071.026.351.5 50.2
AM-Thinking-v1-Distilled-code 324k+0.024 49.952.368.744.4 53.8
Light-R1-SFTData 79k+0.084 55.564.438.851.9 52.7
ODA-Mixture-500k 500k+0.039 63.472.866.759.6 65.6
ODA-Mixture-100k 100k+0.149 56.871.264.451.5 61.0

🌐 About OpenDataArena

OpenDataArena is an open research platform dedicated to discovering, evaluating, and advancing high-quality datasets for AI post-training. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.

Key Features:

  • πŸ† Dataset Leaderboard β€” helps researchers identify the most valuable and high-quality datasets across different domains
  • πŸ“Š Detailed Evaluation Scores β€” provides comprehensive metrics to assess data quality, complexity, difficulty, etc.
  • 🧰 Data Processing Toolkit β€” OpenDataArena-Tool offers an open-source pipeline for dataset curation and scoring.

πŸš€ Usage

Model repo: OpenDataArena/Qwen2.5-7B-ODA-Mixture-100k. Below is a minimal runnable example for loading and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Mixture-100k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“š Citation

If you use this model or its training data (ODA-Mixture-100k), please cite:

@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}
Downloads last month
18
Safetensors
Model size
333k params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train OpenDataArena/Qwen2.5-7B-ODA-Mixture-100k