Frugal-Math-4B: Easy Samples as Length Regularizers in Math RLVR

Paper: Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Code is publicly available on Github.

Base Model: Qwen/Qwen3-4B-Thinking-2507

License: Apache 2.0

$Success-rate distribution$

Overview

Frugal-Math-4B is a reasoning-optimized variant of Qwen3-4B-Thinking-2507 trained via Reinforcement Learning with Verifiable Rewards (RLVR) on the FrugalMath dataset.

It introduces emergent brevity: the model learns to reason efficiently and generate concise, verifiable mathematical solutions—without any explicit length penalty. By retaining moderately easy problems during training, Frugal-Math implicitly regularizes reasoning length, reducing verbosity while preserving accuracy.

Training Setup

Parameter	Value
Algorithm	Group Relative Policy Optimization (GRPO)
Reward function	Verifiable binary reward (exact match of boxed answer)
Context length	16k tokens
Batch size	128
Group size (G)	16
Learning rate	1e-6
Compute	250 H200 GPU-days
Framework	verl

Training Stages

Stage	Objective	Source	#Samples	Description
Stage 1 – Emergent Brevity	Implicit length regularization	Internal curated mix of math datasets	14.2 k	Moderately easy verifiable math problems encourage concise reasoning.
Stage 2 – Curriculum RLVR	Progressive learning on harder problems	Filtered subset of DeepMath-103k	14.5 k	Gradually harder math problems to improve reasoning depth and coverage.

Performance Across Benchmarks

Evaluation metrics: Pass@1 (%) and Efficiency-Adjusted Accuracy

Max generation length: 42k tokens

Definition: Efficiency-Adjusted Accuracy (EAA)

To compare models jointly on accuracy and brevity, we introduce a new metric named Efficiency-Adjusted Accuracy (EAA). EAA penalizes unnecessarily long reasoning chains:

$\text{EAA}\gamma = a \times \exp!\left[-\gamma \cdot \frac{L - L{\min}}{L_{\max} - L_{\min}}\right]$

where a is accuracy, $L$ is average output length, and $γ$ controls how strongly long outputs are penalized ($γ$ = 3 in our experiments). Higher EAA means the model solves tasks efficiently, with fewer tokens for similar accuracy.

Results

Model	Size	GPQA Diamond	AIME25	Omni-Hard	GSM Plus	IFEval	MATH-500	Average
Qwen3-30B-A3B-Thinking-2507	30B	70.71 \| 25.26	86.67 \| 09.79	08.09 \| 00.63	90.29 \| 90.29	41.35 \| 41.35	97.80 \| 08.15	65.82 \| 29.25
SmolLM3-3B	3B	27.78 \| 01.38	30.00 \| 11.44	35.26 \| 14.20	83.48 \| 29.39	71.21 \| 03.55	90.80 \| 45.35	56.42 \| 17.55
Phi-4-mini-reasoning	4B	30.30 \| 03.05	40.00 \| 12.83	32.37 \| 18.39	87.10 \| 61.12	51.58 \| 22.05	90.80 \| 44.21	55.36 \| 26.94
Qwen3-4B-Thinking-2507	4B	67.17 \| 03.68	73.33 \| 03.65	04.62 \| 00.23	89.05 \| 16.71	38.57 \| 20.79	97.60 \| 04.86	61.72 \| 08.32
Frugal-Math-4B-Stage-1 (ours)	4B	63.64 \| 31.22	60.00 \| 43.73	35.84 \| 31.54	89.24 \| 04.44	39.91 \| 22.43	95.00 \| 55.51	63.94 \| 31.48
Frugal-Math-4B-Stage-2 (ours)	4B	70.20 \| 70.20	70.00 \| 70.00	47.40 \| 47.40	89.00 \| 11.15	39.49 \| 23.20	95.20 \| 95.20	68.55 \| 52.86

Average Reasoning Length

Model	Size	Avg Output Length (tokens)
Qwen3-30B-A3B-Thinking-2507	30B	9 946
SmolLM3-3B	3B	8 338
Phi-4-mini-reasoning	4B	7 458
Qwen3-4B-Thinking-2507	4B	11 491
Frugal-Math-4B-Stage-1 (ours)	4B	6 270
Frugal-Math-4B-Stage-2 (ours)	4B	5 712

Conclusions

➡️ Frugal-Math-4B-Stage 2 outperforms all 4B-class baselines in both accuracy and efficiency, achieving similar performance to the 30B MoE model, and better on average.

➡️ ≈ 50–60 % reduction in reasoning length while preserving or improving performance.

Intended Use

Verifiable mathematical reasoning and competition-style tasks
Efficiency–accuracy trade-off studies in RLHF/RLVR

🚫 Limitations

Optimized for math reasoning only.
Generalization to other domains is part of ongoing research.

Citation

If you use this model, please cite:

@misc{bounhar2025frugalmath,
  title={Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR},
  author={Bounhar, Abdelaziz et al.},
  year={2025},
  journal={arXiv preprint arXiv:2511.01937}
}

Downloads last month: 52

Safetensors

Model size

4B params

Tensor type

F32

Model tree for MBZUAI-Paris/Frugal-Math-4B

Quantizations

2 models

Dataset used to train MBZUAI-Paris/Frugal-Math-4B

Collection including MBZUAI-Paris/Frugal-Math-4B

Frugal-Math

Collection

11 items • Updated 11 days ago • 2

Model	Size	GPQA Diamond	AIME25	Omni-Hard	GSM Plus	IFEval	MATH-500	Average
Qwen3-30B-A3B-Thinking-2507	30B	70.71 \| 25.26	86.67 \| 09.79	08.09 \| 00.63	90.29 \| 90.29	41.35 \| 41.35	97.80 \| 08.15	65.82 \| 29.25
SmolLM3-3B	3B	27.78 \| 01.38	30.00 \| 11.44	35.26 \| 14.20	83.48 \| 29.39	71.21 \| 03.55	90.80 \| 45.35	56.42 \| 17.55
Phi-4-mini-reasoning	4B	30.30 \| 03.05	40.00 \| 12.83	32.37 \| 18.39	87.10 \| 61.12	51.58 \| 22.05	90.80 \| 44.21	55.36 \| 26.94
Qwen3-4B-Thinking-2507	4B	67.17 \| 03.68	73.33 \| 03.65	04.62 \| 00.23	89.05 \| 16.71	38.57 \| 20.79	97.60 \| 04.86	61.72 \| 08.32
Frugal-Math-4B-Stage-1 (ours)	4B	63.64 \| 31.22	60.00 \| 43.73	35.84 \| 31.54	89.24 \| 04.44	39.91 \| 22.43	95.00 \| 55.51	63.94 \| 31.48
Frugal-Math-4B-Stage-2 (ours)	4B	70.20 \| 70.20	70.00 \| 70.00	47.40 \| 47.40	89.00 \| 11.15	39.49 \| 23.20	95.20 \| 95.20	68.55 \| 52.86