VibeStudio/MiniMax-M2-THRIFT-55-v1
Targeted Reduction for Inference and Fine-Tuning — ~55% Expert Pruned
A lean, efficiency-first variant of MiniMax-M2 designed to maximize latency, throughput, and VRAM savings for local, on-prem, and edge deployments.
TLDR
- What: ~55% expert-pruned MoE with staged pruning + knowledge distillation.
- Why: Push the efficiency frontier for compact, responsive deployments.
- Now: Ready for experimentation with solid coverage across core evals and more on the way.
Why it’s useful
- Lower latency: Fast, responsive interactions for interactive apps and tools.
- Smaller memory footprint: Fits tighter VRAM budgets and increases node density.
- Higher throughput: Serve more concurrent users on the same hardware.
- Deployment-friendly: Smooth drop-in via SGLang with OpenAI-compatible API.
- Adaptable: Plays well with light fine-tuning to match domain and style.
Intended use
- Local/air-gapped assistants and dev tools
- Cost-sensitive batches and realtime services
- Edge and on-prem deployments prioritizing efficiency
How Our Approach Works
Active research in progress — we continue to iterate and expand ablations.
Teacher–student setup: Start with MiniMax-M2 as teacher and a copy as student.
Gradual expert pruning: Remove ≈5% experts per stage over ~11 stages (≈55% total), guided by importance scores with a lightweight Leave-One-Expert-Out check to retain rare-but-important experts.
Distill after each prune: Retrain the student to imitate the teacher on
- Outputs (token probability distributions),
- Hidden states, and
- Router behavior over the surviving experts.
Run AI Coding Agents Fully Locally (Mac Studio, DGX Spark, AMD AI Max) https://github.com/latent-variable/minimax-agent-guide
- Downloads last month
- 70
Model tree for VibeStudio/MiniMax-M2-THRIFT-55-MLX-4bit
Base model
MiniMaxAI/MiniMax-M2