reasoning-datasets-competition (Reasoning datasets competition )

codelion

posted an update 1 day ago

Post

580

Recently, Essential AI released a new 8B base model EssentialAI/rnj-1 they highlighted the importance of data mix for pretraning -

"In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this. "

This resonates with the recent work we did around optimal dataset mixing for pretraining where we saw have the right mix can increase the efficiency of training -
https://huggingface.co/blog/codelion/optimal-dataset-mixing

codelion

posted an update 3 days ago

Post

2474

NotebookLM's infographics feature is amazing, it generates poster-type images from any text. Here is one I tried for my new HF article on ellora - https://huggingface.co/blog/codelion/ellora-lora-recipes

ZennyKenny

posted an update 3 days ago

Post

151

What a trip. Just walked through @burtenshaw and @evalstate tutorial on adding Hugging Face Skills to your Claude Code agent so you can fine tune LLMs by chatting with AI.

These are the kinds of innovations that are going to help everyone benefit from the power of Artificial Intelligence. Well done gentlemen and thank you for sharing.

1 reply

·

codelion

posted an update 5 days ago

Post

2244

Perplexity released a dataset (BrowseSafe) and benchmark to catch and prevent malicious prompt-injection instructions in real-time.

We trained a prompt injection classifier on BrowseSafe using adaptive-classifier with ModernBERT-base embeddings.

74.9% F1 on detecting prompt injection in web content.

Model -> adaptive-classifier/browsesafe
Dataset -> perplexity-ai/browsesafe-bench
Repo -> https://github.com/codelion/adaptive-classifier

1 reply

·

codelion

posted an update 6 days ago

Post

1568

I just published Ellora - 6 production-ready LoRA recipes for enhancing LLMs with specific capabilities. Each recipe costs under $100 to run and includes complete training code, data generation, and evaluation.

The 6 Recipes:
Recipe 1: Accuracy Recovery - Recover 75% of quantization losses with self-distillation
Recipe 2: Reasoning LoRA - Add structured thinking with GRPO (0% to 60% adoption, 75% quality boost)
Recipe 3: Tool Calling - Real execution on actual codebases
Recipe 4: Context Extension - Scale from 32K to 2M tokens (61x increase)
Recipe 5: Secure Code Generation - 97% vulnerability reduction using automated Semgrep analysis
Recipe 6: Execution-Aware World Models - Teaching models runtime behavior

Why Recipes?
Ellora provides methodologies, not frameworks. Use them with your existing tools (PEFT, LoRAX, vLLM, Unsloth, HuggingFace). Each recipe uses self-supervised data generation (Magpie approach) - no expensive human labeling required.

All recipes include Jupyter notebooks you can run immediately with clear success metrics.

GitHub: https://github.com/codelion/ellora
Full Article: https://huggingface.co/blog/codelion/ellora-lora-recipes

Built something with these recipes? I'd love to see what you create!

ZennyKenny

posted an update 8 days ago

Post

239

😐 I keep seeing takes on LinkedIn from American business influencers melting down about Silicon Valley startup "dependence" on open-source Chinese models.

🤔 Can anyone describe a credible scenario where these models can be leveraged by the Chinese government to endanger American security interests or am I right to believe that this is just Red Scare nonsense?

2 replies

·

ZennyKenny

posted an update 18 days ago

Post

419

The #feedback channel of app early access Slack Workspaces is some of the best unintentional comedy material I have ever come across tbh.

codelion

posted an update 18 days ago

Post

1961

Introducing OpenEvolve Prompt Optimizer - a Space that automatically evolves and optimizes your prompts using OpenEvolve!

This tool uses OpenEvolve to iteratively improve prompts by testing them on real datasets and evolving better versions. No more manual prompt engineering guesswork - let OpenEvolve find the optimal prompts for you.

How it works:
- Enter your initial prompt using {input} as a placeholder for dataset inputs
- Input any HuggingFace dataset name you want to use for optimization
- Specify the dataset split and field names for your use case
- Click Optimize Prompt and the system will validate everything first
- Compare your initial prompt vs the evolved best prompt side-by-side

Try it here: algorithmicsuperintelligence/prompt-optimizer

OpenEvolve GitHub: https://github.com/algorithmicsuperintelligence/openevolve

ZennyKenny

posted an update 21 days ago

Post

3142

🎉 Wow. Congratulations @bfirsh and the Replicate team on the CloudFlare acquisition!

✌️ You've really built an incredible ecosystem and product offering and should be super proud.

codelion

posted an update 24 days ago

Post

2457

🎯 Introducing Chayan: A Calibrated 4-Model LLM Router Achieving 69% Accuracy on RouterArena

We're excited to share Chayan, a cost-efficient LLM router that intelligently routes queries between 4 models to maximize accuracy while minimizing cost. Chayan just submitted to the RouterArena leaderboard and achieved 69.05% accuracy on the benchmark!

🔗 Model: adaptive-classifier/chayan
🔗 Dataset: RouteWorks/RouterArena

📊 Performance Highlights

Chayan achieves impressive results on the RouterArena benchmark:
• 69.05% accuracy (would rank #1 on current leaderboard)
• $0.333 per 1K queries
• +12.07pp improvement over all-mini baseline (56.98%)
• 99% of perfect 2-model oracle performance at 57% lower cost

Compared to our previous 2-model router (61.43% accuracy), Chayan delivers +7.62pp improvement through smarter 4-model routing.

🧠 How It Works

Chayan uses an Adaptive K-NN classifier with prototype memory to route between 4 models:
• openai/gpt-4o-mini (fast & cheap)
• google/gemini-2.5-flash-lite (balanced)
• google/gemini-2.5-flash (capable)
• openai/gpt-4o (most powerful)

🚀 Getting Started

You can use Chayan directly from HuggingFace:

from adaptive_classifier import AdaptiveClassifier

Load Chayan
router = AdaptiveClassifier.load("adaptive-classifier/chayan")

Route a query
query = "What is the capital of France?"
predictions = router.predict(query, k=4)

Get top model recommendation
best_model = predictions[0][0]
print(f"Recommended model: {best_model}")

Built with the adaptive-classifier library: https://github.com/codelion/adaptive-classifier

codelion

posted an update 29 days ago

Post

3977

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing

1 reply

·

ZennyKenny

posted an update 30 days ago

Post

329

🎉 Novoyaz is live.

A few months ago, I built a quick POC in Hugging Face that used a fine-tuned variant of OpenAI's OSS-20B model that I trained to convert the text from pre-reform Russian-language documents into modern Russian orthography.

⚡️ This morning, I launched novoyaz.io.

This is a production app, the frontend for which I built in like two hours with Lovable, that uses that same fine-tuned model for transliteration, but now has a bunch of extra features that make using it even easier (like taking and uploading pictures with your on-device camera for example 😅).

👉 If you're a researcher, or know a researcher, for whom this app will improve their day-to-day workflows, please get in touch with me.

codelion

posted an update about 1 month ago

Post

271

MARS Achieves Strong Results on Google DeepMind's IMO-Bench

We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.

What is MARS?

MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.

Results on IMO-Bench:

AnswerBench (400 short-answer problems):
MARS: 36.0% (144/400 correct)
Baseline: 24.5% (98/400 correct)
Improvement: +11.5pp across all domains

Category breakdown:
- Algebra: 33% (vs 21% baseline)
- Combinatorics: 26% (vs 19% baseline)
- Geometry: 43% (vs 28% baseline)
- Number Theory: 42% (vs 30% baseline)

ProofBench (60 proof construction problems):
MARS: 26.7% (16/60 correct)
Baseline: 18.3% (11/60 correct)
Improvement: +8.4pp

Category breakdown:
- Number Theory: 42.9% (vs 14.3% baseline)
- Combinatorics: 37.5% (vs 31.2% baseline)
- Algebra: 18.8% (vs 25.0% baseline)
- Geometry: 7.1% (vs 0.0% baseline)

All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.

Datasets available at:
AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench
ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench

Try it yourself:

python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025

Or via API with approach prefix:

model: "mars-google/gemini-2.5-flash-lite-preview-09-2025"

Full evaluation code and results available at: github.com/algorithmicsuperintelligence/optillm

codelion

posted an update about 1 month ago

Post

3602

On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.

To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.

The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

If you're interested in the full story of how we discovered this optimal mixture and why curriculum learning catastrophically failed, check out the complete article: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Sometimes less really is more - when you mix it right.

1 reply

·

codelion

posted an update about 1 month ago

Post

387

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.

Key Finding:

A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.

Results:

Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

The takeaway: careful dataset curation matters more than total data volume.

Model: codelion/gpt-2-70m

Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples

Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing

ZennyKenny

posted an update about 1 month ago

Post

342

Anyone got the scoop on a good OCR model that's available on inference?

Keen to make use of an endpoint (gated or not -- happy to pay for usage) for a personal project, but not so keen to pay for the GPU hosting myself.

🙈🙈🙈

4 replies

·

ZennyKenny

posted an update about 2 months ago

Post

326

Has anyone tried Strawberry Browser? https://strawberrybrowser.com/?ref_id=8D41NQCY7

😇 Shamelessly sharing my referral link here to move up in the waitlist line. Help me out, give it a click.

2 replies

·

codelion

posted an update about 2 months ago

Post

3252

🧠 Introducing Ellora Recipe #6: Execution-Aware World Model for Qwen3-4B-Thinking

Teaching LLMs to understand not just what code does, but HOW it executes at runtime!

Inspired by Meta's CWM (Code World Model) research, this LoRA adapter adds execution awareness to Qwen3-4B-Thinking-2507. The model learns to predict variable states, trace program execution step-by-step, and debug code by understanding runtime behavior.

🔍 Key Innovation:
We combine Qwen3's native thinking capabilities with real Python execution traces captured via sys.settrace(). The model is trained using GRPO with a custom reward function that scores execution prediction accuracy.

📊 Training Approach:
- Hybrid Magpie-style code generation
- Real execution tracing for ground truth
- Self-supervised learning (no manual annotations!)
- 298 training samples with execution traces

✨ What it does:
- Predicts variable states at each line of code
- Explains execution flow with thinking tags
- Helps debug by understanding runtime behavior
- Works as a "neural debugger"

🎯 Results:
- 20% overall accuracy on execution prediction
- 33.3% mean state accuracy
- Trained on Qwen3-4B-Thinking (262K context, 4B params)

🔗 Links:
Model: codelion/Qwen3-4B-execution-world-model-lora
Dataset: codelion/execution-world-model-dataset
GitHub Recipe: https://github.com/codelion/ellora
Notebook: https://github.com/codelion/ellora/blob/main/Ellora_Recipe_6_Execution_World_Model_Thinking_LoRA.ipynb

Part of the Ellora project - standardized LoRA recipes for enhancing LLM capabilities. All recipes use self-supervised data generation and work with existing infrastructure (PEFT, LoRAX, vLLM).

#LLM #LoRA #CodeGeneration #WorldModel #Qwen #AI #MachineLearning

ZennyKenny

posted an update about 2 months ago

Post

2172

Did Hugging Face just ban hammer a bunch of bot accounts or am I just so uninteresting that 30% of my subs dropped me overnight?

😬 Wait, don't answer that.

2 replies

·

ZennyKenny

authored a paper about 2 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 36

Reasoning datasets competition

AI & ML interests

Recent Activity

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

AI & ML interests

Recent Activity

Team members 41

reasoning-datasets-competition's activity