---
title: OpenThoughts Model Benchmark Explorer
emoji: 📊
colorFrom: blue
colorTo: red
sdk: streamlit
sdk_version: 1.28.0
app_file: benchmark_explorer_app.py
pinned: false
license: mit
---

# 🔬 OpenThoughts Evalchemy Benchmark Explorer

Exploring correlations and relationships between LLMs performance across different reasoning benchmarks.
This explorer is built on top of the [OpenThoughts](https://github.com/open-thoughts/open-thoughts) project to explore the model that we have trained and evaluated as well as external models that we have evaluated.
All evaluation results were produced and logged using [Evalchemy](https://github.com/mlfoundations/evalchemy).

## Features

### 📊 Overview Dashboard
- Key metrics and dataset statistics
- Benchmark coverage visualization
- Quick correlation insights
- Category-based analysis

### 🔥 Interactive Heatmap
- Multiple correlation methods (Pearson, Spearman, Kendall)
- Interactive hover tooltips
- Real-time correlation statistics
- Distribution analysis

### 📈 Scatter Plot Explorer
- Dynamic benchmark selection
- Interactive scatter plots with regression lines
- Multiple correlation coefficients
- Data point exploration

### 🎯 Model Performance Analysis
- Model search and filtering
- Performance rankings
- Radar chart comparisons
- Side-by-side model analysis

### 📋 Statistical Summary
- Comprehensive dataset statistics
- Benchmark-wise analysis
- Export capabilities
- Correlation summaries

### 🔬 Uncertainty Analysis
- Measurement precision analysis
- Error bar visualizations with 95% CI
- Signal-to-noise ratios
- Uncertainty-aware correlations

## Benchmark Categories

- **Math** (red): AIME24, AIME25, AMC23, MATH500
- **Code** (blue): CodeElo, CodeForces, LiveCodeBench v2 & v5
- **Science** (green): GPQADiamond, JEEBench  
- **General** (orange): MMLUPro, HLE

## Data Filtering Options

- Category-based filtering
- Zero-value filtering with threshold
- Minimum coverage requirements
- Dynamic slider ranges based on actual data

## Architecture

- **Frontend**: Streamlit with Plotly interactive visualizations
- **Backend**: Pandas/NumPy for data processing, SciPy for statistics
- **Caching**: Smart caching for performance optimization
- **Real-time**: On-the-fly correlation computation for dynamic filtering

## Usage

The application automatically loads benchmark data and provides six specialized analysis modules. Use the sidebar controls to filter data and customize the analysis based on your needs.

Perfect for researchers, practitioners, and anyone interested in understanding the relationships between different AI evaluation benchmarks.