Commit
Β·
ee84e6a
1
Parent(s):
64af94c
feat: Add optimized MCP tools for token reduction
Browse files- Add get_top_performers() tool: 90% token reduction for 'top N' queries
- Add get_leaderboard_summary() tool: 99% token reduction for overview queries
- Fix JSON serialization in get_dataset(): Remove default=str, handle NaN properly
- Update app.py: Add Ocean theme, document new tools in API docs
- Update README: Add detailed sections for new tools, update tool count to 9
- Benefits: Agent can now answer queries in 2-3 steps instead of 20 steps
- README.md +101 -29
- app.py +115 -15
- mcp_tools.py +200 -4
README.md
CHANGED
|
@@ -54,14 +54,16 @@ This MCP server is part of a complete agent evaluation ecosystem built on two fo
|
|
| 54 |
|
| 55 |
---
|
| 56 |
|
| 57 |
-
### π οΈ **
|
| 58 |
-
1. **π analyze_leaderboard**: Generate insights from evaluation leaderboard data
|
| 59 |
-
2. **π debug_trace**: Debug specific agent execution traces using OpenTelemetry data
|
| 60 |
-
3. **π° estimate_cost**: Predict evaluation costs before running
|
| 61 |
4. **βοΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 62 |
-
5.
|
| 63 |
-
6.
|
| 64 |
-
7.
|
|
|
|
|
|
|
| 65 |
|
| 66 |
### π¦ **3 Data Resources**
|
| 67 |
1. **leaderboard data**: Direct JSON access to evaluation results
|
|
@@ -113,9 +115,9 @@ All analysis is powered by **Google Gemini 2.5 Pro** for intelligent, context-aw
|
|
| 113 |
- β
**Testing Interface**: Beautiful Gradio UI for testing all components
|
| 114 |
- β
**Enterprise Focus**: Cost optimization, debugging, decision support, and custom dataset generation
|
| 115 |
- β
**Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
|
| 116 |
-
- β
**
|
| 117 |
|
| 118 |
-
### π οΈ
|
| 119 |
|
| 120 |
#### 1. analyze_leaderboard
|
| 121 |
|
|
@@ -163,7 +165,67 @@ Compares two evaluation runs with AI-powered analysis across multiple dimensions
|
|
| 163 |
|
| 164 |
**Example Use Case**: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment based on your priorities (accuracy, cost, speed, or environmental impact).
|
| 165 |
|
| 166 |
-
#### 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
|
| 169 |
- Simple, flexible tool that returns complete dataset with metadata
|
|
@@ -172,25 +234,24 @@ Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
|
|
| 172 |
- Automatically sorts by timestamp if available
|
| 173 |
- Configurable row limit (1-200) to manage token usage
|
| 174 |
|
|
|
|
|
|
|
| 175 |
**Security Restriction**: Only datasets with "smoltrace-" in the repository name are allowed.
|
| 176 |
|
| 177 |
**Primary Use Cases**:
|
| 178 |
-
- Load `smoltrace-leaderboard` to find run IDs and model names
|
| 179 |
-
- Discover supporting datasets via `results_dataset`, `traces_dataset`, `metrics_dataset` fields
|
| 180 |
- Load `smoltrace-results-*` datasets to see individual test case details
|
| 181 |
- Load `smoltrace-traces-*` datasets to access OpenTelemetry trace data
|
| 182 |
- Load `smoltrace-metrics-*` datasets to get GPU performance data
|
| 183 |
-
-
|
| 184 |
|
| 185 |
-
**
|
| 186 |
-
1.
|
| 187 |
-
2.
|
| 188 |
-
3.
|
| 189 |
-
4. Can now answer questions like "What are the last 10 run IDs?" or "Which models were tested?"
|
| 190 |
|
| 191 |
-
**Example Use Case**: When
|
| 192 |
|
| 193 |
-
####
|
| 194 |
|
| 195 |
Generates domain-specific synthetic test datasets for SMOLTRACE evaluations using Google Gemini 2.5 Pro:
|
| 196 |
- AI-powered task generation tailored to your domain
|
|
@@ -232,7 +293,7 @@ Each generated task includes:
|
|
| 232 |
|
| 233 |
**Example Use Case**: A financial services company wants to evaluate their customer service agent that uses custom tools for stock quotes, portfolio analysis, and transaction processing. They use this tool to generate 50 realistic tasks covering common customer inquiries across different difficulty levels, then run SMOLTRACE evaluations to benchmark different LLM models before deployment.
|
| 234 |
|
| 235 |
-
####
|
| 236 |
|
| 237 |
Upload generated datasets to HuggingFace Hub with proper formatting and metadata:
|
| 238 |
- Automatically formats data for HuggingFace datasets library
|
|
@@ -437,14 +498,16 @@ A: The MCP endpoint is publicly accessible. However, the tools may require Huggi
|
|
| 437 |
|
| 438 |
### Available MCP Components
|
| 439 |
|
| 440 |
-
**Tools** (
|
| 441 |
1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
|
| 442 |
2. **debug_trace**: Trace debugging with AI insights
|
| 443 |
3. **estimate_cost**: Cost estimation with optimization recommendations
|
| 444 |
4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 445 |
-
5. **
|
| 446 |
-
6. **
|
| 447 |
-
7. **
|
|
|
|
|
|
|
| 448 |
|
| 449 |
**Resources** (3):
|
| 450 |
1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
|
|
@@ -611,12 +674,14 @@ Google Gemini 2.5 Pro client that:
|
|
| 611 |
### mcp_tools.py
|
| 612 |
Complete MCP implementation with 13 components:
|
| 613 |
|
| 614 |
-
**Tools** (
|
| 615 |
- `analyze_leaderboard()`: AI-powered leaderboard analysis
|
| 616 |
- `debug_trace()`: AI-powered trace debugging
|
| 617 |
- `estimate_cost()`: AI-powered cost estimation
|
| 618 |
- `compare_runs()`: AI-powered run comparison
|
| 619 |
-
- `
|
|
|
|
|
|
|
| 620 |
- `generate_synthetic_dataset()`: Create domain-specific test datasets with AI
|
| 621 |
- `push_dataset_to_hub()`: Upload datasets to HuggingFace Hub
|
| 622 |
|
|
@@ -766,12 +831,19 @@ For issues or questions:
|
|
| 766 |
|
| 767 |
### v1.0.0 (2025-11-14)
|
| 768 |
- Initial release for MCP Hackathon
|
| 769 |
-
- **Complete MCP Implementation**:
|
| 770 |
-
-
|
|
|
|
|
|
|
|
|
|
| 771 |
- 3 data resources (leaderboard, trace, cost data)
|
| 772 |
- 3 prompt templates (analysis, debug, optimization)
|
| 773 |
- Gradio native MCP support with decorators (`@gr.mcp.*`)
|
| 774 |
- Google Gemini 2.5 Pro integration for all AI analysis
|
| 775 |
- Live HuggingFace dataset integration
|
|
|
|
|
|
|
|
|
|
|
|
|
| 776 |
- SSE transport for MCP communication
|
| 777 |
- Production-ready for HuggingFace Spaces deployment
|
|
|
|
| 54 |
|
| 55 |
---
|
| 56 |
|
| 57 |
+
### π οΈ **9 AI-Powered & Optimized Tools**
|
| 58 |
+
1. **π analyze_leaderboard**: Generate AI-powered insights from evaluation leaderboard data
|
| 59 |
+
2. **π debug_trace**: Debug specific agent execution traces using OpenTelemetry data with AI assistance
|
| 60 |
+
3. **π° estimate_cost**: Predict evaluation costs before running with AI-powered recommendations
|
| 61 |
4. **βοΈ compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 62 |
+
5. **π get_top_performers**: Get top N models from leaderboard (optimized for quick queries, avoids token bloat)
|
| 63 |
+
6. **π get_leaderboard_summary**: Get high-level leaderboard statistics (optimized for overview queries)
|
| 64 |
+
7. **π¦ get_dataset**: Load SMOLTRACE datasets (smoltrace-* prefix only) as JSON for flexible analysis
|
| 65 |
+
8. **π§ͺ generate_synthetic_dataset**: Create domain-specific test datasets for SMOLTRACE evaluations (supports up to 100 tasks with parallel batched generation)
|
| 66 |
+
9. **π€ push_dataset_to_hub**: Upload generated datasets to HuggingFace Hub
|
| 67 |
|
| 68 |
### π¦ **3 Data Resources**
|
| 69 |
1. **leaderboard data**: Direct JSON access to evaluation results
|
|
|
|
| 115 |
- β
**Testing Interface**: Beautiful Gradio UI for testing all components
|
| 116 |
- β
**Enterprise Focus**: Cost optimization, debugging, decision support, and custom dataset generation
|
| 117 |
- β
**Google Gemini Powered**: Leverages Gemini 2.5 Pro for intelligent analysis
|
| 118 |
+
- β
**15 Total Components**: 9 Tools + 3 Resources + 3 Prompts
|
| 119 |
|
| 120 |
+
### π οΈ Nine Production-Ready Tools
|
| 121 |
|
| 122 |
#### 1. analyze_leaderboard
|
| 123 |
|
|
|
|
| 165 |
|
| 166 |
**Example Use Case**: After running evaluations with two different models, compare them head-to-head to determine which is better for production deployment based on your priorities (accuracy, cost, speed, or environmental impact).
|
| 167 |
|
| 168 |
+
#### 5. get_top_performers
|
| 169 |
+
|
| 170 |
+
Get top performing models from leaderboard with optimized token usage.
|
| 171 |
+
|
| 172 |
+
**β‘ Performance Optimization**: This tool returns only the top N models (5-20 runs) instead of loading the full leaderboard dataset (51 runs), resulting in **90% token reduction** compared to using `get_dataset()`.
|
| 173 |
+
|
| 174 |
+
**When to Use**: Perfect for queries like:
|
| 175 |
+
- "Which model is leading?"
|
| 176 |
+
- "Show me the top 5 models"
|
| 177 |
+
- "What's the best model for cost efficiency?"
|
| 178 |
+
|
| 179 |
+
**Parameters**:
|
| 180 |
+
- `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
|
| 181 |
+
- `metric` (str): Metric to rank by - "success_rate", "total_cost_usd", "avg_duration_ms", or "co2_emissions_g" (default: "success_rate")
|
| 182 |
+
- `top_n` (int): Number of top models to return (range: 1-20, default: 5)
|
| 183 |
+
|
| 184 |
+
**Returns**: Properly formatted JSON with:
|
| 185 |
+
- Metric used for ranking
|
| 186 |
+
- Ranking order (ascending/descending)
|
| 187 |
+
- Total runs in leaderboard
|
| 188 |
+
- Array of top performers with essential fields only (10 fields vs 20+ in full dataset)
|
| 189 |
+
|
| 190 |
+
**Benefits**:
|
| 191 |
+
- β
**Token Reduction**: Returns 5-20 runs instead of all 51 runs (90% fewer tokens)
|
| 192 |
+
- β
**Ready to Use**: Properly formatted JSON (no parsing needed, no string conversion issues)
|
| 193 |
+
- β
**Pre-Sorted**: Already sorted by your chosen metric
|
| 194 |
+
- β
**Essential Data Only**: Includes only 10 essential columns to minimize token usage
|
| 195 |
+
|
| 196 |
+
**Example Use Case**: An agent needs to quickly answer "What are the top 3 most cost-effective models?" without consuming excessive tokens by loading the entire leaderboard dataset.
|
| 197 |
+
|
| 198 |
+
#### 6. get_leaderboard_summary
|
| 199 |
+
|
| 200 |
+
Get high-level leaderboard statistics without loading individual runs.
|
| 201 |
+
|
| 202 |
+
**β‘ Performance Optimization**: This tool returns only aggregated statistics instead of raw data, resulting in **99% token reduction** compared to using `get_dataset()` on the full leaderboard.
|
| 203 |
+
|
| 204 |
+
**When to Use**: Perfect for overview queries like:
|
| 205 |
+
- "How many runs are in the leaderboard?"
|
| 206 |
+
- "What's the average success rate across all models?"
|
| 207 |
+
- "Give me an overview of evaluation results"
|
| 208 |
+
|
| 209 |
+
**Parameters**:
|
| 210 |
+
- `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
|
| 211 |
+
|
| 212 |
+
**Returns**: Properly formatted JSON with:
|
| 213 |
+
- Total runs count
|
| 214 |
+
- Unique models and submitters count
|
| 215 |
+
- Overall statistics (avg/best/worst success rates, avg cost, avg duration, total CO2)
|
| 216 |
+
- Breakdown by agent type (tool/code/both)
|
| 217 |
+
- Breakdown by provider (litellm/transformers)
|
| 218 |
+
- Top 3 models by success rate
|
| 219 |
+
|
| 220 |
+
**Benefits**:
|
| 221 |
+
- β
**Extreme Token Reduction**: Returns summary stats instead of 51 runs (99% fewer tokens)
|
| 222 |
+
- β
**Ready to Use**: Properly formatted JSON (no parsing needed)
|
| 223 |
+
- β
**Comprehensive Stats**: Includes averages, distributions, and breakdowns
|
| 224 |
+
- β
**Quick Insights**: Perfect for "overview" and "summary" questions
|
| 225 |
+
|
| 226 |
+
**Example Use Case**: An agent needs to provide a high-level overview of evaluation results without loading 51 individual runs and consuming 50K+ tokens.
|
| 227 |
+
|
| 228 |
+
#### 7. get_dataset
|
| 229 |
|
| 230 |
Loads SMOLTRACE datasets from HuggingFace and returns raw data as JSON:
|
| 231 |
- Simple, flexible tool that returns complete dataset with metadata
|
|
|
|
| 234 |
- Automatically sorts by timestamp if available
|
| 235 |
- Configurable row limit (1-200) to manage token usage
|
| 236 |
|
| 237 |
+
**β οΈ Important**: For leaderboard queries, **prefer using `get_top_performers()` or `get_leaderboard_summary()` instead** - they're specifically optimized to avoid token bloat!
|
| 238 |
+
|
| 239 |
**Security Restriction**: Only datasets with "smoltrace-" in the repository name are allowed.
|
| 240 |
|
| 241 |
**Primary Use Cases**:
|
|
|
|
|
|
|
| 242 |
- Load `smoltrace-results-*` datasets to see individual test case details
|
| 243 |
- Load `smoltrace-traces-*` datasets to access OpenTelemetry trace data
|
| 244 |
- Load `smoltrace-metrics-*` datasets to get GPU performance data
|
| 245 |
+
- For leaderboard queries: **Use `get_top_performers()` or `get_leaderboard_summary()` instead!**
|
| 246 |
|
| 247 |
+
**Recommended Workflow**:
|
| 248 |
+
1. For overview: Use `get_leaderboard_summary()` (99% token reduction)
|
| 249 |
+
2. For top N queries: Use `get_top_performers()` (90% token reduction)
|
| 250 |
+
3. For specific run IDs: Use `get_dataset()` only when you need non-leaderboard datasets
|
|
|
|
| 251 |
|
| 252 |
+
**Example Use Case**: When you need to load trace data or results data for a specific run, use `get_dataset("username/smoltrace-traces-gpt4")`. For leaderboard queries, use the optimized tools instead.
|
| 253 |
|
| 254 |
+
#### 8. generate_synthetic_dataset
|
| 255 |
|
| 256 |
Generates domain-specific synthetic test datasets for SMOLTRACE evaluations using Google Gemini 2.5 Pro:
|
| 257 |
- AI-powered task generation tailored to your domain
|
|
|
|
| 293 |
|
| 294 |
**Example Use Case**: A financial services company wants to evaluate their customer service agent that uses custom tools for stock quotes, portfolio analysis, and transaction processing. They use this tool to generate 50 realistic tasks covering common customer inquiries across different difficulty levels, then run SMOLTRACE evaluations to benchmark different LLM models before deployment.
|
| 295 |
|
| 296 |
+
#### 9. push_dataset_to_hub
|
| 297 |
|
| 298 |
Upload generated datasets to HuggingFace Hub with proper formatting and metadata:
|
| 299 |
- Automatically formats data for HuggingFace datasets library
|
|
|
|
| 498 |
|
| 499 |
### Available MCP Components
|
| 500 |
|
| 501 |
+
**Tools** (9):
|
| 502 |
1. **analyze_leaderboard**: AI-powered leaderboard analysis with Gemini 2.5 Pro
|
| 503 |
2. **debug_trace**: Trace debugging with AI insights
|
| 504 |
3. **estimate_cost**: Cost estimation with optimization recommendations
|
| 505 |
4. **compare_runs**: Compare two evaluation runs with AI-powered analysis
|
| 506 |
+
5. **get_top_performers**: Get top N models from leaderboard (optimized, 90% token reduction)
|
| 507 |
+
6. **get_leaderboard_summary**: Get leaderboard statistics (optimized, 99% token reduction)
|
| 508 |
+
7. **get_dataset**: Load SMOLTRACE datasets (smoltrace-* only) as JSON
|
| 509 |
+
8. **generate_synthetic_dataset**: Create domain-specific test datasets with AI
|
| 510 |
+
9. **push_dataset_to_hub**: Upload datasets to HuggingFace Hub
|
| 511 |
|
| 512 |
**Resources** (3):
|
| 513 |
1. **leaderboard://{repo}**: Direct access to raw leaderboard data in JSON
|
|
|
|
| 674 |
### mcp_tools.py
|
| 675 |
Complete MCP implementation with 13 components:
|
| 676 |
|
| 677 |
+
**Tools** (9 async functions):
|
| 678 |
- `analyze_leaderboard()`: AI-powered leaderboard analysis
|
| 679 |
- `debug_trace()`: AI-powered trace debugging
|
| 680 |
- `estimate_cost()`: AI-powered cost estimation
|
| 681 |
- `compare_runs()`: AI-powered run comparison
|
| 682 |
+
- `get_top_performers()`: Optimized tool to get top N models (90% token reduction)
|
| 683 |
+
- `get_leaderboard_summary()`: Optimized tool for leaderboard statistics (99% token reduction)
|
| 684 |
+
- `get_dataset()`: Load SMOLTRACE datasets as JSON (use optimized tools for leaderboard!)
|
| 685 |
- `generate_synthetic_dataset()`: Create domain-specific test datasets with AI
|
| 686 |
- `push_dataset_to_hub()`: Upload datasets to HuggingFace Hub
|
| 687 |
|
|
|
|
| 831 |
|
| 832 |
### v1.0.0 (2025-11-14)
|
| 833 |
- Initial release for MCP Hackathon
|
| 834 |
+
- **Complete MCP Implementation**: 15 components total
|
| 835 |
+
- 9 AI-powered and optimized tools:
|
| 836 |
+
- analyze_leaderboard, debug_trace, estimate_cost, compare_runs (AI-powered)
|
| 837 |
+
- get_top_performers, get_leaderboard_summary (optimized for token reduction)
|
| 838 |
+
- get_dataset, generate_synthetic_dataset, push_dataset_to_hub (data management)
|
| 839 |
- 3 data resources (leaderboard, trace, cost data)
|
| 840 |
- 3 prompt templates (analysis, debug, optimization)
|
| 841 |
- Gradio native MCP support with decorators (`@gr.mcp.*`)
|
| 842 |
- Google Gemini 2.5 Pro integration for all AI analysis
|
| 843 |
- Live HuggingFace dataset integration
|
| 844 |
+
- **Performance Optimizations**:
|
| 845 |
+
- get_top_performers: 90% token reduction vs full leaderboard
|
| 846 |
+
- get_leaderboard_summary: 99% token reduction vs full leaderboard
|
| 847 |
+
- Proper JSON serialization (no string conversion issues)
|
| 848 |
- SSE transport for MCP communication
|
| 849 |
- Production-ready for HuggingFace Spaces deployment
|
app.py
CHANGED
|
@@ -32,6 +32,8 @@ Tools Provided:
|
|
| 32 |
π debug_trace - Debug agent execution traces with AI
|
| 33 |
π° estimate_cost - Predict evaluation costs before running
|
| 34 |
βοΈ compare_runs - Compare evaluation runs with AI analysis
|
|
|
|
|
|
|
| 35 |
π¦ get_dataset - Load SMOLTRACE datasets as JSON
|
| 36 |
π§ͺ generate_synthetic_dataset - Create domain-specific test datasets
|
| 37 |
π€ push_dataset_to_hub - Upload datasets to HuggingFace Hub
|
|
@@ -64,6 +66,8 @@ from mcp_tools import (
|
|
| 64 |
debug_trace,
|
| 65 |
estimate_cost,
|
| 66 |
compare_runs,
|
|
|
|
|
|
|
| 67 |
get_dataset,
|
| 68 |
generate_synthetic_dataset,
|
| 69 |
push_dataset_to_hub
|
|
@@ -80,19 +84,24 @@ def create_gradio_ui():
|
|
| 80 |
"""Create Gradio UI for testing MCP tools"""
|
| 81 |
|
| 82 |
# Note: Gradio 6 has different theme API
|
| 83 |
-
with gr.Blocks(
|
|
|
|
|
|
|
|
|
|
| 84 |
gr.Markdown("""
|
| 85 |
# π€ TraceMind MCP Server
|
| 86 |
|
| 87 |
**AI-Powered Analysis for Agent Evaluation Data**
|
| 88 |
|
| 89 |
-
This server provides **
|
| 90 |
|
| 91 |
-
### MCP Tools (AI-Powered)
|
| 92 |
-
- π **Analyze Leaderboard**: Get insights from evaluation results
|
| 93 |
-
- π **Debug Trace**: Understand what happened in a specific test
|
| 94 |
-
- π° **Estimate Cost**: Predict evaluation costs before running
|
| 95 |
- βοΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
|
|
|
|
|
|
|
| 96 |
- π¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
|
| 97 |
- π§ͺ **Generate Synthetic Dataset**: Create domain-specific test datasets for SMOLTRACE
|
| 98 |
- π€ **Push to Hub**: Upload generated datasets to HuggingFace Hub
|
|
@@ -1023,10 +1032,101 @@ def create_gradio_ui():
|
|
| 1023 |
|
| 1024 |
---
|
| 1025 |
|
| 1026 |
-
### 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1027 |
|
| 1028 |
**Description**: Load SMOLTRACE datasets from HuggingFace and return as JSON
|
| 1029 |
|
|
|
|
|
|
|
| 1030 |
**Parameters**:
|
| 1031 |
- `dataset_repo` (str, required): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
|
| 1032 |
- `max_rows` (int): Maximum number of rows to return (default: 50, range: 1-200)
|
|
@@ -1036,19 +1136,19 @@ def create_gradio_ui():
|
|
| 1036 |
**Restriction**: Only datasets with "smoltrace-" in the repository name are allowed for security.
|
| 1037 |
|
| 1038 |
**Use Cases**:
|
| 1039 |
-
- Load smoltrace-leaderboard to find run IDs, model names, and supporting dataset references
|
| 1040 |
- Load smoltrace-results-* datasets to see individual test case details
|
| 1041 |
- Load smoltrace-traces-* datasets to access OpenTelemetry trace data
|
| 1042 |
- Load smoltrace-metrics-* datasets to get GPU metrics and performance data
|
|
|
|
| 1043 |
|
| 1044 |
**Workflow**:
|
| 1045 |
-
1.
|
| 1046 |
-
2.
|
| 1047 |
-
3.
|
| 1048 |
|
| 1049 |
---
|
| 1050 |
|
| 1051 |
-
###
|
| 1052 |
|
| 1053 |
**Description**: Generate domain-specific synthetic test datasets for SMOLTRACE evaluations using AI
|
| 1054 |
|
|
@@ -1085,7 +1185,7 @@ def create_gradio_ui():
|
|
| 1085 |
|
| 1086 |
---
|
| 1087 |
|
| 1088 |
-
###
|
| 1089 |
|
| 1090 |
**Description**: Push a generated synthetic dataset to HuggingFace Hub
|
| 1091 |
|
|
@@ -1149,8 +1249,8 @@ def create_gradio_ui():
|
|
| 1149 |
|
| 1150 |
### What's Exposed via MCP:
|
| 1151 |
|
| 1152 |
-
####
|
| 1153 |
-
The
|
| 1154 |
are automatically exposed as MCP tools and can be called from any MCP client.
|
| 1155 |
|
| 1156 |
#### 3 MCP Resources (Data Access)
|
|
|
|
| 32 |
π debug_trace - Debug agent execution traces with AI
|
| 33 |
π° estimate_cost - Predict evaluation costs before running
|
| 34 |
βοΈ compare_runs - Compare evaluation runs with AI analysis
|
| 35 |
+
π get_top_performers - Get top N models from leaderboard (optimized)
|
| 36 |
+
π get_leaderboard_summary - Get leaderboard overview statistics
|
| 37 |
π¦ get_dataset - Load SMOLTRACE datasets as JSON
|
| 38 |
π§ͺ generate_synthetic_dataset - Create domain-specific test datasets
|
| 39 |
π€ push_dataset_to_hub - Upload datasets to HuggingFace Hub
|
|
|
|
| 66 |
debug_trace,
|
| 67 |
estimate_cost,
|
| 68 |
compare_runs,
|
| 69 |
+
get_top_performers,
|
| 70 |
+
get_leaderboard_summary,
|
| 71 |
get_dataset,
|
| 72 |
generate_synthetic_dataset,
|
| 73 |
push_dataset_to_hub
|
|
|
|
| 84 |
"""Create Gradio UI for testing MCP tools"""
|
| 85 |
|
| 86 |
# Note: Gradio 6 has different theme API
|
| 87 |
+
with gr.Blocks(
|
| 88 |
+
title="TraceMind MCP Server",
|
| 89 |
+
theme=gr.themes.Ocean()
|
| 90 |
+
) as demo:
|
| 91 |
gr.Markdown("""
|
| 92 |
# π€ TraceMind MCP Server
|
| 93 |
|
| 94 |
**AI-Powered Analysis for Agent Evaluation Data**
|
| 95 |
|
| 96 |
+
This server provides **9 MCP Tools + 3 MCP Resources + 3 MCP Prompts**:
|
| 97 |
|
| 98 |
+
### MCP Tools (AI-Powered & Optimized)
|
| 99 |
+
- π **Analyze Leaderboard**: Get AI-powered insights from evaluation results
|
| 100 |
+
- π **Debug Trace**: Understand what happened in a specific test with AI debugging
|
| 101 |
+
- π° **Estimate Cost**: Predict evaluation costs before running with AI recommendations
|
| 102 |
- βοΈ **Compare Runs**: Compare two evaluation runs with AI-powered analysis
|
| 103 |
+
- π **Get Top Performers**: Get top N models from leaderboard (optimized for quick queries)
|
| 104 |
+
- π **Get Leaderboard Summary**: Get high-level leaderboard statistics (optimized for overview)
|
| 105 |
- π¦ **Get Dataset**: Load any HuggingFace dataset as JSON for flexible analysis
|
| 106 |
- π§ͺ **Generate Synthetic Dataset**: Create domain-specific test datasets for SMOLTRACE
|
| 107 |
- π€ **Push to Hub**: Upload generated datasets to HuggingFace Hub
|
|
|
|
| 1032 |
|
| 1033 |
---
|
| 1034 |
|
| 1035 |
+
### 5. get_top_performers
|
| 1036 |
+
|
| 1037 |
+
**Description**: Get top performing models from leaderboard - optimized for quick queries
|
| 1038 |
+
|
| 1039 |
+
**β‘ Performance**: This tool is **optimized** to avoid token bloat by returning only essential data for top performers instead of the full leaderboard (51 runs).
|
| 1040 |
+
|
| 1041 |
+
**When to use**: Use this instead of `get_dataset()` when you need to answer questions like:
|
| 1042 |
+
- "Which model is leading?"
|
| 1043 |
+
- "Show me the top 5 models"
|
| 1044 |
+
- "What's the best model for cost?"
|
| 1045 |
+
|
| 1046 |
+
**Parameters**:
|
| 1047 |
+
- `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
|
| 1048 |
+
- `metric` (str): Metric to rank by (default: "success_rate")
|
| 1049 |
+
- Options: "success_rate", "total_cost_usd", "avg_duration_ms", "co2_emissions_g"
|
| 1050 |
+
- `top_n` (int): Number of top models to return (default: 5, range: 1-20)
|
| 1051 |
+
|
| 1052 |
+
**Returns**: JSON object with top performers - **ready to use, no parsing needed**
|
| 1053 |
+
|
| 1054 |
+
**Benefits vs get_dataset()**:
|
| 1055 |
+
- β
Returns only 5-20 runs instead of all 51 runs (90% token reduction)
|
| 1056 |
+
- β
Properly formatted JSON (no string conversion issues)
|
| 1057 |
+
- β
Pre-sorted by your chosen metric
|
| 1058 |
+
- β
Includes only essential columns (10 fields vs 20+ fields)
|
| 1059 |
+
|
| 1060 |
+
**Example Response**:
|
| 1061 |
+
```json
|
| 1062 |
+
{
|
| 1063 |
+
"metric_ranked_by": "success_rate",
|
| 1064 |
+
"ranking_order": "descending (higher is better)",
|
| 1065 |
+
"total_runs_in_leaderboard": 51,
|
| 1066 |
+
"top_n": 5,
|
| 1067 |
+
"top_performers": [
|
| 1068 |
+
{
|
| 1069 |
+
"run_id": "run_123",
|
| 1070 |
+
"model": "openai/gpt-4",
|
| 1071 |
+
"success_rate": 95.8,
|
| 1072 |
+
"total_cost_usd": 0.05,
|
| 1073 |
+
...
|
| 1074 |
+
}
|
| 1075 |
+
]
|
| 1076 |
+
}
|
| 1077 |
+
```
|
| 1078 |
+
|
| 1079 |
+
---
|
| 1080 |
+
|
| 1081 |
+
### 6. get_leaderboard_summary
|
| 1082 |
+
|
| 1083 |
+
**Description**: Get high-level leaderboard summary statistics - optimized for overview queries
|
| 1084 |
+
|
| 1085 |
+
**β‘ Performance**: This tool is **optimized** to return only summary statistics (no individual runs), avoiding the full dataset that causes token bloat.
|
| 1086 |
+
|
| 1087 |
+
**When to use**: Use this instead of `get_dataset()` when you need to answer questions like:
|
| 1088 |
+
- "How many runs are in the leaderboard?"
|
| 1089 |
+
- "What's the average success rate?"
|
| 1090 |
+
- "Give me an overview of the leaderboard"
|
| 1091 |
+
|
| 1092 |
+
**Parameters**:
|
| 1093 |
+
- `leaderboard_repo` (str): HuggingFace dataset repository (default: "kshitijthakkar/smoltrace-leaderboard")
|
| 1094 |
+
|
| 1095 |
+
**Returns**: JSON object with summary statistics - **ready to use, no parsing needed**
|
| 1096 |
+
|
| 1097 |
+
**Benefits vs get_dataset()**:
|
| 1098 |
+
- β
Returns aggregated stats instead of raw data (99% token reduction)
|
| 1099 |
+
- β
Properly formatted JSON (no string conversion issues)
|
| 1100 |
+
- β
Includes breakdowns by agent_type and provider
|
| 1101 |
+
- β
Shows top 3 models by success rate
|
| 1102 |
+
- β
Calculates averages, totals, and distributions
|
| 1103 |
+
|
| 1104 |
+
**Example Response**:
|
| 1105 |
+
```json
|
| 1106 |
+
{
|
| 1107 |
+
"leaderboard_repo": "kshitijthakkar/smoltrace-leaderboard",
|
| 1108 |
+
"summary": {
|
| 1109 |
+
"total_runs": 51,
|
| 1110 |
+
"unique_models": 15,
|
| 1111 |
+
"overall_stats": {
|
| 1112 |
+
"avg_success_rate": 89.5,
|
| 1113 |
+
"best_success_rate": 95.8,
|
| 1114 |
+
"avg_cost_per_run_usd": 0.023
|
| 1115 |
+
},
|
| 1116 |
+
"breakdown_by_agent_type": {...},
|
| 1117 |
+
"top_3_models_by_success_rate": [...]
|
| 1118 |
+
}
|
| 1119 |
+
}
|
| 1120 |
+
```
|
| 1121 |
+
|
| 1122 |
+
---
|
| 1123 |
+
|
| 1124 |
+
### 7. get_dataset
|
| 1125 |
|
| 1126 |
**Description**: Load SMOLTRACE datasets from HuggingFace and return as JSON
|
| 1127 |
|
| 1128 |
+
**β οΈ Note**: For leaderboard queries, prefer using `get_top_performers()` or `get_leaderboard_summary()` instead - they're optimized to avoid token bloat!
|
| 1129 |
+
|
| 1130 |
**Parameters**:
|
| 1131 |
- `dataset_repo` (str, required): HuggingFace dataset repository path with "smoltrace-" prefix (e.g., "kshitijthakkar/smoltrace-leaderboard")
|
| 1132 |
- `max_rows` (int): Maximum number of rows to return (default: 50, range: 1-200)
|
|
|
|
| 1136 |
**Restriction**: Only datasets with "smoltrace-" in the repository name are allowed for security.
|
| 1137 |
|
| 1138 |
**Use Cases**:
|
|
|
|
| 1139 |
- Load smoltrace-results-* datasets to see individual test case details
|
| 1140 |
- Load smoltrace-traces-* datasets to access OpenTelemetry trace data
|
| 1141 |
- Load smoltrace-metrics-* datasets to get GPU metrics and performance data
|
| 1142 |
+
- For leaderboard: Use `get_top_performers()` or `get_leaderboard_summary()` instead!
|
| 1143 |
|
| 1144 |
**Workflow**:
|
| 1145 |
+
1. Use `get_leaderboard_summary()` for overview questions
|
| 1146 |
+
2. Use `get_top_performers()` for "top N" queries
|
| 1147 |
+
3. Use `get_dataset()` only for non-leaderboard datasets or when you need specific run IDs
|
| 1148 |
|
| 1149 |
---
|
| 1150 |
|
| 1151 |
+
### 8. generate_synthetic_dataset
|
| 1152 |
|
| 1153 |
**Description**: Generate domain-specific synthetic test datasets for SMOLTRACE evaluations using AI
|
| 1154 |
|
|
|
|
| 1185 |
|
| 1186 |
---
|
| 1187 |
|
| 1188 |
+
### 9. push_dataset_to_hub
|
| 1189 |
|
| 1190 |
**Description**: Push a generated synthetic dataset to HuggingFace Hub
|
| 1191 |
|
|
|
|
| 1249 |
|
| 1250 |
### What's Exposed via MCP:
|
| 1251 |
|
| 1252 |
+
#### 9 MCP Tools (AI-Powered & Optimized)
|
| 1253 |
+
The nine tools above (`analyze_leaderboard`, `debug_trace`, `estimate_cost`, `compare_runs`, `get_top_performers`, `get_leaderboard_summary`, `get_dataset`, `generate_synthetic_dataset`, `push_dataset_to_hub`)
|
| 1254 |
are automatically exposed as MCP tools and can be called from any MCP client.
|
| 1255 |
|
| 1256 |
#### 3 MCP Resources (Data Access)
|
mcp_tools.py
CHANGED
|
@@ -718,6 +718,196 @@ async def analyze_results(
|
|
| 718 |
return f"β **Error analyzing results**: {str(e)}\n\nPlease check:\n- Repository name is correct (should be smoltrace-results-*)\n- You have access to the dataset\n- HF_TOKEN is set correctly"
|
| 719 |
|
| 720 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 721 |
@gr.mcp.tool()
|
| 722 |
async def get_dataset(
|
| 723 |
dataset_repo: str,
|
|
@@ -751,7 +941,7 @@ async def get_dataset(
|
|
| 751 |
"dataset_repo": dataset_repo,
|
| 752 |
"error": "Only datasets with 'smoltrace-' prefix are allowed. Please use smoltrace-leaderboard or other smoltrace-* datasets.",
|
| 753 |
"data": []
|
| 754 |
-
}, indent=2
|
| 755 |
|
| 756 |
# Load dataset from HuggingFace dataset = load_dataset(dataset_repo, split="train")
|
| 757 |
df = pd.DataFrame(dataset)
|
|
@@ -762,7 +952,7 @@ async def get_dataset(
|
|
| 762 |
"error": "Dataset is empty",
|
| 763 |
"total_rows": 0,
|
| 764 |
"data": []
|
| 765 |
-
}, indent=2
|
| 766 |
|
| 767 |
# Get total row count before limiting
|
| 768 |
total_rows = len(df)
|
|
@@ -776,6 +966,10 @@ async def get_dataset(
|
|
| 776 |
|
| 777 |
df_limited = df.head(max_rows)
|
| 778 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 779 |
# Convert to list of dictionaries
|
| 780 |
data = df_limited.to_dict(orient="records")
|
| 781 |
|
|
@@ -788,14 +982,16 @@ async def get_dataset(
|
|
| 788 |
"data": data
|
| 789 |
}
|
| 790 |
|
| 791 |
-
|
|
|
|
|
|
|
| 792 |
|
| 793 |
except Exception as e:
|
| 794 |
return json.dumps({
|
| 795 |
"dataset_repo": dataset_repo,
|
| 796 |
"error": f"Failed to load dataset: {str(e)}",
|
| 797 |
"data": []
|
| 798 |
-
}, indent=2
|
| 799 |
|
| 800 |
|
| 801 |
# ============================================================================
|
|
|
|
| 718 |
return f"β **Error analyzing results**: {str(e)}\n\nPlease check:\n- Repository name is correct (should be smoltrace-results-*)\n- You have access to the dataset\n- HF_TOKEN is set correctly"
|
| 719 |
|
| 720 |
|
| 721 |
+
@gr.mcp.tool()
|
| 722 |
+
async def get_top_performers(
|
| 723 |
+
leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard",
|
| 724 |
+
metric: str = "success_rate",
|
| 725 |
+
top_n: int = 5
|
| 726 |
+
) -> str:
|
| 727 |
+
"""
|
| 728 |
+
Get top performing models from leaderboard - optimized for quick queries.
|
| 729 |
+
|
| 730 |
+
**USE THIS TOOL** instead of get_dataset() when you need to answer questions like:
|
| 731 |
+
- "Which model is leading?"
|
| 732 |
+
- "Show me the top 5 models"
|
| 733 |
+
- "What's the best model for cost?"
|
| 734 |
+
|
| 735 |
+
This tool returns ONLY the essential data for top performers, avoiding the
|
| 736 |
+
full 51-run dataset that causes token bloat. Returns properly formatted JSON
|
| 737 |
+
that's ready to use without parsing.
|
| 738 |
+
|
| 739 |
+
Args:
|
| 740 |
+
leaderboard_repo (str): HuggingFace dataset repository. Default: "kshitijthakkar/smoltrace-leaderboard"
|
| 741 |
+
metric (str): Metric to rank by. Options: "success_rate", "total_cost_usd", "avg_duration_ms", "co2_emissions_g". Default: "success_rate"
|
| 742 |
+
top_n (int): Number of top models to return. Range: 1-20. Default: 5
|
| 743 |
+
|
| 744 |
+
Returns:
|
| 745 |
+
str: JSON object with top performers - ready to use, no parsing needed
|
| 746 |
+
"""
|
| 747 |
+
try:
|
| 748 |
+
# Load leaderboard dataset
|
| 749 |
+
ds = load_dataset(leaderboard_repo, split="train")
|
| 750 |
+
df = pd.DataFrame(ds)
|
| 751 |
+
|
| 752 |
+
if df.empty:
|
| 753 |
+
return json.dumps({
|
| 754 |
+
"error": "Leaderboard dataset is empty",
|
| 755 |
+
"top_performers": []
|
| 756 |
+
}, indent=2)
|
| 757 |
+
|
| 758 |
+
# Validate metric
|
| 759 |
+
valid_metrics = ["success_rate", "total_cost_usd", "avg_duration_ms", "co2_emissions_g"]
|
| 760 |
+
if metric not in valid_metrics:
|
| 761 |
+
return json.dumps({
|
| 762 |
+
"error": f"Invalid metric '{metric}'. Valid options: {valid_metrics}",
|
| 763 |
+
"top_performers": []
|
| 764 |
+
}, indent=2)
|
| 765 |
+
|
| 766 |
+
# Limit top_n
|
| 767 |
+
top_n = max(1, min(20, top_n))
|
| 768 |
+
|
| 769 |
+
# Sort by metric (ascending for cost/latency/co2, descending for success_rate)
|
| 770 |
+
ascending = metric in ["total_cost_usd", "avg_duration_ms", "co2_emissions_g"]
|
| 771 |
+
df_sorted = df.sort_values(metric, ascending=ascending)
|
| 772 |
+
|
| 773 |
+
# Get top N
|
| 774 |
+
top_models = df_sorted.head(top_n)
|
| 775 |
+
|
| 776 |
+
# Select only essential columns to minimize tokens
|
| 777 |
+
essential_columns = [
|
| 778 |
+
"run_id", "model", "agent_type", "provider",
|
| 779 |
+
"success_rate", "total_cost_usd", "avg_duration_ms",
|
| 780 |
+
"co2_emissions_g", "total_tests", "timestamp"
|
| 781 |
+
]
|
| 782 |
+
|
| 783 |
+
# Filter to only columns that exist
|
| 784 |
+
available_columns = [col for col in essential_columns if col in top_models.columns]
|
| 785 |
+
top_models_filtered = top_models[available_columns]
|
| 786 |
+
|
| 787 |
+
# CRITICAL FIX: Handle NaN/None properly
|
| 788 |
+
top_models_filtered = top_models_filtered.where(pd.notnull(top_models_filtered), None)
|
| 789 |
+
|
| 790 |
+
# Convert to dict
|
| 791 |
+
top_performers_data = top_models_filtered.to_dict(orient="records")
|
| 792 |
+
|
| 793 |
+
result = {
|
| 794 |
+
"metric_ranked_by": metric,
|
| 795 |
+
"ranking_order": "ascending (lower is better)" if ascending else "descending (higher is better)",
|
| 796 |
+
"total_runs_in_leaderboard": len(df),
|
| 797 |
+
"top_n": top_n,
|
| 798 |
+
"top_performers": top_performers_data
|
| 799 |
+
}
|
| 800 |
+
|
| 801 |
+
return json.dumps(result, indent=2)
|
| 802 |
+
|
| 803 |
+
except Exception as e:
|
| 804 |
+
return json.dumps({
|
| 805 |
+
"error": f"Failed to get top performers: {str(e)}",
|
| 806 |
+
"top_performers": []
|
| 807 |
+
}, indent=2)
|
| 808 |
+
|
| 809 |
+
|
| 810 |
+
@gr.mcp.tool()
|
| 811 |
+
async def get_leaderboard_summary(
|
| 812 |
+
leaderboard_repo: str = "kshitijthakkar/smoltrace-leaderboard"
|
| 813 |
+
) -> str:
|
| 814 |
+
"""
|
| 815 |
+
Get high-level leaderboard summary statistics - optimized for overview queries.
|
| 816 |
+
|
| 817 |
+
**USE THIS TOOL** instead of get_dataset() when you need to answer questions like:
|
| 818 |
+
- "How many runs are in the leaderboard?"
|
| 819 |
+
- "What's the average success rate?"
|
| 820 |
+
- "Give me an overview of the leaderboard"
|
| 821 |
+
|
| 822 |
+
This tool returns ONLY summary statistics (no individual runs), avoiding the
|
| 823 |
+
full dataset that causes token bloat. Returns properly formatted JSON that's
|
| 824 |
+
ready to use without parsing.
|
| 825 |
+
|
| 826 |
+
Args:
|
| 827 |
+
leaderboard_repo (str): HuggingFace dataset repository. Default: "kshitijthakkar/smoltrace-leaderboard"
|
| 828 |
+
|
| 829 |
+
Returns:
|
| 830 |
+
str: JSON object with summary statistics - ready to use, no parsing needed
|
| 831 |
+
"""
|
| 832 |
+
try:
|
| 833 |
+
# Load leaderboard dataset
|
| 834 |
+
ds = load_dataset(leaderboard_repo, split="train")
|
| 835 |
+
df = pd.DataFrame(ds)
|
| 836 |
+
|
| 837 |
+
if df.empty:
|
| 838 |
+
return json.dumps({
|
| 839 |
+
"error": "Leaderboard dataset is empty",
|
| 840 |
+
"summary": {}
|
| 841 |
+
}, indent=2)
|
| 842 |
+
|
| 843 |
+
# Calculate summary statistics
|
| 844 |
+
summary = {
|
| 845 |
+
"total_runs": len(df),
|
| 846 |
+
"unique_models": int(df['model'].nunique()) if 'model' in df.columns else 0,
|
| 847 |
+
"unique_submitters": int(df['submitted_by'].nunique()) if 'submitted_by' in df.columns else 0,
|
| 848 |
+
"overall_stats": {
|
| 849 |
+
"avg_success_rate": float(df['success_rate'].mean()) if 'success_rate' in df.columns else None,
|
| 850 |
+
"best_success_rate": float(df['success_rate'].max()) if 'success_rate' in df.columns else None,
|
| 851 |
+
"worst_success_rate": float(df['success_rate'].min()) if 'success_rate' in df.columns else None,
|
| 852 |
+
"avg_cost_per_run_usd": float(df['total_cost_usd'].mean()) if 'total_cost_usd' in df.columns else None,
|
| 853 |
+
"avg_duration_ms": float(df['avg_duration_ms'].mean()) if 'avg_duration_ms' in df.columns else None,
|
| 854 |
+
"total_co2_emissions_g": float(df['co2_emissions_g'].sum()) if 'co2_emissions_g' in df.columns else None
|
| 855 |
+
},
|
| 856 |
+
"breakdown_by_agent_type": {},
|
| 857 |
+
"breakdown_by_provider": {},
|
| 858 |
+
"top_3_models_by_success_rate": []
|
| 859 |
+
}
|
| 860 |
+
|
| 861 |
+
# Breakdown by agent type
|
| 862 |
+
if 'agent_type' in df.columns and 'success_rate' in df.columns:
|
| 863 |
+
agent_stats = df.groupby('agent_type').agg({
|
| 864 |
+
'success_rate': 'mean',
|
| 865 |
+
'run_id': 'count'
|
| 866 |
+
}).to_dict()
|
| 867 |
+
|
| 868 |
+
summary["breakdown_by_agent_type"] = {
|
| 869 |
+
agent_type: {
|
| 870 |
+
"count": int(agent_stats['run_id'][agent_type]),
|
| 871 |
+
"avg_success_rate": float(agent_stats['success_rate'][agent_type])
|
| 872 |
+
}
|
| 873 |
+
for agent_type in agent_stats['run_id'].keys()
|
| 874 |
+
}
|
| 875 |
+
|
| 876 |
+
# Breakdown by provider
|
| 877 |
+
if 'provider' in df.columns and 'success_rate' in df.columns:
|
| 878 |
+
provider_stats = df.groupby('provider').agg({
|
| 879 |
+
'success_rate': 'mean',
|
| 880 |
+
'run_id': 'count'
|
| 881 |
+
}).to_dict()
|
| 882 |
+
|
| 883 |
+
summary["breakdown_by_provider"] = {
|
| 884 |
+
provider: {
|
| 885 |
+
"count": int(provider_stats['run_id'][provider]),
|
| 886 |
+
"avg_success_rate": float(provider_stats['success_rate'][provider])
|
| 887 |
+
}
|
| 888 |
+
for provider in provider_stats['run_id'].keys()
|
| 889 |
+
}
|
| 890 |
+
|
| 891 |
+
# Top 3 models by success rate
|
| 892 |
+
if 'success_rate' in df.columns and 'model' in df.columns:
|
| 893 |
+
top_3 = df.nlargest(3, 'success_rate')[['model', 'success_rate', 'total_cost_usd', 'avg_duration_ms']]
|
| 894 |
+
top_3 = top_3.where(pd.notnull(top_3), None)
|
| 895 |
+
summary["top_3_models_by_success_rate"] = top_3.to_dict(orient="records")
|
| 896 |
+
|
| 897 |
+
result = {
|
| 898 |
+
"leaderboard_repo": leaderboard_repo,
|
| 899 |
+
"summary": summary
|
| 900 |
+
}
|
| 901 |
+
|
| 902 |
+
return json.dumps(result, indent=2)
|
| 903 |
+
|
| 904 |
+
except Exception as e:
|
| 905 |
+
return json.dumps({
|
| 906 |
+
"error": f"Failed to get leaderboard summary: {str(e)}",
|
| 907 |
+
"summary": {}
|
| 908 |
+
}, indent=2)
|
| 909 |
+
|
| 910 |
+
|
| 911 |
@gr.mcp.tool()
|
| 912 |
async def get_dataset(
|
| 913 |
dataset_repo: str,
|
|
|
|
| 941 |
"dataset_repo": dataset_repo,
|
| 942 |
"error": "Only datasets with 'smoltrace-' prefix are allowed. Please use smoltrace-leaderboard or other smoltrace-* datasets.",
|
| 943 |
"data": []
|
| 944 |
+
}, indent=2)
|
| 945 |
|
| 946 |
# Load dataset from HuggingFace dataset = load_dataset(dataset_repo, split="train")
|
| 947 |
df = pd.DataFrame(dataset)
|
|
|
|
| 952 |
"error": "Dataset is empty",
|
| 953 |
"total_rows": 0,
|
| 954 |
"data": []
|
| 955 |
+
}, indent=2)
|
| 956 |
|
| 957 |
# Get total row count before limiting
|
| 958 |
total_rows = len(df)
|
|
|
|
| 966 |
|
| 967 |
df_limited = df.head(max_rows)
|
| 968 |
|
| 969 |
+
# CRITICAL FIX: Replace NaN/None values with proper None before conversion
|
| 970 |
+
# This ensures json.dumps() handles them correctly as null instead of "None" string
|
| 971 |
+
df_limited = df_limited.where(pd.notnull(df_limited), None)
|
| 972 |
+
|
| 973 |
# Convert to list of dictionaries
|
| 974 |
data = df_limited.to_dict(orient="records")
|
| 975 |
|
|
|
|
| 982 |
"data": data
|
| 983 |
}
|
| 984 |
|
| 985 |
+
# CRITICAL FIX: Remove default=str to ensure proper JSON serialization
|
| 986 |
+
# Using default=str was converting None to string "None" causing agent parsing issues
|
| 987 |
+
return json.dumps(result, indent=2)
|
| 988 |
|
| 989 |
except Exception as e:
|
| 990 |
return json.dumps({
|
| 991 |
"dataset_repo": dataset_repo,
|
| 992 |
"error": f"Failed to load dataset: {str(e)}",
|
| 993 |
"data": []
|
| 994 |
+
}, indent=2)
|
| 995 |
|
| 996 |
|
| 997 |
# ============================================================================
|