Spaces:
Running
Running
Commit
Β·
34f1a7a
1
Parent(s):
880ef7f
docs: Deploy final documentation package
Browse files- ARCHITECTURE.md +1035 -0
- MCP_INTEGRATION.md +706 -0
- README.md +318 -343
- USER_GUIDE.md +1026 -0
ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,1035 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TraceMind-AI - Technical Architecture
|
| 2 |
+
|
| 3 |
+
This document provides a deep technical dive into the TraceMind-AI architecture, implementation details, and system design.
|
| 4 |
+
|
| 5 |
+
## Table of Contents
|
| 6 |
+
|
| 7 |
+
- [System Overview](#system-overview)
|
| 8 |
+
- [Project Structure](#project-structure)
|
| 9 |
+
- [Core Components](#core-components)
|
| 10 |
+
- [MCP Client Architecture](#mcp-client-architecture)
|
| 11 |
+
- [Agent Framework Integration](#agent-framework-integration)
|
| 12 |
+
- [Data Flow](#data-flow)
|
| 13 |
+
- [Authentication & Authorization](#authentication--authorization)
|
| 14 |
+
- [Screen Navigation](#screen-navigation)
|
| 15 |
+
- [Job Submission Architecture](#job-submission-architecture)
|
| 16 |
+
- [Deployment](#deployment)
|
| 17 |
+
- [Performance Optimization](#performance-optimization)
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## System Overview
|
| 22 |
+
|
| 23 |
+
TraceMind-AI is a comprehensive Gradio-based web application for evaluating AI agent performance. It serves as the user-facing platform in the TraceMind ecosystem, demonstrating enterprise MCP client usage (Track 2: MCP in Action).
|
| 24 |
+
|
| 25 |
+
### Technology Stack
|
| 26 |
+
|
| 27 |
+
| Component | Technology | Version | Purpose |
|
| 28 |
+
|-----------|-----------|---------|---------|
|
| 29 |
+
| **UI Framework** | Gradio | 5.49.1 | Web interface with components |
|
| 30 |
+
| **MCP Client** | MCP Python SDK | Latest | Connect to MCP servers |
|
| 31 |
+
| **Agent Framework** | smolagents | 1.22.0+ | Autonomous agent with MCP tools |
|
| 32 |
+
| **Data Source** | HuggingFace Datasets | Latest | Load evaluation results |
|
| 33 |
+
| **Authentication** | HuggingFace OAuth | - | User authentication |
|
| 34 |
+
| **Job Platforms** | HF Jobs + Modal | - | Evaluation job submission |
|
| 35 |
+
| **Language** | Python | 3.10+ | Core implementation |
|
| 36 |
+
|
| 37 |
+
### High-Level Architecture
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 41 |
+
β User Browser β
|
| 42 |
+
β - Gradio Interface (React-based) β
|
| 43 |
+
β - OAuth Flow (HuggingFace) β
|
| 44 |
+
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 45 |
+
β
|
| 46 |
+
β HTTP/WebSocket
|
| 47 |
+
β
|
| 48 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
β TraceMind-AI (Gradio App) - Track 2 β
|
| 50 |
+
β β
|
| 51 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 52 |
+
β β Screen Layer (screens/) β β
|
| 53 |
+
β β - Leaderboard β β
|
| 54 |
+
β β - Agent Chat β β
|
| 55 |
+
β β - New Evaluation β β
|
| 56 |
+
β β - Job Monitoring β β
|
| 57 |
+
β β - Trace Detail β β
|
| 58 |
+
β β - Settings β β
|
| 59 |
+
β ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β
|
| 60 |
+
β β β
|
| 61 |
+
β ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ β
|
| 62 |
+
β β Component Layer (components/) β β
|
| 63 |
+
β β - Leaderboard Table (Custom HTML) β β
|
| 64 |
+
β β - Analytics Charts β β
|
| 65 |
+
β β - Metric Displays β β
|
| 66 |
+
β β - Report Cards β β
|
| 67 |
+
β ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β
|
| 68 |
+
β β β
|
| 69 |
+
β ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ β
|
| 70 |
+
β β Service Layer β β
|
| 71 |
+
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
|
| 72 |
+
β β β MCP Client β β Data Loader β β β
|
| 73 |
+
β β β (mcp_client/) β β (data_loader.py) β β β
|
| 74 |
+
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
|
| 75 |
+
β β ββοΏ½οΏ½οΏ½βββββββββββββββββ ββββββββββββββββββββ β β
|
| 76 |
+
β β β Agent (smolagentsβ β Job Submission β β β
|
| 77 |
+
β β β screens/chat.py) β β (utils/) β β β
|
| 78 |
+
β β ββββββββββββββββββββ ββββββββββββββββββββ β β
|
| 79 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 80 |
+
β β
|
| 81 |
+
βββββββββββββ¬ββββββββββββββββββββββββββββββββββββ¬ββββββββββββββ
|
| 82 |
+
β β
|
| 83 |
+
β β
|
| 84 |
+
βββββββββββββββββββββββββ βββββββββββββββββββββββββ
|
| 85 |
+
β TraceMind MCP Server β β External Services β
|
| 86 |
+
β (Track 1) β β - HF Datasets β
|
| 87 |
+
β - 11 AI Tools β β - HF Jobs β
|
| 88 |
+
β - 3 Resources β β - Modal β
|
| 89 |
+
β - 3 Prompts β β - LLM APIs β
|
| 90 |
+
βββββββββββββββββββββββββ βββββββββββββββββββββββββ
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## Project Structure
|
| 96 |
+
|
| 97 |
+
```
|
| 98 |
+
TraceMind-AI/
|
| 99 |
+
βββ app.py # Main entry point, Gradio app
|
| 100 |
+
β
|
| 101 |
+
βββ screens/ # UI screens (6 tabs)
|
| 102 |
+
β βββ __init__.py
|
| 103 |
+
β βββ leaderboard.py # Screen 1: Leaderboard with AI insights
|
| 104 |
+
β βββ chat.py # Screen 2: Agent Chat (smolagents)
|
| 105 |
+
β βββ dashboard.py # Screen 3: New Evaluation
|
| 106 |
+
β βββ job_monitoring.py # Screen 4: Job Status Tracking
|
| 107 |
+
β βββ trace_detail.py # Screen 5: Trace Visualization
|
| 108 |
+
β βββ settings.py # Screen 6: API Key Configuration
|
| 109 |
+
β βββ compare.py # Screen 7: Run Comparison (optional)
|
| 110 |
+
β βββ documentation.py # Screen 8: API Documentation
|
| 111 |
+
β βββ mcp_helpers.py # Shared MCP client helpers
|
| 112 |
+
β
|
| 113 |
+
βββ components/ # Reusable UI components
|
| 114 |
+
β βββ __init__.py
|
| 115 |
+
β βββ leaderboard_table.py # Custom HTML table component
|
| 116 |
+
β βββ analytics_charts.py # Performance charts (Plotly)
|
| 117 |
+
β βββ metric_displays.py # Metric cards and badges
|
| 118 |
+
β βββ report_cards.py # Summary report cards
|
| 119 |
+
β βββ thought_graph.py # Agent reasoning visualization
|
| 120 |
+
β
|
| 121 |
+
βββ mcp_client/ # MCP client implementation
|
| 122 |
+
β βββ __init__.py
|
| 123 |
+
β βββ client.py # Async MCP client
|
| 124 |
+
β βββ sync_wrapper.py # Synchronous wrapper for Gradio
|
| 125 |
+
β
|
| 126 |
+
βββ utils/ # Utility modules
|
| 127 |
+
β βββ __init__.py
|
| 128 |
+
β βββ auth.py # HuggingFace OAuth
|
| 129 |
+
β βββ navigation.py # Screen navigation state
|
| 130 |
+
β βββ hf_jobs_submission.py # HuggingFace Jobs integration
|
| 131 |
+
β βββ modal_job_submission.py # Modal integration
|
| 132 |
+
β
|
| 133 |
+
βββ styles/ # Custom styling
|
| 134 |
+
β βββ __init__.py
|
| 135 |
+
β βββ tracemind_theme.py # Gradio theme customization
|
| 136 |
+
β
|
| 137 |
+
βββ data_loader.py # Dataset loading and caching
|
| 138 |
+
βββ requirements.txt # Python dependencies
|
| 139 |
+
βββ .env.example # Environment variable template
|
| 140 |
+
βββ .gitignore
|
| 141 |
+
βββ README.md # Project documentation
|
| 142 |
+
βββ USER_GUIDE.md # Complete user guide
|
| 143 |
+
|
| 144 |
+
Total: ~35 files, ~8,000 lines of code
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
### File Breakdown
|
| 148 |
+
|
| 149 |
+
| Directory | Files | Lines | Purpose |
|
| 150 |
+
|-----------|-------|-------|---------|
|
| 151 |
+
| `screens/` | 9 | ~3,500 | UI screen implementations |
|
| 152 |
+
| `components/` | 5 | ~1,200 | Reusable UI components |
|
| 153 |
+
| `mcp_client/` | 3 | ~800 | MCP client integration |
|
| 154 |
+
| `utils/` | 4 | ~1,500 | Authentication, jobs, navigation |
|
| 155 |
+
| `styles/` | 2 | ~300 | Custom theme and CSS |
|
| 156 |
+
| Root | 3 | ~700 | Main app, data loader, config |
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## Core Components
|
| 161 |
+
|
| 162 |
+
### 1. app.py - Main Application
|
| 163 |
+
|
| 164 |
+
**Purpose**: Entry point, orchestrates all screens and manages global state.
|
| 165 |
+
|
| 166 |
+
**Architecture**:
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
# app.py structure
|
| 170 |
+
import gradio as gr
|
| 171 |
+
from screens import *
|
| 172 |
+
from mcp_client.sync_wrapper import get_sync_mcp_client
|
| 173 |
+
from utils.auth import auth_ui
|
| 174 |
+
from data_loader import DataLoader
|
| 175 |
+
|
| 176 |
+
# 1. Initialize services
|
| 177 |
+
mcp_client = get_sync_mcp_client()
|
| 178 |
+
mcp_client.initialize()
|
| 179 |
+
data_loader = DataLoader()
|
| 180 |
+
|
| 181 |
+
# 2. Create Gradio app
|
| 182 |
+
with gr.Blocks(theme=tracemind_theme) as app:
|
| 183 |
+
# Global state
|
| 184 |
+
gr.State(...) # User session, navigation, etc.
|
| 185 |
+
|
| 186 |
+
# Authentication (if not disabled)
|
| 187 |
+
if not DISABLE_OAUTH:
|
| 188 |
+
auth_ui()
|
| 189 |
+
|
| 190 |
+
# Main tabs
|
| 191 |
+
with gr.Tabs():
|
| 192 |
+
with gr.Tab("π Leaderboard"):
|
| 193 |
+
leaderboard_screen()
|
| 194 |
+
|
| 195 |
+
with gr.Tab("π€ Agent Chat"):
|
| 196 |
+
chat_screen()
|
| 197 |
+
|
| 198 |
+
with gr.Tab("π New Evaluation"):
|
| 199 |
+
dashboard_screen()
|
| 200 |
+
|
| 201 |
+
with gr.Tab("π Job Monitoring"):
|
| 202 |
+
job_monitoring_screen()
|
| 203 |
+
|
| 204 |
+
with gr.Tab("βοΈ Settings"):
|
| 205 |
+
settings_screen()
|
| 206 |
+
|
| 207 |
+
# 3. Launch
|
| 208 |
+
if __name__ == "__main__":
|
| 209 |
+
app.launch(
|
| 210 |
+
server_name="0.0.0.0",
|
| 211 |
+
server_port=7860,
|
| 212 |
+
share=False
|
| 213 |
+
)
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
**Key Responsibilities**:
|
| 217 |
+
- Initialize MCP client and data loader (global instances)
|
| 218 |
+
- Create tabbed interface with all screens
|
| 219 |
+
- Manage authentication flow
|
| 220 |
+
- Handle global state (user session, API keys)
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
### 2. Screen Layer (screens/)
|
| 225 |
+
|
| 226 |
+
Each screen is a self-contained module that returns a Gradio component tree.
|
| 227 |
+
|
| 228 |
+
#### screens/leaderboard.py
|
| 229 |
+
|
| 230 |
+
**Purpose**: Display evaluation results with AI-powered insights.
|
| 231 |
+
|
| 232 |
+
**Components**:
|
| 233 |
+
- Load button
|
| 234 |
+
- AI insights panel (Markdown) - powered by MCP server
|
| 235 |
+
- Leaderboard table (custom HTML component)
|
| 236 |
+
- Filter controls (agent type, provider)
|
| 237 |
+
|
| 238 |
+
**MCP Integration**:
|
| 239 |
+
```python
|
| 240 |
+
def load_leaderboard(mcp_client):
|
| 241 |
+
# 1. Load dataset
|
| 242 |
+
ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
|
| 243 |
+
df = pd.DataFrame(ds)
|
| 244 |
+
|
| 245 |
+
# 2. Get AI insights from MCP server
|
| 246 |
+
insights = mcp_client.analyze_leaderboard(
|
| 247 |
+
metric_focus="overall",
|
| 248 |
+
time_range="last_week",
|
| 249 |
+
top_n=5
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
# 3. Render table with custom component
|
| 253 |
+
table_html = render_leaderboard_table(df)
|
| 254 |
+
|
| 255 |
+
return insights, table_html
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
#### screens/chat.py
|
| 259 |
+
|
| 260 |
+
**Purpose**: Autonomous agent interface with MCP tool access.
|
| 261 |
+
|
| 262 |
+
**Agent Setup**:
|
| 263 |
+
```python
|
| 264 |
+
from smolagents import ToolCallingAgent, MCPClient, HfApiModel
|
| 265 |
+
|
| 266 |
+
# Initialize agent with MCP client
|
| 267 |
+
def create_agent():
|
| 268 |
+
mcp_client = MCPClient(MCP_SERVER_URL)
|
| 269 |
+
|
| 270 |
+
model = HfApiModel(
|
| 271 |
+
model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 272 |
+
token=os.getenv("HF_TOKEN")
|
| 273 |
+
)
|
| 274 |
+
|
| 275 |
+
agent = ToolCallingAgent(
|
| 276 |
+
tools=[], # MCP tools loaded automatically
|
| 277 |
+
model=model,
|
| 278 |
+
mcp_client=mcp_client,
|
| 279 |
+
max_steps=10
|
| 280 |
+
)
|
| 281 |
+
|
| 282 |
+
return agent
|
| 283 |
+
|
| 284 |
+
# Chat interaction
|
| 285 |
+
def agent_chat(message, history, show_reasoning):
|
| 286 |
+
if show_reasoning:
|
| 287 |
+
agent.verbosity_level = 2 # Show tool execution
|
| 288 |
+
else:
|
| 289 |
+
agent.verbosity_level = 0 # Only final answer
|
| 290 |
+
|
| 291 |
+
response = agent.run(message)
|
| 292 |
+
history.append((message, response))
|
| 293 |
+
|
| 294 |
+
return history, ""
|
| 295 |
+
```
|
| 296 |
+
|
| 297 |
+
**MCP Tool Access**:
|
| 298 |
+
Agent automatically discovers and uses all 11 MCP tools from TraceMind MCP Server.
|
| 299 |
+
|
| 300 |
+
#### screens/dashboard.py
|
| 301 |
+
|
| 302 |
+
**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal.
|
| 303 |
+
|
| 304 |
+
**Key Functions**:
|
| 305 |
+
- Model selection (text input)
|
| 306 |
+
- Infrastructure choice (HF Jobs / Modal)
|
| 307 |
+
- Hardware selection (auto / manual)
|
| 308 |
+
- Cost estimation (MCP-powered)
|
| 309 |
+
- Job submission
|
| 310 |
+
|
| 311 |
+
**Cost Estimation Flow**:
|
| 312 |
+
```python
|
| 313 |
+
def estimate_cost_click(model, agent_type, num_tests, hardware, mcp_client):
|
| 314 |
+
# Call MCP server for cost estimate
|
| 315 |
+
estimate = mcp_client.estimate_cost(
|
| 316 |
+
model=model,
|
| 317 |
+
agent_type=agent_type,
|
| 318 |
+
num_tests=num_tests,
|
| 319 |
+
hardware=hardware
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
return estimate # Display in dialog
|
| 323 |
+
```
|
| 324 |
+
|
| 325 |
+
**Job Submission Flow**:
|
| 326 |
+
```python
|
| 327 |
+
def submit_job(model, agent_type, hardware, infrastructure, api_keys):
|
| 328 |
+
if infrastructure == "HuggingFace Jobs":
|
| 329 |
+
job_id = submit_hf_job(model, agent_type, hardware, api_keys)
|
| 330 |
+
elif infrastructure == "Modal":
|
| 331 |
+
job_id = submit_modal_job(model, agent_type, hardware, api_keys)
|
| 332 |
+
|
| 333 |
+
return f"β
Job submitted: {job_id}"
|
| 334 |
+
```
|
| 335 |
+
|
| 336 |
+
#### screens/job_monitoring.py
|
| 337 |
+
|
| 338 |
+
**Purpose**: Track status of submitted jobs.
|
| 339 |
+
|
| 340 |
+
**Data Source**: HuggingFace Jobs API or Modal API
|
| 341 |
+
|
| 342 |
+
**Refresh Strategy**:
|
| 343 |
+
- Manual refresh button
|
| 344 |
+
- Auto-refresh every 30 seconds (optional)
|
| 345 |
+
|
| 346 |
+
#### screens/trace_detail.py
|
| 347 |
+
|
| 348 |
+
**Purpose**: Visualize OpenTelemetry traces with GPU metrics.
|
| 349 |
+
|
| 350 |
+
**Components**:
|
| 351 |
+
- Waterfall diagram (spans timeline)
|
| 352 |
+
- Span details panel
|
| 353 |
+
- GPU metrics overlay (for GPU jobs)
|
| 354 |
+
- MCP-powered Q&A
|
| 355 |
+
|
| 356 |
+
**Trace Loading**:
|
| 357 |
+
```python
|
| 358 |
+
def load_trace(trace_id, traces_repo):
|
| 359 |
+
# Load trace dataset
|
| 360 |
+
ds = load_dataset(traces_repo)
|
| 361 |
+
trace_data = ds.filter(lambda x: x["trace_id"] == trace_id)[0]
|
| 362 |
+
|
| 363 |
+
# Render waterfall
|
| 364 |
+
waterfall_html = render_waterfall(trace_data["spans"])
|
| 365 |
+
|
| 366 |
+
return waterfall_html
|
| 367 |
+
```
|
| 368 |
+
|
| 369 |
+
**MCP Q&A**:
|
| 370 |
+
```python
|
| 371 |
+
def ask_trace_question(trace_id, traces_repo, question, mcp_client):
|
| 372 |
+
# Call MCP server to debug trace
|
| 373 |
+
answer = mcp_client.debug_trace(
|
| 374 |
+
trace_id=trace_id,
|
| 375 |
+
traces_repo=traces_repo,
|
| 376 |
+
question=question
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
return answer
|
| 380 |
+
```
|
| 381 |
+
|
| 382 |
+
#### screens/settings.py
|
| 383 |
+
|
| 384 |
+
**Purpose**: Configure API keys and preferences.
|
| 385 |
+
|
| 386 |
+
**Security**:
|
| 387 |
+
- Keys stored in Gradio State (session-only, not server-side)
|
| 388 |
+
- All forms use `api_name=False` (not exposed via API)
|
| 389 |
+
- HTTPS encryption for all API calls
|
| 390 |
+
|
| 391 |
+
**Configuration Options**:
|
| 392 |
+
- Gemini API Key
|
| 393 |
+
- HuggingFace Token
|
| 394 |
+
- Modal Token ID + Secret
|
| 395 |
+
- LLM Provider Keys (OpenAI, Anthropic, etc.)
|
| 396 |
+
|
| 397 |
+
---
|
| 398 |
+
|
| 399 |
+
### 3. Component Layer (components/)
|
| 400 |
+
|
| 401 |
+
Reusable UI components that can be used across multiple screens.
|
| 402 |
+
|
| 403 |
+
#### components/leaderboard_table.py
|
| 404 |
+
|
| 405 |
+
**Purpose**: Custom HTML table with sorting, filtering, and styling.
|
| 406 |
+
|
| 407 |
+
**Why Custom Component?**:
|
| 408 |
+
- Gradio's default Dataframe component lacks advanced styling
|
| 409 |
+
- Need clickable rows for navigation
|
| 410 |
+
- Custom sorting and filtering logic
|
| 411 |
+
- Badge rendering for metrics
|
| 412 |
+
|
| 413 |
+
**Implementation**:
|
| 414 |
+
```python
|
| 415 |
+
def render_leaderboard_table(df: pd.DataFrame) -> str:
|
| 416 |
+
"""Render leaderboard as interactive HTML table"""
|
| 417 |
+
|
| 418 |
+
html = """
|
| 419 |
+
<style>
|
| 420 |
+
.leaderboard-table { ... }
|
| 421 |
+
.metric-badge { ... }
|
| 422 |
+
</style>
|
| 423 |
+
<table class="leaderboard-table">
|
| 424 |
+
<thead>
|
| 425 |
+
<tr>
|
| 426 |
+
<th onclick="sortTable(0)">Model</th>
|
| 427 |
+
<th onclick="sortTable(1)">Success Rate</th>
|
| 428 |
+
<th onclick="sortTable(2)">Cost</th>
|
| 429 |
+
...
|
| 430 |
+
</tr>
|
| 431 |
+
</thead>
|
| 432 |
+
<tbody>
|
| 433 |
+
"""
|
| 434 |
+
|
| 435 |
+
for idx, row in df.iterrows():
|
| 436 |
+
html += f"""
|
| 437 |
+
<tr onclick="selectRun('{row['run_id']}')">
|
| 438 |
+
<td>{row['model']}</td>
|
| 439 |
+
<td><span class="badge success">{row['success_rate']}%</span></td>
|
| 440 |
+
<td>${row['total_cost_usd']:.4f}</td>
|
| 441 |
+
...
|
| 442 |
+
</tr>
|
| 443 |
+
"""
|
| 444 |
+
|
| 445 |
+
html += """
|
| 446 |
+
</tbody>
|
| 447 |
+
</table>
|
| 448 |
+
<script>
|
| 449 |
+
function sortTable(col) { ... }
|
| 450 |
+
function selectRun(runId) {
|
| 451 |
+
// Trigger Gradio event to navigate to run detail
|
| 452 |
+
document.dispatchEvent(new CustomEvent('runSelected', {detail: runId}));
|
| 453 |
+
}
|
| 454 |
+
</script>
|
| 455 |
+
"""
|
| 456 |
+
|
| 457 |
+
return html
|
| 458 |
+
```
|
| 459 |
+
|
| 460 |
+
**Integration with Gradio**:
|
| 461 |
+
```python
|
| 462 |
+
# In leaderboard screen
|
| 463 |
+
table_html = gr.HTML()
|
| 464 |
+
|
| 465 |
+
load_btn.click(
|
| 466 |
+
fn=lambda: render_leaderboard_table(df),
|
| 467 |
+
outputs=table_html
|
| 468 |
+
)
|
| 469 |
+
```
|
| 470 |
+
|
| 471 |
+
#### components/analytics_charts.py
|
| 472 |
+
|
| 473 |
+
**Purpose**: Performance charts using Plotly.
|
| 474 |
+
|
| 475 |
+
**Charts Provided**:
|
| 476 |
+
- Success rate over time (line chart)
|
| 477 |
+
- Cost comparison (bar chart)
|
| 478 |
+
- Duration distribution (histogram)
|
| 479 |
+
- CO2 emissions by model (pie chart)
|
| 480 |
+
|
| 481 |
+
**Example**:
|
| 482 |
+
```python
|
| 483 |
+
import plotly.graph_objects as go
|
| 484 |
+
|
| 485 |
+
def create_cost_comparison_chart(df):
|
| 486 |
+
fig = go.Figure(data=[
|
| 487 |
+
go.Bar(
|
| 488 |
+
x=df['model'],
|
| 489 |
+
y=df['total_cost_usd'],
|
| 490 |
+
marker_color='indianred'
|
| 491 |
+
)
|
| 492 |
+
])
|
| 493 |
+
|
| 494 |
+
fig.update_layout(
|
| 495 |
+
title="Cost Comparison by Model",
|
| 496 |
+
xaxis_title="Model",
|
| 497 |
+
yaxis_title="Total Cost (USD)"
|
| 498 |
+
)
|
| 499 |
+
|
| 500 |
+
return fig
|
| 501 |
+
```
|
| 502 |
+
|
| 503 |
+
#### components/thought_graph.py
|
| 504 |
+
|
| 505 |
+
**Purpose**: Visualize agent reasoning steps (for Agent Chat).
|
| 506 |
+
|
| 507 |
+
**Visualization**:
|
| 508 |
+
- Graph nodes: Reasoning steps, tool calls
|
| 509 |
+
- Edges: Flow between steps
|
| 510 |
+
- Annotations: Tool results, errors
|
| 511 |
+
|
| 512 |
+
---
|
| 513 |
+
|
| 514 |
+
### 4. MCP Client Layer (mcp_client/)
|
| 515 |
+
|
| 516 |
+
#### mcp_client/client.py - Async MCP Client
|
| 517 |
+
|
| 518 |
+
**Purpose**: Connect to TraceMind MCP Server via MCP protocol.
|
| 519 |
+
|
| 520 |
+
**Implementation**: (See [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) for full code)
|
| 521 |
+
|
| 522 |
+
**Key Methods**:
|
| 523 |
+
- `connect()`: Establish SSE connection to MCP server
|
| 524 |
+
- `call_tool(tool_name, arguments)`: Call an MCP tool
|
| 525 |
+
- `analyze_leaderboard(**kwargs)`: Wrapper for analyze_leaderboard tool
|
| 526 |
+
- `estimate_cost(**kwargs)`: Wrapper for estimate_cost tool
|
| 527 |
+
- `debug_trace(**kwargs)`: Wrapper for debug_trace tool
|
| 528 |
+
|
| 529 |
+
#### mcp_client/sync_wrapper.py - Synchronous Wrapper
|
| 530 |
+
|
| 531 |
+
**Purpose**: Provide synchronous API for Gradio event handlers.
|
| 532 |
+
|
| 533 |
+
**Why Needed?**: Gradio event handlers are synchronous, but MCP client is async.
|
| 534 |
+
|
| 535 |
+
**Pattern**:
|
| 536 |
+
```python
|
| 537 |
+
class SyncMCPClient:
|
| 538 |
+
def __init__(self, mcp_server_url):
|
| 539 |
+
self.async_client = AsyncMCPClient(mcp_server_url)
|
| 540 |
+
|
| 541 |
+
def _run_async(self, coro):
|
| 542 |
+
"""Run async coroutine in sync context"""
|
| 543 |
+
loop = asyncio.get_event_loop()
|
| 544 |
+
return loop.run_until_complete(coro)
|
| 545 |
+
|
| 546 |
+
def analyze_leaderboard(self, **kwargs):
|
| 547 |
+
"""Synchronous wrapper"""
|
| 548 |
+
return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
|
| 549 |
+
```
|
| 550 |
+
|
| 551 |
+
---
|
| 552 |
+
|
| 553 |
+
### 5. Data Loader (data_loader.py)
|
| 554 |
+
|
| 555 |
+
**Purpose**: Load and cache HuggingFace datasets.
|
| 556 |
+
|
| 557 |
+
**Features**:
|
| 558 |
+
- In-memory caching (5-minute TTL)
|
| 559 |
+
- Error handling for missing datasets
|
| 560 |
+
- Automatic retry logic
|
| 561 |
+
- Dataset validation
|
| 562 |
+
|
| 563 |
+
**Implementation**:
|
| 564 |
+
```python
|
| 565 |
+
from datasets import load_dataset
|
| 566 |
+
from functools import lru_cache
|
| 567 |
+
import time
|
| 568 |
+
|
| 569 |
+
class DataLoader:
|
| 570 |
+
def __init__(self):
|
| 571 |
+
self.cache = {}
|
| 572 |
+
self.cache_ttl = 300 # 5 minutes
|
| 573 |
+
|
| 574 |
+
def load_leaderboard(self, repo="kshitijthakkar/smoltrace-leaderboard"):
|
| 575 |
+
"""Load leaderboard with caching"""
|
| 576 |
+
cache_key = f"leaderboard:{repo}"
|
| 577 |
+
|
| 578 |
+
# Check cache
|
| 579 |
+
if cache_key in self.cache:
|
| 580 |
+
cached_time, cached_data = self.cache[cache_key]
|
| 581 |
+
if time.time() - cached_time < self.cache_ttl:
|
| 582 |
+
return cached_data
|
| 583 |
+
|
| 584 |
+
# Load fresh data
|
| 585 |
+
ds = load_dataset(repo, split="train")
|
| 586 |
+
df = pd.DataFrame(ds)
|
| 587 |
+
|
| 588 |
+
# Cache
|
| 589 |
+
self.cache[cache_key] = (time.time(), df)
|
| 590 |
+
|
| 591 |
+
return df
|
| 592 |
+
|
| 593 |
+
def load_results(self, repo):
|
| 594 |
+
"""Load results dataset for specific run"""
|
| 595 |
+
ds = load_dataset(repo, split="train")
|
| 596 |
+
return pd.DataFrame(ds)
|
| 597 |
+
|
| 598 |
+
def load_traces(self, repo):
|
| 599 |
+
"""Load traces dataset for specific run"""
|
| 600 |
+
ds = load_dataset(repo, split="train")
|
| 601 |
+
return ds # Keep as Dataset for filtering
|
| 602 |
+
```
|
| 603 |
+
|
| 604 |
+
---
|
| 605 |
+
|
| 606 |
+
## MCP Client Architecture
|
| 607 |
+
|
| 608 |
+
**Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
|
| 609 |
+
|
| 610 |
+
**Summary**:
|
| 611 |
+
- **Async Client**: `mcp_client/client.py` - async MCP protocol implementation
|
| 612 |
+
- **Sync Wrapper**: `mcp_client/sync_wrapper.py` - synchronous API for Gradio
|
| 613 |
+
- **Global Instance**: Initialized once in `app.py`, shared across all screens
|
| 614 |
+
|
| 615 |
+
**Usage Pattern**:
|
| 616 |
+
```python
|
| 617 |
+
# In app.py (initialization)
|
| 618 |
+
from mcp_client.sync_wrapper import get_sync_mcp_client
|
| 619 |
+
mcp_client = get_sync_mcp_client()
|
| 620 |
+
mcp_client.initialize()
|
| 621 |
+
|
| 622 |
+
# In screen (usage)
|
| 623 |
+
def some_event_handler(mcp_client):
|
| 624 |
+
result = mcp_client.analyze_leaderboard(metric_focus="cost")
|
| 625 |
+
return result
|
| 626 |
+
```
|
| 627 |
+
|
| 628 |
+
---
|
| 629 |
+
|
| 630 |
+
## Agent Framework Integration
|
| 631 |
+
|
| 632 |
+
**Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
|
| 633 |
+
|
| 634 |
+
**Framework**: smolagents (HuggingFace's agent framework)
|
| 635 |
+
|
| 636 |
+
**Key Features**:
|
| 637 |
+
- Autonomous tool discovery from MCP server
|
| 638 |
+
- Multi-step reasoning with tool chaining
|
| 639 |
+
- Context-aware responses
|
| 640 |
+
- Reasoning visualization (optional)
|
| 641 |
+
|
| 642 |
+
**Agent Setup**:
|
| 643 |
+
```python
|
| 644 |
+
from smolagents import ToolCallingAgent, MCPClient
|
| 645 |
+
|
| 646 |
+
agent = ToolCallingAgent(
|
| 647 |
+
tools=[], # Empty - tools loaded from MCP server
|
| 648 |
+
model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct"),
|
| 649 |
+
mcp_client=MCPClient(MCP_SERVER_URL),
|
| 650 |
+
max_steps=10
|
| 651 |
+
)
|
| 652 |
+
```
|
| 653 |
+
|
| 654 |
+
---
|
| 655 |
+
|
| 656 |
+
## Data Flow
|
| 657 |
+
|
| 658 |
+
### Leaderboard Loading Flow
|
| 659 |
+
|
| 660 |
+
```
|
| 661 |
+
1. User clicks "Load Leaderboard"
|
| 662 |
+
β
|
| 663 |
+
2. Gradio Event Handler (leaderboard.py)
|
| 664 |
+
load_leaderboard()
|
| 665 |
+
β
|
| 666 |
+
3. Data Loader (data_loader.py)
|
| 667 |
+
βββ Check cache (5-min TTL)
|
| 668 |
+
β βββ If cached: return cached data
|
| 669 |
+
βββ If not cached: load from HF Datasets
|
| 670 |
+
βββ load_dataset("kshitijthakkar/smoltrace-leaderboard")
|
| 671 |
+
β
|
| 672 |
+
4. MCP Client (sync_wrapper.py)
|
| 673 |
+
mcp_client.analyze_leaderboard(metric_focus="overall")
|
| 674 |
+
β
|
| 675 |
+
5. MCP Server (TraceMind-mcp-server)
|
| 676 |
+
βββ Load data
|
| 677 |
+
βββ Call Gemini API
|
| 678 |
+
βββ Return AI analysis
|
| 679 |
+
β
|
| 680 |
+
6. Render Components
|
| 681 |
+
βββ AI Insights (Markdown)
|
| 682 |
+
βββ Leaderboard Table (Custom HTML)
|
| 683 |
+
β
|
| 684 |
+
7. Display to User
|
| 685 |
+
```
|
| 686 |
+
|
| 687 |
+
### Agent Chat Flow
|
| 688 |
+
|
| 689 |
+
```
|
| 690 |
+
1. User types message: "What are the top 3 models?"
|
| 691 |
+
β
|
| 692 |
+
2. Gradio Event Handler (chat.py)
|
| 693 |
+
agent_chat(message, history, show_reasoning)
|
| 694 |
+
β
|
| 695 |
+
3. smolagents Agent
|
| 696 |
+
agent.run(message)
|
| 697 |
+
βββ Step 1: Plan approach
|
| 698 |
+
β βββ "Need to get top models from leaderboard"
|
| 699 |
+
βββ Step 2: Discover MCP tools
|
| 700 |
+
β βββ Found: get_top_performers, analyze_leaderboard
|
| 701 |
+
βββ Step 3: Call MCP tool
|
| 702 |
+
β βββ get_top_performers(metric="success_rate", top_n=3)
|
| 703 |
+
βββ Step 4: Parse result
|
| 704 |
+
β βββ Extract model names, success rates, costs
|
| 705 |
+
βββ Step 5: Format response
|
| 706 |
+
βββ Generate markdown table with insights
|
| 707 |
+
β
|
| 708 |
+
4. Return to user with full reasoning trace (if enabled)
|
| 709 |
+
```
|
| 710 |
+
|
| 711 |
+
### Job Submission Flow
|
| 712 |
+
|
| 713 |
+
```
|
| 714 |
+
1. User fills form β Clicks "Submit Evaluation"
|
| 715 |
+
β
|
| 716 |
+
2. Gradio Event Handler (dashboard.py)
|
| 717 |
+
submit_job(model, agent_type, hardware, infrastructure)
|
| 718 |
+
β
|
| 719 |
+
3. Job Submission Module (utils/)
|
| 720 |
+
if infrastructure == "HuggingFace Jobs":
|
| 721 |
+
βββ hf_jobs_submission.py
|
| 722 |
+
βββ Build job config (YAML)
|
| 723 |
+
βββ Submit via HF Jobs API
|
| 724 |
+
βββ Return job_id
|
| 725 |
+
elif infrastructure == "Modal":
|
| 726 |
+
βββ modal_job_submission.py
|
| 727 |
+
βββ Build Modal app config
|
| 728 |
+
βββ Submit via Modal SDK
|
| 729 |
+
βββ Return job_id
|
| 730 |
+
β
|
| 731 |
+
4. Store job_id in session state
|
| 732 |
+
β
|
| 733 |
+
5. Redirect to Job Monitoring screen
|
| 734 |
+
β
|
| 735 |
+
6. Auto-refresh status every 30s
|
| 736 |
+
```
|
| 737 |
+
|
| 738 |
+
---
|
| 739 |
+
|
| 740 |
+
## Authentication & Authorization
|
| 741 |
+
|
| 742 |
+
### HuggingFace OAuth
|
| 743 |
+
|
| 744 |
+
**Implementation**: `utils/auth.py`
|
| 745 |
+
|
| 746 |
+
**Flow**:
|
| 747 |
+
```
|
| 748 |
+
1. User visits TraceMind-AI
|
| 749 |
+
β
|
| 750 |
+
2. Check OAuth token in session
|
| 751 |
+
βββ If valid: proceed to app
|
| 752 |
+
βββ If invalid: show login screen
|
| 753 |
+
β
|
| 754 |
+
3. User clicks "Sign in with HuggingFace"
|
| 755 |
+
β
|
| 756 |
+
4. Redirect to HuggingFace OAuth page
|
| 757 |
+
βββ User authorizes TraceMind-AI
|
| 758 |
+
βββ HF redirects back with token
|
| 759 |
+
β
|
| 760 |
+
5. Store token in Gradio State (session)
|
| 761 |
+
β
|
| 762 |
+
6. Use token for:
|
| 763 |
+
βββ HF Datasets access
|
| 764 |
+
βββ HF Jobs submission
|
| 765 |
+
βββ User identification
|
| 766 |
+
```
|
| 767 |
+
|
| 768 |
+
**Code**:
|
| 769 |
+
```python
|
| 770 |
+
# utils/auth.py
|
| 771 |
+
import gradio as gr
|
| 772 |
+
|
| 773 |
+
def auth_ui():
|
| 774 |
+
"""Create OAuth login UI"""
|
| 775 |
+
gr.LoginButton(
|
| 776 |
+
value="Sign in with HuggingFace",
|
| 777 |
+
auth_provider="huggingface"
|
| 778 |
+
)
|
| 779 |
+
|
| 780 |
+
# In app.py
|
| 781 |
+
with gr.Blocks() as app:
|
| 782 |
+
if not DISABLE_OAUTH:
|
| 783 |
+
auth_ui()
|
| 784 |
+
```
|
| 785 |
+
|
| 786 |
+
### API Key Storage
|
| 787 |
+
|
| 788 |
+
**Strategy**: Session-only storage (not server-side persistence)
|
| 789 |
+
|
| 790 |
+
**Implementation**:
|
| 791 |
+
```python
|
| 792 |
+
# In settings screen
|
| 793 |
+
def save_api_keys(gemini_key, hf_token):
|
| 794 |
+
"""Store keys in session state"""
|
| 795 |
+
session_state = gr.State({
|
| 796 |
+
"gemini_key": gemini_key,
|
| 797 |
+
"hf_token": hf_token
|
| 798 |
+
})
|
| 799 |
+
|
| 800 |
+
# Override default clients with user keys
|
| 801 |
+
if gemini_key:
|
| 802 |
+
os.environ["GEMINI_API_KEY"] = gemini_key
|
| 803 |
+
if hf_token:
|
| 804 |
+
os.environ["HF_TOKEN"] = hf_token
|
| 805 |
+
|
| 806 |
+
return "β
API keys saved for this session"
|
| 807 |
+
```
|
| 808 |
+
|
| 809 |
+
**Security**:
|
| 810 |
+
- β
Keys stored only in browser memory
|
| 811 |
+
- β
Not saved to disk or database
|
| 812 |
+
- β
Forms use `api_name=False` (not exposed via API)
|
| 813 |
+
- β
HTTPS encryption
|
| 814 |
+
|
| 815 |
+
---
|
| 816 |
+
|
| 817 |
+
## Screen Navigation
|
| 818 |
+
|
| 819 |
+
### State Management
|
| 820 |
+
|
| 821 |
+
**Pattern**: Gradio State components for session data
|
| 822 |
+
|
| 823 |
+
```python
|
| 824 |
+
# In app.py
|
| 825 |
+
with gr.Blocks() as app:
|
| 826 |
+
# Global state
|
| 827 |
+
session_state = gr.State({
|
| 828 |
+
"user": None,
|
| 829 |
+
"current_run_id": None,
|
| 830 |
+
"current_trace_id": None,
|
| 831 |
+
"api_keys": {}
|
| 832 |
+
})
|
| 833 |
+
|
| 834 |
+
# Pass to all screens
|
| 835 |
+
leaderboard_screen(session_state)
|
| 836 |
+
chat_screen(session_state)
|
| 837 |
+
```
|
| 838 |
+
|
| 839 |
+
### Navigation Between Screens
|
| 840 |
+
|
| 841 |
+
**Pattern**: Click event triggers tab switch + state update
|
| 842 |
+
|
| 843 |
+
```python
|
| 844 |
+
# In leaderboard screen
|
| 845 |
+
def row_click(run_id, session_state):
|
| 846 |
+
"""Navigate to run detail when row clicked"""
|
| 847 |
+
session_state["current_run_id"] = run_id
|
| 848 |
+
|
| 849 |
+
# Switch to trace detail tab (Tab index 4)
|
| 850 |
+
return gr.Tabs.update(selected=4), session_state
|
| 851 |
+
|
| 852 |
+
table_component.select(
|
| 853 |
+
fn=row_click,
|
| 854 |
+
inputs=[gr.State(), session_state],
|
| 855 |
+
outputs=[main_tabs, session_state]
|
| 856 |
+
)
|
| 857 |
+
```
|
| 858 |
+
|
| 859 |
+
---
|
| 860 |
+
|
| 861 |
+
## Job Submission Architecture
|
| 862 |
+
|
| 863 |
+
### HuggingFace Jobs Integration
|
| 864 |
+
|
| 865 |
+
**File**: `utils/hf_jobs_submission.py`
|
| 866 |
+
|
| 867 |
+
**Key Functions**:
|
| 868 |
+
```python
|
| 869 |
+
def submit_hf_job(model, agent_type, hardware, api_keys):
|
| 870 |
+
"""Submit evaluation job to HuggingFace Jobs"""
|
| 871 |
+
|
| 872 |
+
# 1. Build job config (YAML)
|
| 873 |
+
job_config = {
|
| 874 |
+
"name": f"SMOLTRACE Eval - {model}",
|
| 875 |
+
"hardware": hardware, # cpu-basic, t4-small, a10g-small, a100-large, h200
|
| 876 |
+
"environment": {
|
| 877 |
+
"MODEL": model,
|
| 878 |
+
"AGENT_TYPE": agent_type,
|
| 879 |
+
"HF_TOKEN": api_keys["hf_token"],
|
| 880 |
+
# ... other env vars
|
| 881 |
+
},
|
| 882 |
+
"command": [
|
| 883 |
+
"pip install smoltrace[otel,gpu]",
|
| 884 |
+
f"smoltrace-eval --model {model} --agent-type {agent_type} ..."
|
| 885 |
+
]
|
| 886 |
+
}
|
| 887 |
+
|
| 888 |
+
# 2. Submit via HF Jobs API
|
| 889 |
+
response = requests.post(
|
| 890 |
+
"https://huggingface.co/api/jobs",
|
| 891 |
+
headers={"Authorization": f"Bearer {api_keys['hf_token']}"},
|
| 892 |
+
json=job_config
|
| 893 |
+
)
|
| 894 |
+
|
| 895 |
+
# 3. Return job ID
|
| 896 |
+
job_id = response.json()["id"]
|
| 897 |
+
return job_id
|
| 898 |
+
```
|
| 899 |
+
|
| 900 |
+
### Modal Integration
|
| 901 |
+
|
| 902 |
+
**File**: `utils/modal_job_submission.py`
|
| 903 |
+
|
| 904 |
+
**Key Functions**:
|
| 905 |
+
```python
|
| 906 |
+
import modal
|
| 907 |
+
|
| 908 |
+
def submit_modal_job(model, agent_type, hardware, api_keys):
|
| 909 |
+
"""Submit evaluation job to Modal"""
|
| 910 |
+
|
| 911 |
+
# 1. Create Modal app
|
| 912 |
+
app = modal.App("smoltrace-eval")
|
| 913 |
+
|
| 914 |
+
# 2. Define function with GPU
|
| 915 |
+
@app.function(
|
| 916 |
+
image=modal.Image.debian_slim().pip_install("smoltrace[otel,gpu]"),
|
| 917 |
+
gpu=hardware, # A10, A100-80GB, H200
|
| 918 |
+
secrets=[
|
| 919 |
+
modal.Secret.from_dict({
|
| 920 |
+
"HF_TOKEN": api_keys["hf_token"],
|
| 921 |
+
# ... other secrets
|
| 922 |
+
})
|
| 923 |
+
]
|
| 924 |
+
)
|
| 925 |
+
def run_evaluation():
|
| 926 |
+
import smoltrace
|
| 927 |
+
# Run evaluation
|
| 928 |
+
results = smoltrace.evaluate(model=model, agent_type=agent_type)
|
| 929 |
+
return results
|
| 930 |
+
|
| 931 |
+
# 3. Deploy and run
|
| 932 |
+
with app.run():
|
| 933 |
+
result = run_evaluation.remote()
|
| 934 |
+
|
| 935 |
+
return result.job_id
|
| 936 |
+
```
|
| 937 |
+
|
| 938 |
+
---
|
| 939 |
+
|
| 940 |
+
## Deployment
|
| 941 |
+
|
| 942 |
+
### HuggingFace Spaces
|
| 943 |
+
|
| 944 |
+
**Platform**: HuggingFace Spaces
|
| 945 |
+
**SDK**: Gradio 5.49.1
|
| 946 |
+
**Hardware**: CPU Basic (upgradeable)
|
| 947 |
+
**URL**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
|
| 948 |
+
|
| 949 |
+
### Configuration
|
| 950 |
+
|
| 951 |
+
**Space Metadata** (README.md header):
|
| 952 |
+
```yaml
|
| 953 |
+
---
|
| 954 |
+
title: TraceMind AI
|
| 955 |
+
emoji: π§
|
| 956 |
+
colorFrom: indigo
|
| 957 |
+
colorTo: purple
|
| 958 |
+
sdk: gradio
|
| 959 |
+
sdk_version: 5.49.1
|
| 960 |
+
app_file: app.py
|
| 961 |
+
short_description: AI agent evaluation with MCP-powered intelligence
|
| 962 |
+
license: agpl-3.0
|
| 963 |
+
pinned: true
|
| 964 |
+
tags:
|
| 965 |
+
- mcp-in-action-track-enterprise
|
| 966 |
+
- agent-evaluation
|
| 967 |
+
- mcp-client
|
| 968 |
+
- leaderboard
|
| 969 |
+
- gradio
|
| 970 |
+
---
|
| 971 |
+
```
|
| 972 |
+
|
| 973 |
+
### Environment Variables
|
| 974 |
+
|
| 975 |
+
**Set in HF Spaces Secrets**:
|
| 976 |
+
```bash
|
| 977 |
+
# Required
|
| 978 |
+
GEMINI_API_KEY=your_gemini_key
|
| 979 |
+
HF_TOKEN=your_hf_token
|
| 980 |
+
|
| 981 |
+
# Optional
|
| 982 |
+
MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
|
| 983 |
+
LEADERBOARD_REPO=kshitijthakkar/smoltrace-leaderboard
|
| 984 |
+
DISABLE_OAUTH=false # Set to true for local development
|
| 985 |
+
```
|
| 986 |
+
|
| 987 |
+
---
|
| 988 |
+
|
| 989 |
+
## Performance Optimization
|
| 990 |
+
|
| 991 |
+
### 1. Data Caching
|
| 992 |
+
|
| 993 |
+
**Implementation**: `data_loader.py`
|
| 994 |
+
- In-memory cache with 5-minute TTL
|
| 995 |
+
- Reduces HF Datasets API calls
|
| 996 |
+
- Faster page loads
|
| 997 |
+
|
| 998 |
+
### 2. Async MCP Calls
|
| 999 |
+
|
| 1000 |
+
**Pattern**: Use async for non-blocking I/O
|
| 1001 |
+
```python
|
| 1002 |
+
# Could be optimized to run in parallel
|
| 1003 |
+
async def load_data_with_insights():
|
| 1004 |
+
leaderboard_task = load_dataset_async(...)
|
| 1005 |
+
insights_task = mcp_client.analyze_leaderboard_async(...)
|
| 1006 |
+
|
| 1007 |
+
leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
|
| 1008 |
+
return leaderboard, insights
|
| 1009 |
+
```
|
| 1010 |
+
|
| 1011 |
+
### 3. Component Lazy Loading
|
| 1012 |
+
|
| 1013 |
+
**Strategy**: Load components only when tabs are activated
|
| 1014 |
+
```python
|
| 1015 |
+
with gr.Tab("Trace Detail", visible=False) as trace_tab:
|
| 1016 |
+
# Components created only when tab first shown
|
| 1017 |
+
@trace_tab.select
|
| 1018 |
+
def load_trace_components():
|
| 1019 |
+
return build_trace_visualization()
|
| 1020 |
+
```
|
| 1021 |
+
|
| 1022 |
+
---
|
| 1023 |
+
|
| 1024 |
+
## Related Documentation
|
| 1025 |
+
|
| 1026 |
+
- [README.md](PROPOSED_README_TRACEMIND_AI.md) - Overview and quick start
|
| 1027 |
+
- [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete screen-by-screen guide
|
| 1028 |
+
- [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) - MCP client implementation
|
| 1029 |
+
- [TraceMind MCP Server Architecture](ARCHITECTURE_MCP_SERVER.md) - Server-side architecture
|
| 1030 |
+
|
| 1031 |
+
---
|
| 1032 |
+
|
| 1033 |
+
**Last Updated**: November 21, 2025
|
| 1034 |
+
**Version**: 1.0.0
|
| 1035 |
+
**Track**: MCP in Action (Enterprise)
|
MCP_INTEGRATION.md
ADDED
|
@@ -0,0 +1,706 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TraceMind-AI - MCP Integration Guide
|
| 2 |
+
|
| 3 |
+
This document explains how TraceMind-AI integrates with MCP servers to provide AI-powered agent evaluation.
|
| 4 |
+
|
| 5 |
+
## Table of Contents
|
| 6 |
+
|
| 7 |
+
- [Overview](#overview)
|
| 8 |
+
- [Dual MCP Integration](#dual-mcp-integration)
|
| 9 |
+
- [Architecture](#architecture)
|
| 10 |
+
- [MCP Client Implementation](#mcp-client-implementation)
|
| 11 |
+
- [Agent Framework Integration](#agent-framework-integration)
|
| 12 |
+
- [MCP Tools Usage](#mcp-tools-usage)
|
| 13 |
+
- [Development Guide](#development-guide)
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Overview
|
| 18 |
+
|
| 19 |
+
TraceMind-AI demonstrates **enterprise MCP client usage** as part of the **Track 2: MCP in Action** submission. It showcases two distinct patterns of MCP integration:
|
| 20 |
+
|
| 21 |
+
1. **Direct MCP Client**: Python-based client connecting to remote MCP server via SSE transport
|
| 22 |
+
2. **Autonomous Agent**: `smolagents`-based agent with access to MCP tools for multi-step reasoning
|
| 23 |
+
|
| 24 |
+
Both patterns consume the same MCP server ([TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)) to provide AI-powered analysis of agent evaluation data.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Dual MCP Integration
|
| 29 |
+
|
| 30 |
+
### Pattern 1: Direct MCP Client Integration
|
| 31 |
+
|
| 32 |
+
**Where**: Leaderboard insights, cost estimation dialogs, trace debugging
|
| 33 |
+
|
| 34 |
+
**How it works**:
|
| 35 |
+
```python
|
| 36 |
+
# TraceMind-AI calls MCP server directly
|
| 37 |
+
mcp_client = get_sync_mcp_client()
|
| 38 |
+
insights = mcp_client.analyze_leaderboard(
|
| 39 |
+
metric_focus="overall",
|
| 40 |
+
time_range="last_week",
|
| 41 |
+
top_n=5
|
| 42 |
+
)
|
| 43 |
+
# Display insights in UI
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
**Use cases**:
|
| 47 |
+
- Generate leaderboard insights when user clicks "Load Leaderboard"
|
| 48 |
+
- Estimate costs when user clicks "Estimate Cost" in New Evaluation form
|
| 49 |
+
- Debug traces when user asks questions in trace visualization
|
| 50 |
+
|
| 51 |
+
**Advantages**:
|
| 52 |
+
- Direct, fast execution
|
| 53 |
+
- Synchronous API (easy to integrate with Gradio)
|
| 54 |
+
- Predictable, structured responses
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
### Pattern 2: Autonomous Agent with MCP Tools
|
| 59 |
+
|
| 60 |
+
**Where**: Agent Chat tab
|
| 61 |
+
|
| 62 |
+
**How it works**:
|
| 63 |
+
```python
|
| 64 |
+
# smolagents agent discovers and uses MCP tools autonomously
|
| 65 |
+
from smolagents import ToolCallingAgent, MCPClient
|
| 66 |
+
|
| 67 |
+
# Agent initialized with MCP client
|
| 68 |
+
agent = ToolCallingAgent(
|
| 69 |
+
tools=[], # Tools loaded from MCP server
|
| 70 |
+
model=model_client,
|
| 71 |
+
mcp_client=MCPClient(mcp_server_url)
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
# User asks question
|
| 75 |
+
result = agent.run("What are the top 3 models and their costs?")
|
| 76 |
+
|
| 77 |
+
# Agent plans:
|
| 78 |
+
# 1. Call get_top_performers MCP tool
|
| 79 |
+
# 2. Extract costs from results
|
| 80 |
+
# 3. Format and present to user
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
**Use cases**:
|
| 84 |
+
- Answer complex questions requiring multi-step analysis
|
| 85 |
+
- Compare models across multiple dimensions
|
| 86 |
+
- Plan evaluation strategies with cost estimates
|
| 87 |
+
- Provide recommendations based on leaderboard data
|
| 88 |
+
|
| 89 |
+
**Advantages**:
|
| 90 |
+
- Natural language interface
|
| 91 |
+
- Multi-step reasoning
|
| 92 |
+
- Autonomous tool selection
|
| 93 |
+
- Context-aware responses
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## Architecture
|
| 98 |
+
|
| 99 |
+
### System Overview
|
| 100 |
+
|
| 101 |
+
```
|
| 102 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 103 |
+
β TraceMind-AI (Gradio App) - Track 2 β
|
| 104 |
+
β β
|
| 105 |
+
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 106 |
+
β β UI Layer (Gradio) β β
|
| 107 |
+
β β - Leaderboard tab β β
|
| 108 |
+
β β - Agent Chat tab β β
|
| 109 |
+
β β - New Evaluation tab β β
|
| 110 |
+
β β - Trace Visualization tab β β
|
| 111 |
+
β ββββββββββββββ¬ββββββββββββββββββββββββββββββ¬βββββββββββββββ β
|
| 112 |
+
β β β β
|
| 113 |
+
β βββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
|
| 114 |
+
β β Direct MCP Client β β Autonomous Agent β β
|
| 115 |
+
β β (sync_wrapper.py) β β (smolagents) β β
|
| 116 |
+
β β β β β β
|
| 117 |
+
β β - Synchronous API β β - Multi-step reasoning β β
|
| 118 |
+
β β - Tool calling β β - Tool discovery β β
|
| 119 |
+
β β - Error handling β β - Context management β β
|
| 120 |
+
β βββββββββββββ¬ββββββββββββ βββββββββββββββ¬βββββββββββββ β
|
| 121 |
+
β βββββββββββββββββββ¬ββββββββββββββ β
|
| 122 |
+
β β β
|
| 123 |
+
β MCP Protocol β
|
| 124 |
+
β (SSE Transport) β
|
| 125 |
+
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
|
| 126 |
+
β
|
| 127 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 128 |
+
β TraceMind MCP Server - Track 1 β
|
| 129 |
+
β https://huggingface.co/spaces/MCP-1st-Birthday/ β
|
| 130 |
+
β TraceMind-mcp-server β
|
| 131 |
+
β β
|
| 132 |
+
β 11 AI-Powered Tools: β
|
| 133 |
+
β - analyze_leaderboard β
|
| 134 |
+
β - debug_trace β
|
| 135 |
+
β - estimate_cost β
|
| 136 |
+
β - compare_runs β
|
| 137 |
+
β - analyze_results β
|
| 138 |
+
β - get_top_performers β
|
| 139 |
+
β - get_leaderboard_summary β
|
| 140 |
+
β - get_dataset β
|
| 141 |
+
β - generate_synthetic_dataset β
|
| 142 |
+
β - push_dataset_to_hub β
|
| 143 |
+
β - generate_prompt_template β
|
| 144 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## MCP Client Implementation
|
| 150 |
+
|
| 151 |
+
### File Structure
|
| 152 |
+
|
| 153 |
+
```
|
| 154 |
+
TraceMind-AI/
|
| 155 |
+
βββ mcp_client/
|
| 156 |
+
β βββ __init__.py
|
| 157 |
+
β βββ client.py # Async MCP client
|
| 158 |
+
β βββ sync_wrapper.py # Synchronous wrapper for Gradio
|
| 159 |
+
βββ agent/
|
| 160 |
+
β βββ __init__.py
|
| 161 |
+
β βββ smolagents_setup.py # Agent with MCP integration
|
| 162 |
+
βββ app.py # Main Gradio app
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
### Async MCP Client (`client.py`)
|
| 166 |
+
|
| 167 |
+
```python
|
| 168 |
+
from mcp import ClientSession, StdioServerParameters
|
| 169 |
+
import mcp.types as types
|
| 170 |
+
|
| 171 |
+
class TraceMindMCPClient:
|
| 172 |
+
"""Async MCP client for TraceMind MCP Server"""
|
| 173 |
+
|
| 174 |
+
def __init__(self, mcp_server_url: str):
|
| 175 |
+
self.mcp_server_url = mcp_server_url
|
| 176 |
+
self.session = None
|
| 177 |
+
|
| 178 |
+
async def connect(self):
|
| 179 |
+
"""Establish connection to MCP server via SSE"""
|
| 180 |
+
# For HTTP-based MCP servers (HuggingFace Spaces)
|
| 181 |
+
self.session = ClientSession(
|
| 182 |
+
ServerParameters(
|
| 183 |
+
url=self.mcp_server_url,
|
| 184 |
+
transport="sse"
|
| 185 |
+
)
|
| 186 |
+
)
|
| 187 |
+
await self.session.__aenter__()
|
| 188 |
+
|
| 189 |
+
# List available tools
|
| 190 |
+
tools_result = await self.session.list_tools()
|
| 191 |
+
self.available_tools = {tool.name: tool for tool in tools_result.tools}
|
| 192 |
+
|
| 193 |
+
print(f"Connected to MCP server. Available tools: {list(self.available_tools.keys())}")
|
| 194 |
+
|
| 195 |
+
async def call_tool(self, tool_name: str, arguments: dict) -> str:
|
| 196 |
+
"""Call an MCP tool with given arguments"""
|
| 197 |
+
if not self.session:
|
| 198 |
+
raise RuntimeError("MCP client not connected. Call connect() first.")
|
| 199 |
+
|
| 200 |
+
if tool_name not in self.available_tools:
|
| 201 |
+
raise ValueError(f"Tool '{tool_name}' not available. Available: {list(self.available_tools.keys())}")
|
| 202 |
+
|
| 203 |
+
# Call the tool
|
| 204 |
+
result = await self.session.call_tool(tool_name, arguments=arguments)
|
| 205 |
+
|
| 206 |
+
# Extract text response
|
| 207 |
+
if result.content and len(result.content) > 0:
|
| 208 |
+
return result.content[0].text
|
| 209 |
+
return ""
|
| 210 |
+
|
| 211 |
+
async def analyze_leaderboard(self, **kwargs) -> str:
|
| 212 |
+
"""Wrapper for analyze_leaderboard tool"""
|
| 213 |
+
return await self.call_tool("analyze_leaderboard", kwargs)
|
| 214 |
+
|
| 215 |
+
async def estimate_cost(self, **kwargs) -> str:
|
| 216 |
+
"""Wrapper for estimate_cost tool"""
|
| 217 |
+
return await self.call_tool("estimate_cost", kwargs)
|
| 218 |
+
|
| 219 |
+
async def debug_trace(self, **kwargs) -> str:
|
| 220 |
+
"""Wrapper for debug_trace tool"""
|
| 221 |
+
return await self.call_tool("debug_trace", kwargs)
|
| 222 |
+
|
| 223 |
+
async def compare_runs(self, **kwargs) -> str:
|
| 224 |
+
"""Wrapper for compare_runs tool"""
|
| 225 |
+
return await self.call_tool("compare_runs", kwargs)
|
| 226 |
+
|
| 227 |
+
async def get_top_performers(self, **kwargs) -> str:
|
| 228 |
+
"""Wrapper for get_top_performers tool"""
|
| 229 |
+
return await self.call_tool("get_top_performers", kwargs)
|
| 230 |
+
|
| 231 |
+
async def disconnect(self):
|
| 232 |
+
"""Close MCP connection"""
|
| 233 |
+
if self.session:
|
| 234 |
+
await self.session.__aexit__(None, None, None)
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
### Synchronous Wrapper (`sync_wrapper.py`)
|
| 238 |
+
|
| 239 |
+
```python
|
| 240 |
+
import asyncio
|
| 241 |
+
from typing import Optional
|
| 242 |
+
from .client import TraceMindMCPClient
|
| 243 |
+
|
| 244 |
+
class SyncMCPClient:
|
| 245 |
+
"""Synchronous wrapper for async MCP client (Gradio-compatible)"""
|
| 246 |
+
|
| 247 |
+
def __init__(self, mcp_server_url: str):
|
| 248 |
+
self.mcp_server_url = mcp_server_url
|
| 249 |
+
self.async_client = TraceMindMCPClient(mcp_server_url)
|
| 250 |
+
self._connected = False
|
| 251 |
+
|
| 252 |
+
def _run_async(self, coro):
|
| 253 |
+
"""Run async coroutine in sync context"""
|
| 254 |
+
try:
|
| 255 |
+
loop = asyncio.get_event_loop()
|
| 256 |
+
except RuntimeError:
|
| 257 |
+
loop = asyncio.new_event_loop()
|
| 258 |
+
asyncio.set_event_loop(loop)
|
| 259 |
+
|
| 260 |
+
return loop.run_until_complete(coro)
|
| 261 |
+
|
| 262 |
+
def initialize(self):
|
| 263 |
+
"""Connect to MCP server"""
|
| 264 |
+
if not self._connected:
|
| 265 |
+
self._run_async(self.async_client.connect())
|
| 266 |
+
self._connected = True
|
| 267 |
+
|
| 268 |
+
def analyze_leaderboard(self, **kwargs) -> str:
|
| 269 |
+
"""Synchronous wrapper for analyze_leaderboard"""
|
| 270 |
+
if not self._connected:
|
| 271 |
+
self.initialize()
|
| 272 |
+
return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
|
| 273 |
+
|
| 274 |
+
def estimate_cost(self, **kwargs) -> str:
|
| 275 |
+
"""Synchronous wrapper for estimate_cost"""
|
| 276 |
+
if not self._connected:
|
| 277 |
+
self.initialize()
|
| 278 |
+
return self._run_async(self.async_client.estimate_cost(**kwargs))
|
| 279 |
+
|
| 280 |
+
def debug_trace(self, **kwargs) -> str:
|
| 281 |
+
"""Synchronous wrapper for debug_trace"""
|
| 282 |
+
if not self._connected:
|
| 283 |
+
self.initialize()
|
| 284 |
+
return self._run_async(self.async_client.debug_trace(**kwargs))
|
| 285 |
+
|
| 286 |
+
# ... (similar wrappers for other tools)
|
| 287 |
+
|
| 288 |
+
# Global instance for use in Gradio app
|
| 289 |
+
_mcp_client: Optional[SyncMCPClient] = None
|
| 290 |
+
|
| 291 |
+
def get_sync_mcp_client() -> SyncMCPClient:
|
| 292 |
+
"""Get or create global sync MCP client instance"""
|
| 293 |
+
global _mcp_client
|
| 294 |
+
if _mcp_client is None:
|
| 295 |
+
mcp_server_url = os.getenv(
|
| 296 |
+
"MCP_SERVER_URL",
|
| 297 |
+
"https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
|
| 298 |
+
)
|
| 299 |
+
_mcp_client = SyncMCPClient(mcp_server_url)
|
| 300 |
+
return _mcp_client
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
### Usage in Gradio App
|
| 304 |
+
|
| 305 |
+
```python
|
| 306 |
+
# app.py
|
| 307 |
+
from mcp_client.sync_wrapper import get_sync_mcp_client
|
| 308 |
+
|
| 309 |
+
# Initialize MCP client
|
| 310 |
+
mcp_client = get_sync_mcp_client()
|
| 311 |
+
mcp_client.initialize()
|
| 312 |
+
|
| 313 |
+
# Use in Gradio event handlers
|
| 314 |
+
def load_leaderboard():
|
| 315 |
+
"""Load leaderboard and generate AI insights"""
|
| 316 |
+
# Load dataset
|
| 317 |
+
ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
|
| 318 |
+
df = pd.DataFrame(ds)
|
| 319 |
+
|
| 320 |
+
# Get AI insights from MCP server
|
| 321 |
+
try:
|
| 322 |
+
insights = mcp_client.analyze_leaderboard(
|
| 323 |
+
metric_focus="overall",
|
| 324 |
+
time_range="last_week",
|
| 325 |
+
top_n=5
|
| 326 |
+
)
|
| 327 |
+
except Exception as e:
|
| 328 |
+
insights = f"β Error generating insights: {str(e)}"
|
| 329 |
+
|
| 330 |
+
return df, insights
|
| 331 |
+
|
| 332 |
+
# Gradio UI
|
| 333 |
+
with gr.Blocks() as app:
|
| 334 |
+
with gr.Tab("π Leaderboard"):
|
| 335 |
+
load_btn = gr.Button("Load Leaderboard")
|
| 336 |
+
insights_md = gr.Markdown(label="AI Insights")
|
| 337 |
+
leaderboard_table = gr.Dataframe()
|
| 338 |
+
|
| 339 |
+
load_btn.click(
|
| 340 |
+
fn=load_leaderboard,
|
| 341 |
+
outputs=[leaderboard_table, insights_md]
|
| 342 |
+
)
|
| 343 |
+
```
|
| 344 |
+
|
| 345 |
+
---
|
| 346 |
+
|
| 347 |
+
## Agent Framework Integration
|
| 348 |
+
|
| 349 |
+
### smolagents Setup
|
| 350 |
+
|
| 351 |
+
```python
|
| 352 |
+
# agent/smolagents_setup.py
|
| 353 |
+
from smolagents import ToolCallingAgent, MCPClient, HfApiModel
|
| 354 |
+
import os
|
| 355 |
+
|
| 356 |
+
def create_agent():
|
| 357 |
+
"""Create smolagents agent with MCP tool access"""
|
| 358 |
+
|
| 359 |
+
# 1. Configure MCP client
|
| 360 |
+
mcp_server_url = os.getenv(
|
| 361 |
+
"MCP_SERVER_URL",
|
| 362 |
+
"https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
|
| 363 |
+
)
|
| 364 |
+
|
| 365 |
+
mcp_client = MCPClient(mcp_server_url)
|
| 366 |
+
|
| 367 |
+
# 2. Configure LLM
|
| 368 |
+
model = HfApiModel(
|
| 369 |
+
model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 370 |
+
token=os.getenv("HF_TOKEN")
|
| 371 |
+
)
|
| 372 |
+
|
| 373 |
+
# 3. Create agent with MCP tools
|
| 374 |
+
agent = ToolCallingAgent(
|
| 375 |
+
tools=[], # MCP tools loaded automatically
|
| 376 |
+
model=model,
|
| 377 |
+
mcp_client=mcp_client,
|
| 378 |
+
max_steps=10,
|
| 379 |
+
verbosity_level=1
|
| 380 |
+
)
|
| 381 |
+
|
| 382 |
+
return agent
|
| 383 |
+
|
| 384 |
+
def run_agent_query(agent: ToolCallingAgent, query: str, show_reasoning: bool = False):
|
| 385 |
+
"""Run agent query and return response"""
|
| 386 |
+
try:
|
| 387 |
+
# Set verbosity based on show_reasoning flag
|
| 388 |
+
if show_reasoning:
|
| 389 |
+
agent.verbosity_level = 2 # Show tool execution logs
|
| 390 |
+
else:
|
| 391 |
+
agent.verbosity_level = 0 # Only show final answer
|
| 392 |
+
|
| 393 |
+
# Run agent
|
| 394 |
+
result = agent.run(query)
|
| 395 |
+
|
| 396 |
+
return result
|
| 397 |
+
except Exception as e:
|
| 398 |
+
return f"β Agent error: {str(e)}"
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
### Agent Chat UI
|
| 402 |
+
|
| 403 |
+
```python
|
| 404 |
+
# app.py
|
| 405 |
+
from agent.smolagents_setup import create_agent, run_agent_query
|
| 406 |
+
|
| 407 |
+
# Initialize agent (once at startup)
|
| 408 |
+
agent = create_agent()
|
| 409 |
+
|
| 410 |
+
def agent_chat(message: str, history: list, show_reasoning: bool):
|
| 411 |
+
"""Handle agent chat interaction"""
|
| 412 |
+
# Run agent query
|
| 413 |
+
response = run_agent_query(agent, message, show_reasoning)
|
| 414 |
+
|
| 415 |
+
# Update chat history
|
| 416 |
+
history.append((message, response))
|
| 417 |
+
|
| 418 |
+
return history, ""
|
| 419 |
+
|
| 420 |
+
# Gradio UI
|
| 421 |
+
with gr.Blocks() as app:
|
| 422 |
+
with gr.Tab("π€ Agent Chat"):
|
| 423 |
+
gr.Markdown("## Autonomous Agent with MCP Tools")
|
| 424 |
+
gr.Markdown("Ask questions about agent evaluations. The agent has access to all MCP tools.")
|
| 425 |
+
|
| 426 |
+
chatbot = gr.Chatbot(label="Agent Chat")
|
| 427 |
+
msg = gr.Textbox(label="Your Question", placeholder="What are the top 3 models and their costs?")
|
| 428 |
+
show_reasoning = gr.Checkbox(label="Show Agent Reasoning", value=False)
|
| 429 |
+
|
| 430 |
+
# Quick action buttons
|
| 431 |
+
with gr.Row():
|
| 432 |
+
quick_top = gr.Button("Quick: Top Models")
|
| 433 |
+
quick_cost = gr.Button("Quick: Cost Estimate")
|
| 434 |
+
quick_load = gr.Button("Quick: Load Leaderboard")
|
| 435 |
+
|
| 436 |
+
# Event handlers
|
| 437 |
+
msg.submit(agent_chat, [msg, chatbot, show_reasoning], [chatbot, msg])
|
| 438 |
+
|
| 439 |
+
quick_top.click(
|
| 440 |
+
lambda h, sr: agent_chat(
|
| 441 |
+
"What are the top 5 models by success rate with their costs?",
|
| 442 |
+
h,
|
| 443 |
+
sr
|
| 444 |
+
),
|
| 445 |
+
[chatbot, show_reasoning],
|
| 446 |
+
[chatbot, msg]
|
| 447 |
+
)
|
| 448 |
+
```
|
| 449 |
+
|
| 450 |
+
---
|
| 451 |
+
|
| 452 |
+
## MCP Tools Usage
|
| 453 |
+
|
| 454 |
+
### Tools Used in TraceMind-AI
|
| 455 |
+
|
| 456 |
+
| Tool | Where Used | Purpose |
|
| 457 |
+
|------|-----------|---------|
|
| 458 |
+
| `analyze_leaderboard` | Leaderboard tab | Generate AI insights when user loads leaderboard |
|
| 459 |
+
| `estimate_cost` | New Evaluation tab | Predict costs before submitting evaluation |
|
| 460 |
+
| `debug_trace` | Trace Visualization | Answer questions about execution traces |
|
| 461 |
+
| `compare_runs` | Agent Chat | Compare two evaluation runs side-by-side |
|
| 462 |
+
| `analyze_results` | Agent Chat | Analyze detailed test results with optimization recommendations |
|
| 463 |
+
| `get_top_performers` | Agent Chat | Efficiently fetch top N models (90% token reduction) |
|
| 464 |
+
| `get_leaderboard_summary` | Agent Chat | Get high-level statistics (99% token reduction) |
|
| 465 |
+
| `get_dataset` | Agent Chat | Load SMOLTRACE datasets for detailed analysis |
|
| 466 |
+
|
| 467 |
+
### Example Tool Calls
|
| 468 |
+
|
| 469 |
+
**Example 1: Leaderboard Insights**
|
| 470 |
+
```python
|
| 471 |
+
# User clicks "Load Leaderboard" button
|
| 472 |
+
insights = mcp_client.analyze_leaderboard(
|
| 473 |
+
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 474 |
+
metric_focus="overall",
|
| 475 |
+
time_range="last_week",
|
| 476 |
+
top_n=5
|
| 477 |
+
)
|
| 478 |
+
|
| 479 |
+
# Display in Gradio Markdown component
|
| 480 |
+
insights_md.value = insights
|
| 481 |
+
```
|
| 482 |
+
|
| 483 |
+
**Example 2: Cost Estimation**
|
| 484 |
+
```python
|
| 485 |
+
# User fills New Evaluation form and clicks "Estimate Cost"
|
| 486 |
+
estimate = mcp_client.estimate_cost(
|
| 487 |
+
model="meta-llama/Llama-3.1-8B",
|
| 488 |
+
agent_type="both",
|
| 489 |
+
num_tests=100,
|
| 490 |
+
hardware="auto"
|
| 491 |
+
)
|
| 492 |
+
|
| 493 |
+
# Display in dialog
|
| 494 |
+
gr.Info(estimate)
|
| 495 |
+
```
|
| 496 |
+
|
| 497 |
+
**Example 3: Agent Multi-Step Query**
|
| 498 |
+
```python
|
| 499 |
+
# User asks: "What are the top 3 models and how much do they cost?"
|
| 500 |
+
|
| 501 |
+
# Agent reasoning (internal):
|
| 502 |
+
# Step 1: Need to get top models by success rate
|
| 503 |
+
# β Call get_top_performers(metric="success_rate", top_n=3)
|
| 504 |
+
#
|
| 505 |
+
# Step 2: Extract cost information from results
|
| 506 |
+
# β Parse JSON response, get "total_cost_usd" field
|
| 507 |
+
#
|
| 508 |
+
# Step 3: Format response for user
|
| 509 |
+
# β Create markdown table with model names, success rates, costs
|
| 510 |
+
|
| 511 |
+
# Agent response:
|
| 512 |
+
"""
|
| 513 |
+
Here are the top 3 models by success rate:
|
| 514 |
+
|
| 515 |
+
1. **GPT-4**: 95.8% success rate, $0.05 per run
|
| 516 |
+
2. **Claude-3**: 94.1% success rate, $0.04 per run
|
| 517 |
+
3. **Llama-3.1-8B**: 93.4% success rate, $0.002 per run
|
| 518 |
+
|
| 519 |
+
GPT-4 leads in accuracy but is 25x more expensive than Llama-3.1.
|
| 520 |
+
For cost-sensitive workloads, Llama-3.1 offers the best value.
|
| 521 |
+
"""
|
| 522 |
+
```
|
| 523 |
+
|
| 524 |
+
---
|
| 525 |
+
|
| 526 |
+
## Development Guide
|
| 527 |
+
|
| 528 |
+
### Adding New MCP Tool Integration
|
| 529 |
+
|
| 530 |
+
1. **Add method to async client** (`client.py`):
|
| 531 |
+
```python
|
| 532 |
+
async def new_tool_name(self, **kwargs) -> str:
|
| 533 |
+
"""Wrapper for new_tool_name MCP tool"""
|
| 534 |
+
return await self.call_tool("new_tool_name", kwargs)
|
| 535 |
+
```
|
| 536 |
+
|
| 537 |
+
2. **Add synchronous wrapper** (`sync_wrapper.py`):
|
| 538 |
+
```python
|
| 539 |
+
def new_tool_name(self, **kwargs) -> str:
|
| 540 |
+
"""Synchronous wrapper for new_tool_name"""
|
| 541 |
+
if not self._connected:
|
| 542 |
+
self.initialize()
|
| 543 |
+
return self._run_async(self.async_client.new_tool_name(**kwargs))
|
| 544 |
+
```
|
| 545 |
+
|
| 546 |
+
3. **Use in Gradio app** (`app.py`):
|
| 547 |
+
```python
|
| 548 |
+
def handle_new_tool():
|
| 549 |
+
result = mcp_client.new_tool_name(param1="value1", param2="value2")
|
| 550 |
+
return result
|
| 551 |
+
```
|
| 552 |
+
|
| 553 |
+
**Note**: Agent automatically discovers new tools from MCP server, no code changes needed!
|
| 554 |
+
|
| 555 |
+
### Testing MCP Integration
|
| 556 |
+
|
| 557 |
+
**Test 1: Connection**
|
| 558 |
+
```python
|
| 559 |
+
python -c "from mcp_client.sync_wrapper import get_sync_mcp_client; client = get_sync_mcp_client(); client.initialize(); print('β
MCP client connected')"
|
| 560 |
+
```
|
| 561 |
+
|
| 562 |
+
**Test 2: Tool Call**
|
| 563 |
+
```python
|
| 564 |
+
from mcp_client.sync_wrapper import get_sync_mcp_client
|
| 565 |
+
|
| 566 |
+
client = get_sync_mcp_client()
|
| 567 |
+
client.initialize()
|
| 568 |
+
|
| 569 |
+
result = client.analyze_leaderboard(
|
| 570 |
+
metric_focus="cost",
|
| 571 |
+
time_range="last_week",
|
| 572 |
+
top_n=3
|
| 573 |
+
)
|
| 574 |
+
|
| 575 |
+
print(result)
|
| 576 |
+
```
|
| 577 |
+
|
| 578 |
+
**Test 3: Agent**
|
| 579 |
+
```python
|
| 580 |
+
from agent.smolagents_setup import create_agent, run_agent_query
|
| 581 |
+
|
| 582 |
+
agent = create_agent()
|
| 583 |
+
response = run_agent_query(agent, "What are the top 3 models?", show_reasoning=True)
|
| 584 |
+
print(response)
|
| 585 |
+
```
|
| 586 |
+
|
| 587 |
+
### Debugging MCP Issues
|
| 588 |
+
|
| 589 |
+
**Issue**: Connection timeout
|
| 590 |
+
- **Check**: MCP server is running at specified URL
|
| 591 |
+
- **Check**: Network connectivity to HuggingFace Spaces
|
| 592 |
+
- **Check**: SSE transport is enabled on server
|
| 593 |
+
|
| 594 |
+
**Issue**: Tool not found
|
| 595 |
+
- **Check**: MCP server has the tool implemented
|
| 596 |
+
- **Check**: Tool name matches exactly (case-sensitive)
|
| 597 |
+
- **Check**: Client initialized successfully (call `initialize()` first)
|
| 598 |
+
|
| 599 |
+
**Issue**: Agent not using MCP tools
|
| 600 |
+
- **Check**: MCPClient is properly configured in agent setup
|
| 601 |
+
- **Check**: Agent has `max_steps > 0` to allow tool usage
|
| 602 |
+
- **Check**: Query requires tool usage (not answerable from agent's knowledge alone)
|
| 603 |
+
|
| 604 |
+
---
|
| 605 |
+
|
| 606 |
+
## Performance Considerations
|
| 607 |
+
|
| 608 |
+
### Token Optimization
|
| 609 |
+
|
| 610 |
+
**Problem**: Loading full leaderboard dataset consumes excessive tokens
|
| 611 |
+
**Solution**: Use token-optimized MCP tools
|
| 612 |
+
|
| 613 |
+
```python
|
| 614 |
+
# β BAD: Loads all 51 runs (50K+ tokens)
|
| 615 |
+
leaderboard = mcp_client.get_dataset("kshitijthakkar/smoltrace-leaderboard")
|
| 616 |
+
|
| 617 |
+
# β
GOOD: Returns only top 5 (5K tokens, 90% reduction)
|
| 618 |
+
top_performers = mcp_client.get_top_performers(top_n=5)
|
| 619 |
+
|
| 620 |
+
# β
BETTER: Returns summary stats (500 tokens, 99% reduction)
|
| 621 |
+
summary = mcp_client.get_leaderboard_summary()
|
| 622 |
+
```
|
| 623 |
+
|
| 624 |
+
### Caching
|
| 625 |
+
|
| 626 |
+
**Problem**: Repeated identical MCP calls waste time and credits
|
| 627 |
+
**Solution**: Implement client-side caching
|
| 628 |
+
|
| 629 |
+
```python
|
| 630 |
+
from functools import lru_cache
|
| 631 |
+
import time
|
| 632 |
+
|
| 633 |
+
@lru_cache(maxsize=32)
|
| 634 |
+
def cached_analyze_leaderboard(metric_focus: str, time_range: str, top_n: int, cache_key: int):
|
| 635 |
+
"""Cached MCP call with TTL via cache_key"""
|
| 636 |
+
return mcp_client.analyze_leaderboard(
|
| 637 |
+
metric_focus=metric_focus,
|
| 638 |
+
time_range=time_range,
|
| 639 |
+
top_n=top_n
|
| 640 |
+
)
|
| 641 |
+
|
| 642 |
+
# Use with 5-minute cache TTL
|
| 643 |
+
cache_key = int(time.time() // 300) # Changes every 5 minutes
|
| 644 |
+
insights = cached_analyze_leaderboard("overall", "last_week", 5, cache_key)
|
| 645 |
+
```
|
| 646 |
+
|
| 647 |
+
### Async Optimization
|
| 648 |
+
|
| 649 |
+
**Problem**: Sequential MCP calls block UI
|
| 650 |
+
**Solution**: Use async for parallel calls
|
| 651 |
+
|
| 652 |
+
```python
|
| 653 |
+
import asyncio
|
| 654 |
+
|
| 655 |
+
async def load_leaderboard_with_insights():
|
| 656 |
+
"""Load leaderboard and insights in parallel"""
|
| 657 |
+
# Start both operations concurrently
|
| 658 |
+
leaderboard_task = asyncio.create_task(load_dataset_async("kshitijthakkar/smoltrace-leaderboard"))
|
| 659 |
+
insights_task = asyncio.create_task(mcp_client.analyze_leaderboard(metric_focus="overall"))
|
| 660 |
+
|
| 661 |
+
# Wait for both to complete
|
| 662 |
+
leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
|
| 663 |
+
|
| 664 |
+
return leaderboard, insights
|
| 665 |
+
```
|
| 666 |
+
|
| 667 |
+
---
|
| 668 |
+
|
| 669 |
+
## Security Considerations
|
| 670 |
+
|
| 671 |
+
### API Key Management
|
| 672 |
+
|
| 673 |
+
**DO**:
|
| 674 |
+
- Store API keys in environment variables or HF Spaces secrets
|
| 675 |
+
- Use session-only storage in Gradio (not server-side persistence)
|
| 676 |
+
- Rotate keys regularly
|
| 677 |
+
|
| 678 |
+
**DON'T**:
|
| 679 |
+
- Hardcode API keys in source code
|
| 680 |
+
- Expose keys in client-side JavaScript
|
| 681 |
+
- Log API keys in console or files
|
| 682 |
+
|
| 683 |
+
### MCP Server Trust
|
| 684 |
+
|
| 685 |
+
**Verify MCP server authenticity**:
|
| 686 |
+
- Use HTTPS URLs only
|
| 687 |
+
- Verify domain ownership (huggingface.co spaces)
|
| 688 |
+
- Review MCP server code before connecting (open source)
|
| 689 |
+
|
| 690 |
+
**Limit tool access**:
|
| 691 |
+
- Only connect to trusted MCP servers
|
| 692 |
+
- Review tool permissions before use
|
| 693 |
+
- Implement rate limiting for tool calls
|
| 694 |
+
|
| 695 |
+
---
|
| 696 |
+
|
| 697 |
+
## Related Documentation
|
| 698 |
+
|
| 699 |
+
- [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete UI walkthrough
|
| 700 |
+
- [JOB_SUBMISSION.md](JOB_SUBMISSION_TRACEMIND_AI.md) - Evaluation job guide
|
| 701 |
+
- [ARCHITECTURE.md](ARCHITECTURE_TRACEMIND_AI.md) - Technical architecture
|
| 702 |
+
- [TraceMind MCP Server Documentation](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
|
| 703 |
+
|
| 704 |
+
---
|
| 705 |
+
|
| 706 |
+
**Last Updated**: November 21, 2025
|
README.md
CHANGED
|
@@ -20,474 +20,449 @@ tags:
|
|
| 20 |
# π§ TraceMind-AI
|
| 21 |
|
| 22 |
<p align="center">
|
| 23 |
-
<img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
|
| 24 |
-
<br/>
|
| 25 |
-
<br/>
|
| 26 |
<img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
|
| 27 |
</p>
|
| 28 |
|
| 29 |
**Agent Evaluation Platform with MCP-Powered Intelligence**
|
| 30 |
|
| 31 |
[](https://github.com/modelcontextprotocol)
|
| 32 |
-
[-purple)](https://github.com/modelcontextprotocol/hackathon)
|
| 33 |
[](https://gradio.app/)
|
| 34 |
|
| 35 |
> **π― Track 2 Submission**: MCP in Action (Enterprise)
|
| 36 |
> **π
MCP's 1st Birthday Hackathon**: November 14-30, 2025
|
| 37 |
|
| 38 |
-
## Overview
|
| 39 |
-
|
| 40 |
-
TraceMind-AI is a comprehensive platform for evaluating AI agent performance across different models, providers, and configurations. It provides real-time insights, cost analysis, and detailed trace visualization powered by the Model Context Protocol (MCP).
|
| 41 |
-
|
| 42 |
-
### ποΈ **Built on Open Source Foundation**
|
| 43 |
-
|
| 44 |
-
This platform is part of a complete agent evaluation ecosystem built on two foundational open-source projects:
|
| 45 |
-
|
| 46 |
-
**π TraceVerde (genai_otel_instrument)** - Automatic OpenTelemetry Instrumentation
|
| 47 |
-
- **What**: Zero-code OTEL instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
|
| 48 |
-
- **Why**: Captures every LLM call, tool usage, and agent step automatically
|
| 49 |
-
- **Links**: [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
|
| 50 |
-
|
| 51 |
-
**π SMOLTRACE** - Agent Evaluation Engine
|
| 52 |
-
- **What**: Lightweight, production-ready evaluation framework with OTEL tracing built-in
|
| 53 |
-
- **Why**: Generates structured datasets (leaderboard, results, traces, metrics) displayed in this UI
|
| 54 |
-
- **Links**: [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
|
| 55 |
-
|
| 56 |
-
**The Flow**: `TraceVerde` instruments your agents β `SMOLTRACE` evaluates them β `TraceMind-AI` visualizes results with MCP-powered intelligence
|
| 57 |
-
|
| 58 |
---
|
| 59 |
|
| 60 |
-
##
|
| 61 |
|
| 62 |
-
|
| 63 |
-
- **π€ Autonomous Agent Chat**: Interactive agent powered by smolagents with MCP tools (Track 2)
|
| 64 |
-
- **π¬ MCP Integration**: AI-powered analysis using remote MCP servers
|
| 65 |
-
- **βοΈ Multi-Cloud Evaluation**: Submit jobs to HuggingFace Jobs or Modal (H200, A100, A10 GPUs)
|
| 66 |
-
- **π° Smart Cost Estimation**: Auto-select hardware and predict costs before running evaluations
|
| 67 |
-
- **π Trace Visualization**: Detailed OpenTelemetry trace analysis with GPU metrics
|
| 68 |
-
- **π Performance Metrics**: GPU utilization, CO2 emissions, token usage tracking
|
| 69 |
-
- **π§ Agent Reasoning**: View step-by-step agent planning and tool execution
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
| 76 |
-
- `analyze_leaderboard` - AI-generated insights about evaluation trends
|
| 77 |
-
- `estimate_cost` - Cost estimation with hardware recommendations
|
| 78 |
-
- `debug_trace` - Interactive trace analysis and debugging
|
| 79 |
-
- `compare_runs` - Side-by-side run comparison
|
| 80 |
-
- `analyze_results` - Test case analysis with optimization recommendations
|
| 81 |
|
| 82 |
-
##
|
| 83 |
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
- Python 3.10+
|
| 88 |
-
- HuggingFace account (for authentication)
|
| 89 |
|
| 90 |
-
|
| 91 |
-
- β οΈ **HuggingFace Pro account** ($9/month) with credit card
|
| 92 |
-
- HuggingFace token with **Read + Write + Run Jobs** permissions
|
| 93 |
-
- API keys for model providers (OpenAI, Anthropic, etc.)
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
1. Clone the repository:
|
| 100 |
-
```bash
|
| 101 |
-
git clone https://github.com/Mandark-droid/TraceMind-AI.git
|
| 102 |
-
cd TraceMind-AI
|
| 103 |
```
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
| 111 |
-
```bash
|
| 112 |
-
cp .env.example .env
|
| 113 |
-
# Edit .env with your configuration
|
| 114 |
-
```
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
Visit http://localhost:7860
|
| 122 |
-
|
| 123 |
-
## π― For Hackathon Judges & Visitors
|
| 124 |
|
| 125 |
-
|
|
|
|
|
|
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
| 130 |
|
| 131 |
-
**
|
|
|
|
| 132 |
|
| 133 |
-
|
| 134 |
-
2. Go to **βοΈ Settings** tab
|
| 135 |
-
3. Enter your **Gemini API Key** and **HuggingFace Token**
|
| 136 |
-
4. Click **"Save & Override Keys"**
|
| 137 |
-
|
| 138 |
-
**Step 2: Configure TraceMind-AI** (Optional, for additional features)
|
| 139 |
|
| 140 |
-
|
| 141 |
-
2. Go to **βοΈ Settings** tab
|
| 142 |
-
3. Enter your **Gemini API Key** and **HuggingFace Token**
|
| 143 |
-
4. Click **"Save API Keys"**
|
| 144 |
|
| 145 |
-
###
|
| 146 |
|
| 147 |
-
- **MCP
|
| 148 |
-
- **TraceMind-AI**: Main UI that calls the MCP server for intelligent analysis
|
| 149 |
-
- They run in **separate sessions** β need separate configuration
|
| 150 |
-
- Configuring both ensures your keys are used for the complete evaluation flow
|
| 151 |
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
-
|
| 157 |
-
- Visit: https://ai.google.dev/
|
| 158 |
-
- Click "Get API Key" β Create project β Generate key
|
| 159 |
-
- **Free tier**: 1,500 requests/day (sufficient for evaluation)
|
| 160 |
|
| 161 |
-
|
| 162 |
-
-
|
| 163 |
-
-
|
| 164 |
-
- **
|
| 165 |
-
|
| 166 |
-
- **Free tier**: No rate limits for public dataset access
|
| 167 |
|
| 168 |
-
###
|
| 169 |
|
| 170 |
-
|
| 171 |
-
-
|
| 172 |
-
-
|
| 173 |
-
-
|
| 174 |
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
β
**Session-only storage**: Keys stored only in browser memory
|
| 178 |
-
β
**No server persistence**: Keys never saved to disk
|
| 179 |
-
β
**Not exposed via API**: Settings forms use `api_name=False`
|
| 180 |
-
β
**HTTPS encryption**: All API calls over secure connections
|
| 181 |
|
| 182 |
-
##
|
| 183 |
|
| 184 |
-
|
| 185 |
-
- **HuggingFace Jobs**: Managed compute with H200, A100, A10, T4 GPUs
|
| 186 |
-
- **Modal**: Serverless GPU compute with pay-per-second pricing
|
| 187 |
|
| 188 |
-
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
|
|
|
| 196 |
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
- β
**Read** (view datasets)
|
| 201 |
-
- β
**Write** (upload results)
|
| 202 |
-
- β
**Run Jobs** (submit evaluation jobs)
|
| 203 |
-
- β οΈ Read-only tokens will NOT work
|
| 204 |
|
| 205 |
-
|
|
|
|
|
|
|
| 206 |
|
| 207 |
-
|
| 208 |
-
- Sign up at: https://modal.com
|
| 209 |
-
- Generate API token at: https://modal.com/settings/tokens
|
| 210 |
-
- Pay-per-second billing (no monthly subscription)
|
| 211 |
|
| 212 |
-
|
| 213 |
-
- MODAL_TOKEN_ID (starts with `ak-`)
|
| 214 |
-
- MODAL_TOKEN_SECRET (starts with `as-`)
|
| 215 |
|
| 216 |
-
|
| 217 |
|
| 218 |
-
|
| 219 |
-
- OpenAI, Anthropic, Google, etc.
|
| 220 |
-
- Configure in Settings β LLM Provider API Keys
|
| 221 |
-
- Passed securely as job secrets
|
| 222 |
|
| 223 |
-
|
|
|
|
|
|
|
| 224 |
|
| 225 |
-
|
| 226 |
|
| 227 |
-
**
|
| 228 |
-
- **
|
| 229 |
-
- **
|
| 230 |
-
-
|
| 231 |
-
- **a100-large**: Large models (70B+) - ~$3.00/hr
|
| 232 |
-
- Pricing: https://huggingface.co/pricing#spaces-pricing
|
| 233 |
|
| 234 |
-
**Modal
|
| 235 |
-
-
|
| 236 |
-
-
|
| 237 |
-
- **A100-80GB**: Large models (70B+) - ~$0.0030/sec
|
| 238 |
-
- **H200**: Fastest inference - ~$0.0050/sec
|
| 239 |
-
- Pricing: https://modal.com/pricing
|
| 240 |
|
| 241 |
-
###
|
| 242 |
|
| 243 |
-
|
| 244 |
-
- Add HF Token (with Run Jobs permission) - **required for both platforms**
|
| 245 |
-
- Add Modal credentials (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET) - **for Modal only**
|
| 246 |
-
- Add LLM provider keys (OpenAI, Anthropic, etc.)
|
| 247 |
-
|
| 248 |
-
2. **Create Evaluation** (New Evaluation tab):
|
| 249 |
-
- **Select infrastructure**: HuggingFace Jobs or Modal
|
| 250 |
-
- Choose model and agent type
|
| 251 |
-
- Configure hardware (or use **"auto"** for smart selection)
|
| 252 |
-
- Set timeout (default: 1h)
|
| 253 |
-
- Click "π° Estimate Cost" to preview cost/duration
|
| 254 |
-
- Click "Submit Evaluation"
|
| 255 |
-
|
| 256 |
-
3. **Monitor Job**:
|
| 257 |
-
- View job ID and status in confirmation screen
|
| 258 |
-
- **HF Jobs**: Track at https://huggingface.co/jobs or use Job Monitoring tab
|
| 259 |
-
- **Modal**: Track at https://modal.com/apps
|
| 260 |
-
- Results automatically appear in leaderboard when complete
|
| 261 |
-
|
| 262 |
-
### What Happens During a Job
|
| 263 |
-
|
| 264 |
-
1. Job starts on selected infrastructure (HF Jobs or Modal)
|
| 265 |
-
2. Docker container built with required dependencies
|
| 266 |
-
3. SMOLTRACE evaluates your model with OpenTelemetry tracing
|
| 267 |
-
4. Results uploaded to 4 HuggingFace datasets:
|
| 268 |
-
- Leaderboard entry (summary stats)
|
| 269 |
-
- Results dataset (test case details)
|
| 270 |
-
- Traces dataset (OTEL spans)
|
| 271 |
-
- Metrics dataset (GPU metrics, CO2 emissions)
|
| 272 |
-
5. Results appear in TraceMind leaderboard automatically
|
| 273 |
-
|
| 274 |
-
**Expected Duration:**
|
| 275 |
-
- CPU jobs (API models): 2-5 minutes
|
| 276 |
-
- GPU jobs (local models): 15-30 minutes (includes model download)
|
| 277 |
|
| 278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
|
| 286 |
-
|
| 287 |
-
# Options: "hfapi" (default), "inference_client", "litellm"
|
| 288 |
-
AGENT_MODEL_TYPE=hfapi
|
| 289 |
|
| 290 |
-
|
| 291 |
-
# Required if AGENT_MODEL_TYPE=litellm
|
| 292 |
-
GEMINI_API_KEY=your_gemini_api_key_here
|
| 293 |
|
| 294 |
-
|
| 295 |
-
MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
|
| 296 |
|
| 297 |
-
|
| 298 |
-
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
-
|
| 301 |
-
DISABLE_OAUTH=true
|
| 302 |
-
```
|
| 303 |
|
| 304 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
|
| 306 |
-
|
| 307 |
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
- Requires: `HF_TOKEN`
|
| 311 |
-
- Best for: General use, free tier available
|
| 312 |
|
| 313 |
-
|
| 314 |
-
- Model: `deepseek-ai/DeepSeek-V3-0324`
|
| 315 |
-
- Requires: `HF_TOKEN`
|
| 316 |
-
- Best for: Advanced reasoning, faster inference
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
|
| 323 |
-
|
|
|
|
|
|
|
|
|
|
| 324 |
|
| 325 |
-
|
|
|
|
|
|
|
|
|
|
| 326 |
|
| 327 |
-
|
| 328 |
-
- **Results**: Individual test case results
|
| 329 |
-
- **Traces**: OpenTelemetry trace data
|
| 330 |
-
- **Metrics**: GPU metrics and performance data
|
| 331 |
|
| 332 |
-
##
|
| 333 |
|
| 334 |
-
###
|
| 335 |
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
βββ mcp_client/ # MCP client implementation
|
| 341 |
-
β βββ client.py # Async MCP client
|
| 342 |
-
β βββ sync_wrapper.py # Synchronous wrapper
|
| 343 |
-
βββ utils/ # Utilities
|
| 344 |
-
β βββ auth.py # HuggingFace OAuth
|
| 345 |
-
β βββ navigation.py # Screen navigation
|
| 346 |
-
βββ screens/ # UI screens
|
| 347 |
-
βββ components/ # Reusable components
|
| 348 |
-
βββ styles/ # Custom CSS
|
| 349 |
-
```
|
| 350 |
|
| 351 |
-
###
|
| 352 |
|
| 353 |
-
|
|
|
|
|
|
|
|
|
|
| 354 |
|
| 355 |
-
|
| 356 |
-
from mcp_client.sync_wrapper import get_sync_mcp_client
|
| 357 |
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
|
|
|
| 361 |
|
| 362 |
-
|
| 363 |
-
insights = mcp_client.analyze_leaderboard(
|
| 364 |
-
metric_focus="overall",
|
| 365 |
-
time_range="last_week",
|
| 366 |
-
top_n=5
|
| 367 |
-
)
|
| 368 |
-
```
|
| 369 |
|
| 370 |
-
|
|
|
|
|
|
|
|
|
|
| 371 |
|
| 372 |
-
|
| 373 |
|
| 374 |
-
|
| 375 |
-
2. Navigate to the "Leaderboard" tab
|
| 376 |
-
3. Click "Load Leaderboard" to fetch the latest data
|
| 377 |
-
4. View AI-powered insights generated by the MCP server
|
| 378 |
|
| 379 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
|
| 381 |
-
|
| 382 |
-
2. Enter the model name (e.g., `openai/gpt-4`)
|
| 383 |
-
3. Select agent type and number of tests
|
| 384 |
-
4. Click "Estimate Cost" for AI-powered analysis
|
| 385 |
|
| 386 |
-
|
| 387 |
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 392 |
|
| 393 |
-
|
| 394 |
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
4.
|
| 402 |
-
5.
|
| 403 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 404 |
|
| 405 |
-
|
| 406 |
-
- Analysis: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 407 |
-
- Cost Comparison: "Compare the costs of the top 3 models - which one offers the best value?"
|
| 408 |
-
- Recommendations: "Based on the leaderboard data, which model would you recommend for a production system?"
|
| 409 |
|
| 410 |
-
##
|
| 411 |
|
| 412 |
-
|
| 413 |
-
- **Agent Framework**: smolagents 1.22.0+ (Track 2)
|
| 414 |
-
- **MCP Protocol**: MCP integration via Gradio & smolagents MCPClient
|
| 415 |
-
- **Data**: HuggingFace Datasets API
|
| 416 |
-
- **Authentication**: HuggingFace OAuth
|
| 417 |
-
- **AI Models**:
|
| 418 |
-
- Default: Qwen/Qwen2.5-Coder-32B-Instruct (HF Inference API)
|
| 419 |
-
- Optional: DeepSeek-V3 (Nebius), Gemini 2.5 Flash
|
| 420 |
-
- MCP Server: Google Gemini 2.5 Pro
|
| 421 |
|
| 422 |
-
|
| 423 |
|
| 424 |
-
|
| 425 |
|
| 426 |
-
|
| 427 |
-
# Install dependencies
|
| 428 |
-
pip install -r requirements.txt
|
| 429 |
|
| 430 |
-
|
| 431 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 432 |
|
| 433 |
-
|
| 434 |
-
python app.py
|
| 435 |
-
```
|
| 436 |
|
| 437 |
-
|
|
|
|
|
|
|
|
|
|
| 438 |
|
| 439 |
-
|
| 440 |
|
| 441 |
-
##
|
| 442 |
|
| 443 |
-
|
| 444 |
-
- [Data Loader API](data_loader.py) - Dataset loading and caching
|
| 445 |
-
- [MCP Client API](mcp_client/client.py) - MCP protocol integration
|
| 446 |
-
- [Authentication](utils/auth.py) - HuggingFace OAuth integration
|
| 447 |
|
| 448 |
-
|
|
|
|
|
|
|
| 449 |
|
| 450 |
-
|
|
|
|
|
|
|
| 451 |
|
| 452 |
-
|
|
|
|
|
|
|
| 453 |
|
| 454 |
-
|
|
|
|
|
|
|
| 455 |
|
| 456 |
-
|
|
|
|
|
|
|
| 457 |
|
| 458 |
-
|
|
|
|
|
|
|
| 459 |
|
| 460 |
-
|
| 461 |
|
| 462 |
-
|
| 463 |
|
| 464 |
-
|
| 465 |
|
| 466 |
-
##
|
| 467 |
|
|
|
|
| 468 |
**Track**: MCP in Action (Enterprise)
|
| 469 |
**Author**: Kshitij Thakkar
|
| 470 |
-
**Powered by**: MCP
|
| 471 |
**Built with**: Gradio 5.49.1 (MCP client integration)
|
| 472 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 473 |
---
|
| 474 |
|
| 475 |
-
##
|
| 476 |
|
| 477 |
-
-
|
| 478 |
-
- **Gradio Team** - For Gradio 6 with MCP integration
|
| 479 |
-
- **HuggingFace** - For Spaces hosting and dataset infrastructure
|
| 480 |
-
- **Google** - For Gemini API access
|
| 481 |
-
- **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
|
| 482 |
|
| 483 |
-
|
| 484 |
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
-
|
| 488 |
-
-
|
|
|
|
|
|
|
| 489 |
|
| 490 |
---
|
| 491 |
|
| 492 |
-
**
|
| 493 |
-
|
|
|
|
|
|
| 20 |
# π§ TraceMind-AI
|
| 21 |
|
| 22 |
<p align="center">
|
|
|
|
|
|
|
|
|
|
| 23 |
<img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
|
| 24 |
</p>
|
| 25 |
|
| 26 |
**Agent Evaluation Platform with MCP-Powered Intelligence**
|
| 27 |
|
| 28 |
[](https://github.com/modelcontextprotocol)
|
| 29 |
+
[-purple)](https://github.com/modelcontextprotocol/hackathon)
|
| 30 |
[](https://gradio.app/)
|
| 31 |
|
| 32 |
> **π― Track 2 Submission**: MCP in Action (Enterprise)
|
| 33 |
> **π
MCP's 1st Birthday Hackathon**: November 14-30, 2025
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
---
|
| 36 |
|
| 37 |
+
## Why TraceMind-AI?
|
| 38 |
|
| 39 |
+
**The Challenge**: Evaluating AI agents generates complex data across models, providers, and configurations. Making sense of it all is overwhelming.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
**The Solution**: TraceMind-AI is your **intelligent agent evaluation command center**:
|
| 42 |
+
- π **Live leaderboard** with real-time performance data
|
| 43 |
+
- π€ **Autonomous agent chat** powered by MCP tools
|
| 44 |
+
- π° **Smart cost estimation** before you run evaluations
|
| 45 |
+
- π **Deep trace analysis** to debug agent behavior
|
| 46 |
+
- βοΈ **Multi-cloud job submission** (HuggingFace Jobs + Modal)
|
| 47 |
|
| 48 |
+
All powered by the **Model Context Protocol** for AI-driven insights at every step.
|
| 49 |
|
| 50 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
## π Try It Now
|
| 53 |
|
| 54 |
+
- **π Live Demo**: [TraceMind-AI Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind)
|
| 55 |
+
- **π οΈ MCP Server**: [TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) (Track 1)
|
| 56 |
+
- **π Full Docs**: See [USER_GUIDE.md](USER_GUIDE.md) for complete walkthrough
|
| 57 |
+
- **π¬ MCP Server Quick Demo (5 min)**: [Watch on Loom](https://www.loom.com/share/d4d0003f06fa4327b46ba5c081bdf835)
|
| 58 |
+
- **πΊ MCP Server Full Demo (20 min)**: [Watch on Loom](https://www.loom.com/share/de559bb0aef749559c79117b7f951250)
|
| 59 |
|
| 60 |
+
---
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
## The TraceMind Ecosystem
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
TraceMind-AI is the **user-facing platform** in a complete 4-project agent evaluation ecosystem:
|
| 65 |
|
| 66 |
+
<p align="center">
|
| 67 |
+
<img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
|
| 68 |
+
<br/><br/>
|
| 69 |
+
</p>
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
```
|
| 72 |
+
π TraceVerde π SMOLTRACE
|
| 73 |
+
(genai_otel_instrument) (Evaluation Engine)
|
| 74 |
+
β β
|
| 75 |
+
Instruments Evaluates
|
| 76 |
+
LLM calls agents
|
| 77 |
+
β β
|
| 78 |
+
βββββββββββββ¬ββββββββββββββββββββ
|
| 79 |
+
β
|
| 80 |
+
Generates Datasets
|
| 81 |
+
(leaderboard, traces, metrics)
|
| 82 |
+
β
|
| 83 |
+
βββββββββββββ΄ββββββββββββββββββββ
|
| 84 |
+
β β
|
| 85 |
+
π οΈ TraceMind MCP Server π§ TraceMind-AI
|
| 86 |
+
(Track 1 - Building MCP) (This Project - Track 2)
|
| 87 |
+
Provides AI Tools Consumes MCP Tools
|
| 88 |
+
ββββββββββ MCP Protocol βββββββββ
|
| 89 |
```
|
| 90 |
|
| 91 |
+
### The Foundation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
**π TraceVerde** - Automatic OpenTelemetry instrumentation for LLM frameworks
|
| 94 |
+
β Captures every LLM call, tool usage, and agent step
|
| 95 |
+
β [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
**π SMOLTRACE** - Lightweight evaluation engine with built-in tracing
|
| 98 |
+
β Generates structured datasets (leaderboard, results, traces, metrics)
|
| 99 |
+
β [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
|
| 100 |
|
| 101 |
+
### The Platform
|
| 102 |
|
| 103 |
+
**π οΈ TraceMind MCP Server** - AI-powered analysis tools via MCP
|
| 104 |
+
β [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) | [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server)
|
| 105 |
+
β **Track 1**: Building MCP (Enterprise)
|
| 106 |
|
| 107 |
+
**π§ TraceMind-AI** (This Project) - Interactive UI that consumes MCP tools
|
| 108 |
+
β **Track 2**: MCP in Action (Enterprise)
|
| 109 |
|
| 110 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
+
## Key Features
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
### π― MCP Integration (Track 2)
|
| 115 |
|
| 116 |
+
TraceMind-AI demonstrates **enterprise MCP client usage** in two ways:
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
**1. Direct MCP Client Integration**
|
| 119 |
+
- Connects to TraceMind MCP Server via SSE transport
|
| 120 |
+
- Uses 5 AI-powered tools: `analyze_leaderboard`, `estimate_cost`, `debug_trace`, `compare_runs`, `analyze_results`
|
| 121 |
+
- Real-time insights powered by Google Gemini 2.5 Flash
|
| 122 |
|
| 123 |
+
**2. Autonomous Agent with MCP Tools**
|
| 124 |
+
- Built with `smolagents` framework
|
| 125 |
+
- Agent has access to all MCP server tools
|
| 126 |
+
- Natural language queries β autonomous tool execution
|
| 127 |
+
- Example: *"What are the top 3 models and how much do they cost?"*
|
| 128 |
|
| 129 |
+
### π Agent Evaluation Features
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+
- **Live Leaderboard**: View all evaluation runs with sortable metrics
|
| 132 |
+
- **Cost Estimation**: Auto-select hardware and predict costs before running
|
| 133 |
+
- **Trace Visualization**: Deep-dive into OpenTelemetry traces with GPU metrics
|
| 134 |
+
- **Multi-Cloud Jobs**: Submit evaluations to HuggingFace Jobs or Modal
|
| 135 |
+
- **Performance Analytics**: GPU utilization, CO2 emissions, token tracking
|
|
|
|
| 136 |
|
| 137 |
+
### π‘ Smart Features
|
| 138 |
|
| 139 |
+
- **Auto Hardware Selection**: Based on model size and provider
|
| 140 |
+
- **Real-time Job Monitoring**: Track HuggingFace Jobs status
|
| 141 |
+
- **Agent Reasoning Visibility**: See step-by-step tool execution
|
| 142 |
+
- **Quick Action Buttons**: One-click common queries
|
| 143 |
|
| 144 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
## Quick Start
|
| 147 |
|
| 148 |
+
### Option 1: Use the Live Demo (Recommended)
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
1. **Visit**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
|
| 151 |
+
2. **Login**: Sign in with your HuggingFace account
|
| 152 |
+
3. **Explore**: Browse the leaderboard, chat with the agent, visualize traces
|
| 153 |
|
| 154 |
+
### Option 2: Run Locally
|
| 155 |
|
| 156 |
+
```bash
|
| 157 |
+
# Clone and setup
|
| 158 |
+
git clone https://github.com/Mandark-droid/TraceMind-AI.git
|
| 159 |
+
cd TraceMind-AI
|
| 160 |
+
pip install -r requirements.txt
|
| 161 |
|
| 162 |
+
# Configure environment
|
| 163 |
+
cp .env.example .env
|
| 164 |
+
# Edit .env with your API keys (see Configuration section)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
+
# Run the app
|
| 167 |
+
python app.py
|
| 168 |
+
```
|
| 169 |
|
| 170 |
+
Visit http://localhost:7860
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
---
|
|
|
|
|
|
|
| 173 |
|
| 174 |
+
## Configuration
|
| 175 |
|
| 176 |
+
### For Viewing (Free)
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
+
**Required**:
|
| 179 |
+
- HuggingFace account (free)
|
| 180 |
+
- HuggingFace token with **Read** permissions
|
| 181 |
|
| 182 |
+
### For Submitting Jobs (Paid)
|
| 183 |
|
| 184 |
+
**Required**:
|
| 185 |
+
- β οΈ **HuggingFace Pro** ($9/month) with credit card
|
| 186 |
+
- HuggingFace token with **Read + Write + Run Jobs** permissions
|
| 187 |
+
- LLM provider API keys (OpenAI, Anthropic, etc.)
|
|
|
|
|
|
|
| 188 |
|
| 189 |
+
**Optional (Modal Alternative)**:
|
| 190 |
+
- Modal account (pay-per-second, no subscription)
|
| 191 |
+
- Modal API token (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
+
### Using Your Own API Keys (Recommended for Judges)
|
| 194 |
|
| 195 |
+
To prevent rate limits during evaluation:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
+
**Step 1: Configure MCP Server** (Required for AI tools)
|
| 198 |
+
1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
|
| 199 |
+
2. Go to **βοΈ Settings** tab
|
| 200 |
+
3. Enter: **Gemini API Key** + **HuggingFace Token**
|
| 201 |
+
4. Click **"Save & Override Keys"**
|
| 202 |
|
| 203 |
+
**Step 2: Configure TraceMind-AI** (Optional)
|
| 204 |
+
1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
|
| 205 |
+
2. Go to **βοΈ Settings** tab
|
| 206 |
+
3. Enter: **Gemini API Key** + **HuggingFace Token**
|
| 207 |
+
4. Click **"Save API Keys"**
|
| 208 |
|
| 209 |
+
**Get Free API Keys**:
|
| 210 |
+
- **Gemini**: https://ai.google.dev/ (1,500 requests/day)
|
| 211 |
+
- **HuggingFace**: https://huggingface.co/settings/tokens (unlimited for public datasets)
|
| 212 |
|
| 213 |
+
---
|
|
|
|
|
|
|
| 214 |
|
| 215 |
+
## For Hackathon Judges
|
|
|
|
|
|
|
| 216 |
|
| 217 |
+
### β
Track 2 Compliance
|
|
|
|
| 218 |
|
| 219 |
+
- **MCP Client Integration**: Connects to remote MCP server via SSE transport
|
| 220 |
+
- **Autonomous Agent**: `smolagents` agent with MCP tool access
|
| 221 |
+
- **Enterprise Focus**: Cost optimization, job submission, performance analytics
|
| 222 |
+
- **Production-Ready**: Deployed to HuggingFace Spaces with OAuth authentication
|
| 223 |
+
- **Real Data**: Live HuggingFace datasets from SMOLTRACE evaluations
|
| 224 |
|
| 225 |
+
### π― Key Innovations
|
|
|
|
|
|
|
| 226 |
|
| 227 |
+
1. **Dual MCP Integration**: Both direct MCP client + autonomous agent with MCP tools
|
| 228 |
+
2. **Multi-Cloud Support**: HuggingFace Jobs + Modal for serverless compute
|
| 229 |
+
3. **Auto Hardware Selection**: Smart hardware recommendations based on model size
|
| 230 |
+
4. **Complete Ecosystem**: Part of 4-project platform demonstrating full evaluation workflow
|
| 231 |
+
5. **Agent Reasoning Visibility**: See step-by-step MCP tool execution
|
| 232 |
|
| 233 |
+
### πΉ Demo Materials
|
| 234 |
|
| 235 |
+
- **π₯ Demo Video**: [Coming Soon - Link to walkthrough]
|
| 236 |
+
- **π’ Social Post**: [Coming Soon - Link to announcement]
|
|
|
|
|
|
|
| 237 |
|
| 238 |
+
### π§ͺ Testing Suggestions
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
+
**1. Try the Agent Chat** (π€ Agent Chat tab):
|
| 241 |
+
- "Analyze the current leaderboard and show me the top 5 models"
|
| 242 |
+
- "Compare the costs of the top 3 models"
|
| 243 |
+
- "Estimate the cost of running 100 tests with GPT-4"
|
| 244 |
|
| 245 |
+
**2. Explore the Leaderboard** (π Leaderboard tab):
|
| 246 |
+
- Click "Load Leaderboard" to see live data
|
| 247 |
+
- Read the AI-generated insights (powered by MCP server)
|
| 248 |
+
- Click on a run to see detailed test results
|
| 249 |
|
| 250 |
+
**3. Visualize Traces** (Select a run β View traces):
|
| 251 |
+
- See OpenTelemetry waterfall diagrams
|
| 252 |
+
- View GPU metrics overlay (for GPU jobs)
|
| 253 |
+
- Ask questions about the trace (MCP-powered debugging)
|
| 254 |
|
| 255 |
+
---
|
|
|
|
|
|
|
|
|
|
| 256 |
|
| 257 |
+
## What Can You Do?
|
| 258 |
|
| 259 |
+
### π View & Analyze
|
| 260 |
|
| 261 |
+
- **Browse leaderboard** with AI-powered insights
|
| 262 |
+
- **Compare models** side-by-side across metrics
|
| 263 |
+
- **Analyze traces** with interactive visualization
|
| 264 |
+
- **Ask questions** via autonomous agent
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 265 |
|
| 266 |
+
### π° Estimate & Plan
|
| 267 |
|
| 268 |
+
- **Get cost estimates** before running evaluations
|
| 269 |
+
- **Compare hardware options** (CPU vs GPU tiers)
|
| 270 |
+
- **Preview duration** and CO2 emissions
|
| 271 |
+
- **See recommendations** from AI analysis
|
| 272 |
|
| 273 |
+
### π Submit & Monitor
|
|
|
|
| 274 |
|
| 275 |
+
- **Submit evaluation jobs** to HuggingFace or Modal
|
| 276 |
+
- **Track job status** in real-time
|
| 277 |
+
- **View results** automatically when complete
|
| 278 |
+
- **Download datasets** for further analysis
|
| 279 |
|
| 280 |
+
### π§ͺ Generate & Customize
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
|
| 282 |
+
- **Generate synthetic datasets** for custom domains and tools
|
| 283 |
+
- **Create prompt templates** optimized for your use case
|
| 284 |
+
- **Push to HuggingFace Hub** with one click
|
| 285 |
+
- **Test evaluations** without writing code
|
| 286 |
|
| 287 |
+
---
|
| 288 |
|
| 289 |
+
## Documentation
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
+
**For quick evaluation**:
|
| 292 |
+
- Read this README for overview
|
| 293 |
+
- Visit the [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) to try it
|
| 294 |
+
- Check out the **π€ Agent Chat** tab for autonomous MCP usage
|
| 295 |
+
|
| 296 |
+
**For deep dives**:
|
| 297 |
+
- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
|
| 298 |
+
- Leaderboard tab usage
|
| 299 |
+
- Agent chat interactions
|
| 300 |
+
- Synthetic data generator
|
| 301 |
+
- Job submission workflow
|
| 302 |
+
- Trace visualization guide
|
| 303 |
+
- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture
|
| 304 |
+
- How TraceMind-AI connects to MCP server
|
| 305 |
+
- Agent framework integration (smolagents)
|
| 306 |
+
- MCP tool usage examples
|
| 307 |
+
- [JOB_SUBMISSION.md](JOB_SUBMISSION.md) - Evaluation job guide
|
| 308 |
+
- HuggingFace Jobs setup
|
| 309 |
+
- Modal integration
|
| 310 |
+
- Hardware selection guide
|
| 311 |
+
- Cost optimization tips
|
| 312 |
+
- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture
|
| 313 |
+
- Project structure
|
| 314 |
+
- Data flow
|
| 315 |
+
- Authentication
|
| 316 |
+
- Deployment
|
| 317 |
|
| 318 |
+
---
|
|
|
|
|
|
|
|
|
|
| 319 |
|
| 320 |
+
## Technology Stack
|
| 321 |
|
| 322 |
+
- **UI Framework**: Gradio 5.49.1
|
| 323 |
+
- **Agent Framework**: smolagents 1.22.0+
|
| 324 |
+
- **MCP Integration**: MCP Python SDK + smolagents MCPClient
|
| 325 |
+
- **Data Source**: HuggingFace Datasets API
|
| 326 |
+
- **Authentication**: HuggingFace OAuth
|
| 327 |
+
- **AI Models**:
|
| 328 |
+
- Agent: Qwen/Qwen2.5-Coder-32B-Instruct (HF API)
|
| 329 |
+
- MCP Server: Google Gemini 2.5 Flash
|
| 330 |
+
- **Cloud Platforms**: HuggingFace Jobs + Modal
|
| 331 |
|
| 332 |
+
---
|
| 333 |
|
| 334 |
+
## Example Workflows
|
| 335 |
+
|
| 336 |
+
### Workflow 1: Quick Analysis
|
| 337 |
+
1. Open TraceMind-AI
|
| 338 |
+
2. Go to **π€ Agent Chat**
|
| 339 |
+
3. Click **"Quick: Top Models"**
|
| 340 |
+
4. See agent fetch leaderboard and analyze top performers
|
| 341 |
+
5. Ask follow-up: *"Which one is most cost-effective?"*
|
| 342 |
+
|
| 343 |
+
### Workflow 2: Submit Evaluation Job
|
| 344 |
+
1. Go to **βοΈ Settings** β Configure API keys
|
| 345 |
+
2. Go to **π New Evaluation**
|
| 346 |
+
3. Select model (e.g., `meta-llama/Llama-3.1-8B`)
|
| 347 |
+
4. Choose infrastructure (HuggingFace Jobs or Modal)
|
| 348 |
+
5. Click **"π° Estimate Cost"** to preview
|
| 349 |
+
6. Click **"Submit Evaluation"**
|
| 350 |
+
7. Monitor job in **π Job Monitoring** tab
|
| 351 |
+
8. View results in leaderboard when complete
|
| 352 |
+
|
| 353 |
+
### Workflow 3: Debug Agent Behavior
|
| 354 |
+
1. Browse **π Leaderboard**
|
| 355 |
+
2. Click on a run with failures
|
| 356 |
+
3. View **detailed test results**
|
| 357 |
+
4. Click on a failed test to see trace
|
| 358 |
+
5. Use MCP-powered Q&A: *"Why did this test fail?"*
|
| 359 |
+
6. Get AI analysis of the execution trace
|
| 360 |
+
|
| 361 |
+
### Workflow 4: Generate Custom Test Dataset
|
| 362 |
+
1. Go to **π¬ Synthetic Data Generator**
|
| 363 |
+
2. Configure:
|
| 364 |
+
- Domain: `finance`
|
| 365 |
+
- Tools: `get_stock_price,calculate_profit,send_alert`
|
| 366 |
+
- Number of tasks: `20`
|
| 367 |
+
- Difficulty: `balanced`
|
| 368 |
+
3. Click **"Generate Dataset"**
|
| 369 |
+
4. Review generated tasks and prompt template
|
| 370 |
+
5. Enter repository name: `yourname/smoltrace-finance-tasks`
|
| 371 |
+
6. Click **"Push to HuggingFace Hub"**
|
| 372 |
+
7. Use your custom dataset in evaluations
|
| 373 |
|
| 374 |
+
---
|
|
|
|
|
|
|
|
|
|
| 375 |
|
| 376 |
+
## Screenshots
|
| 377 |
|
| 378 |
+
*See [SCREENSHOTS.md](SCREENSHOTS.md) for annotated screenshots of all screens*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 379 |
|
| 380 |
+
---
|
| 381 |
|
| 382 |
+
## π Quick Links
|
| 383 |
|
| 384 |
+
### π¦ Component Links
|
|
|
|
|
|
|
| 385 |
|
| 386 |
+
| Component | Description | Links |
|
| 387 |
+
|-----------|-------------|-------|
|
| 388 |
+
| **TraceVerde** | OTEL Instrumentation | [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) β’ [PyPI](https://pypi.org/project/genai-otel-instrument) |
|
| 389 |
+
| **SMOLTRACE** | Evaluation Engine | [GitHub](https://github.com/Mandark-droid/SMOLTRACE) β’ [PyPI](https://pypi.org/project/smoltrace/) |
|
| 390 |
+
| **MCP Server** | Building MCP (Track 1) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) β’ [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server) |
|
| 391 |
+
| **TraceMind-AI** | MCP in Action (Track 2) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) β’ [GitHub](https://github.com/Mandark-droid/TraceMind-AI) |
|
| 392 |
|
| 393 |
+
### π’ Community Posts
|
|
|
|
|
|
|
| 394 |
|
| 395 |
+
- π [**TraceMind Teaser**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_mcpsfirstbirthdayhackathon-mcpsfirstbirthdayhackathon-activity-7395686529270013952-g_id) - MCP's 1st Birthday Hackathon announcement
|
| 396 |
+
- π [**SMOLTRACE Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_ai-machinelearning-llm-activity-7394350375908126720-im_T) - Lightweight agent evaluation engine
|
| 397 |
+
- π [**TraceVerde Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_genai-opentelemetry-observability-activity-7390339855135813632-wqEg) - Zero-code OTEL instrumentation for LLMs
|
| 398 |
+
- π [**TraceVerde 3K Downloads**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_thank-you-open-source-community-a-week-activity-7392205780592132096-nu6U) - Thank you to the community!
|
| 399 |
|
| 400 |
+
---
|
| 401 |
|
| 402 |
+
## πΊοΈ Future Roadmap
|
| 403 |
|
| 404 |
+
We're committed to making TraceMind the most comprehensive agent evaluation platform. Here's what's coming next:
|
|
|
|
|
|
|
|
|
|
| 405 |
|
| 406 |
+
### 1. ποΈ Dynamic MCP Server Generator
|
| 407 |
+
Generate domain-specific MCP servers on-the-fly with custom tools via AI code generation.
|
| 408 |
+
**Use case**: Rapidly prototype MCP servers without writing boilerplate code.
|
| 409 |
|
| 410 |
+
### 2. π― Intelligent Model Router
|
| 411 |
+
Automatically select optimal models based on real-time leaderboard data, budget constraints, and accuracy requirements.
|
| 412 |
+
**Use case**: Optimize evaluation costs while maintaining quality for large-scale continuous evaluation.
|
| 413 |
|
| 414 |
+
### 3. π¬ Automated A/B Testing Framework
|
| 415 |
+
Compare multiple agent configurations with statistical significance testing and automatic winner selection.
|
| 416 |
+
**Use case**: Find optimal agent configuration scientifically before production deployment.
|
| 417 |
|
| 418 |
+
### 4. π₯ Collaborative Evaluation Workspace
|
| 419 |
+
Real-time collaboration with shared runs, team comments, cost budgets, and stakeholder reports.
|
| 420 |
+
**Use case**: Streamline team workflows and coordinate evaluation efforts across distributed teams.
|
| 421 |
|
| 422 |
+
### 5. π CI/CD Pipeline Integration
|
| 423 |
+
Automated agent evaluation on every PR with GitHub Actions, result comments, and merge blocking on quality drops.
|
| 424 |
+
**Use case**: Catch agent performance regressions before production and maintain quality standards automatically.
|
| 425 |
|
| 426 |
+
### 6. π§° Integrated SMOLTRACE CLI Features
|
| 427 |
+
Bring all SMOLTRACE CLI tools into the UI: clean, copy, distill, merge, export, validate, anonymize datasets.
|
| 428 |
+
**Use case**: Manage evaluation datasets efficiently without command-line, with visual preview and undo capabilities.
|
| 429 |
|
| 430 |
+
---
|
| 431 |
|
| 432 |
+
**Implementation Timeline**: Q1-Q4 2026 | **Want to contribute?** Join our community and help shape the future of agent evaluation!
|
| 433 |
|
| 434 |
+
---
|
| 435 |
|
| 436 |
+
## Credits
|
| 437 |
|
| 438 |
+
**Built for**: MCP's 1st Birthday Hackathon (Nov 14-30, 2025)
|
| 439 |
**Track**: MCP in Action (Enterprise)
|
| 440 |
**Author**: Kshitij Thakkar
|
| 441 |
+
**Powered by**: TraceMind MCP Server + Gradio + smolagents
|
| 442 |
**Built with**: Gradio 5.49.1 (MCP client integration)
|
| 443 |
|
| 444 |
+
**Special Thanks**:
|
| 445 |
+
- **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
|
| 446 |
+
|
| 447 |
+
**Sponsors**: HuggingFace β’ Google Gemini β’ Modal β’ Anthropic β’ Gradio β’ ElevenLabs β’ SambaNova β’ Blaxel
|
| 448 |
+
|
| 449 |
---
|
| 450 |
|
| 451 |
+
## License
|
| 452 |
|
| 453 |
+
AGPL-3.0 - See [LICENSE](LICENSE) for details
|
|
|
|
|
|
|
|
|
|
|
|
|
| 454 |
|
| 455 |
+
---
|
| 456 |
|
| 457 |
+
## Support
|
| 458 |
+
|
| 459 |
+
- π§ GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
|
| 460 |
+
- π¬ HF Discord: `#mcp-1st-birthday-officialπ`
|
| 461 |
+
- π·οΈ Tag: `mcp-in-action-track-enterprise`
|
| 462 |
+
- π¦ Twitter: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
|
| 463 |
|
| 464 |
---
|
| 465 |
|
| 466 |
+
**Ready to evaluate your agents with AI-powered intelligence?**
|
| 467 |
+
|
| 468 |
+
π **Try the live demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
|
USER_GUIDE.md
ADDED
|
@@ -0,0 +1,1026 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TraceMind-AI - Complete User Guide
|
| 2 |
+
|
| 3 |
+
This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.
|
| 4 |
+
|
| 5 |
+
## Table of Contents
|
| 6 |
+
|
| 7 |
+
- [Getting Started](#getting-started)
|
| 8 |
+
- [Screen-by-Screen Guide](#screen-by-screen-guide)
|
| 9 |
+
- [π Leaderboard](#-leaderboard)
|
| 10 |
+
- [π€ Agent Chat](#-agent-chat)
|
| 11 |
+
- [π New Evaluation](#-new-evaluation)
|
| 12 |
+
- [π Job Monitoring](#-job-monitoring)
|
| 13 |
+
- [π Trace Visualization](#-trace-visualization)
|
| 14 |
+
- [π¬ Synthetic Data Generator](#-synthetic-data-generator)
|
| 15 |
+
- [βοΈ Settings](#οΈ-settings)
|
| 16 |
+
- [Common Workflows](#common-workflows)
|
| 17 |
+
- [Troubleshooting](#troubleshooting)
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Getting Started
|
| 22 |
+
|
| 23 |
+
### First-Time Setup
|
| 24 |
+
|
| 25 |
+
1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
|
| 26 |
+
2. **Sign in** with your HuggingFace account (required for viewing)
|
| 27 |
+
3. **Configure API keys** (optional but recommended):
|
| 28 |
+
- Go to **βοΈ Settings** tab
|
| 29 |
+
- Enter Gemini API Key and HuggingFace Token
|
| 30 |
+
- Click **"Save API Keys"**
|
| 31 |
+
|
| 32 |
+
### Navigation
|
| 33 |
+
|
| 34 |
+
TraceMind-AI is organized into tabs:
|
| 35 |
+
- **π Leaderboard**: View evaluation results with AI insights
|
| 36 |
+
- **π€ Agent Chat**: Interactive autonomous agent powered by MCP tools
|
| 37 |
+
- **π New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
|
| 38 |
+
- **π Job Monitoring**: Track status of submitted jobs
|
| 39 |
+
- **π Trace Visualization**: Deep-dive into agent execution traces
|
| 40 |
+
- **π¬ Synthetic Data Generator**: Create custom test datasets with AI
|
| 41 |
+
- **βοΈ Settings**: Configure API keys and preferences
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Screen-by-Screen Guide
|
| 46 |
+
|
| 47 |
+
### π Leaderboard
|
| 48 |
+
|
| 49 |
+
**Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.
|
| 50 |
+
|
| 51 |
+
#### Features
|
| 52 |
+
|
| 53 |
+
**Main Table**:
|
| 54 |
+
- View all evaluation runs from the SMOLTRACE leaderboard
|
| 55 |
+
- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
|
| 56 |
+
- Click any row to see detailed test results
|
| 57 |
+
|
| 58 |
+
**AI Insights Panel** (Top of screen):
|
| 59 |
+
- Automatically generated insights from MCP server
|
| 60 |
+
- Powered by Google Gemini 2.5 Flash
|
| 61 |
+
- Updates when you click "Load Leaderboard"
|
| 62 |
+
- Shows top performers, trends, and recommendations
|
| 63 |
+
|
| 64 |
+
**Filter & Sort Options**:
|
| 65 |
+
- Filter by agent type (tool, code, both)
|
| 66 |
+
- Filter by provider (litellm, transformers)
|
| 67 |
+
- Sort by any metric (success rate, cost, duration)
|
| 68 |
+
|
| 69 |
+
#### How to Use
|
| 70 |
+
|
| 71 |
+
1. **Load Data**:
|
| 72 |
+
```
|
| 73 |
+
Click "Load Leaderboard" button
|
| 74 |
+
β Fetches latest evaluation runs from HuggingFace
|
| 75 |
+
β AI generates insights automatically
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
2. **Read AI Insights**:
|
| 79 |
+
- Located at top of screen
|
| 80 |
+
- Summary of evaluation trends
|
| 81 |
+
- Top performing models
|
| 82 |
+
- Cost/accuracy trade-offs
|
| 83 |
+
- Actionable recommendations
|
| 84 |
+
|
| 85 |
+
3. **Explore Runs**:
|
| 86 |
+
- Scroll through table
|
| 87 |
+
- Sort by clicking column headers
|
| 88 |
+
- Click on any run to see details
|
| 89 |
+
|
| 90 |
+
4. **View Details**:
|
| 91 |
+
```
|
| 92 |
+
Click a row in the table
|
| 93 |
+
β Opens detail view with:
|
| 94 |
+
- All test cases (success/failure)
|
| 95 |
+
- Execution times
|
| 96 |
+
- Cost breakdown
|
| 97 |
+
- Link to trace visualization
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
#### Example Workflow
|
| 101 |
+
|
| 102 |
+
```
|
| 103 |
+
Scenario: Find the most cost-effective model for production
|
| 104 |
+
|
| 105 |
+
1. Click "Load Leaderboard"
|
| 106 |
+
2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
|
| 107 |
+
3. Sort table by "Cost" (ascending)
|
| 108 |
+
4. Compare top 3 cheapest models
|
| 109 |
+
5. Click on Llama-3.1-8B run to see detailed results
|
| 110 |
+
6. Review success rate (93.4%) and test case breakdowns
|
| 111 |
+
7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
#### Tips
|
| 115 |
+
|
| 116 |
+
- **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
|
| 117 |
+
- **Compare models**: Use the sort function to compare across different metrics
|
| 118 |
+
- **Trust the AI**: The insights panel provides strategic recommendations based on all data
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
### π€ Agent Chat
|
| 123 |
+
|
| 124 |
+
**Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.
|
| 125 |
+
|
| 126 |
+
**π― Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.
|
| 127 |
+
|
| 128 |
+
#### Features
|
| 129 |
+
|
| 130 |
+
**Autonomous Agent**:
|
| 131 |
+
- Built with `smolagents` framework
|
| 132 |
+
- Has access to all TraceMind MCP Server tools
|
| 133 |
+
- Plans and executes multi-step actions
|
| 134 |
+
- Provides detailed, data-driven answers
|
| 135 |
+
|
| 136 |
+
**MCP Tools Available to Agent**:
|
| 137 |
+
- `analyze_leaderboard` - Get AI insights about top performers
|
| 138 |
+
- `estimate_cost` - Calculate evaluation costs before running
|
| 139 |
+
- `debug_trace` - Analyze execution traces
|
| 140 |
+
- `compare_runs` - Compare two evaluation runs
|
| 141 |
+
- `get_top_performers` - Fetch top N models efficiently
|
| 142 |
+
- `get_leaderboard_summary` - Get high-level statistics
|
| 143 |
+
- `get_dataset` - Load SMOLTRACE datasets
|
| 144 |
+
- `analyze_results` - Analyze detailed test results
|
| 145 |
+
|
| 146 |
+
**Agent Reasoning Visibility**:
|
| 147 |
+
- Toggle **"Show Agent Reasoning"** to see:
|
| 148 |
+
- Planning steps
|
| 149 |
+
- Tool execution logs
|
| 150 |
+
- Intermediate results
|
| 151 |
+
- Final synthesis
|
| 152 |
+
|
| 153 |
+
**Quick Action Buttons**:
|
| 154 |
+
- **"Quick: Top Models"**: Get top 5 models with costs
|
| 155 |
+
- **"Quick: Cost Estimate"**: Estimate cost for a model
|
| 156 |
+
- **"Quick: Load Leaderboard"**: Fetch leaderboard summary
|
| 157 |
+
|
| 158 |
+
#### How to Use
|
| 159 |
+
|
| 160 |
+
1. **Start a Conversation**:
|
| 161 |
+
```
|
| 162 |
+
Type your question in the chat box
|
| 163 |
+
Example: "What are the top 3 performing models and how much do they cost?"
|
| 164 |
+
|
| 165 |
+
Click "Send"
|
| 166 |
+
β Agent plans approach
|
| 167 |
+
β Executes MCP tools
|
| 168 |
+
β Returns synthesized answer
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
2. **Watch Agent Work** (optional):
|
| 172 |
+
```
|
| 173 |
+
Enable "Show Agent Reasoning" checkbox
|
| 174 |
+
β See step-by-step execution:
|
| 175 |
+
Step 1: Calling get_top_performers with metric=success_rate, top_n=3
|
| 176 |
+
Step 2: Analyzing costs from results
|
| 177 |
+
Step 3: Formatting response with recommendations
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
3. **Ask Follow-Up Questions**:
|
| 181 |
+
```
|
| 182 |
+
You: "Which one is most cost-effective?"
|
| 183 |
+
Agent: Uses previous context + MCP tools to answer
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
4. **Use Quick Actions**:
|
| 187 |
+
```
|
| 188 |
+
Click "Quick: Top Models" button
|
| 189 |
+
β Agent automatically executes query
|
| 190 |
+
β Returns top 5 models with metrics
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
#### Example Conversations
|
| 194 |
+
|
| 195 |
+
**Example 1: Cost Comparison**
|
| 196 |
+
```
|
| 197 |
+
User: "Compare the costs of the top 3 models. Which one offers the best value?"
|
| 198 |
+
|
| 199 |
+
Agent:
|
| 200 |
+
Let me analyze the top 3 models by success rate:
|
| 201 |
+
|
| 202 |
+
[Calls: get_top_performers(metric="success_rate", top_n=3)]
|
| 203 |
+
|
| 204 |
+
Top 3 Models:
|
| 205 |
+
1. GPT-4: 95.8% success, $0.05/run
|
| 206 |
+
2. Claude-3: 94.1% success, $0.04/run
|
| 207 |
+
3. Llama-3.1-8B: 93.4% success, $0.002/run
|
| 208 |
+
|
| 209 |
+
[Calls: analyze_leaderboard(metric_focus="cost")]
|
| 210 |
+
|
| 211 |
+
Cost Analysis:
|
| 212 |
+
- GPT-4: Highest accuracy but 25x more expensive than Llama
|
| 213 |
+
- Claude-3: Good middle ground at $0.04/run
|
| 214 |
+
- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop
|
| 215 |
+
|
| 216 |
+
Recommendation for "best value":
|
| 217 |
+
β Llama-3.1-8B offers 93.4% accuracy at $0.002/run
|
| 218 |
+
β For 1M runs: Saves $48,000/month vs GPT-4
|
| 219 |
+
β Only 2.4% accuracy trade-off
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
**Example 2: Evaluation Planning**
|
| 223 |
+
```
|
| 224 |
+
User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"
|
| 225 |
+
|
| 226 |
+
Agent:
|
| 227 |
+
Let me estimate costs for DeepSeek-V3:
|
| 228 |
+
|
| 229 |
+
[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]
|
| 230 |
+
|
| 231 |
+
Cost Estimate:
|
| 232 |
+
- Model: deepseek-ai/DeepSeek-V3
|
| 233 |
+
- Tests: 100
|
| 234 |
+
- Recommended hardware: GPU A100 (70B+ parameters)
|
| 235 |
+
|
| 236 |
+
Breakdown:
|
| 237 |
+
- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
|
| 238 |
+
- HF Jobs GPU cost: ~$3.00/hr for A100
|
| 239 |
+
- Estimated duration: 25 minutes
|
| 240 |
+
- Total compute cost: $1.25
|
| 241 |
+
- Total cost: $1.25 (100 tests)
|
| 242 |
+
|
| 243 |
+
Recommendation:
|
| 244 |
+
Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
|
| 245 |
+
Alternative: Modal H200 for 2x faster inference at $2.50.
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
#### Tips
|
| 249 |
+
|
| 250 |
+
- **Be specific**: Ask clear, focused questions for better answers
|
| 251 |
+
- **Use context**: Agent remembers conversation history
|
| 252 |
+
- **Watch reasoning**: Enable to understand how agent uses MCP tools
|
| 253 |
+
- **Try quick actions**: Fast way to get common information
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
### π New Evaluation
|
| 258 |
+
|
| 259 |
+
**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.
|
| 260 |
+
|
| 261 |
+
**β οΈ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.
|
| 262 |
+
|
| 263 |
+
#### Features
|
| 264 |
+
|
| 265 |
+
**Model Selection**:
|
| 266 |
+
- Enter any model name (format: `provider/model-name`)
|
| 267 |
+
- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
|
| 268 |
+
- Auto-detects if API model or local model
|
| 269 |
+
|
| 270 |
+
**Infrastructure Choice**:
|
| 271 |
+
- **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
|
| 272 |
+
- **Modal**: Serverless GPU compute (pay-per-second)
|
| 273 |
+
|
| 274 |
+
**Hardware Selection**:
|
| 275 |
+
- **Auto** (recommended): Automatically selects optimal hardware based on model size
|
| 276 |
+
- **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU
|
| 277 |
+
|
| 278 |
+
**Cost Estimation**:
|
| 279 |
+
- Click **"π° Estimate Cost"** before submitting
|
| 280 |
+
- Shows predicted:
|
| 281 |
+
- LLM API costs (for API models)
|
| 282 |
+
- Compute costs (for local models)
|
| 283 |
+
- Duration estimate
|
| 284 |
+
- CO2 emissions
|
| 285 |
+
|
| 286 |
+
**Agent Type**:
|
| 287 |
+
- **tool**: Test tool-calling capabilities
|
| 288 |
+
- **code**: Test code generation capabilities
|
| 289 |
+
- **both**: Test both (recommended)
|
| 290 |
+
|
| 291 |
+
#### How to Use
|
| 292 |
+
|
| 293 |
+
**Step 1: Configure Prerequisites** (One-time setup)
|
| 294 |
+
|
| 295 |
+
For **HuggingFace Jobs**:
|
| 296 |
+
```
|
| 297 |
+
1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
|
| 298 |
+
2. Add credit card for compute charges
|
| 299 |
+
3. Create HF token with "Read + Write + Run Jobs" permissions
|
| 300 |
+
4. Go to Settings tab β Enter HF token β Save
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
For **Modal** (Alternative):
|
| 304 |
+
```
|
| 305 |
+
1. Sign up: https://modal.com (free tier available)
|
| 306 |
+
2. Generate API token: https://modal.com/settings/tokens
|
| 307 |
+
3. Go to Settings tab β Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET β Save
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
For **API Models** (OpenAI, Anthropic, etc.):
|
| 311 |
+
```
|
| 312 |
+
1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
|
| 313 |
+
2. Go to Settings tab β Enter provider API key β Save
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
**Step 2: Create Evaluation**
|
| 317 |
+
|
| 318 |
+
```
|
| 319 |
+
1. Enter model name:
|
| 320 |
+
Example: "meta-llama/Llama-3.1-8B"
|
| 321 |
+
|
| 322 |
+
2. Select infrastructure:
|
| 323 |
+
- HuggingFace Jobs (default)
|
| 324 |
+
- Modal (alternative)
|
| 325 |
+
|
| 326 |
+
3. Choose agent type:
|
| 327 |
+
- "both" (recommended)
|
| 328 |
+
|
| 329 |
+
4. Select hardware:
|
| 330 |
+
- "auto" (recommended - smart selection)
|
| 331 |
+
- Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200
|
| 332 |
+
|
| 333 |
+
5. Set timeout (optional):
|
| 334 |
+
- Default: 3600s (1 hour)
|
| 335 |
+
- Range: 300s - 7200s
|
| 336 |
+
|
| 337 |
+
6. Click "π° Estimate Cost":
|
| 338 |
+
β Shows predicted cost and duration
|
| 339 |
+
β Example: "$2.00, 20 minutes, 0.5g CO2"
|
| 340 |
+
|
| 341 |
+
7. Review estimate, then click "Submit Evaluation"
|
| 342 |
+
```
|
| 343 |
+
|
| 344 |
+
**Step 3: Monitor Job**
|
| 345 |
+
|
| 346 |
+
```
|
| 347 |
+
After submission:
|
| 348 |
+
β Job ID displayed
|
| 349 |
+
β Go to "π Job Monitoring" tab to track progress
|
| 350 |
+
β Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
**Step 4: View Results**
|
| 354 |
+
|
| 355 |
+
```
|
| 356 |
+
When job completes:
|
| 357 |
+
β Results automatically uploaded to HuggingFace datasets
|
| 358 |
+
β Appears in Leaderboard within 1-2 minutes
|
| 359 |
+
β Click on your run to see detailed results
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
#### Hardware Selection Guide
|
| 363 |
+
|
| 364 |
+
**For API Models** (OpenAI, Anthropic, Google):
|
| 365 |
+
- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
|
| 366 |
+
- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
|
| 367 |
+
- Why: No GPU needed for API calls
|
| 368 |
+
|
| 369 |
+
**For Small Models** (4B-8B parameters):
|
| 370 |
+
- Use: `t4-small` (HF) or A10G (Modal)
|
| 371 |
+
- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
|
| 372 |
+
- Examples: Llama-3.1-8B, Mistral-7B
|
| 373 |
+
|
| 374 |
+
**For Medium Models** (7B-13B parameters):
|
| 375 |
+
- Use: `a10g-small` (HF) or A10G (Modal)
|
| 376 |
+
- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
|
| 377 |
+
- Examples: Qwen2.5-14B, Mixtral-8x7B
|
| 378 |
+
|
| 379 |
+
**For Large Models** (70B+ parameters):
|
| 380 |
+
- Use: `a100-large` (HF) or A100-80GB (Modal)
|
| 381 |
+
- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
|
| 382 |
+
- Examples: Llama-3.1-70B, DeepSeek-V3
|
| 383 |
+
|
| 384 |
+
**For Fastest Inference**:
|
| 385 |
+
- Use: `h200` (HF or Modal)
|
| 386 |
+
- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
|
| 387 |
+
- Best for: Time-sensitive evaluations, large batches
|
| 388 |
+
|
| 389 |
+
#### Example Workflows
|
| 390 |
+
|
| 391 |
+
**Workflow 1: Evaluate API Model (OpenAI GPT-4)**
|
| 392 |
+
```
|
| 393 |
+
1. Model: "openai/gpt-4"
|
| 394 |
+
2. Infrastructure: HuggingFace Jobs
|
| 395 |
+
3. Agent type: both
|
| 396 |
+
4. Hardware: auto (selects cpu-basic)
|
| 397 |
+
5. Estimate: $50.00 (mostly API costs), 45 min
|
| 398 |
+
6. Submit β Monitor β View in leaderboard
|
| 399 |
+
```
|
| 400 |
+
|
| 401 |
+
**Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
|
| 402 |
+
```
|
| 403 |
+
1. Model: "meta-llama/Llama-3.1-8B"
|
| 404 |
+
2. Infrastructure: Modal (for pay-per-second billing)
|
| 405 |
+
3. Agent type: both
|
| 406 |
+
4. Hardware: auto (selects A10G)
|
| 407 |
+
5. Estimate: $0.20, 15 min
|
| 408 |
+
6. Submit β Monitor β View in leaderboard
|
| 409 |
+
```
|
| 410 |
+
|
| 411 |
+
#### Tips
|
| 412 |
+
|
| 413 |
+
- **Always estimate first**: Prevents surprise costs
|
| 414 |
+
- **Use "auto" hardware**: Smart selection based on model size
|
| 415 |
+
- **Start small**: Test with 10-20 tests before scaling to 100+
|
| 416 |
+
- **Monitor jobs**: Check Job Monitoring tab for status
|
| 417 |
+
- **Modal for experimentation**: Pay-per-second is cost-effective for testing
|
| 418 |
+
|
| 419 |
+
---
|
| 420 |
+
|
| 421 |
+
### π Job Monitoring
|
| 422 |
+
|
| 423 |
+
**Purpose**: Track status of submitted evaluation jobs.
|
| 424 |
+
|
| 425 |
+
#### Features
|
| 426 |
+
|
| 427 |
+
**Job Status Display**:
|
| 428 |
+
- Job ID
|
| 429 |
+
- Current status (pending, running, completed, failed)
|
| 430 |
+
- Start time
|
| 431 |
+
- Duration
|
| 432 |
+
- Infrastructure (HF Jobs or Modal)
|
| 433 |
+
|
| 434 |
+
**Real-time Updates**:
|
| 435 |
+
- Auto-refreshes every 30 seconds
|
| 436 |
+
- Manual refresh button
|
| 437 |
+
|
| 438 |
+
**Job Actions**:
|
| 439 |
+
- View logs
|
| 440 |
+
- Cancel job (if still running)
|
| 441 |
+
- View results (if completed)
|
| 442 |
+
|
| 443 |
+
#### How to Use
|
| 444 |
+
|
| 445 |
+
```
|
| 446 |
+
1. Go to "π Job Monitoring" tab
|
| 447 |
+
2. See list of your submitted jobs
|
| 448 |
+
3. Click "Refresh" for latest status
|
| 449 |
+
4. When status = "completed":
|
| 450 |
+
β Click "View Results"
|
| 451 |
+
β Opens leaderboard filtered to your run
|
| 452 |
+
```
|
| 453 |
+
|
| 454 |
+
#### Job Statuses
|
| 455 |
+
|
| 456 |
+
- **Pending**: Job queued, waiting for resources
|
| 457 |
+
- **Running**: Evaluation in progress
|
| 458 |
+
- **Completed**: Evaluation finished successfully
|
| 459 |
+
- **Failed**: Evaluation encountered an error
|
| 460 |
+
|
| 461 |
+
#### Tips
|
| 462 |
+
|
| 463 |
+
- **Check logs** if job fails: Helps diagnose issues
|
| 464 |
+
- **Expected duration**:
|
| 465 |
+
- API models: 2-5 minutes
|
| 466 |
+
- Local models: 15-30 minutes (includes model download)
|
| 467 |
+
|
| 468 |
+
---
|
| 469 |
+
|
| 470 |
+
### π Trace Visualization
|
| 471 |
+
|
| 472 |
+
**Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.
|
| 473 |
+
|
| 474 |
+
**Access**: Click on any test case in a run's detail view
|
| 475 |
+
|
| 476 |
+
#### Features
|
| 477 |
+
|
| 478 |
+
**Waterfall Diagram**:
|
| 479 |
+
- Visual timeline of execution
|
| 480 |
+
- Spans show: LLM calls, tool executions, reasoning steps
|
| 481 |
+
- Duration bars (wider = slower)
|
| 482 |
+
- Parent-child relationships
|
| 483 |
+
|
| 484 |
+
**Span Details**:
|
| 485 |
+
- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
|
| 486 |
+
- Start/end times
|
| 487 |
+
- Duration
|
| 488 |
+
- Attributes (model, tokens, cost, tool inputs/outputs)
|
| 489 |
+
- Status (OK, ERROR)
|
| 490 |
+
|
| 491 |
+
**GPU Metrics Overlay** (for GPU jobs only):
|
| 492 |
+
- GPU utilization %
|
| 493 |
+
- Memory usage
|
| 494 |
+
- Temperature
|
| 495 |
+
- CO2 emissions
|
| 496 |
+
|
| 497 |
+
**MCP-Powered Q&A**:
|
| 498 |
+
- Ask questions about the trace
|
| 499 |
+
- Example: "Why was tool X called twice?"
|
| 500 |
+
- Agent uses `debug_trace` MCP tool to analyze
|
| 501 |
+
|
| 502 |
+
#### How to Use
|
| 503 |
+
|
| 504 |
+
```
|
| 505 |
+
1. From leaderboard β Click a run β Click a test case
|
| 506 |
+
2. View waterfall diagram:
|
| 507 |
+
β Spans arranged chronologically
|
| 508 |
+
β Parent spans (e.g., "Agent Execution")
|
| 509 |
+
β Child spans (e.g., "LLM Call", "Tool Call")
|
| 510 |
+
|
| 511 |
+
3. Click any span:
|
| 512 |
+
β See detailed attributes
|
| 513 |
+
β Token counts, costs, inputs/outputs
|
| 514 |
+
|
| 515 |
+
4. Ask questions (MCP-powered):
|
| 516 |
+
User: "Why did this test fail?"
|
| 517 |
+
β Agent analyzes trace with debug_trace tool
|
| 518 |
+
β Returns explanation with span references
|
| 519 |
+
|
| 520 |
+
5. Check GPU metrics (if available):
|
| 521 |
+
β Graph shows utilization over time
|
| 522 |
+
β Overlayed on execution timeline
|
| 523 |
+
```
|
| 524 |
+
|
| 525 |
+
#### Example Analysis
|
| 526 |
+
|
| 527 |
+
**Scenario: Understanding a slow execution**
|
| 528 |
+
|
| 529 |
+
```
|
| 530 |
+
1. Open trace for test_045 (duration: 8.5s)
|
| 531 |
+
2. Waterfall shows:
|
| 532 |
+
- Span 1: LLM Call - Reasoning (1.2s) β
|
| 533 |
+
- Span 2: Tool Call - search_web (6.5s) β οΈ SLOW
|
| 534 |
+
- Span 3: LLM Call - Final Response (0.8s) β
|
| 535 |
+
|
| 536 |
+
3. Click Span 2 (search_web):
|
| 537 |
+
- Input: {"query": "weather in Tokyo"}
|
| 538 |
+
- Output: 5 results
|
| 539 |
+
- Duration: 6.5s (6x slower than typical)
|
| 540 |
+
|
| 541 |
+
4. Ask agent: "Why was the search_web call so slow?"
|
| 542 |
+
β Agent analysis:
|
| 543 |
+
"The search_web call took 6.5s due to network latency.
|
| 544 |
+
Span attributes show API response time: 6.2s.
|
| 545 |
+
This is an external dependency issue, not agent code.
|
| 546 |
+
Recommendation: Implement timeout (5s) and fallback strategy."
|
| 547 |
+
```
|
| 548 |
+
|
| 549 |
+
#### Tips
|
| 550 |
+
|
| 551 |
+
- **Look for patterns**: Similar failures often have common spans
|
| 552 |
+
- **Use MCP Q&A**: Faster than manual trace analysis
|
| 553 |
+
- **Check GPU metrics**: Identify resource bottlenecks
|
| 554 |
+
- **Compare successful vs failed traces**: Spot differences
|
| 555 |
+
|
| 556 |
+
---
|
| 557 |
+
|
| 558 |
+
### π¬ Synthetic Data Generator
|
| 559 |
+
|
| 560 |
+
**Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.
|
| 561 |
+
|
| 562 |
+
#### Features
|
| 563 |
+
|
| 564 |
+
**AI-Powered Dataset Generation**:
|
| 565 |
+
- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
|
| 566 |
+
- Customizable domain, tools, difficulty, and agent type
|
| 567 |
+
- Automatic batching for large datasets (parallel generation)
|
| 568 |
+
- SMOLTRACE-format output ready for evaluation
|
| 569 |
+
|
| 570 |
+
**Prompt Template Generation**:
|
| 571 |
+
- Customized YAML templates based on smolagents format
|
| 572 |
+
- Optimized for your specific domain and tools
|
| 573 |
+
- Included automatically in dataset card
|
| 574 |
+
|
| 575 |
+
**Push to HuggingFace Hub**:
|
| 576 |
+
- One-click upload to HuggingFace Hub
|
| 577 |
+
- Public or private repositories
|
| 578 |
+
- Auto-generated README with usage instructions
|
| 579 |
+
- Ready to use with SMOLTRACE evaluations
|
| 580 |
+
|
| 581 |
+
#### How to Use
|
| 582 |
+
|
| 583 |
+
**Step 1: Configure & Generate Dataset**
|
| 584 |
+
|
| 585 |
+
1. Navigate to **π¬ Synthetic Data Generator** tab
|
| 586 |
+
|
| 587 |
+
2. Configure generation parameters:
|
| 588 |
+
- **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
|
| 589 |
+
- **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
|
| 590 |
+
- **Number of Tasks**: 5-100 tasks (slider)
|
| 591 |
+
- **Difficulty Level**:
|
| 592 |
+
- `balanced` (40% easy, 40% medium, 20% hard)
|
| 593 |
+
- `easy_only` (100% easy tasks)
|
| 594 |
+
- `medium_only` (100% medium tasks)
|
| 595 |
+
- `hard_only` (100% hard tasks)
|
| 596 |
+
- `progressive` (50% easy, 30% medium, 20% hard)
|
| 597 |
+
- **Agent Type**:
|
| 598 |
+
- `tool` (ToolCallingAgent only)
|
| 599 |
+
- `code` (CodeAgent only)
|
| 600 |
+
- `both` (50/50 mix)
|
| 601 |
+
|
| 602 |
+
3. Click **"π² Generate Synthetic Dataset"**
|
| 603 |
+
|
| 604 |
+
4. Wait for generation (30-120s depending on size):
|
| 605 |
+
- Shows progress message
|
| 606 |
+
- Automatic batching for >20 tasks
|
| 607 |
+
- Parallel API calls for faster generation
|
| 608 |
+
|
| 609 |
+
**Step 2: Review Generated Content**
|
| 610 |
+
|
| 611 |
+
1. **Dataset Preview Tab**:
|
| 612 |
+
- View all generated tasks in JSON format
|
| 613 |
+
- Check task IDs, prompts, expected tools, difficulty
|
| 614 |
+
- See dataset statistics:
|
| 615 |
+
- Total tasks
|
| 616 |
+
- Difficulty distribution
|
| 617 |
+
- Agent type distribution
|
| 618 |
+
- Tools coverage
|
| 619 |
+
|
| 620 |
+
2. **Prompt Template Tab**:
|
| 621 |
+
- View customized YAML prompt template
|
| 622 |
+
- Based on smolagents templates
|
| 623 |
+
- Adapted for your domain and tools
|
| 624 |
+
- Ready to use with ToolCallingAgent or CodeAgent
|
| 625 |
+
|
| 626 |
+
**Step 3: Push to HuggingFace Hub** (Optional)
|
| 627 |
+
|
| 628 |
+
1. Enter **Repository Name**:
|
| 629 |
+
- Format: `username/smoltrace-{domain}-tasks`
|
| 630 |
+
- Example: `alice/smoltrace-finance-tasks`
|
| 631 |
+
- Auto-filled with your HF username after generation
|
| 632 |
+
|
| 633 |
+
2. Set **Visibility**:
|
| 634 |
+
- β Private Repository (unchecked = public)
|
| 635 |
+
- β Private Repository (checked = private)
|
| 636 |
+
|
| 637 |
+
3. Provide **HuggingFace Token** (optional):
|
| 638 |
+
- Leave empty to use environment token (HF_TOKEN from Settings)
|
| 639 |
+
- Or paste token from https://huggingface.co/settings/tokens
|
| 640 |
+
- Requires write permissions
|
| 641 |
+
|
| 642 |
+
4. Click **"π€ Push to HuggingFace Hub"**
|
| 643 |
+
|
| 644 |
+
5. Wait for upload (5-30s):
|
| 645 |
+
- Creates dataset repository
|
| 646 |
+
- Uploads tasks
|
| 647 |
+
- Generates README with:
|
| 648 |
+
- Usage instructions
|
| 649 |
+
- Prompt template
|
| 650 |
+
- SMOLTRACE integration code
|
| 651 |
+
- Returns dataset URL
|
| 652 |
+
|
| 653 |
+
#### Example Workflow
|
| 654 |
+
|
| 655 |
+
```
|
| 656 |
+
Scenario: Create finance evaluation dataset with 20 tasks
|
| 657 |
+
|
| 658 |
+
1. Configure:
|
| 659 |
+
Domain: "finance"
|
| 660 |
+
Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
|
| 661 |
+
Number of Tasks: 20
|
| 662 |
+
Difficulty: "balanced"
|
| 663 |
+
Agent Type: "both"
|
| 664 |
+
|
| 665 |
+
2. Click "Generate"
|
| 666 |
+
β AI generates 20 tasks:
|
| 667 |
+
- 8 easy (single tool, straightforward)
|
| 668 |
+
- 8 medium (multiple tools or complex logic)
|
| 669 |
+
- 4 hard (complex reasoning, edge cases)
|
| 670 |
+
- 10 for ToolCallingAgent
|
| 671 |
+
- 10 for CodeAgent
|
| 672 |
+
β Also generates customized prompt template
|
| 673 |
+
|
| 674 |
+
3. Review Dataset Preview:
|
| 675 |
+
Task 1:
|
| 676 |
+
{
|
| 677 |
+
"id": "finance_stock_price_1",
|
| 678 |
+
"prompt": "What is the current price of AAPL stock?",
|
| 679 |
+
"expected_tool": "get_stock_price",
|
| 680 |
+
"difficulty": "easy",
|
| 681 |
+
"agent_type": "tool",
|
| 682 |
+
"expected_keywords": ["AAPL", "price", "$"]
|
| 683 |
+
}
|
| 684 |
+
|
| 685 |
+
Task 15:
|
| 686 |
+
{
|
| 687 |
+
"id": "finance_complex_analysis_15",
|
| 688 |
+
"prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
|
| 689 |
+
"expected_tool": "calculate_roi",
|
| 690 |
+
"expected_tool_calls": 2,
|
| 691 |
+
"difficulty": "hard",
|
| 692 |
+
"agent_type": "code",
|
| 693 |
+
"expected_keywords": ["ROI", "15%", "alert"]
|
| 694 |
+
}
|
| 695 |
+
|
| 696 |
+
4. Review Prompt Template:
|
| 697 |
+
See customized YAML with:
|
| 698 |
+
- Finance-specific system prompt
|
| 699 |
+
- Tool descriptions for get_stock_price, calculate_roi, etc.
|
| 700 |
+
- Response format guidelines
|
| 701 |
+
|
| 702 |
+
5. Push to Hub:
|
| 703 |
+
Repository: "yourname/smoltrace-finance-tasks"
|
| 704 |
+
Private: No (public)
|
| 705 |
+
Token: (empty, using environment token)
|
| 706 |
+
|
| 707 |
+
β Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
|
| 708 |
+
β README includes usage instructions and prompt template
|
| 709 |
+
|
| 710 |
+
6. Use in evaluation:
|
| 711 |
+
# Load your custom dataset
|
| 712 |
+
dataset = load_dataset("yourname/smoltrace-finance-tasks")
|
| 713 |
+
|
| 714 |
+
# Run SMOLTRACE evaluation
|
| 715 |
+
smoltrace-eval --model openai/gpt-4 \
|
| 716 |
+
--dataset-name yourname/smoltrace-finance-tasks \
|
| 717 |
+
--agent-type both
|
| 718 |
+
```
|
| 719 |
+
|
| 720 |
+
#### Configuration Reference
|
| 721 |
+
|
| 722 |
+
**Difficulty Levels Explained**:
|
| 723 |
+
|
| 724 |
+
| Level | Characteristics | Example |
|
| 725 |
+
|-------|----------------|---------|
|
| 726 |
+
| **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" β get_weather("Tokyo") |
|
| 727 |
+
| **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" β get_weather("Tokyo"), get_weather("London"), compare |
|
| 728 |
+
| **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |
|
| 729 |
+
|
| 730 |
+
**Agent Types Explained**:
|
| 731 |
+
|
| 732 |
+
| Type | Description | Use Case |
|
| 733 |
+
|------|-------------|----------|
|
| 734 |
+
| **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
|
| 735 |
+
| **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
|
| 736 |
+
| **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |
|
| 737 |
+
|
| 738 |
+
#### Best Practices
|
| 739 |
+
|
| 740 |
+
**Domain Selection**:
|
| 741 |
+
- Be specific: "customer_support_saas" > "support"
|
| 742 |
+
- Match your use case: Use actual business domain
|
| 743 |
+
- Consider tools available: Domain should align with tools
|
| 744 |
+
|
| 745 |
+
**Tool Names**:
|
| 746 |
+
- Use descriptive names: "get_stock_price" > "fetch"
|
| 747 |
+
- Match actual tool implementations
|
| 748 |
+
- 3-8 tools is ideal (enough variety, not overwhelming)
|
| 749 |
+
- Include mix of data retrieval and action tools
|
| 750 |
+
|
| 751 |
+
**Number of Tasks**:
|
| 752 |
+
- 5-10 tasks: Quick testing, proof of concept
|
| 753 |
+
- 20-30 tasks: Solid evaluation dataset
|
| 754 |
+
- 50-100 tasks: Comprehensive benchmark
|
| 755 |
+
|
| 756 |
+
**Difficulty Distribution**:
|
| 757 |
+
- `balanced`: Best for general evaluation
|
| 758 |
+
- `progressive`: Good for learning/debugging
|
| 759 |
+
- `easy_only`: Quick sanity checks
|
| 760 |
+
- `hard_only`: Stress testing advanced capabilities
|
| 761 |
+
|
| 762 |
+
**Quality Assurance**:
|
| 763 |
+
- Always review generated tasks before pushing
|
| 764 |
+
- Check for domain relevance and variety
|
| 765 |
+
- Verify expected tools match your actual tools
|
| 766 |
+
- Ensure prompts are clear and executable
|
| 767 |
+
|
| 768 |
+
#### Troubleshooting
|
| 769 |
+
|
| 770 |
+
**Generation fails with "Invalid API key"**:
|
| 771 |
+
- Go to **βοΈ Settings**
|
| 772 |
+
- Configure Gemini API Key
|
| 773 |
+
- Get key from https://aistudio.google.com/apikey
|
| 774 |
+
|
| 775 |
+
**Generated tasks don't match domain**:
|
| 776 |
+
- Be more specific in domain description
|
| 777 |
+
- Try regenerating with adjusted parameters
|
| 778 |
+
- Review prompt template for domain alignment
|
| 779 |
+
|
| 780 |
+
**Push to Hub fails with "Authentication error"**:
|
| 781 |
+
- Verify HuggingFace token has write permissions
|
| 782 |
+
- Get token from https://huggingface.co/settings/tokens
|
| 783 |
+
- Check token in **βοΈ Settings** or provide directly
|
| 784 |
+
|
| 785 |
+
**Dataset generation is slow (>60s)**:
|
| 786 |
+
- Large requests (>20 tasks) are automatically batched
|
| 787 |
+
- Each batch takes 30-120s
|
| 788 |
+
- Example: 100 tasks = 5 batches Γ 60s = ~5 minutes
|
| 789 |
+
- This is normal for large datasets
|
| 790 |
+
|
| 791 |
+
**Tasks are too easy/hard**:
|
| 792 |
+
- Adjust difficulty distribution
|
| 793 |
+
- Regenerate with different settings
|
| 794 |
+
- Mix difficulty levels with `balanced` or `progressive`
|
| 795 |
+
|
| 796 |
+
#### Advanced Tips
|
| 797 |
+
|
| 798 |
+
**Iterative Refinement**:
|
| 799 |
+
1. Generate 10 tasks with `balanced` difficulty
|
| 800 |
+
2. Review quality and variety
|
| 801 |
+
3. If satisfied, generate 50-100 tasks with same settings
|
| 802 |
+
4. If not, adjust domain/tools and regenerate
|
| 803 |
+
|
| 804 |
+
**Dataset Versioning**:
|
| 805 |
+
- Use version suffixes: `username/smoltrace-finance-tasks-v2`
|
| 806 |
+
- Iterate on datasets as tools evolve
|
| 807 |
+
- Keep track of which version was used for evaluations
|
| 808 |
+
|
| 809 |
+
**Combining Datasets**:
|
| 810 |
+
- Generate multiple small datasets for different domains
|
| 811 |
+
- Use SMOLTRACE CLI to merge datasets
|
| 812 |
+
- Create comprehensive multi-domain benchmarks
|
| 813 |
+
|
| 814 |
+
**Custom Prompt Templates**:
|
| 815 |
+
- Generate prompt template separately
|
| 816 |
+
- Customize further based on your needs
|
| 817 |
+
- Use in agent initialization before evaluation
|
| 818 |
+
- Include in dataset card for reproducibility
|
| 819 |
+
|
| 820 |
+
---
|
| 821 |
+
|
| 822 |
+
### βοΈ Settings
|
| 823 |
+
|
| 824 |
+
**Purpose**: Configure API keys, preferences, and authentication.
|
| 825 |
+
|
| 826 |
+
#### Features
|
| 827 |
+
|
| 828 |
+
**API Key Configuration**:
|
| 829 |
+
- Gemini API Key (for MCP server AI analysis)
|
| 830 |
+
- HuggingFace Token (for dataset access + job submission)
|
| 831 |
+
- Modal Token ID + Secret (for Modal job submission)
|
| 832 |
+
- LLM Provider Keys (OpenAI, Anthropic, etc.)
|
| 833 |
+
|
| 834 |
+
**Preferences**:
|
| 835 |
+
- Default infrastructure (HF Jobs vs Modal)
|
| 836 |
+
- Default hardware tier
|
| 837 |
+
- Auto-refresh intervals
|
| 838 |
+
|
| 839 |
+
**Security**:
|
| 840 |
+
- Keys stored in browser session only (not server)
|
| 841 |
+
- HTTPS encryption for all API calls
|
| 842 |
+
- Keys never logged or exposed
|
| 843 |
+
|
| 844 |
+
#### How to Use
|
| 845 |
+
|
| 846 |
+
**Configure Essential Keys**:
|
| 847 |
+
```
|
| 848 |
+
1. Go to "βοΈ Settings" tab
|
| 849 |
+
|
| 850 |
+
2. Enter Gemini API Key:
|
| 851 |
+
- Get from: https://ai.google.dev/
|
| 852 |
+
- Click "Get API Key" β Create project β Generate
|
| 853 |
+
- Paste into field
|
| 854 |
+
- Free tier: 1,500 requests/day
|
| 855 |
+
|
| 856 |
+
3. Enter HuggingFace Token:
|
| 857 |
+
- Get from: https://huggingface.co/settings/tokens
|
| 858 |
+
- Click "New token" β Name: "TraceMind"
|
| 859 |
+
- Permissions:
|
| 860 |
+
- Read (for viewing datasets)
|
| 861 |
+
- Write (for uploading results)
|
| 862 |
+
- Run Jobs (for evaluation submission)
|
| 863 |
+
- Paste into field
|
| 864 |
+
|
| 865 |
+
4. Click "Save API Keys"
|
| 866 |
+
β Keys stored in browser session
|
| 867 |
+
β MCP server will use your keys
|
| 868 |
+
```
|
| 869 |
+
|
| 870 |
+
**Configure for Job Submission** (Optional):
|
| 871 |
+
|
| 872 |
+
For **HuggingFace Jobs**:
|
| 873 |
+
```
|
| 874 |
+
Already configured if you entered HF token above with "Run Jobs" permission.
|
| 875 |
+
```
|
| 876 |
+
|
| 877 |
+
For **Modal** (Alternative):
|
| 878 |
+
```
|
| 879 |
+
1. Sign up: https://modal.com
|
| 880 |
+
2. Get token: https://modal.com/settings/tokens
|
| 881 |
+
3. Copy MODAL_TOKEN_ID (starts with 'ak-')
|
| 882 |
+
4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
|
| 883 |
+
5. Paste both into Settings β Save
|
| 884 |
+
```
|
| 885 |
+
|
| 886 |
+
For **API Model Providers**:
|
| 887 |
+
```
|
| 888 |
+
1. Get API key from provider:
|
| 889 |
+
- OpenAI: https://platform.openai.com/api-keys
|
| 890 |
+
- Anthropic: https://console.anthropic.com/settings/keys
|
| 891 |
+
- Google: https://ai.google.dev/
|
| 892 |
+
|
| 893 |
+
2. Paste into corresponding field in Settings
|
| 894 |
+
3. Click "Save LLM Provider Keys"
|
| 895 |
+
```
|
| 896 |
+
|
| 897 |
+
#### Security Best Practices
|
| 898 |
+
|
| 899 |
+
- **Use environment variables**: For production, set keys via HF Spaces secrets
|
| 900 |
+
- **Rotate keys regularly**: Generate new tokens every 3-6 months
|
| 901 |
+
- **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
|
| 902 |
+
- **Monitor usage**: Check API provider dashboards for unexpected charges
|
| 903 |
+
|
| 904 |
+
---
|
| 905 |
+
|
| 906 |
+
## Common Workflows
|
| 907 |
+
|
| 908 |
+
### Workflow 1: Quick Model Comparison
|
| 909 |
+
|
| 910 |
+
```
|
| 911 |
+
Goal: Compare GPT-4 vs Llama-3.1-8B for production use
|
| 912 |
+
|
| 913 |
+
Steps:
|
| 914 |
+
1. Go to Leaderboard β Load Leaderboard
|
| 915 |
+
2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
|
| 916 |
+
3. Sort by Success Rate β Note: GPT-4 (95.8%), Llama (93.4%)
|
| 917 |
+
4. Sort by Cost β Note: GPT-4 ($0.05), Llama ($0.002)
|
| 918 |
+
5. Go to Agent Chat β Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
|
| 919 |
+
β Agent analyzes with MCP tools
|
| 920 |
+
β Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
|
| 921 |
+
6. Decision: Use Llama-3.1-8B for production
|
| 922 |
+
```
|
| 923 |
+
|
| 924 |
+
### Workflow 2: Evaluate Custom Model
|
| 925 |
+
|
| 926 |
+
```
|
| 927 |
+
Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark
|
| 928 |
+
|
| 929 |
+
Steps:
|
| 930 |
+
1. Ensure model is on HuggingFace: username/my-finetuned-model
|
| 931 |
+
2. Go to Settings β Configure HF token (with Run Jobs permission)
|
| 932 |
+
3. Go to New Evaluation:
|
| 933 |
+
- Model: "username/my-finetuned-model"
|
| 934 |
+
- Infrastructure: HuggingFace Jobs
|
| 935 |
+
- Agent type: both
|
| 936 |
+
- Hardware: auto
|
| 937 |
+
4. Click "Estimate Cost" β Review: $1.50, 20 min
|
| 938 |
+
5. Click "Submit Evaluation"
|
| 939 |
+
6. Go to Job Monitoring β Wait for "Completed" (15-25 min)
|
| 940 |
+
7. Go to Leaderboard β Refresh β See your model in table
|
| 941 |
+
8. Click your run β Review detailed results
|
| 942 |
+
9. Compare vs other models using Agent Chat
|
| 943 |
+
```
|
| 944 |
+
|
| 945 |
+
### Workflow 3: Debug Failed Test
|
| 946 |
+
|
| 947 |
+
```
|
| 948 |
+
Goal: Understand why test_045 failed in your evaluation
|
| 949 |
+
|
| 950 |
+
Steps:
|
| 951 |
+
1. Go to Leaderboard β Find your run β Click to open details
|
| 952 |
+
2. Filter to failed tests only
|
| 953 |
+
3. Click test_045 β Opens trace visualization
|
| 954 |
+
4. Examine waterfall:
|
| 955 |
+
- Span 1: LLM Call (OK)
|
| 956 |
+
- Span 2: Tool Call - "unknown_tool" (ERROR)
|
| 957 |
+
- No Span 3 (execution stopped)
|
| 958 |
+
5. Ask Agent: "Why did test_045 fail?"
|
| 959 |
+
β Agent uses debug_trace MCP tool
|
| 960 |
+
β Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
|
| 961 |
+
6. Fix: Update agent config to include missing tool
|
| 962 |
+
7. Re-run evaluation with fixed config
|
| 963 |
+
```
|
| 964 |
+
|
| 965 |
+
---
|
| 966 |
+
|
| 967 |
+
## Troubleshooting
|
| 968 |
+
|
| 969 |
+
### Leaderboard Issues
|
| 970 |
+
|
| 971 |
+
**Problem**: "Load Leaderboard" button doesn't work
|
| 972 |
+
- **Solution**: Check HuggingFace token in Settings (needs Read permission)
|
| 973 |
+
- **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard
|
| 974 |
+
|
| 975 |
+
**Problem**: AI insights not showing
|
| 976 |
+
- **Solution**: Check Gemini API key in Settings
|
| 977 |
+
- **Solution**: Wait 5-10 seconds for AI generation to complete
|
| 978 |
+
|
| 979 |
+
### Agent Chat Issues
|
| 980 |
+
|
| 981 |
+
**Problem**: Agent responds with "MCP server connection failed"
|
| 982 |
+
- **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
|
| 983 |
+
- **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings
|
| 984 |
+
|
| 985 |
+
**Problem**: Agent gives incorrect information
|
| 986 |
+
- **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
|
| 987 |
+
- **Solution**: Verify question is clear and specific
|
| 988 |
+
|
| 989 |
+
### Evaluation Submission Issues
|
| 990 |
+
|
| 991 |
+
**Problem**: "Submit Evaluation" fails with auth error
|
| 992 |
+
- **Solution**: HF token needs "Run Jobs" permission
|
| 993 |
+
- **Solution**: Ensure HF Pro account is active ($9/month)
|
| 994 |
+
- **Solution**: Verify credit card is on file for compute charges
|
| 995 |
+
|
| 996 |
+
**Problem**: Job stuck in "Pending" status
|
| 997 |
+
- **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
|
| 998 |
+
- **Solution**: Try Modal as alternative infrastructure
|
| 999 |
+
|
| 1000 |
+
**Problem**: Job fails with "Out of Memory"
|
| 1001 |
+
- **Solution**: Model too large for selected hardware
|
| 1002 |
+
- **Solution**: Increase hardware tier (e.g., t4-small β a10g-small)
|
| 1003 |
+
- **Solution**: Use auto hardware selection
|
| 1004 |
+
|
| 1005 |
+
### Trace Visualization Issues
|
| 1006 |
+
|
| 1007 |
+
**Problem**: Traces not loading
|
| 1008 |
+
- **Solution**: Ensure evaluation completed successfully
|
| 1009 |
+
- **Solution**: Check traces dataset exists on HuggingFace
|
| 1010 |
+
- **Solution**: Verify HF token has Read permission
|
| 1011 |
+
|
| 1012 |
+
**Problem**: GPU metrics missing
|
| 1013 |
+
- **Solution**: Only available for GPU jobs (not API models)
|
| 1014 |
+
- **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled
|
| 1015 |
+
|
| 1016 |
+
---
|
| 1017 |
+
|
| 1018 |
+
## Getting Help
|
| 1019 |
+
|
| 1020 |
+
- **π§ GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
|
| 1021 |
+
- **π¬ HF Discord**: `#agents-mcp-hackathon-winter25`
|
| 1022 |
+
- **π Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)
|
| 1023 |
+
|
| 1024 |
+
---
|
| 1025 |
+
|
| 1026 |
+
**Last Updated**: November 21, 2025
|