kshitijthakkar commited on
Commit
34f1a7a
Β·
1 Parent(s): 880ef7f

docs: Deploy final documentation package

Browse files
Files changed (4) hide show
  1. ARCHITECTURE.md +1035 -0
  2. MCP_INTEGRATION.md +706 -0
  3. README.md +318 -343
  4. USER_GUIDE.md +1026 -0
ARCHITECTURE.md ADDED
@@ -0,0 +1,1035 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceMind-AI - Technical Architecture
2
+
3
+ This document provides a deep technical dive into the TraceMind-AI architecture, implementation details, and system design.
4
+
5
+ ## Table of Contents
6
+
7
+ - [System Overview](#system-overview)
8
+ - [Project Structure](#project-structure)
9
+ - [Core Components](#core-components)
10
+ - [MCP Client Architecture](#mcp-client-architecture)
11
+ - [Agent Framework Integration](#agent-framework-integration)
12
+ - [Data Flow](#data-flow)
13
+ - [Authentication & Authorization](#authentication--authorization)
14
+ - [Screen Navigation](#screen-navigation)
15
+ - [Job Submission Architecture](#job-submission-architecture)
16
+ - [Deployment](#deployment)
17
+ - [Performance Optimization](#performance-optimization)
18
+
19
+ ---
20
+
21
+ ## System Overview
22
+
23
+ TraceMind-AI is a comprehensive Gradio-based web application for evaluating AI agent performance. It serves as the user-facing platform in the TraceMind ecosystem, demonstrating enterprise MCP client usage (Track 2: MCP in Action).
24
+
25
+ ### Technology Stack
26
+
27
+ | Component | Technology | Version | Purpose |
28
+ |-----------|-----------|---------|---------|
29
+ | **UI Framework** | Gradio | 5.49.1 | Web interface with components |
30
+ | **MCP Client** | MCP Python SDK | Latest | Connect to MCP servers |
31
+ | **Agent Framework** | smolagents | 1.22.0+ | Autonomous agent with MCP tools |
32
+ | **Data Source** | HuggingFace Datasets | Latest | Load evaluation results |
33
+ | **Authentication** | HuggingFace OAuth | - | User authentication |
34
+ | **Job Platforms** | HF Jobs + Modal | - | Evaluation job submission |
35
+ | **Language** | Python | 3.10+ | Core implementation |
36
+
37
+ ### High-Level Architecture
38
+
39
+ ```
40
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
41
+ β”‚ User Browser β”‚
42
+ β”‚ - Gradio Interface (React-based) β”‚
43
+ β”‚ - OAuth Flow (HuggingFace) β”‚
44
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
45
+ β”‚
46
+ β”‚ HTTP/WebSocket
47
+ ↓
48
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
49
+ β”‚ TraceMind-AI (Gradio App) - Track 2 β”‚
50
+ β”‚ β”‚
51
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
52
+ β”‚ β”‚ Screen Layer (screens/) β”‚ β”‚
53
+ β”‚ β”‚ - Leaderboard β”‚ β”‚
54
+ β”‚ β”‚ - Agent Chat β”‚ β”‚
55
+ β”‚ β”‚ - New Evaluation β”‚ β”‚
56
+ β”‚ β”‚ - Job Monitoring β”‚ β”‚
57
+ β”‚ β”‚ - Trace Detail β”‚ β”‚
58
+ β”‚ β”‚ - Settings β”‚ β”‚
59
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
60
+ β”‚ β”‚ β”‚
61
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
62
+ β”‚ β”‚ Component Layer (components/) β”‚ β”‚
63
+ β”‚ β”‚ - Leaderboard Table (Custom HTML) β”‚ β”‚
64
+ β”‚ β”‚ - Analytics Charts β”‚ β”‚
65
+ β”‚ β”‚ - Metric Displays β”‚ β”‚
66
+ β”‚ β”‚ - Report Cards β”‚ β”‚
67
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
68
+ β”‚ β”‚ β”‚
69
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
70
+ β”‚ β”‚ Service Layer β”‚ β”‚
71
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
72
+ β”‚ β”‚ β”‚ MCP Client β”‚ β”‚ Data Loader β”‚ β”‚ β”‚
73
+ β”‚ β”‚ β”‚ (mcp_client/) β”‚ β”‚ (data_loader.py) β”‚ β”‚ β”‚
74
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
75
+ β”‚ β”‚ β”Œβ”€οΏ½οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
76
+ β”‚ β”‚ β”‚ Agent (smolagentsβ”‚ β”‚ Job Submission β”‚ β”‚ β”‚
77
+ β”‚ β”‚ β”‚ screens/chat.py) β”‚ β”‚ (utils/) β”‚ β”‚ β”‚
78
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
79
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
80
+ β”‚ β”‚
81
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
82
+ β”‚ β”‚
83
+ ↓ ↓
84
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
85
+ β”‚ TraceMind MCP Server β”‚ β”‚ External Services β”‚
86
+ β”‚ (Track 1) β”‚ β”‚ - HF Datasets β”‚
87
+ β”‚ - 11 AI Tools β”‚ β”‚ - HF Jobs β”‚
88
+ β”‚ - 3 Resources β”‚ β”‚ - Modal β”‚
89
+ β”‚ - 3 Prompts β”‚ β”‚ - LLM APIs β”‚
90
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
91
+ ```
92
+
93
+ ---
94
+
95
+ ## Project Structure
96
+
97
+ ```
98
+ TraceMind-AI/
99
+ β”œβ”€β”€ app.py # Main entry point, Gradio app
100
+ β”‚
101
+ β”œβ”€β”€ screens/ # UI screens (6 tabs)
102
+ β”‚ β”œβ”€β”€ __init__.py
103
+ β”‚ β”œβ”€β”€ leaderboard.py # Screen 1: Leaderboard with AI insights
104
+ β”‚ β”œβ”€β”€ chat.py # Screen 2: Agent Chat (smolagents)
105
+ β”‚ β”œβ”€β”€ dashboard.py # Screen 3: New Evaluation
106
+ β”‚ β”œβ”€β”€ job_monitoring.py # Screen 4: Job Status Tracking
107
+ β”‚ β”œβ”€β”€ trace_detail.py # Screen 5: Trace Visualization
108
+ β”‚ β”œβ”€β”€ settings.py # Screen 6: API Key Configuration
109
+ β”‚ β”œβ”€β”€ compare.py # Screen 7: Run Comparison (optional)
110
+ β”‚ β”œβ”€β”€ documentation.py # Screen 8: API Documentation
111
+ β”‚ └── mcp_helpers.py # Shared MCP client helpers
112
+ β”‚
113
+ β”œβ”€β”€ components/ # Reusable UI components
114
+ β”‚ β”œβ”€β”€ __init__.py
115
+ β”‚ β”œβ”€β”€ leaderboard_table.py # Custom HTML table component
116
+ β”‚ β”œβ”€β”€ analytics_charts.py # Performance charts (Plotly)
117
+ β”‚ β”œβ”€β”€ metric_displays.py # Metric cards and badges
118
+ β”‚ β”œβ”€β”€ report_cards.py # Summary report cards
119
+ β”‚ └── thought_graph.py # Agent reasoning visualization
120
+ β”‚
121
+ β”œβ”€β”€ mcp_client/ # MCP client implementation
122
+ β”‚ β”œβ”€β”€ __init__.py
123
+ β”‚ β”œβ”€β”€ client.py # Async MCP client
124
+ β”‚ └── sync_wrapper.py # Synchronous wrapper for Gradio
125
+ β”‚
126
+ β”œβ”€β”€ utils/ # Utility modules
127
+ β”‚ β”œβ”€β”€ __init__.py
128
+ β”‚ β”œβ”€β”€ auth.py # HuggingFace OAuth
129
+ β”‚ β”œβ”€β”€ navigation.py # Screen navigation state
130
+ β”‚ β”œβ”€β”€ hf_jobs_submission.py # HuggingFace Jobs integration
131
+ β”‚ └── modal_job_submission.py # Modal integration
132
+ β”‚
133
+ β”œβ”€β”€ styles/ # Custom styling
134
+ β”‚ β”œβ”€β”€ __init__.py
135
+ β”‚ └── tracemind_theme.py # Gradio theme customization
136
+ β”‚
137
+ β”œβ”€β”€ data_loader.py # Dataset loading and caching
138
+ β”œβ”€β”€ requirements.txt # Python dependencies
139
+ β”œβ”€β”€ .env.example # Environment variable template
140
+ β”œβ”€β”€ .gitignore
141
+ β”œβ”€β”€ README.md # Project documentation
142
+ └── USER_GUIDE.md # Complete user guide
143
+
144
+ Total: ~35 files, ~8,000 lines of code
145
+ ```
146
+
147
+ ### File Breakdown
148
+
149
+ | Directory | Files | Lines | Purpose |
150
+ |-----------|-------|-------|---------|
151
+ | `screens/` | 9 | ~3,500 | UI screen implementations |
152
+ | `components/` | 5 | ~1,200 | Reusable UI components |
153
+ | `mcp_client/` | 3 | ~800 | MCP client integration |
154
+ | `utils/` | 4 | ~1,500 | Authentication, jobs, navigation |
155
+ | `styles/` | 2 | ~300 | Custom theme and CSS |
156
+ | Root | 3 | ~700 | Main app, data loader, config |
157
+
158
+ ---
159
+
160
+ ## Core Components
161
+
162
+ ### 1. app.py - Main Application
163
+
164
+ **Purpose**: Entry point, orchestrates all screens and manages global state.
165
+
166
+ **Architecture**:
167
+
168
+ ```python
169
+ # app.py structure
170
+ import gradio as gr
171
+ from screens import *
172
+ from mcp_client.sync_wrapper import get_sync_mcp_client
173
+ from utils.auth import auth_ui
174
+ from data_loader import DataLoader
175
+
176
+ # 1. Initialize services
177
+ mcp_client = get_sync_mcp_client()
178
+ mcp_client.initialize()
179
+ data_loader = DataLoader()
180
+
181
+ # 2. Create Gradio app
182
+ with gr.Blocks(theme=tracemind_theme) as app:
183
+ # Global state
184
+ gr.State(...) # User session, navigation, etc.
185
+
186
+ # Authentication (if not disabled)
187
+ if not DISABLE_OAUTH:
188
+ auth_ui()
189
+
190
+ # Main tabs
191
+ with gr.Tabs():
192
+ with gr.Tab("πŸ“Š Leaderboard"):
193
+ leaderboard_screen()
194
+
195
+ with gr.Tab("πŸ€– Agent Chat"):
196
+ chat_screen()
197
+
198
+ with gr.Tab("πŸš€ New Evaluation"):
199
+ dashboard_screen()
200
+
201
+ with gr.Tab("πŸ“ˆ Job Monitoring"):
202
+ job_monitoring_screen()
203
+
204
+ with gr.Tab("βš™οΈ Settings"):
205
+ settings_screen()
206
+
207
+ # 3. Launch
208
+ if __name__ == "__main__":
209
+ app.launch(
210
+ server_name="0.0.0.0",
211
+ server_port=7860,
212
+ share=False
213
+ )
214
+ ```
215
+
216
+ **Key Responsibilities**:
217
+ - Initialize MCP client and data loader (global instances)
218
+ - Create tabbed interface with all screens
219
+ - Manage authentication flow
220
+ - Handle global state (user session, API keys)
221
+
222
+ ---
223
+
224
+ ### 2. Screen Layer (screens/)
225
+
226
+ Each screen is a self-contained module that returns a Gradio component tree.
227
+
228
+ #### screens/leaderboard.py
229
+
230
+ **Purpose**: Display evaluation results with AI-powered insights.
231
+
232
+ **Components**:
233
+ - Load button
234
+ - AI insights panel (Markdown) - powered by MCP server
235
+ - Leaderboard table (custom HTML component)
236
+ - Filter controls (agent type, provider)
237
+
238
+ **MCP Integration**:
239
+ ```python
240
+ def load_leaderboard(mcp_client):
241
+ # 1. Load dataset
242
+ ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
243
+ df = pd.DataFrame(ds)
244
+
245
+ # 2. Get AI insights from MCP server
246
+ insights = mcp_client.analyze_leaderboard(
247
+ metric_focus="overall",
248
+ time_range="last_week",
249
+ top_n=5
250
+ )
251
+
252
+ # 3. Render table with custom component
253
+ table_html = render_leaderboard_table(df)
254
+
255
+ return insights, table_html
256
+ ```
257
+
258
+ #### screens/chat.py
259
+
260
+ **Purpose**: Autonomous agent interface with MCP tool access.
261
+
262
+ **Agent Setup**:
263
+ ```python
264
+ from smolagents import ToolCallingAgent, MCPClient, HfApiModel
265
+
266
+ # Initialize agent with MCP client
267
+ def create_agent():
268
+ mcp_client = MCPClient(MCP_SERVER_URL)
269
+
270
+ model = HfApiModel(
271
+ model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
272
+ token=os.getenv("HF_TOKEN")
273
+ )
274
+
275
+ agent = ToolCallingAgent(
276
+ tools=[], # MCP tools loaded automatically
277
+ model=model,
278
+ mcp_client=mcp_client,
279
+ max_steps=10
280
+ )
281
+
282
+ return agent
283
+
284
+ # Chat interaction
285
+ def agent_chat(message, history, show_reasoning):
286
+ if show_reasoning:
287
+ agent.verbosity_level = 2 # Show tool execution
288
+ else:
289
+ agent.verbosity_level = 0 # Only final answer
290
+
291
+ response = agent.run(message)
292
+ history.append((message, response))
293
+
294
+ return history, ""
295
+ ```
296
+
297
+ **MCP Tool Access**:
298
+ Agent automatically discovers and uses all 11 MCP tools from TraceMind MCP Server.
299
+
300
+ #### screens/dashboard.py
301
+
302
+ **Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal.
303
+
304
+ **Key Functions**:
305
+ - Model selection (text input)
306
+ - Infrastructure choice (HF Jobs / Modal)
307
+ - Hardware selection (auto / manual)
308
+ - Cost estimation (MCP-powered)
309
+ - Job submission
310
+
311
+ **Cost Estimation Flow**:
312
+ ```python
313
+ def estimate_cost_click(model, agent_type, num_tests, hardware, mcp_client):
314
+ # Call MCP server for cost estimate
315
+ estimate = mcp_client.estimate_cost(
316
+ model=model,
317
+ agent_type=agent_type,
318
+ num_tests=num_tests,
319
+ hardware=hardware
320
+ )
321
+
322
+ return estimate # Display in dialog
323
+ ```
324
+
325
+ **Job Submission Flow**:
326
+ ```python
327
+ def submit_job(model, agent_type, hardware, infrastructure, api_keys):
328
+ if infrastructure == "HuggingFace Jobs":
329
+ job_id = submit_hf_job(model, agent_type, hardware, api_keys)
330
+ elif infrastructure == "Modal":
331
+ job_id = submit_modal_job(model, agent_type, hardware, api_keys)
332
+
333
+ return f"βœ… Job submitted: {job_id}"
334
+ ```
335
+
336
+ #### screens/job_monitoring.py
337
+
338
+ **Purpose**: Track status of submitted jobs.
339
+
340
+ **Data Source**: HuggingFace Jobs API or Modal API
341
+
342
+ **Refresh Strategy**:
343
+ - Manual refresh button
344
+ - Auto-refresh every 30 seconds (optional)
345
+
346
+ #### screens/trace_detail.py
347
+
348
+ **Purpose**: Visualize OpenTelemetry traces with GPU metrics.
349
+
350
+ **Components**:
351
+ - Waterfall diagram (spans timeline)
352
+ - Span details panel
353
+ - GPU metrics overlay (for GPU jobs)
354
+ - MCP-powered Q&A
355
+
356
+ **Trace Loading**:
357
+ ```python
358
+ def load_trace(trace_id, traces_repo):
359
+ # Load trace dataset
360
+ ds = load_dataset(traces_repo)
361
+ trace_data = ds.filter(lambda x: x["trace_id"] == trace_id)[0]
362
+
363
+ # Render waterfall
364
+ waterfall_html = render_waterfall(trace_data["spans"])
365
+
366
+ return waterfall_html
367
+ ```
368
+
369
+ **MCP Q&A**:
370
+ ```python
371
+ def ask_trace_question(trace_id, traces_repo, question, mcp_client):
372
+ # Call MCP server to debug trace
373
+ answer = mcp_client.debug_trace(
374
+ trace_id=trace_id,
375
+ traces_repo=traces_repo,
376
+ question=question
377
+ )
378
+
379
+ return answer
380
+ ```
381
+
382
+ #### screens/settings.py
383
+
384
+ **Purpose**: Configure API keys and preferences.
385
+
386
+ **Security**:
387
+ - Keys stored in Gradio State (session-only, not server-side)
388
+ - All forms use `api_name=False` (not exposed via API)
389
+ - HTTPS encryption for all API calls
390
+
391
+ **Configuration Options**:
392
+ - Gemini API Key
393
+ - HuggingFace Token
394
+ - Modal Token ID + Secret
395
+ - LLM Provider Keys (OpenAI, Anthropic, etc.)
396
+
397
+ ---
398
+
399
+ ### 3. Component Layer (components/)
400
+
401
+ Reusable UI components that can be used across multiple screens.
402
+
403
+ #### components/leaderboard_table.py
404
+
405
+ **Purpose**: Custom HTML table with sorting, filtering, and styling.
406
+
407
+ **Why Custom Component?**:
408
+ - Gradio's default Dataframe component lacks advanced styling
409
+ - Need clickable rows for navigation
410
+ - Custom sorting and filtering logic
411
+ - Badge rendering for metrics
412
+
413
+ **Implementation**:
414
+ ```python
415
+ def render_leaderboard_table(df: pd.DataFrame) -> str:
416
+ """Render leaderboard as interactive HTML table"""
417
+
418
+ html = """
419
+ <style>
420
+ .leaderboard-table { ... }
421
+ .metric-badge { ... }
422
+ </style>
423
+ <table class="leaderboard-table">
424
+ <thead>
425
+ <tr>
426
+ <th onclick="sortTable(0)">Model</th>
427
+ <th onclick="sortTable(1)">Success Rate</th>
428
+ <th onclick="sortTable(2)">Cost</th>
429
+ ...
430
+ </tr>
431
+ </thead>
432
+ <tbody>
433
+ """
434
+
435
+ for idx, row in df.iterrows():
436
+ html += f"""
437
+ <tr onclick="selectRun('{row['run_id']}')">
438
+ <td>{row['model']}</td>
439
+ <td><span class="badge success">{row['success_rate']}%</span></td>
440
+ <td>${row['total_cost_usd']:.4f}</td>
441
+ ...
442
+ </tr>
443
+ """
444
+
445
+ html += """
446
+ </tbody>
447
+ </table>
448
+ <script>
449
+ function sortTable(col) { ... }
450
+ function selectRun(runId) {
451
+ // Trigger Gradio event to navigate to run detail
452
+ document.dispatchEvent(new CustomEvent('runSelected', {detail: runId}));
453
+ }
454
+ </script>
455
+ """
456
+
457
+ return html
458
+ ```
459
+
460
+ **Integration with Gradio**:
461
+ ```python
462
+ # In leaderboard screen
463
+ table_html = gr.HTML()
464
+
465
+ load_btn.click(
466
+ fn=lambda: render_leaderboard_table(df),
467
+ outputs=table_html
468
+ )
469
+ ```
470
+
471
+ #### components/analytics_charts.py
472
+
473
+ **Purpose**: Performance charts using Plotly.
474
+
475
+ **Charts Provided**:
476
+ - Success rate over time (line chart)
477
+ - Cost comparison (bar chart)
478
+ - Duration distribution (histogram)
479
+ - CO2 emissions by model (pie chart)
480
+
481
+ **Example**:
482
+ ```python
483
+ import plotly.graph_objects as go
484
+
485
+ def create_cost_comparison_chart(df):
486
+ fig = go.Figure(data=[
487
+ go.Bar(
488
+ x=df['model'],
489
+ y=df['total_cost_usd'],
490
+ marker_color='indianred'
491
+ )
492
+ ])
493
+
494
+ fig.update_layout(
495
+ title="Cost Comparison by Model",
496
+ xaxis_title="Model",
497
+ yaxis_title="Total Cost (USD)"
498
+ )
499
+
500
+ return fig
501
+ ```
502
+
503
+ #### components/thought_graph.py
504
+
505
+ **Purpose**: Visualize agent reasoning steps (for Agent Chat).
506
+
507
+ **Visualization**:
508
+ - Graph nodes: Reasoning steps, tool calls
509
+ - Edges: Flow between steps
510
+ - Annotations: Tool results, errors
511
+
512
+ ---
513
+
514
+ ### 4. MCP Client Layer (mcp_client/)
515
+
516
+ #### mcp_client/client.py - Async MCP Client
517
+
518
+ **Purpose**: Connect to TraceMind MCP Server via MCP protocol.
519
+
520
+ **Implementation**: (See [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) for full code)
521
+
522
+ **Key Methods**:
523
+ - `connect()`: Establish SSE connection to MCP server
524
+ - `call_tool(tool_name, arguments)`: Call an MCP tool
525
+ - `analyze_leaderboard(**kwargs)`: Wrapper for analyze_leaderboard tool
526
+ - `estimate_cost(**kwargs)`: Wrapper for estimate_cost tool
527
+ - `debug_trace(**kwargs)`: Wrapper for debug_trace tool
528
+
529
+ #### mcp_client/sync_wrapper.py - Synchronous Wrapper
530
+
531
+ **Purpose**: Provide synchronous API for Gradio event handlers.
532
+
533
+ **Why Needed?**: Gradio event handlers are synchronous, but MCP client is async.
534
+
535
+ **Pattern**:
536
+ ```python
537
+ class SyncMCPClient:
538
+ def __init__(self, mcp_server_url):
539
+ self.async_client = AsyncMCPClient(mcp_server_url)
540
+
541
+ def _run_async(self, coro):
542
+ """Run async coroutine in sync context"""
543
+ loop = asyncio.get_event_loop()
544
+ return loop.run_until_complete(coro)
545
+
546
+ def analyze_leaderboard(self, **kwargs):
547
+ """Synchronous wrapper"""
548
+ return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
549
+ ```
550
+
551
+ ---
552
+
553
+ ### 5. Data Loader (data_loader.py)
554
+
555
+ **Purpose**: Load and cache HuggingFace datasets.
556
+
557
+ **Features**:
558
+ - In-memory caching (5-minute TTL)
559
+ - Error handling for missing datasets
560
+ - Automatic retry logic
561
+ - Dataset validation
562
+
563
+ **Implementation**:
564
+ ```python
565
+ from datasets import load_dataset
566
+ from functools import lru_cache
567
+ import time
568
+
569
+ class DataLoader:
570
+ def __init__(self):
571
+ self.cache = {}
572
+ self.cache_ttl = 300 # 5 minutes
573
+
574
+ def load_leaderboard(self, repo="kshitijthakkar/smoltrace-leaderboard"):
575
+ """Load leaderboard with caching"""
576
+ cache_key = f"leaderboard:{repo}"
577
+
578
+ # Check cache
579
+ if cache_key in self.cache:
580
+ cached_time, cached_data = self.cache[cache_key]
581
+ if time.time() - cached_time < self.cache_ttl:
582
+ return cached_data
583
+
584
+ # Load fresh data
585
+ ds = load_dataset(repo, split="train")
586
+ df = pd.DataFrame(ds)
587
+
588
+ # Cache
589
+ self.cache[cache_key] = (time.time(), df)
590
+
591
+ return df
592
+
593
+ def load_results(self, repo):
594
+ """Load results dataset for specific run"""
595
+ ds = load_dataset(repo, split="train")
596
+ return pd.DataFrame(ds)
597
+
598
+ def load_traces(self, repo):
599
+ """Load traces dataset for specific run"""
600
+ ds = load_dataset(repo, split="train")
601
+ return ds # Keep as Dataset for filtering
602
+ ```
603
+
604
+ ---
605
+
606
+ ## MCP Client Architecture
607
+
608
+ **Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
609
+
610
+ **Summary**:
611
+ - **Async Client**: `mcp_client/client.py` - async MCP protocol implementation
612
+ - **Sync Wrapper**: `mcp_client/sync_wrapper.py` - synchronous API for Gradio
613
+ - **Global Instance**: Initialized once in `app.py`, shared across all screens
614
+
615
+ **Usage Pattern**:
616
+ ```python
617
+ # In app.py (initialization)
618
+ from mcp_client.sync_wrapper import get_sync_mcp_client
619
+ mcp_client = get_sync_mcp_client()
620
+ mcp_client.initialize()
621
+
622
+ # In screen (usage)
623
+ def some_event_handler(mcp_client):
624
+ result = mcp_client.analyze_leaderboard(metric_focus="cost")
625
+ return result
626
+ ```
627
+
628
+ ---
629
+
630
+ ## Agent Framework Integration
631
+
632
+ **Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
633
+
634
+ **Framework**: smolagents (HuggingFace's agent framework)
635
+
636
+ **Key Features**:
637
+ - Autonomous tool discovery from MCP server
638
+ - Multi-step reasoning with tool chaining
639
+ - Context-aware responses
640
+ - Reasoning visualization (optional)
641
+
642
+ **Agent Setup**:
643
+ ```python
644
+ from smolagents import ToolCallingAgent, MCPClient
645
+
646
+ agent = ToolCallingAgent(
647
+ tools=[], # Empty - tools loaded from MCP server
648
+ model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct"),
649
+ mcp_client=MCPClient(MCP_SERVER_URL),
650
+ max_steps=10
651
+ )
652
+ ```
653
+
654
+ ---
655
+
656
+ ## Data Flow
657
+
658
+ ### Leaderboard Loading Flow
659
+
660
+ ```
661
+ 1. User clicks "Load Leaderboard"
662
+ ↓
663
+ 2. Gradio Event Handler (leaderboard.py)
664
+ load_leaderboard()
665
+ ↓
666
+ 3. Data Loader (data_loader.py)
667
+ β”œβ”€β†’ Check cache (5-min TTL)
668
+ β”‚ └─→ If cached: return cached data
669
+ └─→ If not cached: load from HF Datasets
670
+ └─→ load_dataset("kshitijthakkar/smoltrace-leaderboard")
671
+ ↓
672
+ 4. MCP Client (sync_wrapper.py)
673
+ mcp_client.analyze_leaderboard(metric_focus="overall")
674
+ ↓
675
+ 5. MCP Server (TraceMind-mcp-server)
676
+ β”œβ”€β†’ Load data
677
+ β”œβ”€β†’ Call Gemini API
678
+ └─→ Return AI analysis
679
+ ↓
680
+ 6. Render Components
681
+ β”œβ”€β†’ AI Insights (Markdown)
682
+ └─→ Leaderboard Table (Custom HTML)
683
+ ↓
684
+ 7. Display to User
685
+ ```
686
+
687
+ ### Agent Chat Flow
688
+
689
+ ```
690
+ 1. User types message: "What are the top 3 models?"
691
+ ↓
692
+ 2. Gradio Event Handler (chat.py)
693
+ agent_chat(message, history, show_reasoning)
694
+ ↓
695
+ 3. smolagents Agent
696
+ agent.run(message)
697
+ β”œβ”€β†’ Step 1: Plan approach
698
+ β”‚ └─→ "Need to get top models from leaderboard"
699
+ β”œβ”€β†’ Step 2: Discover MCP tools
700
+ β”‚ └─→ Found: get_top_performers, analyze_leaderboard
701
+ β”œβ”€β†’ Step 3: Call MCP tool
702
+ β”‚ └─→ get_top_performers(metric="success_rate", top_n=3)
703
+ β”œβ”€β†’ Step 4: Parse result
704
+ β”‚ └─→ Extract model names, success rates, costs
705
+ └─→ Step 5: Format response
706
+ └─→ Generate markdown table with insights
707
+ ↓
708
+ 4. Return to user with full reasoning trace (if enabled)
709
+ ```
710
+
711
+ ### Job Submission Flow
712
+
713
+ ```
714
+ 1. User fills form β†’ Clicks "Submit Evaluation"
715
+ ↓
716
+ 2. Gradio Event Handler (dashboard.py)
717
+ submit_job(model, agent_type, hardware, infrastructure)
718
+ ↓
719
+ 3. Job Submission Module (utils/)
720
+ if infrastructure == "HuggingFace Jobs":
721
+ β”œβ”€β†’ hf_jobs_submission.py
722
+ β”œβ”€β†’ Build job config (YAML)
723
+ β”œβ”€β†’ Submit via HF Jobs API
724
+ └─→ Return job_id
725
+ elif infrastructure == "Modal":
726
+ β”œβ”€β†’ modal_job_submission.py
727
+ β”œβ”€β†’ Build Modal app config
728
+ β”œβ”€β†’ Submit via Modal SDK
729
+ └─→ Return job_id
730
+ ↓
731
+ 4. Store job_id in session state
732
+ ↓
733
+ 5. Redirect to Job Monitoring screen
734
+ ↓
735
+ 6. Auto-refresh status every 30s
736
+ ```
737
+
738
+ ---
739
+
740
+ ## Authentication & Authorization
741
+
742
+ ### HuggingFace OAuth
743
+
744
+ **Implementation**: `utils/auth.py`
745
+
746
+ **Flow**:
747
+ ```
748
+ 1. User visits TraceMind-AI
749
+ ↓
750
+ 2. Check OAuth token in session
751
+ β”œβ”€β†’ If valid: proceed to app
752
+ └─→ If invalid: show login screen
753
+ ↓
754
+ 3. User clicks "Sign in with HuggingFace"
755
+ ↓
756
+ 4. Redirect to HuggingFace OAuth page
757
+ β”œβ”€β†’ User authorizes TraceMind-AI
758
+ └─→ HF redirects back with token
759
+ ↓
760
+ 5. Store token in Gradio State (session)
761
+ ↓
762
+ 6. Use token for:
763
+ β”œβ”€β†’ HF Datasets access
764
+ β”œβ”€β†’ HF Jobs submission
765
+ └─→ User identification
766
+ ```
767
+
768
+ **Code**:
769
+ ```python
770
+ # utils/auth.py
771
+ import gradio as gr
772
+
773
+ def auth_ui():
774
+ """Create OAuth login UI"""
775
+ gr.LoginButton(
776
+ value="Sign in with HuggingFace",
777
+ auth_provider="huggingface"
778
+ )
779
+
780
+ # In app.py
781
+ with gr.Blocks() as app:
782
+ if not DISABLE_OAUTH:
783
+ auth_ui()
784
+ ```
785
+
786
+ ### API Key Storage
787
+
788
+ **Strategy**: Session-only storage (not server-side persistence)
789
+
790
+ **Implementation**:
791
+ ```python
792
+ # In settings screen
793
+ def save_api_keys(gemini_key, hf_token):
794
+ """Store keys in session state"""
795
+ session_state = gr.State({
796
+ "gemini_key": gemini_key,
797
+ "hf_token": hf_token
798
+ })
799
+
800
+ # Override default clients with user keys
801
+ if gemini_key:
802
+ os.environ["GEMINI_API_KEY"] = gemini_key
803
+ if hf_token:
804
+ os.environ["HF_TOKEN"] = hf_token
805
+
806
+ return "βœ… API keys saved for this session"
807
+ ```
808
+
809
+ **Security**:
810
+ - βœ… Keys stored only in browser memory
811
+ - βœ… Not saved to disk or database
812
+ - βœ… Forms use `api_name=False` (not exposed via API)
813
+ - βœ… HTTPS encryption
814
+
815
+ ---
816
+
817
+ ## Screen Navigation
818
+
819
+ ### State Management
820
+
821
+ **Pattern**: Gradio State components for session data
822
+
823
+ ```python
824
+ # In app.py
825
+ with gr.Blocks() as app:
826
+ # Global state
827
+ session_state = gr.State({
828
+ "user": None,
829
+ "current_run_id": None,
830
+ "current_trace_id": None,
831
+ "api_keys": {}
832
+ })
833
+
834
+ # Pass to all screens
835
+ leaderboard_screen(session_state)
836
+ chat_screen(session_state)
837
+ ```
838
+
839
+ ### Navigation Between Screens
840
+
841
+ **Pattern**: Click event triggers tab switch + state update
842
+
843
+ ```python
844
+ # In leaderboard screen
845
+ def row_click(run_id, session_state):
846
+ """Navigate to run detail when row clicked"""
847
+ session_state["current_run_id"] = run_id
848
+
849
+ # Switch to trace detail tab (Tab index 4)
850
+ return gr.Tabs.update(selected=4), session_state
851
+
852
+ table_component.select(
853
+ fn=row_click,
854
+ inputs=[gr.State(), session_state],
855
+ outputs=[main_tabs, session_state]
856
+ )
857
+ ```
858
+
859
+ ---
860
+
861
+ ## Job Submission Architecture
862
+
863
+ ### HuggingFace Jobs Integration
864
+
865
+ **File**: `utils/hf_jobs_submission.py`
866
+
867
+ **Key Functions**:
868
+ ```python
869
+ def submit_hf_job(model, agent_type, hardware, api_keys):
870
+ """Submit evaluation job to HuggingFace Jobs"""
871
+
872
+ # 1. Build job config (YAML)
873
+ job_config = {
874
+ "name": f"SMOLTRACE Eval - {model}",
875
+ "hardware": hardware, # cpu-basic, t4-small, a10g-small, a100-large, h200
876
+ "environment": {
877
+ "MODEL": model,
878
+ "AGENT_TYPE": agent_type,
879
+ "HF_TOKEN": api_keys["hf_token"],
880
+ # ... other env vars
881
+ },
882
+ "command": [
883
+ "pip install smoltrace[otel,gpu]",
884
+ f"smoltrace-eval --model {model} --agent-type {agent_type} ..."
885
+ ]
886
+ }
887
+
888
+ # 2. Submit via HF Jobs API
889
+ response = requests.post(
890
+ "https://huggingface.co/api/jobs",
891
+ headers={"Authorization": f"Bearer {api_keys['hf_token']}"},
892
+ json=job_config
893
+ )
894
+
895
+ # 3. Return job ID
896
+ job_id = response.json()["id"]
897
+ return job_id
898
+ ```
899
+
900
+ ### Modal Integration
901
+
902
+ **File**: `utils/modal_job_submission.py`
903
+
904
+ **Key Functions**:
905
+ ```python
906
+ import modal
907
+
908
+ def submit_modal_job(model, agent_type, hardware, api_keys):
909
+ """Submit evaluation job to Modal"""
910
+
911
+ # 1. Create Modal app
912
+ app = modal.App("smoltrace-eval")
913
+
914
+ # 2. Define function with GPU
915
+ @app.function(
916
+ image=modal.Image.debian_slim().pip_install("smoltrace[otel,gpu]"),
917
+ gpu=hardware, # A10, A100-80GB, H200
918
+ secrets=[
919
+ modal.Secret.from_dict({
920
+ "HF_TOKEN": api_keys["hf_token"],
921
+ # ... other secrets
922
+ })
923
+ ]
924
+ )
925
+ def run_evaluation():
926
+ import smoltrace
927
+ # Run evaluation
928
+ results = smoltrace.evaluate(model=model, agent_type=agent_type)
929
+ return results
930
+
931
+ # 3. Deploy and run
932
+ with app.run():
933
+ result = run_evaluation.remote()
934
+
935
+ return result.job_id
936
+ ```
937
+
938
+ ---
939
+
940
+ ## Deployment
941
+
942
+ ### HuggingFace Spaces
943
+
944
+ **Platform**: HuggingFace Spaces
945
+ **SDK**: Gradio 5.49.1
946
+ **Hardware**: CPU Basic (upgradeable)
947
+ **URL**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
948
+
949
+ ### Configuration
950
+
951
+ **Space Metadata** (README.md header):
952
+ ```yaml
953
+ ---
954
+ title: TraceMind AI
955
+ emoji: 🧠
956
+ colorFrom: indigo
957
+ colorTo: purple
958
+ sdk: gradio
959
+ sdk_version: 5.49.1
960
+ app_file: app.py
961
+ short_description: AI agent evaluation with MCP-powered intelligence
962
+ license: agpl-3.0
963
+ pinned: true
964
+ tags:
965
+ - mcp-in-action-track-enterprise
966
+ - agent-evaluation
967
+ - mcp-client
968
+ - leaderboard
969
+ - gradio
970
+ ---
971
+ ```
972
+
973
+ ### Environment Variables
974
+
975
+ **Set in HF Spaces Secrets**:
976
+ ```bash
977
+ # Required
978
+ GEMINI_API_KEY=your_gemini_key
979
+ HF_TOKEN=your_hf_token
980
+
981
+ # Optional
982
+ MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
983
+ LEADERBOARD_REPO=kshitijthakkar/smoltrace-leaderboard
984
+ DISABLE_OAUTH=false # Set to true for local development
985
+ ```
986
+
987
+ ---
988
+
989
+ ## Performance Optimization
990
+
991
+ ### 1. Data Caching
992
+
993
+ **Implementation**: `data_loader.py`
994
+ - In-memory cache with 5-minute TTL
995
+ - Reduces HF Datasets API calls
996
+ - Faster page loads
997
+
998
+ ### 2. Async MCP Calls
999
+
1000
+ **Pattern**: Use async for non-blocking I/O
1001
+ ```python
1002
+ # Could be optimized to run in parallel
1003
+ async def load_data_with_insights():
1004
+ leaderboard_task = load_dataset_async(...)
1005
+ insights_task = mcp_client.analyze_leaderboard_async(...)
1006
+
1007
+ leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
1008
+ return leaderboard, insights
1009
+ ```
1010
+
1011
+ ### 3. Component Lazy Loading
1012
+
1013
+ **Strategy**: Load components only when tabs are activated
1014
+ ```python
1015
+ with gr.Tab("Trace Detail", visible=False) as trace_tab:
1016
+ # Components created only when tab first shown
1017
+ @trace_tab.select
1018
+ def load_trace_components():
1019
+ return build_trace_visualization()
1020
+ ```
1021
+
1022
+ ---
1023
+
1024
+ ## Related Documentation
1025
+
1026
+ - [README.md](PROPOSED_README_TRACEMIND_AI.md) - Overview and quick start
1027
+ - [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete screen-by-screen guide
1028
+ - [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) - MCP client implementation
1029
+ - [TraceMind MCP Server Architecture](ARCHITECTURE_MCP_SERVER.md) - Server-side architecture
1030
+
1031
+ ---
1032
+
1033
+ **Last Updated**: November 21, 2025
1034
+ **Version**: 1.0.0
1035
+ **Track**: MCP in Action (Enterprise)
MCP_INTEGRATION.md ADDED
@@ -0,0 +1,706 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceMind-AI - MCP Integration Guide
2
+
3
+ This document explains how TraceMind-AI integrates with MCP servers to provide AI-powered agent evaluation.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Overview](#overview)
8
+ - [Dual MCP Integration](#dual-mcp-integration)
9
+ - [Architecture](#architecture)
10
+ - [MCP Client Implementation](#mcp-client-implementation)
11
+ - [Agent Framework Integration](#agent-framework-integration)
12
+ - [MCP Tools Usage](#mcp-tools-usage)
13
+ - [Development Guide](#development-guide)
14
+
15
+ ---
16
+
17
+ ## Overview
18
+
19
+ TraceMind-AI demonstrates **enterprise MCP client usage** as part of the **Track 2: MCP in Action** submission. It showcases two distinct patterns of MCP integration:
20
+
21
+ 1. **Direct MCP Client**: Python-based client connecting to remote MCP server via SSE transport
22
+ 2. **Autonomous Agent**: `smolagents`-based agent with access to MCP tools for multi-step reasoning
23
+
24
+ Both patterns consume the same MCP server ([TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)) to provide AI-powered analysis of agent evaluation data.
25
+
26
+ ---
27
+
28
+ ## Dual MCP Integration
29
+
30
+ ### Pattern 1: Direct MCP Client Integration
31
+
32
+ **Where**: Leaderboard insights, cost estimation dialogs, trace debugging
33
+
34
+ **How it works**:
35
+ ```python
36
+ # TraceMind-AI calls MCP server directly
37
+ mcp_client = get_sync_mcp_client()
38
+ insights = mcp_client.analyze_leaderboard(
39
+ metric_focus="overall",
40
+ time_range="last_week",
41
+ top_n=5
42
+ )
43
+ # Display insights in UI
44
+ ```
45
+
46
+ **Use cases**:
47
+ - Generate leaderboard insights when user clicks "Load Leaderboard"
48
+ - Estimate costs when user clicks "Estimate Cost" in New Evaluation form
49
+ - Debug traces when user asks questions in trace visualization
50
+
51
+ **Advantages**:
52
+ - Direct, fast execution
53
+ - Synchronous API (easy to integrate with Gradio)
54
+ - Predictable, structured responses
55
+
56
+ ---
57
+
58
+ ### Pattern 2: Autonomous Agent with MCP Tools
59
+
60
+ **Where**: Agent Chat tab
61
+
62
+ **How it works**:
63
+ ```python
64
+ # smolagents agent discovers and uses MCP tools autonomously
65
+ from smolagents import ToolCallingAgent, MCPClient
66
+
67
+ # Agent initialized with MCP client
68
+ agent = ToolCallingAgent(
69
+ tools=[], # Tools loaded from MCP server
70
+ model=model_client,
71
+ mcp_client=MCPClient(mcp_server_url)
72
+ )
73
+
74
+ # User asks question
75
+ result = agent.run("What are the top 3 models and their costs?")
76
+
77
+ # Agent plans:
78
+ # 1. Call get_top_performers MCP tool
79
+ # 2. Extract costs from results
80
+ # 3. Format and present to user
81
+ ```
82
+
83
+ **Use cases**:
84
+ - Answer complex questions requiring multi-step analysis
85
+ - Compare models across multiple dimensions
86
+ - Plan evaluation strategies with cost estimates
87
+ - Provide recommendations based on leaderboard data
88
+
89
+ **Advantages**:
90
+ - Natural language interface
91
+ - Multi-step reasoning
92
+ - Autonomous tool selection
93
+ - Context-aware responses
94
+
95
+ ---
96
+
97
+ ## Architecture
98
+
99
+ ### System Overview
100
+
101
+ ```
102
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
103
+ β”‚ TraceMind-AI (Gradio App) - Track 2 β”‚
104
+ β”‚ β”‚
105
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
106
+ β”‚ β”‚ UI Layer (Gradio) β”‚ β”‚
107
+ β”‚ β”‚ - Leaderboard tab β”‚ β”‚
108
+ β”‚ β”‚ - Agent Chat tab β”‚ β”‚
109
+ β”‚ β”‚ - New Evaluation tab β”‚ β”‚
110
+ β”‚ β”‚ - Trace Visualization tab β”‚ β”‚
111
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
112
+ β”‚ ↓ ↓ β”‚
113
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
114
+ β”‚ β”‚ Direct MCP Client β”‚ β”‚ Autonomous Agent β”‚ β”‚
115
+ β”‚ β”‚ (sync_wrapper.py) β”‚ β”‚ (smolagents) β”‚ β”‚
116
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
117
+ β”‚ β”‚ - Synchronous API β”‚ β”‚ - Multi-step reasoning β”‚ β”‚
118
+ β”‚ β”‚ - Tool calling β”‚ β”‚ - Tool discovery β”‚ β”‚
119
+ β”‚ β”‚ - Error handling β”‚ β”‚ - Context management β”‚ β”‚
120
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
121
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
122
+ β”‚ ↓ β”‚
123
+ β”‚ MCP Protocol β”‚
124
+ β”‚ (SSE Transport) β”‚
125
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
126
+ ↓
127
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
128
+ β”‚ TraceMind MCP Server - Track 1 β”‚
129
+ β”‚ https://huggingface.co/spaces/MCP-1st-Birthday/ β”‚
130
+ β”‚ TraceMind-mcp-server β”‚
131
+ β”‚ β”‚
132
+ β”‚ 11 AI-Powered Tools: β”‚
133
+ β”‚ - analyze_leaderboard β”‚
134
+ β”‚ - debug_trace β”‚
135
+ β”‚ - estimate_cost β”‚
136
+ β”‚ - compare_runs β”‚
137
+ β”‚ - analyze_results β”‚
138
+ β”‚ - get_top_performers β”‚
139
+ β”‚ - get_leaderboard_summary β”‚
140
+ β”‚ - get_dataset β”‚
141
+ β”‚ - generate_synthetic_dataset β”‚
142
+ β”‚ - push_dataset_to_hub β”‚
143
+ β”‚ - generate_prompt_template β”‚
144
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
145
+ ```
146
+
147
+ ---
148
+
149
+ ## MCP Client Implementation
150
+
151
+ ### File Structure
152
+
153
+ ```
154
+ TraceMind-AI/
155
+ β”œβ”€β”€ mcp_client/
156
+ β”‚ β”œβ”€β”€ __init__.py
157
+ β”‚ β”œβ”€β”€ client.py # Async MCP client
158
+ β”‚ └── sync_wrapper.py # Synchronous wrapper for Gradio
159
+ β”œβ”€β”€ agent/
160
+ β”‚ β”œβ”€β”€ __init__.py
161
+ β”‚ └── smolagents_setup.py # Agent with MCP integration
162
+ └── app.py # Main Gradio app
163
+ ```
164
+
165
+ ### Async MCP Client (`client.py`)
166
+
167
+ ```python
168
+ from mcp import ClientSession, StdioServerParameters
169
+ import mcp.types as types
170
+
171
+ class TraceMindMCPClient:
172
+ """Async MCP client for TraceMind MCP Server"""
173
+
174
+ def __init__(self, mcp_server_url: str):
175
+ self.mcp_server_url = mcp_server_url
176
+ self.session = None
177
+
178
+ async def connect(self):
179
+ """Establish connection to MCP server via SSE"""
180
+ # For HTTP-based MCP servers (HuggingFace Spaces)
181
+ self.session = ClientSession(
182
+ ServerParameters(
183
+ url=self.mcp_server_url,
184
+ transport="sse"
185
+ )
186
+ )
187
+ await self.session.__aenter__()
188
+
189
+ # List available tools
190
+ tools_result = await self.session.list_tools()
191
+ self.available_tools = {tool.name: tool for tool in tools_result.tools}
192
+
193
+ print(f"Connected to MCP server. Available tools: {list(self.available_tools.keys())}")
194
+
195
+ async def call_tool(self, tool_name: str, arguments: dict) -> str:
196
+ """Call an MCP tool with given arguments"""
197
+ if not self.session:
198
+ raise RuntimeError("MCP client not connected. Call connect() first.")
199
+
200
+ if tool_name not in self.available_tools:
201
+ raise ValueError(f"Tool '{tool_name}' not available. Available: {list(self.available_tools.keys())}")
202
+
203
+ # Call the tool
204
+ result = await self.session.call_tool(tool_name, arguments=arguments)
205
+
206
+ # Extract text response
207
+ if result.content and len(result.content) > 0:
208
+ return result.content[0].text
209
+ return ""
210
+
211
+ async def analyze_leaderboard(self, **kwargs) -> str:
212
+ """Wrapper for analyze_leaderboard tool"""
213
+ return await self.call_tool("analyze_leaderboard", kwargs)
214
+
215
+ async def estimate_cost(self, **kwargs) -> str:
216
+ """Wrapper for estimate_cost tool"""
217
+ return await self.call_tool("estimate_cost", kwargs)
218
+
219
+ async def debug_trace(self, **kwargs) -> str:
220
+ """Wrapper for debug_trace tool"""
221
+ return await self.call_tool("debug_trace", kwargs)
222
+
223
+ async def compare_runs(self, **kwargs) -> str:
224
+ """Wrapper for compare_runs tool"""
225
+ return await self.call_tool("compare_runs", kwargs)
226
+
227
+ async def get_top_performers(self, **kwargs) -> str:
228
+ """Wrapper for get_top_performers tool"""
229
+ return await self.call_tool("get_top_performers", kwargs)
230
+
231
+ async def disconnect(self):
232
+ """Close MCP connection"""
233
+ if self.session:
234
+ await self.session.__aexit__(None, None, None)
235
+ ```
236
+
237
+ ### Synchronous Wrapper (`sync_wrapper.py`)
238
+
239
+ ```python
240
+ import asyncio
241
+ from typing import Optional
242
+ from .client import TraceMindMCPClient
243
+
244
+ class SyncMCPClient:
245
+ """Synchronous wrapper for async MCP client (Gradio-compatible)"""
246
+
247
+ def __init__(self, mcp_server_url: str):
248
+ self.mcp_server_url = mcp_server_url
249
+ self.async_client = TraceMindMCPClient(mcp_server_url)
250
+ self._connected = False
251
+
252
+ def _run_async(self, coro):
253
+ """Run async coroutine in sync context"""
254
+ try:
255
+ loop = asyncio.get_event_loop()
256
+ except RuntimeError:
257
+ loop = asyncio.new_event_loop()
258
+ asyncio.set_event_loop(loop)
259
+
260
+ return loop.run_until_complete(coro)
261
+
262
+ def initialize(self):
263
+ """Connect to MCP server"""
264
+ if not self._connected:
265
+ self._run_async(self.async_client.connect())
266
+ self._connected = True
267
+
268
+ def analyze_leaderboard(self, **kwargs) -> str:
269
+ """Synchronous wrapper for analyze_leaderboard"""
270
+ if not self._connected:
271
+ self.initialize()
272
+ return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
273
+
274
+ def estimate_cost(self, **kwargs) -> str:
275
+ """Synchronous wrapper for estimate_cost"""
276
+ if not self._connected:
277
+ self.initialize()
278
+ return self._run_async(self.async_client.estimate_cost(**kwargs))
279
+
280
+ def debug_trace(self, **kwargs) -> str:
281
+ """Synchronous wrapper for debug_trace"""
282
+ if not self._connected:
283
+ self.initialize()
284
+ return self._run_async(self.async_client.debug_trace(**kwargs))
285
+
286
+ # ... (similar wrappers for other tools)
287
+
288
+ # Global instance for use in Gradio app
289
+ _mcp_client: Optional[SyncMCPClient] = None
290
+
291
+ def get_sync_mcp_client() -> SyncMCPClient:
292
+ """Get or create global sync MCP client instance"""
293
+ global _mcp_client
294
+ if _mcp_client is None:
295
+ mcp_server_url = os.getenv(
296
+ "MCP_SERVER_URL",
297
+ "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
298
+ )
299
+ _mcp_client = SyncMCPClient(mcp_server_url)
300
+ return _mcp_client
301
+ ```
302
+
303
+ ### Usage in Gradio App
304
+
305
+ ```python
306
+ # app.py
307
+ from mcp_client.sync_wrapper import get_sync_mcp_client
308
+
309
+ # Initialize MCP client
310
+ mcp_client = get_sync_mcp_client()
311
+ mcp_client.initialize()
312
+
313
+ # Use in Gradio event handlers
314
+ def load_leaderboard():
315
+ """Load leaderboard and generate AI insights"""
316
+ # Load dataset
317
+ ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
318
+ df = pd.DataFrame(ds)
319
+
320
+ # Get AI insights from MCP server
321
+ try:
322
+ insights = mcp_client.analyze_leaderboard(
323
+ metric_focus="overall",
324
+ time_range="last_week",
325
+ top_n=5
326
+ )
327
+ except Exception as e:
328
+ insights = f"❌ Error generating insights: {str(e)}"
329
+
330
+ return df, insights
331
+
332
+ # Gradio UI
333
+ with gr.Blocks() as app:
334
+ with gr.Tab("πŸ“Š Leaderboard"):
335
+ load_btn = gr.Button("Load Leaderboard")
336
+ insights_md = gr.Markdown(label="AI Insights")
337
+ leaderboard_table = gr.Dataframe()
338
+
339
+ load_btn.click(
340
+ fn=load_leaderboard,
341
+ outputs=[leaderboard_table, insights_md]
342
+ )
343
+ ```
344
+
345
+ ---
346
+
347
+ ## Agent Framework Integration
348
+
349
+ ### smolagents Setup
350
+
351
+ ```python
352
+ # agent/smolagents_setup.py
353
+ from smolagents import ToolCallingAgent, MCPClient, HfApiModel
354
+ import os
355
+
356
+ def create_agent():
357
+ """Create smolagents agent with MCP tool access"""
358
+
359
+ # 1. Configure MCP client
360
+ mcp_server_url = os.getenv(
361
+ "MCP_SERVER_URL",
362
+ "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
363
+ )
364
+
365
+ mcp_client = MCPClient(mcp_server_url)
366
+
367
+ # 2. Configure LLM
368
+ model = HfApiModel(
369
+ model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
370
+ token=os.getenv("HF_TOKEN")
371
+ )
372
+
373
+ # 3. Create agent with MCP tools
374
+ agent = ToolCallingAgent(
375
+ tools=[], # MCP tools loaded automatically
376
+ model=model,
377
+ mcp_client=mcp_client,
378
+ max_steps=10,
379
+ verbosity_level=1
380
+ )
381
+
382
+ return agent
383
+
384
+ def run_agent_query(agent: ToolCallingAgent, query: str, show_reasoning: bool = False):
385
+ """Run agent query and return response"""
386
+ try:
387
+ # Set verbosity based on show_reasoning flag
388
+ if show_reasoning:
389
+ agent.verbosity_level = 2 # Show tool execution logs
390
+ else:
391
+ agent.verbosity_level = 0 # Only show final answer
392
+
393
+ # Run agent
394
+ result = agent.run(query)
395
+
396
+ return result
397
+ except Exception as e:
398
+ return f"❌ Agent error: {str(e)}"
399
+ ```
400
+
401
+ ### Agent Chat UI
402
+
403
+ ```python
404
+ # app.py
405
+ from agent.smolagents_setup import create_agent, run_agent_query
406
+
407
+ # Initialize agent (once at startup)
408
+ agent = create_agent()
409
+
410
+ def agent_chat(message: str, history: list, show_reasoning: bool):
411
+ """Handle agent chat interaction"""
412
+ # Run agent query
413
+ response = run_agent_query(agent, message, show_reasoning)
414
+
415
+ # Update chat history
416
+ history.append((message, response))
417
+
418
+ return history, ""
419
+
420
+ # Gradio UI
421
+ with gr.Blocks() as app:
422
+ with gr.Tab("πŸ€– Agent Chat"):
423
+ gr.Markdown("## Autonomous Agent with MCP Tools")
424
+ gr.Markdown("Ask questions about agent evaluations. The agent has access to all MCP tools.")
425
+
426
+ chatbot = gr.Chatbot(label="Agent Chat")
427
+ msg = gr.Textbox(label="Your Question", placeholder="What are the top 3 models and their costs?")
428
+ show_reasoning = gr.Checkbox(label="Show Agent Reasoning", value=False)
429
+
430
+ # Quick action buttons
431
+ with gr.Row():
432
+ quick_top = gr.Button("Quick: Top Models")
433
+ quick_cost = gr.Button("Quick: Cost Estimate")
434
+ quick_load = gr.Button("Quick: Load Leaderboard")
435
+
436
+ # Event handlers
437
+ msg.submit(agent_chat, [msg, chatbot, show_reasoning], [chatbot, msg])
438
+
439
+ quick_top.click(
440
+ lambda h, sr: agent_chat(
441
+ "What are the top 5 models by success rate with their costs?",
442
+ h,
443
+ sr
444
+ ),
445
+ [chatbot, show_reasoning],
446
+ [chatbot, msg]
447
+ )
448
+ ```
449
+
450
+ ---
451
+
452
+ ## MCP Tools Usage
453
+
454
+ ### Tools Used in TraceMind-AI
455
+
456
+ | Tool | Where Used | Purpose |
457
+ |------|-----------|---------|
458
+ | `analyze_leaderboard` | Leaderboard tab | Generate AI insights when user loads leaderboard |
459
+ | `estimate_cost` | New Evaluation tab | Predict costs before submitting evaluation |
460
+ | `debug_trace` | Trace Visualization | Answer questions about execution traces |
461
+ | `compare_runs` | Agent Chat | Compare two evaluation runs side-by-side |
462
+ | `analyze_results` | Agent Chat | Analyze detailed test results with optimization recommendations |
463
+ | `get_top_performers` | Agent Chat | Efficiently fetch top N models (90% token reduction) |
464
+ | `get_leaderboard_summary` | Agent Chat | Get high-level statistics (99% token reduction) |
465
+ | `get_dataset` | Agent Chat | Load SMOLTRACE datasets for detailed analysis |
466
+
467
+ ### Example Tool Calls
468
+
469
+ **Example 1: Leaderboard Insights**
470
+ ```python
471
+ # User clicks "Load Leaderboard" button
472
+ insights = mcp_client.analyze_leaderboard(
473
+ leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
474
+ metric_focus="overall",
475
+ time_range="last_week",
476
+ top_n=5
477
+ )
478
+
479
+ # Display in Gradio Markdown component
480
+ insights_md.value = insights
481
+ ```
482
+
483
+ **Example 2: Cost Estimation**
484
+ ```python
485
+ # User fills New Evaluation form and clicks "Estimate Cost"
486
+ estimate = mcp_client.estimate_cost(
487
+ model="meta-llama/Llama-3.1-8B",
488
+ agent_type="both",
489
+ num_tests=100,
490
+ hardware="auto"
491
+ )
492
+
493
+ # Display in dialog
494
+ gr.Info(estimate)
495
+ ```
496
+
497
+ **Example 3: Agent Multi-Step Query**
498
+ ```python
499
+ # User asks: "What are the top 3 models and how much do they cost?"
500
+
501
+ # Agent reasoning (internal):
502
+ # Step 1: Need to get top models by success rate
503
+ # β†’ Call get_top_performers(metric="success_rate", top_n=3)
504
+ #
505
+ # Step 2: Extract cost information from results
506
+ # β†’ Parse JSON response, get "total_cost_usd" field
507
+ #
508
+ # Step 3: Format response for user
509
+ # β†’ Create markdown table with model names, success rates, costs
510
+
511
+ # Agent response:
512
+ """
513
+ Here are the top 3 models by success rate:
514
+
515
+ 1. **GPT-4**: 95.8% success rate, $0.05 per run
516
+ 2. **Claude-3**: 94.1% success rate, $0.04 per run
517
+ 3. **Llama-3.1-8B**: 93.4% success rate, $0.002 per run
518
+
519
+ GPT-4 leads in accuracy but is 25x more expensive than Llama-3.1.
520
+ For cost-sensitive workloads, Llama-3.1 offers the best value.
521
+ """
522
+ ```
523
+
524
+ ---
525
+
526
+ ## Development Guide
527
+
528
+ ### Adding New MCP Tool Integration
529
+
530
+ 1. **Add method to async client** (`client.py`):
531
+ ```python
532
+ async def new_tool_name(self, **kwargs) -> str:
533
+ """Wrapper for new_tool_name MCP tool"""
534
+ return await self.call_tool("new_tool_name", kwargs)
535
+ ```
536
+
537
+ 2. **Add synchronous wrapper** (`sync_wrapper.py`):
538
+ ```python
539
+ def new_tool_name(self, **kwargs) -> str:
540
+ """Synchronous wrapper for new_tool_name"""
541
+ if not self._connected:
542
+ self.initialize()
543
+ return self._run_async(self.async_client.new_tool_name(**kwargs))
544
+ ```
545
+
546
+ 3. **Use in Gradio app** (`app.py`):
547
+ ```python
548
+ def handle_new_tool():
549
+ result = mcp_client.new_tool_name(param1="value1", param2="value2")
550
+ return result
551
+ ```
552
+
553
+ **Note**: Agent automatically discovers new tools from MCP server, no code changes needed!
554
+
555
+ ### Testing MCP Integration
556
+
557
+ **Test 1: Connection**
558
+ ```python
559
+ python -c "from mcp_client.sync_wrapper import get_sync_mcp_client; client = get_sync_mcp_client(); client.initialize(); print('βœ… MCP client connected')"
560
+ ```
561
+
562
+ **Test 2: Tool Call**
563
+ ```python
564
+ from mcp_client.sync_wrapper import get_sync_mcp_client
565
+
566
+ client = get_sync_mcp_client()
567
+ client.initialize()
568
+
569
+ result = client.analyze_leaderboard(
570
+ metric_focus="cost",
571
+ time_range="last_week",
572
+ top_n=3
573
+ )
574
+
575
+ print(result)
576
+ ```
577
+
578
+ **Test 3: Agent**
579
+ ```python
580
+ from agent.smolagents_setup import create_agent, run_agent_query
581
+
582
+ agent = create_agent()
583
+ response = run_agent_query(agent, "What are the top 3 models?", show_reasoning=True)
584
+ print(response)
585
+ ```
586
+
587
+ ### Debugging MCP Issues
588
+
589
+ **Issue**: Connection timeout
590
+ - **Check**: MCP server is running at specified URL
591
+ - **Check**: Network connectivity to HuggingFace Spaces
592
+ - **Check**: SSE transport is enabled on server
593
+
594
+ **Issue**: Tool not found
595
+ - **Check**: MCP server has the tool implemented
596
+ - **Check**: Tool name matches exactly (case-sensitive)
597
+ - **Check**: Client initialized successfully (call `initialize()` first)
598
+
599
+ **Issue**: Agent not using MCP tools
600
+ - **Check**: MCPClient is properly configured in agent setup
601
+ - **Check**: Agent has `max_steps > 0` to allow tool usage
602
+ - **Check**: Query requires tool usage (not answerable from agent's knowledge alone)
603
+
604
+ ---
605
+
606
+ ## Performance Considerations
607
+
608
+ ### Token Optimization
609
+
610
+ **Problem**: Loading full leaderboard dataset consumes excessive tokens
611
+ **Solution**: Use token-optimized MCP tools
612
+
613
+ ```python
614
+ # ❌ BAD: Loads all 51 runs (50K+ tokens)
615
+ leaderboard = mcp_client.get_dataset("kshitijthakkar/smoltrace-leaderboard")
616
+
617
+ # βœ… GOOD: Returns only top 5 (5K tokens, 90% reduction)
618
+ top_performers = mcp_client.get_top_performers(top_n=5)
619
+
620
+ # βœ… BETTER: Returns summary stats (500 tokens, 99% reduction)
621
+ summary = mcp_client.get_leaderboard_summary()
622
+ ```
623
+
624
+ ### Caching
625
+
626
+ **Problem**: Repeated identical MCP calls waste time and credits
627
+ **Solution**: Implement client-side caching
628
+
629
+ ```python
630
+ from functools import lru_cache
631
+ import time
632
+
633
+ @lru_cache(maxsize=32)
634
+ def cached_analyze_leaderboard(metric_focus: str, time_range: str, top_n: int, cache_key: int):
635
+ """Cached MCP call with TTL via cache_key"""
636
+ return mcp_client.analyze_leaderboard(
637
+ metric_focus=metric_focus,
638
+ time_range=time_range,
639
+ top_n=top_n
640
+ )
641
+
642
+ # Use with 5-minute cache TTL
643
+ cache_key = int(time.time() // 300) # Changes every 5 minutes
644
+ insights = cached_analyze_leaderboard("overall", "last_week", 5, cache_key)
645
+ ```
646
+
647
+ ### Async Optimization
648
+
649
+ **Problem**: Sequential MCP calls block UI
650
+ **Solution**: Use async for parallel calls
651
+
652
+ ```python
653
+ import asyncio
654
+
655
+ async def load_leaderboard_with_insights():
656
+ """Load leaderboard and insights in parallel"""
657
+ # Start both operations concurrently
658
+ leaderboard_task = asyncio.create_task(load_dataset_async("kshitijthakkar/smoltrace-leaderboard"))
659
+ insights_task = asyncio.create_task(mcp_client.analyze_leaderboard(metric_focus="overall"))
660
+
661
+ # Wait for both to complete
662
+ leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
663
+
664
+ return leaderboard, insights
665
+ ```
666
+
667
+ ---
668
+
669
+ ## Security Considerations
670
+
671
+ ### API Key Management
672
+
673
+ **DO**:
674
+ - Store API keys in environment variables or HF Spaces secrets
675
+ - Use session-only storage in Gradio (not server-side persistence)
676
+ - Rotate keys regularly
677
+
678
+ **DON'T**:
679
+ - Hardcode API keys in source code
680
+ - Expose keys in client-side JavaScript
681
+ - Log API keys in console or files
682
+
683
+ ### MCP Server Trust
684
+
685
+ **Verify MCP server authenticity**:
686
+ - Use HTTPS URLs only
687
+ - Verify domain ownership (huggingface.co spaces)
688
+ - Review MCP server code before connecting (open source)
689
+
690
+ **Limit tool access**:
691
+ - Only connect to trusted MCP servers
692
+ - Review tool permissions before use
693
+ - Implement rate limiting for tool calls
694
+
695
+ ---
696
+
697
+ ## Related Documentation
698
+
699
+ - [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete UI walkthrough
700
+ - [JOB_SUBMISSION.md](JOB_SUBMISSION_TRACEMIND_AI.md) - Evaluation job guide
701
+ - [ARCHITECTURE.md](ARCHITECTURE_TRACEMIND_AI.md) - Technical architecture
702
+ - [TraceMind MCP Server Documentation](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
703
+
704
+ ---
705
+
706
+ **Last Updated**: November 21, 2025
README.md CHANGED
@@ -20,474 +20,449 @@ tags:
20
  # 🧠 TraceMind-AI
21
 
22
  <p align="center">
23
- <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
24
- <br/>
25
- <br/>
26
  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
27
  </p>
28
 
29
  **Agent Evaluation Platform with MCP-Powered Intelligence**
30
 
31
  [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
32
- [![Track](https://img.shields.io/badge/Track-MCP%20in%20Action%20(Enterprise)-purple)](https://github.com/modelcontextprotocol/hackathon)
33
  [![Powered by Gradio](https://img.shields.io/badge/Powered%20by-Gradio-orange)](https://gradio.app/)
34
 
35
  > **🎯 Track 2 Submission**: MCP in Action (Enterprise)
36
  > **πŸ“… MCP's 1st Birthday Hackathon**: November 14-30, 2025
37
 
38
- ## Overview
39
-
40
- TraceMind-AI is a comprehensive platform for evaluating AI agent performance across different models, providers, and configurations. It provides real-time insights, cost analysis, and detailed trace visualization powered by the Model Context Protocol (MCP).
41
-
42
- ### πŸ—οΈ **Built on Open Source Foundation**
43
-
44
- This platform is part of a complete agent evaluation ecosystem built on two foundational open-source projects:
45
-
46
- **πŸ”­ TraceVerde (genai_otel_instrument)** - Automatic OpenTelemetry Instrumentation
47
- - **What**: Zero-code OTEL instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
48
- - **Why**: Captures every LLM call, tool usage, and agent step automatically
49
- - **Links**: [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
50
-
51
- **πŸ“Š SMOLTRACE** - Agent Evaluation Engine
52
- - **What**: Lightweight, production-ready evaluation framework with OTEL tracing built-in
53
- - **Why**: Generates structured datasets (leaderboard, results, traces, metrics) displayed in this UI
54
- - **Links**: [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
55
-
56
- **The Flow**: `TraceVerde` instruments your agents β†’ `SMOLTRACE` evaluates them β†’ `TraceMind-AI` visualizes results with MCP-powered intelligence
57
-
58
  ---
59
 
60
- ## Features
61
 
62
- - **πŸ“Š Real-time Leaderboard**: Live evaluation data from HuggingFace datasets
63
- - **πŸ€– Autonomous Agent Chat**: Interactive agent powered by smolagents with MCP tools (Track 2)
64
- - **πŸ’¬ MCP Integration**: AI-powered analysis using remote MCP servers
65
- - **☁️ Multi-Cloud Evaluation**: Submit jobs to HuggingFace Jobs or Modal (H200, A100, A10 GPUs)
66
- - **πŸ’° Smart Cost Estimation**: Auto-select hardware and predict costs before running evaluations
67
- - **πŸ” Trace Visualization**: Detailed OpenTelemetry trace analysis with GPU metrics
68
- - **πŸ“ˆ Performance Metrics**: GPU utilization, CO2 emissions, token usage tracking
69
- - **🧠 Agent Reasoning**: View step-by-step agent planning and tool execution
70
 
71
- ## MCP Integration
 
 
 
 
 
72
 
73
- TraceMind demonstrates enterprise MCP client usage by connecting to [TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) via the Model Context Protocol.
74
 
75
- **MCP Tools Used:**
76
- - `analyze_leaderboard` - AI-generated insights about evaluation trends
77
- - `estimate_cost` - Cost estimation with hardware recommendations
78
- - `debug_trace` - Interactive trace analysis and debugging
79
- - `compare_runs` - Side-by-side run comparison
80
- - `analyze_results` - Test case analysis with optimization recommendations
81
 
82
- ## Quick Start
83
 
84
- ### Prerequisites
 
 
 
 
85
 
86
- **For Viewing Leaderboard & Analysis:**
87
- - Python 3.10+
88
- - HuggingFace account (for authentication)
89
 
90
- **For Submitting Evaluation Jobs:**
91
- - ⚠️ **HuggingFace Pro account** ($9/month) with credit card
92
- - HuggingFace token with **Read + Write + Run Jobs** permissions
93
- - API keys for model providers (OpenAI, Anthropic, etc.)
94
 
95
- > **Note**: Job submission requires a paid HuggingFace Pro account to access compute infrastructure. Viewing existing results is free.
96
 
97
- ### Installation
 
 
 
98
 
99
- 1. Clone the repository:
100
- ```bash
101
- git clone https://github.com/Mandark-droid/TraceMind-AI.git
102
- cd TraceMind-AI
103
  ```
104
-
105
- 2. Install dependencies:
106
- ```bash
107
- pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  ```
109
 
110
- 3. Configure environment:
111
- ```bash
112
- cp .env.example .env
113
- # Edit .env with your configuration
114
- ```
115
 
116
- 4. Run the application:
117
- ```bash
118
- python app.py
119
- ```
120
-
121
- Visit http://localhost:7860
122
-
123
- ## 🎯 For Hackathon Judges & Visitors
124
 
125
- ### Using Your Own API Keys (Recommended)
 
 
126
 
127
- TraceMind-AI integrates with the TraceMind MCP Server to provide AI-powered analysis. To **prevent credit issues during evaluation**, we recommend configuring your own API keys:
128
 
129
- #### Step-by-Step Configuration
 
 
130
 
131
- **Step 1: Configure MCP Server** (Required for MCP tool features)
 
132
 
133
- 1. **Open MCP Server**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
134
- 2. Go to **βš™οΈ Settings** tab
135
- 3. Enter your **Gemini API Key** and **HuggingFace Token**
136
- 4. Click **"Save & Override Keys"**
137
-
138
- **Step 2: Configure TraceMind-AI** (Optional, for additional features)
139
 
140
- 1. **Open TraceMind-AI**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
141
- 2. Go to **βš™οΈ Settings** tab
142
- 3. Enter your **Gemini API Key** and **HuggingFace Token**
143
- 4. Click **"Save API Keys"**
144
 
145
- ### Why Configure Both?
146
 
147
- - **MCP Server**: Provides AI-powered tools (leaderboard analysis, trace debugging, cost estimation)
148
- - **TraceMind-AI**: Main UI that calls the MCP server for intelligent analysis
149
- - They run in **separate sessions** β†’ need separate configuration
150
- - Configuring both ensures your keys are used for the complete evaluation flow
151
 
152
- ### Getting Free API Keys
 
 
 
153
 
154
- Both APIs have generous free tiers:
 
 
 
 
155
 
156
- **Google Gemini API Key**:
157
- - Visit: https://ai.google.dev/
158
- - Click "Get API Key" β†’ Create project β†’ Generate key
159
- - **Free tier**: 1,500 requests/day (sufficient for evaluation)
160
 
161
- **HuggingFace Token** (for viewing):
162
- - Visit: https://huggingface.co/settings/tokens
163
- - Click "New token" β†’ Name it (e.g., "TraceMind Viewer")
164
- - **Permissions**:
165
- - Select "Read" for viewing datasets (sufficient for browsing leaderboard)
166
- - **Free tier**: No rate limits for public dataset access
167
 
168
- ### Default Configuration (Without Your Keys)
169
 
170
- If you don't configure your own keys:
171
- - Apps will use our pre-configured keys from HuggingFace Spaces Secrets
172
- - Fine for brief testing, but may hit rate limits during high traffic
173
- - Recommended to configure your keys for full evaluation
174
 
175
- ### Security Notes
176
-
177
- βœ… **Session-only storage**: Keys stored only in browser memory
178
- βœ… **No server persistence**: Keys never saved to disk
179
- βœ… **Not exposed via API**: Settings forms use `api_name=False`
180
- βœ… **HTTPS encryption**: All API calls over secure connections
181
 
182
- ## πŸš€ Submitting Evaluation Jobs
183
 
184
- TraceMind-AI allows you to submit evaluation jobs to **two cloud platforms**:
185
- - **HuggingFace Jobs**: Managed compute with H200, A100, A10, T4 GPUs
186
- - **Modal**: Serverless GPU compute with pay-per-second pricing
187
 
188
- ### ⚠️ Requirements for Job Submission
 
 
189
 
190
- **For HuggingFace Jobs:**
191
 
192
- 1. **HuggingFace Pro Account** ($9/month)
193
- - Sign up at: https://huggingface.co/pricing
194
- - **Credit card required** to pay for compute usage
195
- - Free accounts cannot submit jobs
 
196
 
197
- 2. **HuggingFace Token with Enhanced Permissions**
198
- - Visit: https://huggingface.co/settings/tokens
199
- - Create token with these permissions:
200
- - βœ… **Read** (view datasets)
201
- - βœ… **Write** (upload results)
202
- - βœ… **Run Jobs** (submit evaluation jobs)
203
- - ⚠️ Read-only tokens will NOT work
204
 
205
- **For Modal (Optional Alternative):**
 
 
206
 
207
- 1. **Modal Account** (Free tier available)
208
- - Sign up at: https://modal.com
209
- - Generate API token at: https://modal.com/settings/tokens
210
- - Pay-per-second billing (no monthly subscription)
211
 
212
- 2. **Configure Modal Credentials in Settings**
213
- - MODAL_TOKEN_ID (starts with `ak-`)
214
- - MODAL_TOKEN_SECRET (starts with `as-`)
215
 
216
- **Both Platforms Require:**
217
 
218
- 3. **Model Provider API Keys**
219
- - OpenAI, Anthropic, Google, etc.
220
- - Configure in Settings β†’ LLM Provider API Keys
221
- - Passed securely as job secrets
222
 
223
- ### Hardware Options & Pricing
 
 
224
 
225
- TraceMind **auto-selects optimal hardware** based on your model size and provider:
226
 
227
- **HuggingFace Jobs:**
228
- - **cpu-basic**: API models (OpenAI, Anthropic) - ~$0.05/hr
229
- - **t4-small**: Small models (4B-8B parameters) - ~$0.60/hr
230
- - **a10g-small**: Medium models (7B-13B) - ~$1.10/hr
231
- - **a100-large**: Large models (70B+) - ~$3.00/hr
232
- - Pricing: https://huggingface.co/pricing#spaces-pricing
233
 
234
- **Modal:**
235
- - **CPU**: API models - ~$0.0001/sec
236
- - **A10G**: Small-medium models (7B-13B) - ~$0.0006/sec
237
- - **A100-80GB**: Large models (70B+) - ~$0.0030/sec
238
- - **H200**: Fastest inference - ~$0.0050/sec
239
- - Pricing: https://modal.com/pricing
240
 
241
- ### How to Submit a Job
242
 
243
- 1. **Configure API Keys** (Settings tab):
244
- - Add HF Token (with Run Jobs permission) - **required for both platforms**
245
- - Add Modal credentials (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET) - **for Modal only**
246
- - Add LLM provider keys (OpenAI, Anthropic, etc.)
247
-
248
- 2. **Create Evaluation** (New Evaluation tab):
249
- - **Select infrastructure**: HuggingFace Jobs or Modal
250
- - Choose model and agent type
251
- - Configure hardware (or use **"auto"** for smart selection)
252
- - Set timeout (default: 1h)
253
- - Click "πŸ’° Estimate Cost" to preview cost/duration
254
- - Click "Submit Evaluation"
255
-
256
- 3. **Monitor Job**:
257
- - View job ID and status in confirmation screen
258
- - **HF Jobs**: Track at https://huggingface.co/jobs or use Job Monitoring tab
259
- - **Modal**: Track at https://modal.com/apps
260
- - Results automatically appear in leaderboard when complete
261
-
262
- ### What Happens During a Job
263
-
264
- 1. Job starts on selected infrastructure (HF Jobs or Modal)
265
- 2. Docker container built with required dependencies
266
- 3. SMOLTRACE evaluates your model with OpenTelemetry tracing
267
- 4. Results uploaded to 4 HuggingFace datasets:
268
- - Leaderboard entry (summary stats)
269
- - Results dataset (test case details)
270
- - Traces dataset (OTEL spans)
271
- - Metrics dataset (GPU metrics, CO2 emissions)
272
- 5. Results appear in TraceMind leaderboard automatically
273
-
274
- **Expected Duration:**
275
- - CPU jobs (API models): 2-5 minutes
276
- - GPU jobs (local models): 15-30 minutes (includes model download)
277
 
278
- ## Configuration
 
 
 
 
279
 
280
- Create a `.env` file with the following variables:
 
 
 
 
281
 
282
- ```env
283
- # HuggingFace Configuration
284
- HF_TOKEN=your_token_here
285
 
286
- # Agent Model Configuration (for Chat Screen - Track 2)
287
- # Options: "hfapi" (default), "inference_client", "litellm"
288
- AGENT_MODEL_TYPE=hfapi
289
 
290
- # API Keys for different model types
291
- # Required if AGENT_MODEL_TYPE=litellm
292
- GEMINI_API_KEY=your_gemini_api_key_here
293
 
294
- # MCP Server URL (note: /sse endpoint for smolagents integration)
295
- MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
296
 
297
- # Dataset Configuration
298
- LEADERBOARD_REPO=kshitijthakkar/smoltrace-leaderboard
 
 
 
299
 
300
- # Development Mode (optional - disables OAuth for local testing)
301
- DISABLE_OAUTH=true
302
- ```
303
 
304
- ### Agent Model Options
 
 
 
 
305
 
306
- The Agent Chat screen supports three model configurations:
307
 
308
- 1. **`hfapi` (Default)**: Uses HuggingFace Inference API
309
- - Model: `Qwen/Qwen2.5-Coder-32B-Instruct`
310
- - Requires: `HF_TOKEN`
311
- - Best for: General use, free tier available
312
 
313
- 2. **`inference_client`**: Uses Nebius provider
314
- - Model: `deepseek-ai/DeepSeek-V3-0324`
315
- - Requires: `HF_TOKEN`
316
- - Best for: Advanced reasoning, faster inference
317
 
318
- 3. **`litellm`**: Uses Google Gemini
319
- - Model: `gemini/gemini-2.5-flash`
320
- - Requires: `GEMINI_API_KEY`
321
- - Best for: Gemini-specific features
322
 
323
- ## Data Sources
 
 
 
324
 
325
- TraceMind-AI loads evaluation data from HuggingFace datasets:
 
 
 
326
 
327
- - **Leaderboard**: Aggregate statistics for all evaluation runs
328
- - **Results**: Individual test case results
329
- - **Traces**: OpenTelemetry trace data
330
- - **Metrics**: GPU metrics and performance data
331
 
332
- ## Architecture
333
 
334
- ### Project Structure
335
 
336
- ```
337
- TraceMind-AI/
338
- β”œβ”€β”€ app.py # Main Gradio application
339
- β”œβ”€β”€ data_loader.py # HuggingFace dataset integration
340
- β”œβ”€β”€ mcp_client/ # MCP client implementation
341
- β”‚ β”œβ”€β”€ client.py # Async MCP client
342
- β”‚ └── sync_wrapper.py # Synchronous wrapper
343
- β”œβ”€β”€ utils/ # Utilities
344
- β”‚ β”œβ”€β”€ auth.py # HuggingFace OAuth
345
- β”‚ └── navigation.py # Screen navigation
346
- β”œβ”€β”€ screens/ # UI screens
347
- β”œβ”€β”€ components/ # Reusable components
348
- └── styles/ # Custom CSS
349
- ```
350
 
351
- ### MCP Client Integration
352
 
353
- TraceMind-AI uses the MCP Python SDK to connect to remote MCP servers:
 
 
 
354
 
355
- ```python
356
- from mcp_client.sync_wrapper import get_sync_mcp_client
357
 
358
- # Initialize MCP client
359
- mcp_client = get_sync_mcp_client()
360
- mcp_client.initialize()
 
361
 
362
- # Call MCP tools
363
- insights = mcp_client.analyze_leaderboard(
364
- metric_focus="overall",
365
- time_range="last_week",
366
- top_n=5
367
- )
368
- ```
369
 
370
- ## Usage
 
 
 
371
 
372
- ### Viewing the Leaderboard
373
 
374
- 1. Log in with your HuggingFace account
375
- 2. Navigate to the "Leaderboard" tab
376
- 3. Click "Load Leaderboard" to fetch the latest data
377
- 4. View AI-powered insights generated by the MCP server
378
 
379
- ### Estimating Costs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
380
 
381
- 1. Navigate to the "Cost Estimator" tab
382
- 2. Enter the model name (e.g., `openai/gpt-4`)
383
- 3. Select agent type and number of tests
384
- 4. Click "Estimate Cost" for AI-powered analysis
385
 
386
- ### Viewing Trace Details
387
 
388
- 1. Select an evaluation run from the leaderboard
389
- 2. Click on a specific test case
390
- 3. View detailed OpenTelemetry trace visualization
391
- 4. Ask questions about the trace using MCP-powered analysis
 
 
 
 
 
392
 
393
- ### Using the Agent Chat (Track 2)
394
 
395
- 1. Navigate to the "πŸ€– Agent Chat" tab
396
- 2. The autonomous agent will initialize with MCP tools from TraceMind MCP Server
397
- 3. Ask questions about agent evaluations:
398
- - "What are the top 3 performing models and their costs?"
399
- - "Estimate the cost of running 500 tests with DeepSeek-V3 on H200"
400
- - "Load the leaderboard and show me the last 5 run IDs"
401
- 4. Watch the agent plan, execute tools, and provide detailed answers
402
- 5. Enable "Show Agent Reasoning" to see step-by-step tool execution
403
- 6. Use Quick Action buttons for common queries
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
404
 
405
- **Example Questions:**
406
- - Analysis: "Analyze the current leaderboard and show me the top performing models with their costs"
407
- - Cost Comparison: "Compare the costs of the top 3 models - which one offers the best value?"
408
- - Recommendations: "Based on the leaderboard data, which model would you recommend for a production system?"
409
 
410
- ## Technology Stack
411
 
412
- - **UI Framework**: Gradio 5.49.1
413
- - **Agent Framework**: smolagents 1.22.0+ (Track 2)
414
- - **MCP Protocol**: MCP integration via Gradio & smolagents MCPClient
415
- - **Data**: HuggingFace Datasets API
416
- - **Authentication**: HuggingFace OAuth
417
- - **AI Models**:
418
- - Default: Qwen/Qwen2.5-Coder-32B-Instruct (HF Inference API)
419
- - Optional: DeepSeek-V3 (Nebius), Gemini 2.5 Flash
420
- - MCP Server: Google Gemini 2.5 Pro
421
 
422
- ## Development
423
 
424
- ### Running Locally
425
 
426
- ```bash
427
- # Install dependencies
428
- pip install -r requirements.txt
429
 
430
- # Set development mode (optional - disables OAuth)
431
- export DISABLE_OAUTH=true
 
 
 
 
432
 
433
- # Run the app
434
- python app.py
435
- ```
436
 
437
- ### Running on HuggingFace Spaces
 
 
 
438
 
439
- This application is configured for deployment on HuggingFace Spaces using the Gradio SDK. The `app.py` file serves as the entry point.
440
 
441
- ## Documentation
442
 
443
- For detailed implementation documentation, see:
444
- - [Data Loader API](data_loader.py) - Dataset loading and caching
445
- - [MCP Client API](mcp_client/client.py) - MCP protocol integration
446
- - [Authentication](utils/auth.py) - HuggingFace OAuth integration
447
 
448
- ## Demo Video
 
 
449
 
450
- [Link to demo video showing the application in action]
 
 
451
 
452
- ## Social Media
 
 
453
 
454
- [Link to social media post about this project]
 
 
455
 
456
- ## License
 
 
457
 
458
- AGPL-3.0 License
 
 
459
 
460
- This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
461
 
462
- ## Contributing
463
 
464
- Contributions are welcome! Please open an issue or submit a pull request.
465
 
466
- ## Built By
467
 
 
468
  **Track**: MCP in Action (Enterprise)
469
  **Author**: Kshitij Thakkar
470
- **Powered by**: MCP Servers (TraceMind-mcp-server) + Gradio
471
  **Built with**: Gradio 5.49.1 (MCP client integration)
472
 
 
 
 
 
 
473
  ---
474
 
475
- ## Acknowledgments
476
 
477
- - **MCP Team** - For the Model Context Protocol specification
478
- - **Gradio Team** - For Gradio 6 with MCP integration
479
- - **HuggingFace** - For Spaces hosting and dataset infrastructure
480
- - **Google** - For Gemini API access
481
- - **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
482
 
483
- ## Links
484
 
485
- - **Live Demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
486
- - **MCP Server**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
487
- - **GitHub**: https://github.com/Mandark-droid/TraceMind-AI
488
- - **MCP Specification**: https://modelcontextprotocol.io
 
 
489
 
490
  ---
491
 
492
- **MCP's 1st Birthday Hackathon Submission**
493
- *Track: MCP in Action - Enterprise*
 
 
20
  # 🧠 TraceMind-AI
21
 
22
  <p align="center">
 
 
 
23
  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
24
  </p>
25
 
26
  **Agent Evaluation Platform with MCP-Powered Intelligence**
27
 
28
  [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
29
+ [![Track 2: MCP in Action](https://img.shields.io/badge/Track-MCP%20in%20Action%20(Enterprise)-purple)](https://github.com/modelcontextprotocol/hackathon)
30
  [![Powered by Gradio](https://img.shields.io/badge/Powered%20by-Gradio-orange)](https://gradio.app/)
31
 
32
  > **🎯 Track 2 Submission**: MCP in Action (Enterprise)
33
  > **πŸ“… MCP's 1st Birthday Hackathon**: November 14-30, 2025
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ---
36
 
37
+ ## Why TraceMind-AI?
38
 
39
+ **The Challenge**: Evaluating AI agents generates complex data across models, providers, and configurations. Making sense of it all is overwhelming.
 
 
 
 
 
 
 
40
 
41
+ **The Solution**: TraceMind-AI is your **intelligent agent evaluation command center**:
42
+ - πŸ“Š **Live leaderboard** with real-time performance data
43
+ - πŸ€– **Autonomous agent chat** powered by MCP tools
44
+ - πŸ’° **Smart cost estimation** before you run evaluations
45
+ - πŸ” **Deep trace analysis** to debug agent behavior
46
+ - ☁️ **Multi-cloud job submission** (HuggingFace Jobs + Modal)
47
 
48
+ All powered by the **Model Context Protocol** for AI-driven insights at every step.
49
 
50
+ ---
 
 
 
 
 
51
 
52
+ ## πŸš€ Try It Now
53
 
54
+ - **🌐 Live Demo**: [TraceMind-AI Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind)
55
+ - **πŸ› οΈ MCP Server**: [TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) (Track 1)
56
+ - **πŸ“– Full Docs**: See [USER_GUIDE.md](USER_GUIDE.md) for complete walkthrough
57
+ - **🎬 MCP Server Quick Demo (5 min)**: [Watch on Loom](https://www.loom.com/share/d4d0003f06fa4327b46ba5c081bdf835)
58
+ - **πŸ“Ί MCP Server Full Demo (20 min)**: [Watch on Loom](https://www.loom.com/share/de559bb0aef749559c79117b7f951250)
59
 
60
+ ---
 
 
61
 
62
+ ## The TraceMind Ecosystem
 
 
 
63
 
64
+ TraceMind-AI is the **user-facing platform** in a complete 4-project agent evaluation ecosystem:
65
 
66
+ <p align="center">
67
+ <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
68
+ <br/><br/>
69
+ </p>
70
 
 
 
 
 
71
  ```
72
+ πŸ”­ TraceVerde πŸ“Š SMOLTRACE
73
+ (genai_otel_instrument) (Evaluation Engine)
74
+ ↓ ↓
75
+ Instruments Evaluates
76
+ LLM calls agents
77
+ ↓ ↓
78
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
79
+ ↓
80
+ Generates Datasets
81
+ (leaderboard, traces, metrics)
82
+ ↓
83
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
84
+ ↓ ↓
85
+ πŸ› οΈ TraceMind MCP Server 🧠 TraceMind-AI
86
+ (Track 1 - Building MCP) (This Project - Track 2)
87
+ Provides AI Tools Consumes MCP Tools
88
+ └───────── MCP Protocol β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
  ```
90
 
91
+ ### The Foundation
 
 
 
 
92
 
93
+ **πŸ”­ TraceVerde** - Automatic OpenTelemetry instrumentation for LLM frameworks
94
+ β†’ Captures every LLM call, tool usage, and agent step
95
+ β†’ [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
 
 
 
 
 
96
 
97
+ **πŸ“Š SMOLTRACE** - Lightweight evaluation engine with built-in tracing
98
+ β†’ Generates structured datasets (leaderboard, results, traces, metrics)
99
+ β†’ [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
100
 
101
+ ### The Platform
102
 
103
+ **πŸ› οΈ TraceMind MCP Server** - AI-powered analysis tools via MCP
104
+ β†’ [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) | [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server)
105
+ β†’ **Track 1**: Building MCP (Enterprise)
106
 
107
+ **🧠 TraceMind-AI** (This Project) - Interactive UI that consumes MCP tools
108
+ β†’ **Track 2**: MCP in Action (Enterprise)
109
 
110
+ ---
 
 
 
 
 
111
 
112
+ ## Key Features
 
 
 
113
 
114
+ ### 🎯 MCP Integration (Track 2)
115
 
116
+ TraceMind-AI demonstrates **enterprise MCP client usage** in two ways:
 
 
 
117
 
118
+ **1. Direct MCP Client Integration**
119
+ - Connects to TraceMind MCP Server via SSE transport
120
+ - Uses 5 AI-powered tools: `analyze_leaderboard`, `estimate_cost`, `debug_trace`, `compare_runs`, `analyze_results`
121
+ - Real-time insights powered by Google Gemini 2.5 Flash
122
 
123
+ **2. Autonomous Agent with MCP Tools**
124
+ - Built with `smolagents` framework
125
+ - Agent has access to all MCP server tools
126
+ - Natural language queries β†’ autonomous tool execution
127
+ - Example: *"What are the top 3 models and how much do they cost?"*
128
 
129
+ ### πŸ“Š Agent Evaluation Features
 
 
 
130
 
131
+ - **Live Leaderboard**: View all evaluation runs with sortable metrics
132
+ - **Cost Estimation**: Auto-select hardware and predict costs before running
133
+ - **Trace Visualization**: Deep-dive into OpenTelemetry traces with GPU metrics
134
+ - **Multi-Cloud Jobs**: Submit evaluations to HuggingFace Jobs or Modal
135
+ - **Performance Analytics**: GPU utilization, CO2 emissions, token tracking
 
136
 
137
+ ### πŸ’‘ Smart Features
138
 
139
+ - **Auto Hardware Selection**: Based on model size and provider
140
+ - **Real-time Job Monitoring**: Track HuggingFace Jobs status
141
+ - **Agent Reasoning Visibility**: See step-by-step tool execution
142
+ - **Quick Action Buttons**: One-click common queries
143
 
144
+ ---
 
 
 
 
 
145
 
146
+ ## Quick Start
147
 
148
+ ### Option 1: Use the Live Demo (Recommended)
 
 
149
 
150
+ 1. **Visit**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
151
+ 2. **Login**: Sign in with your HuggingFace account
152
+ 3. **Explore**: Browse the leaderboard, chat with the agent, visualize traces
153
 
154
+ ### Option 2: Run Locally
155
 
156
+ ```bash
157
+ # Clone and setup
158
+ git clone https://github.com/Mandark-droid/TraceMind-AI.git
159
+ cd TraceMind-AI
160
+ pip install -r requirements.txt
161
 
162
+ # Configure environment
163
+ cp .env.example .env
164
+ # Edit .env with your API keys (see Configuration section)
 
 
 
 
165
 
166
+ # Run the app
167
+ python app.py
168
+ ```
169
 
170
+ Visit http://localhost:7860
 
 
 
171
 
172
+ ---
 
 
173
 
174
+ ## Configuration
175
 
176
+ ### For Viewing (Free)
 
 
 
177
 
178
+ **Required**:
179
+ - HuggingFace account (free)
180
+ - HuggingFace token with **Read** permissions
181
 
182
+ ### For Submitting Jobs (Paid)
183
 
184
+ **Required**:
185
+ - ⚠️ **HuggingFace Pro** ($9/month) with credit card
186
+ - HuggingFace token with **Read + Write + Run Jobs** permissions
187
+ - LLM provider API keys (OpenAI, Anthropic, etc.)
 
 
188
 
189
+ **Optional (Modal Alternative)**:
190
+ - Modal account (pay-per-second, no subscription)
191
+ - Modal API token (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
 
 
 
192
 
193
+ ### Using Your Own API Keys (Recommended for Judges)
194
 
195
+ To prevent rate limits during evaluation:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ **Step 1: Configure MCP Server** (Required for AI tools)
198
+ 1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
199
+ 2. Go to **βš™οΈ Settings** tab
200
+ 3. Enter: **Gemini API Key** + **HuggingFace Token**
201
+ 4. Click **"Save & Override Keys"**
202
 
203
+ **Step 2: Configure TraceMind-AI** (Optional)
204
+ 1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
205
+ 2. Go to **βš™οΈ Settings** tab
206
+ 3. Enter: **Gemini API Key** + **HuggingFace Token**
207
+ 4. Click **"Save API Keys"**
208
 
209
+ **Get Free API Keys**:
210
+ - **Gemini**: https://ai.google.dev/ (1,500 requests/day)
211
+ - **HuggingFace**: https://huggingface.co/settings/tokens (unlimited for public datasets)
212
 
213
+ ---
 
 
214
 
215
+ ## For Hackathon Judges
 
 
216
 
217
+ ### βœ… Track 2 Compliance
 
218
 
219
+ - **MCP Client Integration**: Connects to remote MCP server via SSE transport
220
+ - **Autonomous Agent**: `smolagents` agent with MCP tool access
221
+ - **Enterprise Focus**: Cost optimization, job submission, performance analytics
222
+ - **Production-Ready**: Deployed to HuggingFace Spaces with OAuth authentication
223
+ - **Real Data**: Live HuggingFace datasets from SMOLTRACE evaluations
224
 
225
+ ### 🎯 Key Innovations
 
 
226
 
227
+ 1. **Dual MCP Integration**: Both direct MCP client + autonomous agent with MCP tools
228
+ 2. **Multi-Cloud Support**: HuggingFace Jobs + Modal for serverless compute
229
+ 3. **Auto Hardware Selection**: Smart hardware recommendations based on model size
230
+ 4. **Complete Ecosystem**: Part of 4-project platform demonstrating full evaluation workflow
231
+ 5. **Agent Reasoning Visibility**: See step-by-step MCP tool execution
232
 
233
+ ### πŸ“Ή Demo Materials
234
 
235
+ - **πŸŽ₯ Demo Video**: [Coming Soon - Link to walkthrough]
236
+ - **πŸ“’ Social Post**: [Coming Soon - Link to announcement]
 
 
237
 
238
+ ### πŸ§ͺ Testing Suggestions
 
 
 
239
 
240
+ **1. Try the Agent Chat** (πŸ€– Agent Chat tab):
241
+ - "Analyze the current leaderboard and show me the top 5 models"
242
+ - "Compare the costs of the top 3 models"
243
+ - "Estimate the cost of running 100 tests with GPT-4"
244
 
245
+ **2. Explore the Leaderboard** (πŸ“Š Leaderboard tab):
246
+ - Click "Load Leaderboard" to see live data
247
+ - Read the AI-generated insights (powered by MCP server)
248
+ - Click on a run to see detailed test results
249
 
250
+ **3. Visualize Traces** (Select a run β†’ View traces):
251
+ - See OpenTelemetry waterfall diagrams
252
+ - View GPU metrics overlay (for GPU jobs)
253
+ - Ask questions about the trace (MCP-powered debugging)
254
 
255
+ ---
 
 
 
256
 
257
+ ## What Can You Do?
258
 
259
+ ### πŸ“Š View & Analyze
260
 
261
+ - **Browse leaderboard** with AI-powered insights
262
+ - **Compare models** side-by-side across metrics
263
+ - **Analyze traces** with interactive visualization
264
+ - **Ask questions** via autonomous agent
 
 
 
 
 
 
 
 
 
 
265
 
266
+ ### πŸ’° Estimate & Plan
267
 
268
+ - **Get cost estimates** before running evaluations
269
+ - **Compare hardware options** (CPU vs GPU tiers)
270
+ - **Preview duration** and CO2 emissions
271
+ - **See recommendations** from AI analysis
272
 
273
+ ### πŸš€ Submit & Monitor
 
274
 
275
+ - **Submit evaluation jobs** to HuggingFace or Modal
276
+ - **Track job status** in real-time
277
+ - **View results** automatically when complete
278
+ - **Download datasets** for further analysis
279
 
280
+ ### πŸ§ͺ Generate & Customize
 
 
 
 
 
 
281
 
282
+ - **Generate synthetic datasets** for custom domains and tools
283
+ - **Create prompt templates** optimized for your use case
284
+ - **Push to HuggingFace Hub** with one click
285
+ - **Test evaluations** without writing code
286
 
287
+ ---
288
 
289
+ ## Documentation
 
 
 
290
 
291
+ **For quick evaluation**:
292
+ - Read this README for overview
293
+ - Visit the [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) to try it
294
+ - Check out the **πŸ€– Agent Chat** tab for autonomous MCP usage
295
+
296
+ **For deep dives**:
297
+ - [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
298
+ - Leaderboard tab usage
299
+ - Agent chat interactions
300
+ - Synthetic data generator
301
+ - Job submission workflow
302
+ - Trace visualization guide
303
+ - [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture
304
+ - How TraceMind-AI connects to MCP server
305
+ - Agent framework integration (smolagents)
306
+ - MCP tool usage examples
307
+ - [JOB_SUBMISSION.md](JOB_SUBMISSION.md) - Evaluation job guide
308
+ - HuggingFace Jobs setup
309
+ - Modal integration
310
+ - Hardware selection guide
311
+ - Cost optimization tips
312
+ - [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture
313
+ - Project structure
314
+ - Data flow
315
+ - Authentication
316
+ - Deployment
317
 
318
+ ---
 
 
 
319
 
320
+ ## Technology Stack
321
 
322
+ - **UI Framework**: Gradio 5.49.1
323
+ - **Agent Framework**: smolagents 1.22.0+
324
+ - **MCP Integration**: MCP Python SDK + smolagents MCPClient
325
+ - **Data Source**: HuggingFace Datasets API
326
+ - **Authentication**: HuggingFace OAuth
327
+ - **AI Models**:
328
+ - Agent: Qwen/Qwen2.5-Coder-32B-Instruct (HF API)
329
+ - MCP Server: Google Gemini 2.5 Flash
330
+ - **Cloud Platforms**: HuggingFace Jobs + Modal
331
 
332
+ ---
333
 
334
+ ## Example Workflows
335
+
336
+ ### Workflow 1: Quick Analysis
337
+ 1. Open TraceMind-AI
338
+ 2. Go to **πŸ€– Agent Chat**
339
+ 3. Click **"Quick: Top Models"**
340
+ 4. See agent fetch leaderboard and analyze top performers
341
+ 5. Ask follow-up: *"Which one is most cost-effective?"*
342
+
343
+ ### Workflow 2: Submit Evaluation Job
344
+ 1. Go to **βš™οΈ Settings** β†’ Configure API keys
345
+ 2. Go to **πŸš€ New Evaluation**
346
+ 3. Select model (e.g., `meta-llama/Llama-3.1-8B`)
347
+ 4. Choose infrastructure (HuggingFace Jobs or Modal)
348
+ 5. Click **"πŸ’° Estimate Cost"** to preview
349
+ 6. Click **"Submit Evaluation"**
350
+ 7. Monitor job in **πŸ“Š Job Monitoring** tab
351
+ 8. View results in leaderboard when complete
352
+
353
+ ### Workflow 3: Debug Agent Behavior
354
+ 1. Browse **πŸ“Š Leaderboard**
355
+ 2. Click on a run with failures
356
+ 3. View **detailed test results**
357
+ 4. Click on a failed test to see trace
358
+ 5. Use MCP-powered Q&A: *"Why did this test fail?"*
359
+ 6. Get AI analysis of the execution trace
360
+
361
+ ### Workflow 4: Generate Custom Test Dataset
362
+ 1. Go to **πŸ”¬ Synthetic Data Generator**
363
+ 2. Configure:
364
+ - Domain: `finance`
365
+ - Tools: `get_stock_price,calculate_profit,send_alert`
366
+ - Number of tasks: `20`
367
+ - Difficulty: `balanced`
368
+ 3. Click **"Generate Dataset"**
369
+ 4. Review generated tasks and prompt template
370
+ 5. Enter repository name: `yourname/smoltrace-finance-tasks`
371
+ 6. Click **"Push to HuggingFace Hub"**
372
+ 7. Use your custom dataset in evaluations
373
 
374
+ ---
 
 
 
375
 
376
+ ## Screenshots
377
 
378
+ *See [SCREENSHOTS.md](SCREENSHOTS.md) for annotated screenshots of all screens*
 
 
 
 
 
 
 
 
379
 
380
+ ---
381
 
382
+ ## πŸ”— Quick Links
383
 
384
+ ### πŸ“¦ Component Links
 
 
385
 
386
+ | Component | Description | Links |
387
+ |-----------|-------------|-------|
388
+ | **TraceVerde** | OTEL Instrumentation | [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) β€’ [PyPI](https://pypi.org/project/genai-otel-instrument) |
389
+ | **SMOLTRACE** | Evaluation Engine | [GitHub](https://github.com/Mandark-droid/SMOLTRACE) β€’ [PyPI](https://pypi.org/project/smoltrace/) |
390
+ | **MCP Server** | Building MCP (Track 1) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) β€’ [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server) |
391
+ | **TraceMind-AI** | MCP in Action (Track 2) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) β€’ [GitHub](https://github.com/Mandark-droid/TraceMind-AI) |
392
 
393
+ ### πŸ“’ Community Posts
 
 
394
 
395
+ - πŸŽ‰ [**TraceMind Teaser**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_mcpsfirstbirthdayhackathon-mcpsfirstbirthdayhackathon-activity-7395686529270013952-g_id) - MCP's 1st Birthday Hackathon announcement
396
+ - πŸ“Š [**SMOLTRACE Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_ai-machinelearning-llm-activity-7394350375908126720-im_T) - Lightweight agent evaluation engine
397
+ - πŸ”­ [**TraceVerde Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_genai-opentelemetry-observability-activity-7390339855135813632-wqEg) - Zero-code OTEL instrumentation for LLMs
398
+ - πŸ™ [**TraceVerde 3K Downloads**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_thank-you-open-source-community-a-week-activity-7392205780592132096-nu6U) - Thank you to the community!
399
 
400
+ ---
401
 
402
+ ## πŸ—ΊοΈ Future Roadmap
403
 
404
+ We're committed to making TraceMind the most comprehensive agent evaluation platform. Here's what's coming next:
 
 
 
405
 
406
+ ### 1. πŸ—οΈ Dynamic MCP Server Generator
407
+ Generate domain-specific MCP servers on-the-fly with custom tools via AI code generation.
408
+ **Use case**: Rapidly prototype MCP servers without writing boilerplate code.
409
 
410
+ ### 2. 🎯 Intelligent Model Router
411
+ Automatically select optimal models based on real-time leaderboard data, budget constraints, and accuracy requirements.
412
+ **Use case**: Optimize evaluation costs while maintaining quality for large-scale continuous evaluation.
413
 
414
+ ### 3. πŸ”¬ Automated A/B Testing Framework
415
+ Compare multiple agent configurations with statistical significance testing and automatic winner selection.
416
+ **Use case**: Find optimal agent configuration scientifically before production deployment.
417
 
418
+ ### 4. πŸ‘₯ Collaborative Evaluation Workspace
419
+ Real-time collaboration with shared runs, team comments, cost budgets, and stakeholder reports.
420
+ **Use case**: Streamline team workflows and coordinate evaluation efforts across distributed teams.
421
 
422
+ ### 5. πŸ”„ CI/CD Pipeline Integration
423
+ Automated agent evaluation on every PR with GitHub Actions, result comments, and merge blocking on quality drops.
424
+ **Use case**: Catch agent performance regressions before production and maintain quality standards automatically.
425
 
426
+ ### 6. 🧰 Integrated SMOLTRACE CLI Features
427
+ Bring all SMOLTRACE CLI tools into the UI: clean, copy, distill, merge, export, validate, anonymize datasets.
428
+ **Use case**: Manage evaluation datasets efficiently without command-line, with visual preview and undo capabilities.
429
 
430
+ ---
431
 
432
+ **Implementation Timeline**: Q1-Q4 2026 | **Want to contribute?** Join our community and help shape the future of agent evaluation!
433
 
434
+ ---
435
 
436
+ ## Credits
437
 
438
+ **Built for**: MCP's 1st Birthday Hackathon (Nov 14-30, 2025)
439
  **Track**: MCP in Action (Enterprise)
440
  **Author**: Kshitij Thakkar
441
+ **Powered by**: TraceMind MCP Server + Gradio + smolagents
442
  **Built with**: Gradio 5.49.1 (MCP client integration)
443
 
444
+ **Special Thanks**:
445
+ - **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
446
+
447
+ **Sponsors**: HuggingFace β€’ Google Gemini β€’ Modal β€’ Anthropic β€’ Gradio β€’ ElevenLabs β€’ SambaNova β€’ Blaxel
448
+
449
  ---
450
 
451
+ ## License
452
 
453
+ AGPL-3.0 - See [LICENSE](LICENSE) for details
 
 
 
 
454
 
455
+ ---
456
 
457
+ ## Support
458
+
459
+ - πŸ“§ GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
460
+ - πŸ’¬ HF Discord: `#mcp-1st-birthday-officialπŸ†`
461
+ - 🏷️ Tag: `mcp-in-action-track-enterprise`
462
+ - 🐦 Twitter: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
463
 
464
  ---
465
 
466
+ **Ready to evaluate your agents with AI-powered intelligence?**
467
+
468
+ 🌐 **Try the live demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
USER_GUIDE.md ADDED
@@ -0,0 +1,1026 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceMind-AI - Complete User Guide
2
+
3
+ This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Getting Started](#getting-started)
8
+ - [Screen-by-Screen Guide](#screen-by-screen-guide)
9
+ - [πŸ“Š Leaderboard](#-leaderboard)
10
+ - [πŸ€– Agent Chat](#-agent-chat)
11
+ - [πŸš€ New Evaluation](#-new-evaluation)
12
+ - [πŸ“ˆ Job Monitoring](#-job-monitoring)
13
+ - [πŸ” Trace Visualization](#-trace-visualization)
14
+ - [πŸ”¬ Synthetic Data Generator](#-synthetic-data-generator)
15
+ - [βš™οΈ Settings](#️-settings)
16
+ - [Common Workflows](#common-workflows)
17
+ - [Troubleshooting](#troubleshooting)
18
+
19
+ ---
20
+
21
+ ## Getting Started
22
+
23
+ ### First-Time Setup
24
+
25
+ 1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
26
+ 2. **Sign in** with your HuggingFace account (required for viewing)
27
+ 3. **Configure API keys** (optional but recommended):
28
+ - Go to **βš™οΈ Settings** tab
29
+ - Enter Gemini API Key and HuggingFace Token
30
+ - Click **"Save API Keys"**
31
+
32
+ ### Navigation
33
+
34
+ TraceMind-AI is organized into tabs:
35
+ - **πŸ“Š Leaderboard**: View evaluation results with AI insights
36
+ - **πŸ€– Agent Chat**: Interactive autonomous agent powered by MCP tools
37
+ - **πŸš€ New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
38
+ - **πŸ“ˆ Job Monitoring**: Track status of submitted jobs
39
+ - **πŸ” Trace Visualization**: Deep-dive into agent execution traces
40
+ - **πŸ”¬ Synthetic Data Generator**: Create custom test datasets with AI
41
+ - **βš™οΈ Settings**: Configure API keys and preferences
42
+
43
+ ---
44
+
45
+ ## Screen-by-Screen Guide
46
+
47
+ ### πŸ“Š Leaderboard
48
+
49
+ **Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.
50
+
51
+ #### Features
52
+
53
+ **Main Table**:
54
+ - View all evaluation runs from the SMOLTRACE leaderboard
55
+ - Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
56
+ - Click any row to see detailed test results
57
+
58
+ **AI Insights Panel** (Top of screen):
59
+ - Automatically generated insights from MCP server
60
+ - Powered by Google Gemini 2.5 Flash
61
+ - Updates when you click "Load Leaderboard"
62
+ - Shows top performers, trends, and recommendations
63
+
64
+ **Filter & Sort Options**:
65
+ - Filter by agent type (tool, code, both)
66
+ - Filter by provider (litellm, transformers)
67
+ - Sort by any metric (success rate, cost, duration)
68
+
69
+ #### How to Use
70
+
71
+ 1. **Load Data**:
72
+ ```
73
+ Click "Load Leaderboard" button
74
+ β†’ Fetches latest evaluation runs from HuggingFace
75
+ β†’ AI generates insights automatically
76
+ ```
77
+
78
+ 2. **Read AI Insights**:
79
+ - Located at top of screen
80
+ - Summary of evaluation trends
81
+ - Top performing models
82
+ - Cost/accuracy trade-offs
83
+ - Actionable recommendations
84
+
85
+ 3. **Explore Runs**:
86
+ - Scroll through table
87
+ - Sort by clicking column headers
88
+ - Click on any run to see details
89
+
90
+ 4. **View Details**:
91
+ ```
92
+ Click a row in the table
93
+ β†’ Opens detail view with:
94
+ - All test cases (success/failure)
95
+ - Execution times
96
+ - Cost breakdown
97
+ - Link to trace visualization
98
+ ```
99
+
100
+ #### Example Workflow
101
+
102
+ ```
103
+ Scenario: Find the most cost-effective model for production
104
+
105
+ 1. Click "Load Leaderboard"
106
+ 2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
107
+ 3. Sort table by "Cost" (ascending)
108
+ 4. Compare top 3 cheapest models
109
+ 5. Click on Llama-3.1-8B run to see detailed results
110
+ 6. Review success rate (93.4%) and test case breakdowns
111
+ 7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
112
+ ```
113
+
114
+ #### Tips
115
+
116
+ - **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
117
+ - **Compare models**: Use the sort function to compare across different metrics
118
+ - **Trust the AI**: The insights panel provides strategic recommendations based on all data
119
+
120
+ ---
121
+
122
+ ### πŸ€– Agent Chat
123
+
124
+ **Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.
125
+
126
+ **🎯 Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.
127
+
128
+ #### Features
129
+
130
+ **Autonomous Agent**:
131
+ - Built with `smolagents` framework
132
+ - Has access to all TraceMind MCP Server tools
133
+ - Plans and executes multi-step actions
134
+ - Provides detailed, data-driven answers
135
+
136
+ **MCP Tools Available to Agent**:
137
+ - `analyze_leaderboard` - Get AI insights about top performers
138
+ - `estimate_cost` - Calculate evaluation costs before running
139
+ - `debug_trace` - Analyze execution traces
140
+ - `compare_runs` - Compare two evaluation runs
141
+ - `get_top_performers` - Fetch top N models efficiently
142
+ - `get_leaderboard_summary` - Get high-level statistics
143
+ - `get_dataset` - Load SMOLTRACE datasets
144
+ - `analyze_results` - Analyze detailed test results
145
+
146
+ **Agent Reasoning Visibility**:
147
+ - Toggle **"Show Agent Reasoning"** to see:
148
+ - Planning steps
149
+ - Tool execution logs
150
+ - Intermediate results
151
+ - Final synthesis
152
+
153
+ **Quick Action Buttons**:
154
+ - **"Quick: Top Models"**: Get top 5 models with costs
155
+ - **"Quick: Cost Estimate"**: Estimate cost for a model
156
+ - **"Quick: Load Leaderboard"**: Fetch leaderboard summary
157
+
158
+ #### How to Use
159
+
160
+ 1. **Start a Conversation**:
161
+ ```
162
+ Type your question in the chat box
163
+ Example: "What are the top 3 performing models and how much do they cost?"
164
+
165
+ Click "Send"
166
+ β†’ Agent plans approach
167
+ β†’ Executes MCP tools
168
+ β†’ Returns synthesized answer
169
+ ```
170
+
171
+ 2. **Watch Agent Work** (optional):
172
+ ```
173
+ Enable "Show Agent Reasoning" checkbox
174
+ β†’ See step-by-step execution:
175
+ Step 1: Calling get_top_performers with metric=success_rate, top_n=3
176
+ Step 2: Analyzing costs from results
177
+ Step 3: Formatting response with recommendations
178
+ ```
179
+
180
+ 3. **Ask Follow-Up Questions**:
181
+ ```
182
+ You: "Which one is most cost-effective?"
183
+ Agent: Uses previous context + MCP tools to answer
184
+ ```
185
+
186
+ 4. **Use Quick Actions**:
187
+ ```
188
+ Click "Quick: Top Models" button
189
+ β†’ Agent automatically executes query
190
+ β†’ Returns top 5 models with metrics
191
+ ```
192
+
193
+ #### Example Conversations
194
+
195
+ **Example 1: Cost Comparison**
196
+ ```
197
+ User: "Compare the costs of the top 3 models. Which one offers the best value?"
198
+
199
+ Agent:
200
+ Let me analyze the top 3 models by success rate:
201
+
202
+ [Calls: get_top_performers(metric="success_rate", top_n=3)]
203
+
204
+ Top 3 Models:
205
+ 1. GPT-4: 95.8% success, $0.05/run
206
+ 2. Claude-3: 94.1% success, $0.04/run
207
+ 3. Llama-3.1-8B: 93.4% success, $0.002/run
208
+
209
+ [Calls: analyze_leaderboard(metric_focus="cost")]
210
+
211
+ Cost Analysis:
212
+ - GPT-4: Highest accuracy but 25x more expensive than Llama
213
+ - Claude-3: Good middle ground at $0.04/run
214
+ - Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop
215
+
216
+ Recommendation for "best value":
217
+ β†’ Llama-3.1-8B offers 93.4% accuracy at $0.002/run
218
+ β†’ For 1M runs: Saves $48,000/month vs GPT-4
219
+ β†’ Only 2.4% accuracy trade-off
220
+ ```
221
+
222
+ **Example 2: Evaluation Planning**
223
+ ```
224
+ User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"
225
+
226
+ Agent:
227
+ Let me estimate costs for DeepSeek-V3:
228
+
229
+ [Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]
230
+
231
+ Cost Estimate:
232
+ - Model: deepseek-ai/DeepSeek-V3
233
+ - Tests: 100
234
+ - Recommended hardware: GPU A100 (70B+ parameters)
235
+
236
+ Breakdown:
237
+ - LLM inference cost: ~$0.01/test (self-hosted, no API fees)
238
+ - HF Jobs GPU cost: ~$3.00/hr for A100
239
+ - Estimated duration: 25 minutes
240
+ - Total compute cost: $1.25
241
+ - Total cost: $1.25 (100 tests)
242
+
243
+ Recommendation:
244
+ Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
245
+ Alternative: Modal H200 for 2x faster inference at $2.50.
246
+ ```
247
+
248
+ #### Tips
249
+
250
+ - **Be specific**: Ask clear, focused questions for better answers
251
+ - **Use context**: Agent remembers conversation history
252
+ - **Watch reasoning**: Enable to understand how agent uses MCP tools
253
+ - **Try quick actions**: Fast way to get common information
254
+
255
+ ---
256
+
257
+ ### πŸš€ New Evaluation
258
+
259
+ **Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.
260
+
261
+ **⚠️ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.
262
+
263
+ #### Features
264
+
265
+ **Model Selection**:
266
+ - Enter any model name (format: `provider/model-name`)
267
+ - Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
268
+ - Auto-detects if API model or local model
269
+
270
+ **Infrastructure Choice**:
271
+ - **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
272
+ - **Modal**: Serverless GPU compute (pay-per-second)
273
+
274
+ **Hardware Selection**:
275
+ - **Auto** (recommended): Automatically selects optimal hardware based on model size
276
+ - **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU
277
+
278
+ **Cost Estimation**:
279
+ - Click **"πŸ’° Estimate Cost"** before submitting
280
+ - Shows predicted:
281
+ - LLM API costs (for API models)
282
+ - Compute costs (for local models)
283
+ - Duration estimate
284
+ - CO2 emissions
285
+
286
+ **Agent Type**:
287
+ - **tool**: Test tool-calling capabilities
288
+ - **code**: Test code generation capabilities
289
+ - **both**: Test both (recommended)
290
+
291
+ #### How to Use
292
+
293
+ **Step 1: Configure Prerequisites** (One-time setup)
294
+
295
+ For **HuggingFace Jobs**:
296
+ ```
297
+ 1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
298
+ 2. Add credit card for compute charges
299
+ 3. Create HF token with "Read + Write + Run Jobs" permissions
300
+ 4. Go to Settings tab β†’ Enter HF token β†’ Save
301
+ ```
302
+
303
+ For **Modal** (Alternative):
304
+ ```
305
+ 1. Sign up: https://modal.com (free tier available)
306
+ 2. Generate API token: https://modal.com/settings/tokens
307
+ 3. Go to Settings tab β†’ Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET β†’ Save
308
+ ```
309
+
310
+ For **API Models** (OpenAI, Anthropic, etc.):
311
+ ```
312
+ 1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
313
+ 2. Go to Settings tab β†’ Enter provider API key β†’ Save
314
+ ```
315
+
316
+ **Step 2: Create Evaluation**
317
+
318
+ ```
319
+ 1. Enter model name:
320
+ Example: "meta-llama/Llama-3.1-8B"
321
+
322
+ 2. Select infrastructure:
323
+ - HuggingFace Jobs (default)
324
+ - Modal (alternative)
325
+
326
+ 3. Choose agent type:
327
+ - "both" (recommended)
328
+
329
+ 4. Select hardware:
330
+ - "auto" (recommended - smart selection)
331
+ - Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200
332
+
333
+ 5. Set timeout (optional):
334
+ - Default: 3600s (1 hour)
335
+ - Range: 300s - 7200s
336
+
337
+ 6. Click "πŸ’° Estimate Cost":
338
+ β†’ Shows predicted cost and duration
339
+ β†’ Example: "$2.00, 20 minutes, 0.5g CO2"
340
+
341
+ 7. Review estimate, then click "Submit Evaluation"
342
+ ```
343
+
344
+ **Step 3: Monitor Job**
345
+
346
+ ```
347
+ After submission:
348
+ β†’ Job ID displayed
349
+ β†’ Go to "πŸ“ˆ Job Monitoring" tab to track progress
350
+ β†’ Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
351
+ ```
352
+
353
+ **Step 4: View Results**
354
+
355
+ ```
356
+ When job completes:
357
+ β†’ Results automatically uploaded to HuggingFace datasets
358
+ β†’ Appears in Leaderboard within 1-2 minutes
359
+ β†’ Click on your run to see detailed results
360
+ ```
361
+
362
+ #### Hardware Selection Guide
363
+
364
+ **For API Models** (OpenAI, Anthropic, Google):
365
+ - Use: `cpu-basic` (HF Jobs) or CPU (Modal)
366
+ - Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
367
+ - Why: No GPU needed for API calls
368
+
369
+ **For Small Models** (4B-8B parameters):
370
+ - Use: `t4-small` (HF) or A10G (Modal)
371
+ - Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
372
+ - Examples: Llama-3.1-8B, Mistral-7B
373
+
374
+ **For Medium Models** (7B-13B parameters):
375
+ - Use: `a10g-small` (HF) or A10G (Modal)
376
+ - Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
377
+ - Examples: Qwen2.5-14B, Mixtral-8x7B
378
+
379
+ **For Large Models** (70B+ parameters):
380
+ - Use: `a100-large` (HF) or A100-80GB (Modal)
381
+ - Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
382
+ - Examples: Llama-3.1-70B, DeepSeek-V3
383
+
384
+ **For Fastest Inference**:
385
+ - Use: `h200` (HF or Modal)
386
+ - Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
387
+ - Best for: Time-sensitive evaluations, large batches
388
+
389
+ #### Example Workflows
390
+
391
+ **Workflow 1: Evaluate API Model (OpenAI GPT-4)**
392
+ ```
393
+ 1. Model: "openai/gpt-4"
394
+ 2. Infrastructure: HuggingFace Jobs
395
+ 3. Agent type: both
396
+ 4. Hardware: auto (selects cpu-basic)
397
+ 5. Estimate: $50.00 (mostly API costs), 45 min
398
+ 6. Submit β†’ Monitor β†’ View in leaderboard
399
+ ```
400
+
401
+ **Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
402
+ ```
403
+ 1. Model: "meta-llama/Llama-3.1-8B"
404
+ 2. Infrastructure: Modal (for pay-per-second billing)
405
+ 3. Agent type: both
406
+ 4. Hardware: auto (selects A10G)
407
+ 5. Estimate: $0.20, 15 min
408
+ 6. Submit β†’ Monitor β†’ View in leaderboard
409
+ ```
410
+
411
+ #### Tips
412
+
413
+ - **Always estimate first**: Prevents surprise costs
414
+ - **Use "auto" hardware**: Smart selection based on model size
415
+ - **Start small**: Test with 10-20 tests before scaling to 100+
416
+ - **Monitor jobs**: Check Job Monitoring tab for status
417
+ - **Modal for experimentation**: Pay-per-second is cost-effective for testing
418
+
419
+ ---
420
+
421
+ ### πŸ“ˆ Job Monitoring
422
+
423
+ **Purpose**: Track status of submitted evaluation jobs.
424
+
425
+ #### Features
426
+
427
+ **Job Status Display**:
428
+ - Job ID
429
+ - Current status (pending, running, completed, failed)
430
+ - Start time
431
+ - Duration
432
+ - Infrastructure (HF Jobs or Modal)
433
+
434
+ **Real-time Updates**:
435
+ - Auto-refreshes every 30 seconds
436
+ - Manual refresh button
437
+
438
+ **Job Actions**:
439
+ - View logs
440
+ - Cancel job (if still running)
441
+ - View results (if completed)
442
+
443
+ #### How to Use
444
+
445
+ ```
446
+ 1. Go to "πŸ“ˆ Job Monitoring" tab
447
+ 2. See list of your submitted jobs
448
+ 3. Click "Refresh" for latest status
449
+ 4. When status = "completed":
450
+ β†’ Click "View Results"
451
+ β†’ Opens leaderboard filtered to your run
452
+ ```
453
+
454
+ #### Job Statuses
455
+
456
+ - **Pending**: Job queued, waiting for resources
457
+ - **Running**: Evaluation in progress
458
+ - **Completed**: Evaluation finished successfully
459
+ - **Failed**: Evaluation encountered an error
460
+
461
+ #### Tips
462
+
463
+ - **Check logs** if job fails: Helps diagnose issues
464
+ - **Expected duration**:
465
+ - API models: 2-5 minutes
466
+ - Local models: 15-30 minutes (includes model download)
467
+
468
+ ---
469
+
470
+ ### πŸ” Trace Visualization
471
+
472
+ **Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.
473
+
474
+ **Access**: Click on any test case in a run's detail view
475
+
476
+ #### Features
477
+
478
+ **Waterfall Diagram**:
479
+ - Visual timeline of execution
480
+ - Spans show: LLM calls, tool executions, reasoning steps
481
+ - Duration bars (wider = slower)
482
+ - Parent-child relationships
483
+
484
+ **Span Details**:
485
+ - Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
486
+ - Start/end times
487
+ - Duration
488
+ - Attributes (model, tokens, cost, tool inputs/outputs)
489
+ - Status (OK, ERROR)
490
+
491
+ **GPU Metrics Overlay** (for GPU jobs only):
492
+ - GPU utilization %
493
+ - Memory usage
494
+ - Temperature
495
+ - CO2 emissions
496
+
497
+ **MCP-Powered Q&A**:
498
+ - Ask questions about the trace
499
+ - Example: "Why was tool X called twice?"
500
+ - Agent uses `debug_trace` MCP tool to analyze
501
+
502
+ #### How to Use
503
+
504
+ ```
505
+ 1. From leaderboard β†’ Click a run β†’ Click a test case
506
+ 2. View waterfall diagram:
507
+ β†’ Spans arranged chronologically
508
+ β†’ Parent spans (e.g., "Agent Execution")
509
+ β†’ Child spans (e.g., "LLM Call", "Tool Call")
510
+
511
+ 3. Click any span:
512
+ β†’ See detailed attributes
513
+ β†’ Token counts, costs, inputs/outputs
514
+
515
+ 4. Ask questions (MCP-powered):
516
+ User: "Why did this test fail?"
517
+ β†’ Agent analyzes trace with debug_trace tool
518
+ β†’ Returns explanation with span references
519
+
520
+ 5. Check GPU metrics (if available):
521
+ β†’ Graph shows utilization over time
522
+ β†’ Overlayed on execution timeline
523
+ ```
524
+
525
+ #### Example Analysis
526
+
527
+ **Scenario: Understanding a slow execution**
528
+
529
+ ```
530
+ 1. Open trace for test_045 (duration: 8.5s)
531
+ 2. Waterfall shows:
532
+ - Span 1: LLM Call - Reasoning (1.2s) βœ“
533
+ - Span 2: Tool Call - search_web (6.5s) ⚠️ SLOW
534
+ - Span 3: LLM Call - Final Response (0.8s) βœ“
535
+
536
+ 3. Click Span 2 (search_web):
537
+ - Input: {"query": "weather in Tokyo"}
538
+ - Output: 5 results
539
+ - Duration: 6.5s (6x slower than typical)
540
+
541
+ 4. Ask agent: "Why was the search_web call so slow?"
542
+ β†’ Agent analysis:
543
+ "The search_web call took 6.5s due to network latency.
544
+ Span attributes show API response time: 6.2s.
545
+ This is an external dependency issue, not agent code.
546
+ Recommendation: Implement timeout (5s) and fallback strategy."
547
+ ```
548
+
549
+ #### Tips
550
+
551
+ - **Look for patterns**: Similar failures often have common spans
552
+ - **Use MCP Q&A**: Faster than manual trace analysis
553
+ - **Check GPU metrics**: Identify resource bottlenecks
554
+ - **Compare successful vs failed traces**: Spot differences
555
+
556
+ ---
557
+
558
+ ### πŸ”¬ Synthetic Data Generator
559
+
560
+ **Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.
561
+
562
+ #### Features
563
+
564
+ **AI-Powered Dataset Generation**:
565
+ - Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
566
+ - Customizable domain, tools, difficulty, and agent type
567
+ - Automatic batching for large datasets (parallel generation)
568
+ - SMOLTRACE-format output ready for evaluation
569
+
570
+ **Prompt Template Generation**:
571
+ - Customized YAML templates based on smolagents format
572
+ - Optimized for your specific domain and tools
573
+ - Included automatically in dataset card
574
+
575
+ **Push to HuggingFace Hub**:
576
+ - One-click upload to HuggingFace Hub
577
+ - Public or private repositories
578
+ - Auto-generated README with usage instructions
579
+ - Ready to use with SMOLTRACE evaluations
580
+
581
+ #### How to Use
582
+
583
+ **Step 1: Configure & Generate Dataset**
584
+
585
+ 1. Navigate to **πŸ”¬ Synthetic Data Generator** tab
586
+
587
+ 2. Configure generation parameters:
588
+ - **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
589
+ - **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
590
+ - **Number of Tasks**: 5-100 tasks (slider)
591
+ - **Difficulty Level**:
592
+ - `balanced` (40% easy, 40% medium, 20% hard)
593
+ - `easy_only` (100% easy tasks)
594
+ - `medium_only` (100% medium tasks)
595
+ - `hard_only` (100% hard tasks)
596
+ - `progressive` (50% easy, 30% medium, 20% hard)
597
+ - **Agent Type**:
598
+ - `tool` (ToolCallingAgent only)
599
+ - `code` (CodeAgent only)
600
+ - `both` (50/50 mix)
601
+
602
+ 3. Click **"🎲 Generate Synthetic Dataset"**
603
+
604
+ 4. Wait for generation (30-120s depending on size):
605
+ - Shows progress message
606
+ - Automatic batching for >20 tasks
607
+ - Parallel API calls for faster generation
608
+
609
+ **Step 2: Review Generated Content**
610
+
611
+ 1. **Dataset Preview Tab**:
612
+ - View all generated tasks in JSON format
613
+ - Check task IDs, prompts, expected tools, difficulty
614
+ - See dataset statistics:
615
+ - Total tasks
616
+ - Difficulty distribution
617
+ - Agent type distribution
618
+ - Tools coverage
619
+
620
+ 2. **Prompt Template Tab**:
621
+ - View customized YAML prompt template
622
+ - Based on smolagents templates
623
+ - Adapted for your domain and tools
624
+ - Ready to use with ToolCallingAgent or CodeAgent
625
+
626
+ **Step 3: Push to HuggingFace Hub** (Optional)
627
+
628
+ 1. Enter **Repository Name**:
629
+ - Format: `username/smoltrace-{domain}-tasks`
630
+ - Example: `alice/smoltrace-finance-tasks`
631
+ - Auto-filled with your HF username after generation
632
+
633
+ 2. Set **Visibility**:
634
+ - ☐ Private Repository (unchecked = public)
635
+ - β˜‘ Private Repository (checked = private)
636
+
637
+ 3. Provide **HuggingFace Token** (optional):
638
+ - Leave empty to use environment token (HF_TOKEN from Settings)
639
+ - Or paste token from https://huggingface.co/settings/tokens
640
+ - Requires write permissions
641
+
642
+ 4. Click **"πŸ“€ Push to HuggingFace Hub"**
643
+
644
+ 5. Wait for upload (5-30s):
645
+ - Creates dataset repository
646
+ - Uploads tasks
647
+ - Generates README with:
648
+ - Usage instructions
649
+ - Prompt template
650
+ - SMOLTRACE integration code
651
+ - Returns dataset URL
652
+
653
+ #### Example Workflow
654
+
655
+ ```
656
+ Scenario: Create finance evaluation dataset with 20 tasks
657
+
658
+ 1. Configure:
659
+ Domain: "finance"
660
+ Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
661
+ Number of Tasks: 20
662
+ Difficulty: "balanced"
663
+ Agent Type: "both"
664
+
665
+ 2. Click "Generate"
666
+ β†’ AI generates 20 tasks:
667
+ - 8 easy (single tool, straightforward)
668
+ - 8 medium (multiple tools or complex logic)
669
+ - 4 hard (complex reasoning, edge cases)
670
+ - 10 for ToolCallingAgent
671
+ - 10 for CodeAgent
672
+ β†’ Also generates customized prompt template
673
+
674
+ 3. Review Dataset Preview:
675
+ Task 1:
676
+ {
677
+ "id": "finance_stock_price_1",
678
+ "prompt": "What is the current price of AAPL stock?",
679
+ "expected_tool": "get_stock_price",
680
+ "difficulty": "easy",
681
+ "agent_type": "tool",
682
+ "expected_keywords": ["AAPL", "price", "$"]
683
+ }
684
+
685
+ Task 15:
686
+ {
687
+ "id": "finance_complex_analysis_15",
688
+ "prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
689
+ "expected_tool": "calculate_roi",
690
+ "expected_tool_calls": 2,
691
+ "difficulty": "hard",
692
+ "agent_type": "code",
693
+ "expected_keywords": ["ROI", "15%", "alert"]
694
+ }
695
+
696
+ 4. Review Prompt Template:
697
+ See customized YAML with:
698
+ - Finance-specific system prompt
699
+ - Tool descriptions for get_stock_price, calculate_roi, etc.
700
+ - Response format guidelines
701
+
702
+ 5. Push to Hub:
703
+ Repository: "yourname/smoltrace-finance-tasks"
704
+ Private: No (public)
705
+ Token: (empty, using environment token)
706
+
707
+ β†’ Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
708
+ β†’ README includes usage instructions and prompt template
709
+
710
+ 6. Use in evaluation:
711
+ # Load your custom dataset
712
+ dataset = load_dataset("yourname/smoltrace-finance-tasks")
713
+
714
+ # Run SMOLTRACE evaluation
715
+ smoltrace-eval --model openai/gpt-4 \
716
+ --dataset-name yourname/smoltrace-finance-tasks \
717
+ --agent-type both
718
+ ```
719
+
720
+ #### Configuration Reference
721
+
722
+ **Difficulty Levels Explained**:
723
+
724
+ | Level | Characteristics | Example |
725
+ |-------|----------------|---------|
726
+ | **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" β†’ get_weather("Tokyo") |
727
+ | **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" β†’ get_weather("Tokyo"), get_weather("London"), compare |
728
+ | **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |
729
+
730
+ **Agent Types Explained**:
731
+
732
+ | Type | Description | Use Case |
733
+ |------|-------------|----------|
734
+ | **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
735
+ | **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
736
+ | **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |
737
+
738
+ #### Best Practices
739
+
740
+ **Domain Selection**:
741
+ - Be specific: "customer_support_saas" > "support"
742
+ - Match your use case: Use actual business domain
743
+ - Consider tools available: Domain should align with tools
744
+
745
+ **Tool Names**:
746
+ - Use descriptive names: "get_stock_price" > "fetch"
747
+ - Match actual tool implementations
748
+ - 3-8 tools is ideal (enough variety, not overwhelming)
749
+ - Include mix of data retrieval and action tools
750
+
751
+ **Number of Tasks**:
752
+ - 5-10 tasks: Quick testing, proof of concept
753
+ - 20-30 tasks: Solid evaluation dataset
754
+ - 50-100 tasks: Comprehensive benchmark
755
+
756
+ **Difficulty Distribution**:
757
+ - `balanced`: Best for general evaluation
758
+ - `progressive`: Good for learning/debugging
759
+ - `easy_only`: Quick sanity checks
760
+ - `hard_only`: Stress testing advanced capabilities
761
+
762
+ **Quality Assurance**:
763
+ - Always review generated tasks before pushing
764
+ - Check for domain relevance and variety
765
+ - Verify expected tools match your actual tools
766
+ - Ensure prompts are clear and executable
767
+
768
+ #### Troubleshooting
769
+
770
+ **Generation fails with "Invalid API key"**:
771
+ - Go to **βš™οΈ Settings**
772
+ - Configure Gemini API Key
773
+ - Get key from https://aistudio.google.com/apikey
774
+
775
+ **Generated tasks don't match domain**:
776
+ - Be more specific in domain description
777
+ - Try regenerating with adjusted parameters
778
+ - Review prompt template for domain alignment
779
+
780
+ **Push to Hub fails with "Authentication error"**:
781
+ - Verify HuggingFace token has write permissions
782
+ - Get token from https://huggingface.co/settings/tokens
783
+ - Check token in **βš™οΈ Settings** or provide directly
784
+
785
+ **Dataset generation is slow (>60s)**:
786
+ - Large requests (>20 tasks) are automatically batched
787
+ - Each batch takes 30-120s
788
+ - Example: 100 tasks = 5 batches Γ— 60s = ~5 minutes
789
+ - This is normal for large datasets
790
+
791
+ **Tasks are too easy/hard**:
792
+ - Adjust difficulty distribution
793
+ - Regenerate with different settings
794
+ - Mix difficulty levels with `balanced` or `progressive`
795
+
796
+ #### Advanced Tips
797
+
798
+ **Iterative Refinement**:
799
+ 1. Generate 10 tasks with `balanced` difficulty
800
+ 2. Review quality and variety
801
+ 3. If satisfied, generate 50-100 tasks with same settings
802
+ 4. If not, adjust domain/tools and regenerate
803
+
804
+ **Dataset Versioning**:
805
+ - Use version suffixes: `username/smoltrace-finance-tasks-v2`
806
+ - Iterate on datasets as tools evolve
807
+ - Keep track of which version was used for evaluations
808
+
809
+ **Combining Datasets**:
810
+ - Generate multiple small datasets for different domains
811
+ - Use SMOLTRACE CLI to merge datasets
812
+ - Create comprehensive multi-domain benchmarks
813
+
814
+ **Custom Prompt Templates**:
815
+ - Generate prompt template separately
816
+ - Customize further based on your needs
817
+ - Use in agent initialization before evaluation
818
+ - Include in dataset card for reproducibility
819
+
820
+ ---
821
+
822
+ ### βš™οΈ Settings
823
+
824
+ **Purpose**: Configure API keys, preferences, and authentication.
825
+
826
+ #### Features
827
+
828
+ **API Key Configuration**:
829
+ - Gemini API Key (for MCP server AI analysis)
830
+ - HuggingFace Token (for dataset access + job submission)
831
+ - Modal Token ID + Secret (for Modal job submission)
832
+ - LLM Provider Keys (OpenAI, Anthropic, etc.)
833
+
834
+ **Preferences**:
835
+ - Default infrastructure (HF Jobs vs Modal)
836
+ - Default hardware tier
837
+ - Auto-refresh intervals
838
+
839
+ **Security**:
840
+ - Keys stored in browser session only (not server)
841
+ - HTTPS encryption for all API calls
842
+ - Keys never logged or exposed
843
+
844
+ #### How to Use
845
+
846
+ **Configure Essential Keys**:
847
+ ```
848
+ 1. Go to "βš™οΈ Settings" tab
849
+
850
+ 2. Enter Gemini API Key:
851
+ - Get from: https://ai.google.dev/
852
+ - Click "Get API Key" β†’ Create project β†’ Generate
853
+ - Paste into field
854
+ - Free tier: 1,500 requests/day
855
+
856
+ 3. Enter HuggingFace Token:
857
+ - Get from: https://huggingface.co/settings/tokens
858
+ - Click "New token" β†’ Name: "TraceMind"
859
+ - Permissions:
860
+ - Read (for viewing datasets)
861
+ - Write (for uploading results)
862
+ - Run Jobs (for evaluation submission)
863
+ - Paste into field
864
+
865
+ 4. Click "Save API Keys"
866
+ β†’ Keys stored in browser session
867
+ β†’ MCP server will use your keys
868
+ ```
869
+
870
+ **Configure for Job Submission** (Optional):
871
+
872
+ For **HuggingFace Jobs**:
873
+ ```
874
+ Already configured if you entered HF token above with "Run Jobs" permission.
875
+ ```
876
+
877
+ For **Modal** (Alternative):
878
+ ```
879
+ 1. Sign up: https://modal.com
880
+ 2. Get token: https://modal.com/settings/tokens
881
+ 3. Copy MODAL_TOKEN_ID (starts with 'ak-')
882
+ 4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
883
+ 5. Paste both into Settings β†’ Save
884
+ ```
885
+
886
+ For **API Model Providers**:
887
+ ```
888
+ 1. Get API key from provider:
889
+ - OpenAI: https://platform.openai.com/api-keys
890
+ - Anthropic: https://console.anthropic.com/settings/keys
891
+ - Google: https://ai.google.dev/
892
+
893
+ 2. Paste into corresponding field in Settings
894
+ 3. Click "Save LLM Provider Keys"
895
+ ```
896
+
897
+ #### Security Best Practices
898
+
899
+ - **Use environment variables**: For production, set keys via HF Spaces secrets
900
+ - **Rotate keys regularly**: Generate new tokens every 3-6 months
901
+ - **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
902
+ - **Monitor usage**: Check API provider dashboards for unexpected charges
903
+
904
+ ---
905
+
906
+ ## Common Workflows
907
+
908
+ ### Workflow 1: Quick Model Comparison
909
+
910
+ ```
911
+ Goal: Compare GPT-4 vs Llama-3.1-8B for production use
912
+
913
+ Steps:
914
+ 1. Go to Leaderboard β†’ Load Leaderboard
915
+ 2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
916
+ 3. Sort by Success Rate β†’ Note: GPT-4 (95.8%), Llama (93.4%)
917
+ 4. Sort by Cost β†’ Note: GPT-4 ($0.05), Llama ($0.002)
918
+ 5. Go to Agent Chat β†’ Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
919
+ β†’ Agent analyzes with MCP tools
920
+ β†’ Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
921
+ 6. Decision: Use Llama-3.1-8B for production
922
+ ```
923
+
924
+ ### Workflow 2: Evaluate Custom Model
925
+
926
+ ```
927
+ Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark
928
+
929
+ Steps:
930
+ 1. Ensure model is on HuggingFace: username/my-finetuned-model
931
+ 2. Go to Settings β†’ Configure HF token (with Run Jobs permission)
932
+ 3. Go to New Evaluation:
933
+ - Model: "username/my-finetuned-model"
934
+ - Infrastructure: HuggingFace Jobs
935
+ - Agent type: both
936
+ - Hardware: auto
937
+ 4. Click "Estimate Cost" β†’ Review: $1.50, 20 min
938
+ 5. Click "Submit Evaluation"
939
+ 6. Go to Job Monitoring β†’ Wait for "Completed" (15-25 min)
940
+ 7. Go to Leaderboard β†’ Refresh β†’ See your model in table
941
+ 8. Click your run β†’ Review detailed results
942
+ 9. Compare vs other models using Agent Chat
943
+ ```
944
+
945
+ ### Workflow 3: Debug Failed Test
946
+
947
+ ```
948
+ Goal: Understand why test_045 failed in your evaluation
949
+
950
+ Steps:
951
+ 1. Go to Leaderboard β†’ Find your run β†’ Click to open details
952
+ 2. Filter to failed tests only
953
+ 3. Click test_045 β†’ Opens trace visualization
954
+ 4. Examine waterfall:
955
+ - Span 1: LLM Call (OK)
956
+ - Span 2: Tool Call - "unknown_tool" (ERROR)
957
+ - No Span 3 (execution stopped)
958
+ 5. Ask Agent: "Why did test_045 fail?"
959
+ β†’ Agent uses debug_trace MCP tool
960
+ β†’ Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
961
+ 6. Fix: Update agent config to include missing tool
962
+ 7. Re-run evaluation with fixed config
963
+ ```
964
+
965
+ ---
966
+
967
+ ## Troubleshooting
968
+
969
+ ### Leaderboard Issues
970
+
971
+ **Problem**: "Load Leaderboard" button doesn't work
972
+ - **Solution**: Check HuggingFace token in Settings (needs Read permission)
973
+ - **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard
974
+
975
+ **Problem**: AI insights not showing
976
+ - **Solution**: Check Gemini API key in Settings
977
+ - **Solution**: Wait 5-10 seconds for AI generation to complete
978
+
979
+ ### Agent Chat Issues
980
+
981
+ **Problem**: Agent responds with "MCP server connection failed"
982
+ - **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
983
+ - **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings
984
+
985
+ **Problem**: Agent gives incorrect information
986
+ - **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
987
+ - **Solution**: Verify question is clear and specific
988
+
989
+ ### Evaluation Submission Issues
990
+
991
+ **Problem**: "Submit Evaluation" fails with auth error
992
+ - **Solution**: HF token needs "Run Jobs" permission
993
+ - **Solution**: Ensure HF Pro account is active ($9/month)
994
+ - **Solution**: Verify credit card is on file for compute charges
995
+
996
+ **Problem**: Job stuck in "Pending" status
997
+ - **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
998
+ - **Solution**: Try Modal as alternative infrastructure
999
+
1000
+ **Problem**: Job fails with "Out of Memory"
1001
+ - **Solution**: Model too large for selected hardware
1002
+ - **Solution**: Increase hardware tier (e.g., t4-small β†’ a10g-small)
1003
+ - **Solution**: Use auto hardware selection
1004
+
1005
+ ### Trace Visualization Issues
1006
+
1007
+ **Problem**: Traces not loading
1008
+ - **Solution**: Ensure evaluation completed successfully
1009
+ - **Solution**: Check traces dataset exists on HuggingFace
1010
+ - **Solution**: Verify HF token has Read permission
1011
+
1012
+ **Problem**: GPU metrics missing
1013
+ - **Solution**: Only available for GPU jobs (not API models)
1014
+ - **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled
1015
+
1016
+ ---
1017
+
1018
+ ## Getting Help
1019
+
1020
+ - **πŸ“§ GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
1021
+ - **πŸ’¬ HF Discord**: `#agents-mcp-hackathon-winter25`
1022
+ - **πŸ“– Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)
1023
+
1024
+ ---
1025
+
1026
+ **Last Updated**: November 21, 2025