Spaces:

MCP-1st-Birthday
/

TraceMind

Running

App Files Files Community

kshitijthakkar commited on 19 days ago

Commit

34f1a7a

1 Parent(s): 880ef7f

docs: Deploy final documentation package

Browse files

Files changed (4) hide show

ARCHITECTURE.md +1035 -0
MCP_INTEGRATION.md +706 -0
README.md +318 -343
USER_GUIDE.md +1026 -0

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,1035 @@

+# TraceMind-AI - Technical Architecture
+This document provides a deep technical dive into the TraceMind-AI architecture, implementation details, and system design.
+## Table of Contents
+- [System Overview](#system-overview)
+- [Project Structure](#project-structure)
+- [Core Components](#core-components)
+- [MCP Client Architecture](#mcp-client-architecture)
+- [Agent Framework Integration](#agent-framework-integration)
+- [Data Flow](#data-flow)
+- [Authentication & Authorization](#authentication--authorization)
+- [Screen Navigation](#screen-navigation)
+- [Job Submission Architecture](#job-submission-architecture)
+- [Deployment](#deployment)
+- [Performance Optimization](#performance-optimization)
+---
+## System Overview
+TraceMind-AI is a comprehensive Gradio-based web application for evaluating AI agent performance. It serves as the user-facing platform in the TraceMind ecosystem, demonstrating enterprise MCP client usage (Track 2: MCP in Action).
+### Technology Stack
+| Component | Technology | Version | Purpose |
+|-----------|-----------|---------|---------|
+| **UI Framework** | Gradio | 5.49.1 | Web interface with components |
+| **MCP Client** | MCP Python SDK | Latest | Connect to MCP servers |
+| **Agent Framework** | smolagents | 1.22.0+ | Autonomous agent with MCP tools |
+| **Data Source** | HuggingFace Datasets | Latest | Load evaluation results |
+| **Authentication** | HuggingFace OAuth | - | User authentication |
+| **Job Platforms** | HF Jobs + Modal | - | Evaluation job submission |
+| **Language** | Python | 3.10+ | Core implementation |
+### High-Level Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│ User Browser                                                 │
+│  - Gradio Interface (React-based)                           │
+│  - OAuth Flow (HuggingFace)                                 │
+└──────────────┬──────────────────────────────────────────────┘
+               │
+               │ HTTP/WebSocket
+               ↓
+┌─────────────────────────────────────────────────────────────┐
+│ TraceMind-AI (Gradio App) - Track 2                         │
+│                                                               │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │ Screen Layer (screens/)                             │   │
+│  │  - Leaderboard                                       │   │
+│  │  - Agent Chat                                        │   │
+│  │  - New Evaluation                                    │   │
+│  │  - Job Monitoring                                    │   │
+│  │  - Trace Detail                                      │   │
+│  │  - Settings                                          │   │
+│  └────────────┬────────────────────────────────────────┘   │
+│               │                                              │
+│  ┌────────────┴────────────────────────────────────────┐   │
+│  │ Component Layer (components/)                       │   │
+│  │  - Leaderboard Table (Custom HTML)                  │   │
+│  │  - Analytics Charts                                  │   │
+│  │  - Metric Displays                                   │   │
+│  │  - Report Cards                                      │   │
+│  └────────────┬────────────────────────────────────────┘   │
+│               │                                              │
+│  ┌────────────┴────────────────────────────────────────┐   │
+│  │ Service Layer                                        │   │
+│  │  ┌──────────────────┐  ┌──────────────────┐        │   │
+│  │  │ MCP Client       │  │ Data Loader      │        │   │
+│  │  │ (mcp_client/)    │  │ (data_loader.py) │        │   │
+│  │  └──────────────────┘  └──────────────────┘        │   │
+│  │  ┌─���────────────────┐  ┌──────────────────┐        │   │
+│  │  │ Agent (smolagents│  │ Job Submission   │        │   │
+│  │  │ screens/chat.py) │  │ (utils/)         │        │   │
+│  │  └──────────────────┘  └──────────────────┘        │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                               │
+└───────────┬───────────────────────────────────┬─────────────┘
+            │                                   │
+            ↓                                   ↓
+┌───────────────────────┐         ┌───────────────────────┐
+│ TraceMind MCP Server  │         │ External Services     │
+│ (Track 1)             │         │  - HF Datasets        │
+│  - 11 AI Tools        │         │  - HF Jobs            │
+│  - 3 Resources        │         │  - Modal              │
+│  - 3 Prompts          │         │  - LLM APIs           │
+└───────────────────────┘         └───────────────────────┘
+```
+---
+## Project Structure
+```
+TraceMind-AI/
+├── app.py                          # Main entry point, Gradio app
+│
+├── screens/                        # UI screens (6 tabs)
+│   ├── __init__.py
+│   ├── leaderboard.py             # Screen 1: Leaderboard with AI insights
+│   ├── chat.py                    # Screen 2: Agent Chat (smolagents)
+│   ├── dashboard.py               # Screen 3: New Evaluation
+│   ├── job_monitoring.py          # Screen 4: Job Status Tracking
+│   ├── trace_detail.py            # Screen 5: Trace Visualization
+│   ├── settings.py                # Screen 6: API Key Configuration
+│   ├── compare.py                 # Screen 7: Run Comparison (optional)
+│   ├── documentation.py           # Screen 8: API Documentation
+│   └── mcp_helpers.py             # Shared MCP client helpers
+│
+├── components/                     # Reusable UI components
+│   ├── __init__.py
+│   ├── leaderboard_table.py       # Custom HTML table component
+│   ├── analytics_charts.py        # Performance charts (Plotly)
+│   ├── metric_displays.py         # Metric cards and badges
+│   ├── report_cards.py            # Summary report cards
+│   └── thought_graph.py           # Agent reasoning visualization
+│
+├── mcp_client/                     # MCP client implementation
+│   ├── __init__.py
+│   ├── client.py                  # Async MCP client
+│   └── sync_wrapper.py            # Synchronous wrapper for Gradio
+│
+├── utils/                          # Utility modules
+│   ├── __init__.py
+│   ├── auth.py                    # HuggingFace OAuth
+│   ├── navigation.py              # Screen navigation state
+│   ├── hf_jobs_submission.py      # HuggingFace Jobs integration
+│   └── modal_job_submission.py    # Modal integration
+│
+├── styles/                         # Custom styling
+│   ├── __init__.py
+│   └── tracemind_theme.py         # Gradio theme customization
+│
+├── data_loader.py                  # Dataset loading and caching
+├── requirements.txt                # Python dependencies
+├── .env.example                    # Environment variable template
+├── .gitignore
+├── README.md                       # Project documentation
+└── USER_GUIDE.md                   # Complete user guide
+Total: ~35 files, ~8,000 lines of code
+```
+### File Breakdown
+| Directory | Files | Lines | Purpose |
+|-----------|-------|-------|---------|
+| `screens/` | 9 | ~3,500 | UI screen implementations |
+| `components/` | 5 | ~1,200 | Reusable UI components |
+| `mcp_client/` | 3 | ~800 | MCP client integration |
+| `utils/` | 4 | ~1,500 | Authentication, jobs, navigation |
+| `styles/` | 2 | ~300 | Custom theme and CSS |
+| Root | 3 | ~700 | Main app, data loader, config |
+---
+## Core Components
+### 1. app.py - Main Application
+**Purpose**: Entry point, orchestrates all screens and manages global state.
+**Architecture**:
+```python
+# app.py structure
+import gradio as gr
+from screens import *
+from mcp_client.sync_wrapper import get_sync_mcp_client
+from utils.auth import auth_ui
+from data_loader import DataLoader
+# 1. Initialize services
+mcp_client = get_sync_mcp_client()
+mcp_client.initialize()
+data_loader = DataLoader()
+# 2. Create Gradio app
+with gr.Blocks(theme=tracemind_theme) as app:
+    # Global state
+    gr.State(...)  # User session, navigation, etc.
+    # Authentication (if not disabled)
+    if not DISABLE_OAUTH:
+        auth_ui()
+    # Main tabs
+    with gr.Tabs():
+        with gr.Tab("📊 Leaderboard"):
+            leaderboard_screen()
+        with gr.Tab("🤖 Agent Chat"):
+            chat_screen()
+        with gr.Tab("🚀 New Evaluation"):
+            dashboard_screen()
+        with gr.Tab("📈 Job Monitoring"):
+            job_monitoring_screen()
+        with gr.Tab("⚙️ Settings"):
+            settings_screen()
+# 3. Launch
+if __name__ == "__main__":
+    app.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False
+    )
+```
+**Key Responsibilities**:
+- Initialize MCP client and data loader (global instances)
+- Create tabbed interface with all screens
+- Manage authentication flow
+- Handle global state (user session, API keys)
+---
+### 2. Screen Layer (screens/)
+Each screen is a self-contained module that returns a Gradio component tree.
+#### screens/leaderboard.py
+**Purpose**: Display evaluation results with AI-powered insights.
+**Components**:
+- Load button
+- AI insights panel (Markdown) - powered by MCP server
+- Leaderboard table (custom HTML component)
+- Filter controls (agent type, provider)
+**MCP Integration**:
+```python
+def load_leaderboard(mcp_client):
+    # 1. Load dataset
+    ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
+    df = pd.DataFrame(ds)
+    # 2. Get AI insights from MCP server
+    insights = mcp_client.analyze_leaderboard(
+        metric_focus="overall",
+        time_range="last_week",
+        top_n=5
+    )
+    # 3. Render table with custom component
+    table_html = render_leaderboard_table(df)
+    return insights, table_html
+```
+#### screens/chat.py
+**Purpose**: Autonomous agent interface with MCP tool access.
+**Agent Setup**:
+```python
+from smolagents import ToolCallingAgent, MCPClient, HfApiModel
+# Initialize agent with MCP client
+def create_agent():
+    mcp_client = MCPClient(MCP_SERVER_URL)
+    model = HfApiModel(
+        model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
+        token=os.getenv("HF_TOKEN")
+    )
+    agent = ToolCallingAgent(
+        tools=[],  # MCP tools loaded automatically
+        model=model,
+        mcp_client=mcp_client,
+        max_steps=10
+    )
+    return agent
+# Chat interaction
+def agent_chat(message, history, show_reasoning):
+    if show_reasoning:
+        agent.verbosity_level = 2  # Show tool execution
+    else:
+        agent.verbosity_level = 0  # Only final answer
+    response = agent.run(message)
+    history.append((message, response))
+    return history, ""
+```
+**MCP Tool Access**:
+Agent automatically discovers and uses all 11 MCP tools from TraceMind MCP Server.
+#### screens/dashboard.py
+**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal.
+**Key Functions**:
+- Model selection (text input)
+- Infrastructure choice (HF Jobs / Modal)
+- Hardware selection (auto / manual)
+- Cost estimation (MCP-powered)
+- Job submission
+**Cost Estimation Flow**:
+```python
+def estimate_cost_click(model, agent_type, num_tests, hardware, mcp_client):
+    # Call MCP server for cost estimate
+    estimate = mcp_client.estimate_cost(
+        model=model,
+        agent_type=agent_type,
+        num_tests=num_tests,
+        hardware=hardware
+    )
+    return estimate  # Display in dialog
+```
+**Job Submission Flow**:
+```python
+def submit_job(model, agent_type, hardware, infrastructure, api_keys):
+    if infrastructure == "HuggingFace Jobs":
+        job_id = submit_hf_job(model, agent_type, hardware, api_keys)
+    elif infrastructure == "Modal":
+        job_id = submit_modal_job(model, agent_type, hardware, api_keys)
+    return f"✅ Job submitted: {job_id}"
+```
+#### screens/job_monitoring.py
+**Purpose**: Track status of submitted jobs.
+**Data Source**: HuggingFace Jobs API or Modal API
+**Refresh Strategy**:
+- Manual refresh button
+- Auto-refresh every 30 seconds (optional)
+#### screens/trace_detail.py
+**Purpose**: Visualize OpenTelemetry traces with GPU metrics.
+**Components**:
+- Waterfall diagram (spans timeline)
+- Span details panel
+- GPU metrics overlay (for GPU jobs)
+- MCP-powered Q&A
+**Trace Loading**:
+```python
+def load_trace(trace_id, traces_repo):
+    # Load trace dataset
+    ds = load_dataset(traces_repo)
+    trace_data = ds.filter(lambda x: x["trace_id"] == trace_id)[0]
+    # Render waterfall
+    waterfall_html = render_waterfall(trace_data["spans"])
+    return waterfall_html
+```
+**MCP Q&A**:
+```python
+def ask_trace_question(trace_id, traces_repo, question, mcp_client):
+    # Call MCP server to debug trace
+    answer = mcp_client.debug_trace(
+        trace_id=trace_id,
+        traces_repo=traces_repo,
+        question=question
+    )
+    return answer
+```
+#### screens/settings.py
+**Purpose**: Configure API keys and preferences.
+**Security**:
+- Keys stored in Gradio State (session-only, not server-side)
+- All forms use `api_name=False` (not exposed via API)
+- HTTPS encryption for all API calls
+**Configuration Options**:
+- Gemini API Key
+- HuggingFace Token
+- Modal Token ID + Secret
+- LLM Provider Keys (OpenAI, Anthropic, etc.)
+---
+### 3. Component Layer (components/)
+Reusable UI components that can be used across multiple screens.
+#### components/leaderboard_table.py
+**Purpose**: Custom HTML table with sorting, filtering, and styling.
+**Why Custom Component?**:
+- Gradio's default Dataframe component lacks advanced styling
+- Need clickable rows for navigation
+- Custom sorting and filtering logic
+- Badge rendering for metrics
+**Implementation**:
+```python
+def render_leaderboard_table(df: pd.DataFrame) -> str:
+    """Render leaderboard as interactive HTML table"""
+    html = """
+    <style>
+        .leaderboard-table { ... }
+        .metric-badge { ... }
+    </style>
+    <table class="leaderboard-table">
+        <thead>
+            <tr>
+                <th onclick="sortTable(0)">Model</th>
+                <th onclick="sortTable(1)">Success Rate</th>
+                <th onclick="sortTable(2)">Cost</th>
+                ...
+            </tr>
+        </thead>
+        <tbody>
+    """
+    for idx, row in df.iterrows():
+        html += f"""
+            <tr onclick="selectRun('{row['run_id']}')">
+                <td>{row['model']}</td>
+                <td><span class="badge success">{row['success_rate']}%</span></td>
+                <td>${row['total_cost_usd']:.4f}</td>
+                ...
+            </tr>
+        """
+    html += """
+        </tbody>
+    </table>
+    <script>
+        function sortTable(col) { ... }
+        function selectRun(runId) {
+            // Trigger Gradio event to navigate to run detail
+            document.dispatchEvent(new CustomEvent('runSelected', {detail: runId}));
+        }
+    </script>
+    """
+    return html
+```
+**Integration with Gradio**:
+```python
+# In leaderboard screen
+table_html = gr.HTML()
+load_btn.click(
+    fn=lambda: render_leaderboard_table(df),
+    outputs=table_html
+)
+```
+#### components/analytics_charts.py
+**Purpose**: Performance charts using Plotly.
+**Charts Provided**:
+- Success rate over time (line chart)
+- Cost comparison (bar chart)
+- Duration distribution (histogram)
+- CO2 emissions by model (pie chart)
+**Example**:
+```python
+import plotly.graph_objects as go
+def create_cost_comparison_chart(df):
+    fig = go.Figure(data=[
+        go.Bar(
+            x=df['model'],
+            y=df['total_cost_usd'],
+            marker_color='indianred'
+        )
+    ])
+    fig.update_layout(
+        title="Cost Comparison by Model",
+        xaxis_title="Model",
+        yaxis_title="Total Cost (USD)"
+    )
+    return fig
+```
+#### components/thought_graph.py
+**Purpose**: Visualize agent reasoning steps (for Agent Chat).
+**Visualization**:
+- Graph nodes: Reasoning steps, tool calls
+- Edges: Flow between steps
+- Annotations: Tool results, errors
+---
+### 4. MCP Client Layer (mcp_client/)
+#### mcp_client/client.py - Async MCP Client
+**Purpose**: Connect to TraceMind MCP Server via MCP protocol.
+**Implementation**: (See [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) for full code)
+**Key Methods**:
+- `connect()`: Establish SSE connection to MCP server
+- `call_tool(tool_name, arguments)`: Call an MCP tool
+- `analyze_leaderboard(**kwargs)`: Wrapper for analyze_leaderboard tool
+- `estimate_cost(**kwargs)`: Wrapper for estimate_cost tool
+- `debug_trace(**kwargs)`: Wrapper for debug_trace tool
+#### mcp_client/sync_wrapper.py - Synchronous Wrapper
+**Purpose**: Provide synchronous API for Gradio event handlers.
+**Why Needed?**: Gradio event handlers are synchronous, but MCP client is async.
+**Pattern**:
+```python
+class SyncMCPClient:
+    def __init__(self, mcp_server_url):
+        self.async_client = AsyncMCPClient(mcp_server_url)
+    def _run_async(self, coro):
+        """Run async coroutine in sync context"""
+        loop = asyncio.get_event_loop()
+        return loop.run_until_complete(coro)
+    def analyze_leaderboard(self, **kwargs):
+        """Synchronous wrapper"""
+        return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
+```
+---
+### 5. Data Loader (data_loader.py)
+**Purpose**: Load and cache HuggingFace datasets.
+**Features**:
+- In-memory caching (5-minute TTL)
+- Error handling for missing datasets
+- Automatic retry logic
+- Dataset validation
+**Implementation**:
+```python
+from datasets import load_dataset
+from functools import lru_cache
+import time
+class DataLoader:
+    def __init__(self):
+        self.cache = {}
+        self.cache_ttl = 300  # 5 minutes
+    def load_leaderboard(self, repo="kshitijthakkar/smoltrace-leaderboard"):
+        """Load leaderboard with caching"""
+        cache_key = f"leaderboard:{repo}"
+        # Check cache
+        if cache_key in self.cache:
+            cached_time, cached_data = self.cache[cache_key]
+            if time.time() - cached_time < self.cache_ttl:
+                return cached_data
+        # Load fresh data
+        ds = load_dataset(repo, split="train")
+        df = pd.DataFrame(ds)
+        # Cache
+        self.cache[cache_key] = (time.time(), df)
+        return df
+    def load_results(self, repo):
+        """Load results dataset for specific run"""
+        ds = load_dataset(repo, split="train")
+        return pd.DataFrame(ds)
+    def load_traces(self, repo):
+        """Load traces dataset for specific run"""
+        ds = load_dataset(repo, split="train")
+        return ds  # Keep as Dataset for filtering
+```
+---
+## MCP Client Architecture
+**Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
+**Summary**:
+- **Async Client**: `mcp_client/client.py` - async MCP protocol implementation
+- **Sync Wrapper**: `mcp_client/sync_wrapper.py` - synchronous API for Gradio
+- **Global Instance**: Initialized once in `app.py`, shared across all screens
+**Usage Pattern**:
+```python
+# In app.py (initialization)
+from mcp_client.sync_wrapper import get_sync_mcp_client
+mcp_client = get_sync_mcp_client()
+mcp_client.initialize()
+# In screen (usage)
+def some_event_handler(mcp_client):
+    result = mcp_client.analyze_leaderboard(metric_focus="cost")
+    return result
+```
+---
+## Agent Framework Integration
+**Full details in**: [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md)
+**Framework**: smolagents (HuggingFace's agent framework)
+**Key Features**:
+- Autonomous tool discovery from MCP server
+- Multi-step reasoning with tool chaining
+- Context-aware responses
+- Reasoning visualization (optional)
+**Agent Setup**:
+```python
+from smolagents import ToolCallingAgent, MCPClient
+agent = ToolCallingAgent(
+    tools=[],  # Empty - tools loaded from MCP server
+    model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct"),
+    mcp_client=MCPClient(MCP_SERVER_URL),
+    max_steps=10
+)
+```
+---
+## Data Flow
+### Leaderboard Loading Flow
+```
+1. User clicks "Load Leaderboard"
+   ↓
+2. Gradio Event Handler (leaderboard.py)
+   load_leaderboard()
+   ↓
+3. Data Loader (data_loader.py)
+   ├─→ Check cache (5-min TTL)
+   │   └─→ If cached: return cached data
+   └─→ If not cached: load from HF Datasets
+       └─→ load_dataset("kshitijthakkar/smoltrace-leaderboard")
+   ↓
+4. MCP Client (sync_wrapper.py)
+   mcp_client.analyze_leaderboard(metric_focus="overall")
+   ↓
+5. MCP Server (TraceMind-mcp-server)
+   ├─→ Load data
+   ├─→ Call Gemini API
+   └─→ Return AI analysis
+   ↓
+6. Render Components
+   ├─→ AI Insights (Markdown)
+   └─→ Leaderboard Table (Custom HTML)
+   ↓
+7. Display to User
+```
+### Agent Chat Flow
+```
+1. User types message: "What are the top 3 models?"
+   ↓
+2. Gradio Event Handler (chat.py)
+   agent_chat(message, history, show_reasoning)
+   ↓
+3. smolagents Agent
+   agent.run(message)
+   ├─→ Step 1: Plan approach
+   │   └─→ "Need to get top models from leaderboard"
+   ├─→ Step 2: Discover MCP tools
+   │   └─→ Found: get_top_performers, analyze_leaderboard
+   ├─→ Step 3: Call MCP tool
+   │   └─→ get_top_performers(metric="success_rate", top_n=3)
+   ├─→ Step 4: Parse result
+   │   └─→ Extract model names, success rates, costs
+   └─→ Step 5: Format response
+       └─→ Generate markdown table with insights
+   ↓
+4. Return to user with full reasoning trace (if enabled)
+```
+### Job Submission Flow
+```
+1. User fills form → Clicks "Submit Evaluation"
+   ↓
+2. Gradio Event Handler (dashboard.py)
+   submit_job(model, agent_type, hardware, infrastructure)
+   ↓
+3. Job Submission Module (utils/)
+   if infrastructure == "HuggingFace Jobs":
+       ├─→ hf_jobs_submission.py
+       ├─→ Build job config (YAML)
+       ├─→ Submit via HF Jobs API
+       └─→ Return job_id
+   elif infrastructure == "Modal":
+       ├─→ modal_job_submission.py
+       ├─→ Build Modal app config
+       ├─→ Submit via Modal SDK
+       └─→ Return job_id
+   ↓
+4. Store job_id in session state
+   ↓
+5. Redirect to Job Monitoring screen
+   ↓
+6. Auto-refresh status every 30s
+```
+---
+## Authentication & Authorization
+### HuggingFace OAuth
+**Implementation**: `utils/auth.py`
+**Flow**:
+```
+1. User visits TraceMind-AI
+   ↓
+2. Check OAuth token in session
+   ├─→ If valid: proceed to app
+   └─→ If invalid: show login screen
+   ↓
+3. User clicks "Sign in with HuggingFace"
+   ↓
+4. Redirect to HuggingFace OAuth page
+   ├─→ User authorizes TraceMind-AI
+   └─→ HF redirects back with token
+   ↓
+5. Store token in Gradio State (session)
+   ↓
+6. Use token for:
+   ├─→ HF Datasets access
+   ├─→ HF Jobs submission
+   └─→ User identification
+```
+**Code**:
+```python
+# utils/auth.py
+import gradio as gr
+def auth_ui():
+    """Create OAuth login UI"""
+    gr.LoginButton(
+        value="Sign in with HuggingFace",
+        auth_provider="huggingface"
+    )
+# In app.py
+with gr.Blocks() as app:
+    if not DISABLE_OAUTH:
+        auth_ui()
+```
+### API Key Storage
+**Strategy**: Session-only storage (not server-side persistence)
+**Implementation**:
+```python
+# In settings screen
+def save_api_keys(gemini_key, hf_token):
+    """Store keys in session state"""
+    session_state = gr.State({
+        "gemini_key": gemini_key,
+        "hf_token": hf_token
+    })
+    # Override default clients with user keys
+    if gemini_key:
+        os.environ["GEMINI_API_KEY"] = gemini_key
+    if hf_token:
+        os.environ["HF_TOKEN"] = hf_token
+    return "✅ API keys saved for this session"
+```
+**Security**:
+- ✅ Keys stored only in browser memory
+- ✅ Not saved to disk or database
+- ✅ Forms use `api_name=False` (not exposed via API)
+- ✅ HTTPS encryption
+---
+## Screen Navigation
+### State Management
+**Pattern**: Gradio State components for session data
+```python
+# In app.py
+with gr.Blocks() as app:
+    # Global state
+    session_state = gr.State({
+        "user": None,
+        "current_run_id": None,
+        "current_trace_id": None,
+        "api_keys": {}
+    })
+    # Pass to all screens
+    leaderboard_screen(session_state)
+    chat_screen(session_state)
+```
+### Navigation Between Screens
+**Pattern**: Click event triggers tab switch + state update
+```python
+# In leaderboard screen
+def row_click(run_id, session_state):
+    """Navigate to run detail when row clicked"""
+    session_state["current_run_id"] = run_id
+    # Switch to trace detail tab (Tab index 4)
+    return gr.Tabs.update(selected=4), session_state
+table_component.select(
+    fn=row_click,
+    inputs=[gr.State(), session_state],
+    outputs=[main_tabs, session_state]
+)
+```
+---
+## Job Submission Architecture
+### HuggingFace Jobs Integration
+**File**: `utils/hf_jobs_submission.py`
+**Key Functions**:
+```python
+def submit_hf_job(model, agent_type, hardware, api_keys):
+    """Submit evaluation job to HuggingFace Jobs"""
+    # 1. Build job config (YAML)
+    job_config = {
+        "name": f"SMOLTRACE Eval - {model}",
+        "hardware": hardware,  # cpu-basic, t4-small, a10g-small, a100-large, h200
+        "environment": {
+            "MODEL": model,
+            "AGENT_TYPE": agent_type,
+            "HF_TOKEN": api_keys["hf_token"],
+            # ... other env vars
+        },
+        "command": [
+            "pip install smoltrace[otel,gpu]",
+            f"smoltrace-eval --model {model} --agent-type {agent_type} ..."
+        ]
+    }
+    # 2. Submit via HF Jobs API
+    response = requests.post(
+        "https://huggingface.co/api/jobs",
+        headers={"Authorization": f"Bearer {api_keys['hf_token']}"},
+        json=job_config
+    )
+    # 3. Return job ID
+    job_id = response.json()["id"]
+    return job_id
+```
+### Modal Integration
+**File**: `utils/modal_job_submission.py`
+**Key Functions**:
+```python
+import modal
+def submit_modal_job(model, agent_type, hardware, api_keys):
+    """Submit evaluation job to Modal"""
+    # 1. Create Modal app
+    app = modal.App("smoltrace-eval")
+    # 2. Define function with GPU
+    @app.function(
+        image=modal.Image.debian_slim().pip_install("smoltrace[otel,gpu]"),
+        gpu=hardware,  # A10, A100-80GB, H200
+        secrets=[
+            modal.Secret.from_dict({
+                "HF_TOKEN": api_keys["hf_token"],
+                # ... other secrets
+            })
+        ]
+    )
+    def run_evaluation():
+        import smoltrace
+        # Run evaluation
+        results = smoltrace.evaluate(model=model, agent_type=agent_type)
+        return results
+    # 3. Deploy and run
+    with app.run():
+        result = run_evaluation.remote()
+    return result.job_id
+```
+---
+## Deployment
+### HuggingFace Spaces
+**Platform**: HuggingFace Spaces
+**SDK**: Gradio 5.49.1
+**Hardware**: CPU Basic (upgradeable)
+**URL**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
+### Configuration
+**Space Metadata** (README.md header):
+```yaml
+---
+title: TraceMind AI
+emoji: 🧠
+colorFrom: indigo
+colorTo: purple
+sdk: gradio
+sdk_version: 5.49.1
+app_file: app.py
+short_description: AI agent evaluation with MCP-powered intelligence
+license: agpl-3.0
+pinned: true
+tags:
+  - mcp-in-action-track-enterprise
+  - agent-evaluation
+  - mcp-client
+  - leaderboard
+  - gradio
+---
+```
+### Environment Variables
+**Set in HF Spaces Secrets**:
+```bash
+# Required
+GEMINI_API_KEY=your_gemini_key
+HF_TOKEN=your_hf_token
+# Optional
+MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
+LEADERBOARD_REPO=kshitijthakkar/smoltrace-leaderboard
+DISABLE_OAUTH=false  # Set to true for local development
+```
+---
+## Performance Optimization
+### 1. Data Caching
+**Implementation**: `data_loader.py`
+- In-memory cache with 5-minute TTL
+- Reduces HF Datasets API calls
+- Faster page loads
+### 2. Async MCP Calls
+**Pattern**: Use async for non-blocking I/O
+```python
+# Could be optimized to run in parallel
+async def load_data_with_insights():
+    leaderboard_task = load_dataset_async(...)
+    insights_task = mcp_client.analyze_leaderboard_async(...)
+    leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
+    return leaderboard, insights
+```
+### 3. Component Lazy Loading
+**Strategy**: Load components only when tabs are activated
+```python
+with gr.Tab("Trace Detail", visible=False) as trace_tab:
+    # Components created only when tab first shown
+    @trace_tab.select
+    def load_trace_components():
+        return build_trace_visualization()
+```
+---
+## Related Documentation
+- [README.md](PROPOSED_README_TRACEMIND_AI.md) - Overview and quick start
+- [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete screen-by-screen guide
+- [MCP_INTEGRATION.md](MCP_INTEGRATION_TRACEMIND_AI.md) - MCP client implementation
+- [TraceMind MCP Server Architecture](ARCHITECTURE_MCP_SERVER.md) - Server-side architecture
+---
+**Last Updated**: November 21, 2025
+**Version**: 1.0.0
+**Track**: MCP in Action (Enterprise)

MCP_INTEGRATION.md ADDED Viewed

	@@ -0,0 +1,706 @@

+# TraceMind-AI - MCP Integration Guide
+This document explains how TraceMind-AI integrates with MCP servers to provide AI-powered agent evaluation.
+## Table of Contents
+- [Overview](#overview)
+- [Dual MCP Integration](#dual-mcp-integration)
+- [Architecture](#architecture)
+- [MCP Client Implementation](#mcp-client-implementation)
+- [Agent Framework Integration](#agent-framework-integration)
+- [MCP Tools Usage](#mcp-tools-usage)
+- [Development Guide](#development-guide)
+---
+## Overview
+TraceMind-AI demonstrates **enterprise MCP client usage** as part of the **Track 2: MCP in Action** submission. It showcases two distinct patterns of MCP integration:
+1. **Direct MCP Client**: Python-based client connecting to remote MCP server via SSE transport
+2. **Autonomous Agent**: `smolagents`-based agent with access to MCP tools for multi-step reasoning
+Both patterns consume the same MCP server ([TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)) to provide AI-powered analysis of agent evaluation data.
+---
+## Dual MCP Integration
+### Pattern 1: Direct MCP Client Integration
+**Where**: Leaderboard insights, cost estimation dialogs, trace debugging
+**How it works**:
+```python
+# TraceMind-AI calls MCP server directly
+mcp_client = get_sync_mcp_client()
+insights = mcp_client.analyze_leaderboard(
+    metric_focus="overall",
+    time_range="last_week",
+    top_n=5
+)
+# Display insights in UI
+```
+**Use cases**:
+- Generate leaderboard insights when user clicks "Load Leaderboard"
+- Estimate costs when user clicks "Estimate Cost" in New Evaluation form
+- Debug traces when user asks questions in trace visualization
+**Advantages**:
+- Direct, fast execution
+- Synchronous API (easy to integrate with Gradio)
+- Predictable, structured responses
+---
+### Pattern 2: Autonomous Agent with MCP Tools
+**Where**: Agent Chat tab
+**How it works**:
+```python
+# smolagents agent discovers and uses MCP tools autonomously
+from smolagents import ToolCallingAgent, MCPClient
+# Agent initialized with MCP client
+agent = ToolCallingAgent(
+    tools=[],  # Tools loaded from MCP server
+    model=model_client,
+    mcp_client=MCPClient(mcp_server_url)
+)
+# User asks question
+result = agent.run("What are the top 3 models and their costs?")
+# Agent plans:
+#   1. Call get_top_performers MCP tool
+#   2. Extract costs from results
+#   3. Format and present to user
+```
+**Use cases**:
+- Answer complex questions requiring multi-step analysis
+- Compare models across multiple dimensions
+- Plan evaluation strategies with cost estimates
+- Provide recommendations based on leaderboard data
+**Advantages**:
+- Natural language interface
+- Multi-step reasoning
+- Autonomous tool selection
+- Context-aware responses
+---
+## Architecture
+### System Overview
+```
+┌─────────────────────────────────────────────────────────────┐
+│ TraceMind-AI (Gradio App) - Track 2                         │
+│                                                               │
+│ ┌─────────────────────────────────────────────────────────┐ │
+│ │ UI Layer (Gradio)                                       │ │
+│ │  - Leaderboard tab                                      │ │
+│ │  - Agent Chat tab                                       │ │
+│ │  - New Evaluation tab                                   │ │
+│ │  - Trace Visualization tab                              │ │
+│ └────────────┬─────────────────────────────┬──────────────┘ │
+│              ↓                             ↓                 │
+│  ┌───────────────────────┐   ┌──────────────────────────┐  │
+│  │ Direct MCP Client     │   │ Autonomous Agent         │  │
+│  │ (sync_wrapper.py)     │   │ (smolagents)             │  │
+│  │                       │   │                          │  │
+│  │ - Synchronous API     │   │ - Multi-step reasoning   │  │
+│  │ - Tool calling        │   │ - Tool discovery         │  │
+│  │ - Error handling      │   │ - Context management     │  │
+│  └───────────┬───────────┘   └─────────────┬────────────┘  │
+│              └─────────────────┬─────────────┘               │
+│                                ↓                             │
+│                         MCP Protocol                         │
+│                         (SSE Transport)                      │
+└────────────────────────────────┬────────────────────────────┘
+                                 ↓
+┌─────────────────────────────────────────────────────────────┐
+│ TraceMind MCP Server - Track 1                              │
+│ https://huggingface.co/spaces/MCP-1st-Birthday/             │
+│ TraceMind-mcp-server                                        │
+│                                                               │
+│ 11 AI-Powered Tools:                                        │
+│  - analyze_leaderboard                                      │
+│  - debug_trace                                              │
+│  - estimate_cost                                            │
+│  - compare_runs                                             │
+│  - analyze_results                                          │
+│  - get_top_performers                                       │
+│  - get_leaderboard_summary                                  │
+│  - get_dataset                                              │
+│  - generate_synthetic_dataset                               │
+│  - push_dataset_to_hub                                      │
+│  - generate_prompt_template                                 │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## MCP Client Implementation
+### File Structure
+```
+TraceMind-AI/
+├── mcp_client/
+│   ├── __init__.py
+│   ├── client.py              # Async MCP client
+│   └── sync_wrapper.py        # Synchronous wrapper for Gradio
+├── agent/
+│   ├── __init__.py
+│   └── smolagents_setup.py    # Agent with MCP integration
+└── app.py                     # Main Gradio app
+```
+### Async MCP Client (`client.py`)
+```python
+from mcp import ClientSession, StdioServerParameters
+import mcp.types as types
+class TraceMindMCPClient:
+    """Async MCP client for TraceMind MCP Server"""
+    def __init__(self, mcp_server_url: str):
+        self.mcp_server_url = mcp_server_url
+        self.session = None
+    async def connect(self):
+        """Establish connection to MCP server via SSE"""
+        # For HTTP-based MCP servers (HuggingFace Spaces)
+        self.session = ClientSession(
+            ServerParameters(
+                url=self.mcp_server_url,
+                transport="sse"
+            )
+        )
+        await self.session.__aenter__()
+        # List available tools
+        tools_result = await self.session.list_tools()
+        self.available_tools = {tool.name: tool for tool in tools_result.tools}
+        print(f"Connected to MCP server. Available tools: {list(self.available_tools.keys())}")
+    async def call_tool(self, tool_name: str, arguments: dict) -> str:
+        """Call an MCP tool with given arguments"""
+        if not self.session:
+            raise RuntimeError("MCP client not connected. Call connect() first.")
+        if tool_name not in self.available_tools:
+            raise ValueError(f"Tool '{tool_name}' not available. Available: {list(self.available_tools.keys())}")
+        # Call the tool
+        result = await self.session.call_tool(tool_name, arguments=arguments)
+        # Extract text response
+        if result.content and len(result.content) > 0:
+            return result.content[0].text
+        return ""
+    async def analyze_leaderboard(self, **kwargs) -> str:
+        """Wrapper for analyze_leaderboard tool"""
+        return await self.call_tool("analyze_leaderboard", kwargs)
+    async def estimate_cost(self, **kwargs) -> str:
+        """Wrapper for estimate_cost tool"""
+        return await self.call_tool("estimate_cost", kwargs)
+    async def debug_trace(self, **kwargs) -> str:
+        """Wrapper for debug_trace tool"""
+        return await self.call_tool("debug_trace", kwargs)
+    async def compare_runs(self, **kwargs) -> str:
+        """Wrapper for compare_runs tool"""
+        return await self.call_tool("compare_runs", kwargs)
+    async def get_top_performers(self, **kwargs) -> str:
+        """Wrapper for get_top_performers tool"""
+        return await self.call_tool("get_top_performers", kwargs)
+    async def disconnect(self):
+        """Close MCP connection"""
+        if self.session:
+            await self.session.__aexit__(None, None, None)
+```
+### Synchronous Wrapper (`sync_wrapper.py`)
+```python
+import asyncio
+from typing import Optional
+from .client import TraceMindMCPClient
+class SyncMCPClient:
+    """Synchronous wrapper for async MCP client (Gradio-compatible)"""
+    def __init__(self, mcp_server_url: str):
+        self.mcp_server_url = mcp_server_url
+        self.async_client = TraceMindMCPClient(mcp_server_url)
+        self._connected = False
+    def _run_async(self, coro):
+        """Run async coroutine in sync context"""
+        try:
+            loop = asyncio.get_event_loop()
+        except RuntimeError:
+            loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(loop)
+        return loop.run_until_complete(coro)
+    def initialize(self):
+        """Connect to MCP server"""
+        if not self._connected:
+            self._run_async(self.async_client.connect())
+            self._connected = True
+    def analyze_leaderboard(self, **kwargs) -> str:
+        """Synchronous wrapper for analyze_leaderboard"""
+        if not self._connected:
+            self.initialize()
+        return self._run_async(self.async_client.analyze_leaderboard(**kwargs))
+    def estimate_cost(self, **kwargs) -> str:
+        """Synchronous wrapper for estimate_cost"""
+        if not self._connected:
+            self.initialize()
+        return self._run_async(self.async_client.estimate_cost(**kwargs))
+    def debug_trace(self, **kwargs) -> str:
+        """Synchronous wrapper for debug_trace"""
+        if not self._connected:
+            self.initialize()
+        return self._run_async(self.async_client.debug_trace(**kwargs))
+    # ... (similar wrappers for other tools)
+# Global instance for use in Gradio app
+_mcp_client: Optional[SyncMCPClient] = None
+def get_sync_mcp_client() -> SyncMCPClient:
+    """Get or create global sync MCP client instance"""
+    global _mcp_client
+    if _mcp_client is None:
+        mcp_server_url = os.getenv(
+            "MCP_SERVER_URL",
+            "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
+        )
+        _mcp_client = SyncMCPClient(mcp_server_url)
+    return _mcp_client
+```
+### Usage in Gradio App
+```python
+# app.py
+from mcp_client.sync_wrapper import get_sync_mcp_client
+# Initialize MCP client
+mcp_client = get_sync_mcp_client()
+mcp_client.initialize()
+# Use in Gradio event handlers
+def load_leaderboard():
+    """Load leaderboard and generate AI insights"""
+    # Load dataset
+    ds = load_dataset("kshitijthakkar/smoltrace-leaderboard")
+    df = pd.DataFrame(ds)
+    # Get AI insights from MCP server
+    try:
+        insights = mcp_client.analyze_leaderboard(
+            metric_focus="overall",
+            time_range="last_week",
+            top_n=5
+        )
+    except Exception as e:
+        insights = f"❌ Error generating insights: {str(e)}"
+    return df, insights
+# Gradio UI
+with gr.Blocks() as app:
+    with gr.Tab("📊 Leaderboard"):
+        load_btn = gr.Button("Load Leaderboard")
+        insights_md = gr.Markdown(label="AI Insights")
+        leaderboard_table = gr.Dataframe()
+        load_btn.click(
+            fn=load_leaderboard,
+            outputs=[leaderboard_table, insights_md]
+        )
+```
+---
+## Agent Framework Integration
+### smolagents Setup
+```python
+# agent/smolagents_setup.py
+from smolagents import ToolCallingAgent, MCPClient, HfApiModel
+import os
+def create_agent():
+    """Create smolagents agent with MCP tool access"""
+    # 1. Configure MCP client
+    mcp_server_url = os.getenv(
+        "MCP_SERVER_URL",
+        "https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse"
+    )
+    mcp_client = MCPClient(mcp_server_url)
+    # 2. Configure LLM
+    model = HfApiModel(
+        model_id="Qwen/Qwen2.5-Coder-32B-Instruct",
+        token=os.getenv("HF_TOKEN")
+    )
+    # 3. Create agent with MCP tools
+    agent = ToolCallingAgent(
+        tools=[],  # MCP tools loaded automatically
+        model=model,
+        mcp_client=mcp_client,
+        max_steps=10,
+        verbosity_level=1
+    )
+    return agent
+def run_agent_query(agent: ToolCallingAgent, query: str, show_reasoning: bool = False):
+    """Run agent query and return response"""
+    try:
+        # Set verbosity based on show_reasoning flag
+        if show_reasoning:
+            agent.verbosity_level = 2  # Show tool execution logs
+        else:
+            agent.verbosity_level = 0  # Only show final answer
+        # Run agent
+        result = agent.run(query)
+        return result
+    except Exception as e:
+        return f"❌ Agent error: {str(e)}"
+```
+### Agent Chat UI
+```python
+# app.py
+from agent.smolagents_setup import create_agent, run_agent_query
+# Initialize agent (once at startup)
+agent = create_agent()
+def agent_chat(message: str, history: list, show_reasoning: bool):
+    """Handle agent chat interaction"""
+    # Run agent query
+    response = run_agent_query(agent, message, show_reasoning)
+    # Update chat history
+    history.append((message, response))
+    return history, ""
+# Gradio UI
+with gr.Blocks() as app:
+    with gr.Tab("🤖 Agent Chat"):
+        gr.Markdown("## Autonomous Agent with MCP Tools")
+        gr.Markdown("Ask questions about agent evaluations. The agent has access to all MCP tools.")
+        chatbot = gr.Chatbot(label="Agent Chat")
+        msg = gr.Textbox(label="Your Question", placeholder="What are the top 3 models and their costs?")
+        show_reasoning = gr.Checkbox(label="Show Agent Reasoning", value=False)
+        # Quick action buttons
+        with gr.Row():
+            quick_top = gr.Button("Quick: Top Models")
+            quick_cost = gr.Button("Quick: Cost Estimate")
+            quick_load = gr.Button("Quick: Load Leaderboard")
+        # Event handlers
+        msg.submit(agent_chat, [msg, chatbot, show_reasoning], [chatbot, msg])
+        quick_top.click(
+            lambda h, sr: agent_chat(
+                "What are the top 5 models by success rate with their costs?",
+                h,
+                sr
+            ),
+            [chatbot, show_reasoning],
+            [chatbot, msg]
+        )
+```
+---
+## MCP Tools Usage
+### Tools Used in TraceMind-AI
+| Tool | Where Used | Purpose |
+|------|-----------|---------|
+| `analyze_leaderboard` | Leaderboard tab | Generate AI insights when user loads leaderboard |
+| `estimate_cost` | New Evaluation tab | Predict costs before submitting evaluation |
+| `debug_trace` | Trace Visualization | Answer questions about execution traces |
+| `compare_runs` | Agent Chat | Compare two evaluation runs side-by-side |
+| `analyze_results` | Agent Chat | Analyze detailed test results with optimization recommendations |
+| `get_top_performers` | Agent Chat | Efficiently fetch top N models (90% token reduction) |
+| `get_leaderboard_summary` | Agent Chat | Get high-level statistics (99% token reduction) |
+| `get_dataset` | Agent Chat | Load SMOLTRACE datasets for detailed analysis |
+### Example Tool Calls
+**Example 1: Leaderboard Insights**
+```python
+# User clicks "Load Leaderboard" button
+insights = mcp_client.analyze_leaderboard(
+    leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
+    metric_focus="overall",
+    time_range="last_week",
+    top_n=5
+)
+# Display in Gradio Markdown component
+insights_md.value = insights
+```
+**Example 2: Cost Estimation**
+```python
+# User fills New Evaluation form and clicks "Estimate Cost"
+estimate = mcp_client.estimate_cost(
+    model="meta-llama/Llama-3.1-8B",
+    agent_type="both",
+    num_tests=100,
+    hardware="auto"
+)
+# Display in dialog
+gr.Info(estimate)
+```
+**Example 3: Agent Multi-Step Query**
+```python
+# User asks: "What are the top 3 models and how much do they cost?"
+# Agent reasoning (internal):
+#   Step 1: Need to get top models by success rate
+#   → Call get_top_performers(metric="success_rate", top_n=3)
+#
+#   Step 2: Extract cost information from results
+#   → Parse JSON response, get "total_cost_usd" field
+#
+#   Step 3: Format response for user
+#   → Create markdown table with model names, success rates, costs
+# Agent response:
+"""
+Here are the top 3 models by success rate:
+1. **GPT-4**: 95.8% success rate, $0.05 per run
+2. **Claude-3**: 94.1% success rate, $0.04 per run
+3. **Llama-3.1-8B**: 93.4% success rate, $0.002 per run
+GPT-4 leads in accuracy but is 25x more expensive than Llama-3.1.
+For cost-sensitive workloads, Llama-3.1 offers the best value.
+"""
+```
+---
+## Development Guide
+### Adding New MCP Tool Integration
+1. **Add method to async client** (`client.py`):
+```python
+async def new_tool_name(self, **kwargs) -> str:
+    """Wrapper for new_tool_name MCP tool"""
+    return await self.call_tool("new_tool_name", kwargs)
+```
+2. **Add synchronous wrapper** (`sync_wrapper.py`):
+```python
+def new_tool_name(self, **kwargs) -> str:
+    """Synchronous wrapper for new_tool_name"""
+    if not self._connected:
+        self.initialize()
+    return self._run_async(self.async_client.new_tool_name(**kwargs))
+```
+3. **Use in Gradio app** (`app.py`):
+```python
+def handle_new_tool():
+    result = mcp_client.new_tool_name(param1="value1", param2="value2")
+    return result
+```
+**Note**: Agent automatically discovers new tools from MCP server, no code changes needed!
+### Testing MCP Integration
+**Test 1: Connection**
+```python
+python -c "from mcp_client.sync_wrapper import get_sync_mcp_client; client = get_sync_mcp_client(); client.initialize(); print('✅ MCP client connected')"
+```
+**Test 2: Tool Call**
+```python
+from mcp_client.sync_wrapper import get_sync_mcp_client
+client = get_sync_mcp_client()
+client.initialize()
+result = client.analyze_leaderboard(
+    metric_focus="cost",
+    time_range="last_week",
+    top_n=3
+)
+print(result)
+```
+**Test 3: Agent**
+```python
+from agent.smolagents_setup import create_agent, run_agent_query
+agent = create_agent()
+response = run_agent_query(agent, "What are the top 3 models?", show_reasoning=True)
+print(response)
+```
+### Debugging MCP Issues
+**Issue**: Connection timeout
+- **Check**: MCP server is running at specified URL
+- **Check**: Network connectivity to HuggingFace Spaces
+- **Check**: SSE transport is enabled on server
+**Issue**: Tool not found
+- **Check**: MCP server has the tool implemented
+- **Check**: Tool name matches exactly (case-sensitive)
+- **Check**: Client initialized successfully (call `initialize()` first)
+**Issue**: Agent not using MCP tools
+- **Check**: MCPClient is properly configured in agent setup
+- **Check**: Agent has `max_steps > 0` to allow tool usage
+- **Check**: Query requires tool usage (not answerable from agent's knowledge alone)
+---
+## Performance Considerations
+### Token Optimization
+**Problem**: Loading full leaderboard dataset consumes excessive tokens
+**Solution**: Use token-optimized MCP tools
+```python
+# ❌ BAD: Loads all 51 runs (50K+ tokens)
+leaderboard = mcp_client.get_dataset("kshitijthakkar/smoltrace-leaderboard")
+# ✅ GOOD: Returns only top 5 (5K tokens, 90% reduction)
+top_performers = mcp_client.get_top_performers(top_n=5)
+# ✅ BETTER: Returns summary stats (500 tokens, 99% reduction)
+summary = mcp_client.get_leaderboard_summary()
+```
+### Caching
+**Problem**: Repeated identical MCP calls waste time and credits
+**Solution**: Implement client-side caching
+```python
+from functools import lru_cache
+import time
+@lru_cache(maxsize=32)
+def cached_analyze_leaderboard(metric_focus: str, time_range: str, top_n: int, cache_key: int):
+    """Cached MCP call with TTL via cache_key"""
+    return mcp_client.analyze_leaderboard(
+        metric_focus=metric_focus,
+        time_range=time_range,
+        top_n=top_n
+    )
+# Use with 5-minute cache TTL
+cache_key = int(time.time() // 300)  # Changes every 5 minutes
+insights = cached_analyze_leaderboard("overall", "last_week", 5, cache_key)
+```
+### Async Optimization
+**Problem**: Sequential MCP calls block UI
+**Solution**: Use async for parallel calls
+```python
+import asyncio
+async def load_leaderboard_with_insights():
+    """Load leaderboard and insights in parallel"""
+    # Start both operations concurrently
+    leaderboard_task = asyncio.create_task(load_dataset_async("kshitijthakkar/smoltrace-leaderboard"))
+    insights_task = asyncio.create_task(mcp_client.analyze_leaderboard(metric_focus="overall"))
+    # Wait for both to complete
+    leaderboard, insights = await asyncio.gather(leaderboard_task, insights_task)
+    return leaderboard, insights
+```
+---
+## Security Considerations
+### API Key Management
+**DO**:
+- Store API keys in environment variables or HF Spaces secrets
+- Use session-only storage in Gradio (not server-side persistence)
+- Rotate keys regularly
+**DON'T**:
+- Hardcode API keys in source code
+- Expose keys in client-side JavaScript
+- Log API keys in console or files
+### MCP Server Trust
+**Verify MCP server authenticity**:
+- Use HTTPS URLs only
+- Verify domain ownership (huggingface.co spaces)
+- Review MCP server code before connecting (open source)
+**Limit tool access**:
+- Only connect to trusted MCP servers
+- Review tool permissions before use
+- Implement rate limiting for tool calls
+---
+## Related Documentation
+- [USER_GUIDE.md](USER_GUIDE_TRACEMIND_AI.md) - Complete UI walkthrough
+- [JOB_SUBMISSION.md](JOB_SUBMISSION_TRACEMIND_AI.md) - Evaluation job guide
+- [ARCHITECTURE.md](ARCHITECTURE_TRACEMIND_AI.md) - Technical architecture
+- [TraceMind MCP Server Documentation](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server)
+---
+**Last Updated**: November 21, 2025

README.md CHANGED Viewed

@@ -20,474 +20,449 @@ tags:
 # 🧠 TraceMind-AI
 <p align="center">
-  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
-  <br/>
-  <br/>
   <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
 </p>
 **Agent Evaluation Platform with MCP-Powered Intelligence**
 [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
-[![Track](https://img.shields.io/badge/Track-MCP%20in%20Action%20(Enterprise)-purple)](https://github.com/modelcontextprotocol/hackathon)
 [![Powered by Gradio](https://img.shields.io/badge/Powered%20by-Gradio-orange)](https://gradio.app/)
 > **🎯 Track 2 Submission**: MCP in Action (Enterprise)
 > **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
-## Overview
-TraceMind-AI is a comprehensive platform for evaluating AI agent performance across different models, providers, and configurations. It provides real-time insights, cost analysis, and detailed trace visualization powered by the Model Context Protocol (MCP).
-### 🏗️ **Built on Open Source Foundation**
-This platform is part of a complete agent evaluation ecosystem built on two foundational open-source projects:
-**🔭 TraceVerde (genai_otel_instrument)** - Automatic OpenTelemetry Instrumentation
-- **What**: Zero-code OTEL instrumentation for LLM frameworks (LiteLLM, Transformers, LangChain, etc.)
-- **Why**: Captures every LLM call, tool usage, and agent step automatically
-- **Links**: [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
-**📊 SMOLTRACE** - Agent Evaluation Engine
-- **What**: Lightweight, production-ready evaluation framework with OTEL tracing built-in
-- **Why**: Generates structured datasets (leaderboard, results, traces, metrics) displayed in this UI
-- **Links**: [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
-**The Flow**: `TraceVerde` instruments your agents → `SMOLTRACE` evaluates them → `TraceMind-AI` visualizes results with MCP-powered intelligence
 ---
-## Features
-- **📊 Real-time Leaderboard**: Live evaluation data from HuggingFace datasets
-- **🤖 Autonomous Agent Chat**: Interactive agent powered by smolagents with MCP tools (Track 2)
-- **💬 MCP Integration**: AI-powered analysis using remote MCP servers
-- **☁️ Multi-Cloud Evaluation**: Submit jobs to HuggingFace Jobs or Modal (H200, A100, A10 GPUs)
-- **💰 Smart Cost Estimation**: Auto-select hardware and predict costs before running evaluations
-- **🔍 Trace Visualization**: Detailed OpenTelemetry trace analysis with GPU metrics
-- **📈 Performance Metrics**: GPU utilization, CO2 emissions, token usage tracking
-- **🧠 Agent Reasoning**: View step-by-step agent planning and tool execution
-## MCP Integration
-TraceMind demonstrates enterprise MCP client usage by connecting to [TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) via the Model Context Protocol.
-**MCP Tools Used:**
-- `analyze_leaderboard` - AI-generated insights about evaluation trends
-- `estimate_cost` - Cost estimation with hardware recommendations
-- `debug_trace` - Interactive trace analysis and debugging
-- `compare_runs` - Side-by-side run comparison
-- `analyze_results` - Test case analysis with optimization recommendations
-## Quick Start
-### Prerequisites
-**For Viewing Leaderboard & Analysis:**
-- Python 3.10+
-- HuggingFace account (for authentication)
-**For Submitting Evaluation Jobs:**
-- ⚠️ **HuggingFace Pro account** ($9/month) with credit card
-- HuggingFace token with **Read + Write + Run Jobs** permissions
-- API keys for model providers (OpenAI, Anthropic, etc.)
-> **Note**: Job submission requires a paid HuggingFace Pro account to access compute infrastructure. Viewing existing results is free.
-### Installation
-1. Clone the repository:
-```bash
-git clone https://github.com/Mandark-droid/TraceMind-AI.git
-cd TraceMind-AI
 ```
-2. Install dependencies:
-```bash
-pip install -r requirements.txt
 ```
-3. Configure environment:
-```bash
-cp .env.example .env
-# Edit .env with your configuration
-```
-4. Run the application:
-```bash
-python app.py
-```
-Visit http://localhost:7860
-## 🎯 For Hackathon Judges & Visitors
-### Using Your Own API Keys (Recommended)
-TraceMind-AI integrates with the TraceMind MCP Server to provide AI-powered analysis. To **prevent credit issues during evaluation**, we recommend configuring your own API keys:
-#### Step-by-Step Configuration
-**Step 1: Configure MCP Server** (Required for MCP tool features)
-1. **Open MCP Server**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
-2. Go to **⚙️ Settings** tab
-3. Enter your **Gemini API Key** and **HuggingFace Token**
-4. Click **"Save & Override Keys"**
-**Step 2: Configure TraceMind-AI** (Optional, for additional features)
-1. **Open TraceMind-AI**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
-2. Go to **⚙️ Settings** tab
-3. Enter your **Gemini API Key** and **HuggingFace Token**
-4. Click **"Save API Keys"**
-### Why Configure Both?
-- **MCP Server**: Provides AI-powered tools (leaderboard analysis, trace debugging, cost estimation)
-- **TraceMind-AI**: Main UI that calls the MCP server for intelligent analysis
-- They run in **separate sessions** → need separate configuration
-- Configuring both ensures your keys are used for the complete evaluation flow
-### Getting Free API Keys
-Both APIs have generous free tiers:
-**Google Gemini API Key**:
-- Visit: https://ai.google.dev/
-- Click "Get API Key" → Create project → Generate key
-- **Free tier**: 1,500 requests/day (sufficient for evaluation)
-**HuggingFace Token** (for viewing):
-- Visit: https://huggingface.co/settings/tokens
-- Click "New token" → Name it (e.g., "TraceMind Viewer")
-- **Permissions**:
-  - Select "Read" for viewing datasets (sufficient for browsing leaderboard)
-- **Free tier**: No rate limits for public dataset access
-### Default Configuration (Without Your Keys)
-If you don't configure your own keys:
-- Apps will use our pre-configured keys from HuggingFace Spaces Secrets
-- Fine for brief testing, but may hit rate limits during high traffic
-- Recommended to configure your keys for full evaluation
-### Security Notes
-✅ **Session-only storage**: Keys stored only in browser memory
-✅ **No server persistence**: Keys never saved to disk
-✅ **Not exposed via API**: Settings forms use `api_name=False`
-✅ **HTTPS encryption**: All API calls over secure connections
-## 🚀 Submitting Evaluation Jobs
-TraceMind-AI allows you to submit evaluation jobs to **two cloud platforms**:
-- **HuggingFace Jobs**: Managed compute with H200, A100, A10, T4 GPUs
-- **Modal**: Serverless GPU compute with pay-per-second pricing
-### ⚠️ Requirements for Job Submission
-**For HuggingFace Jobs:**
-1. **HuggingFace Pro Account** ($9/month)
-   - Sign up at: https://huggingface.co/pricing
-   - **Credit card required** to pay for compute usage
-   - Free accounts cannot submit jobs
-2. **HuggingFace Token with Enhanced Permissions**
-   - Visit: https://huggingface.co/settings/tokens
-   - Create token with these permissions:
-     - ✅ **Read** (view datasets)
-     - ✅ **Write** (upload results)
-     - ✅ **Run Jobs** (submit evaluation jobs)
-   - ⚠️ Read-only tokens will NOT work
-**For Modal (Optional Alternative):**
-1. **Modal Account** (Free tier available)
-   - Sign up at: https://modal.com
-   - Generate API token at: https://modal.com/settings/tokens
-   - Pay-per-second billing (no monthly subscription)
-2. **Configure Modal Credentials in Settings**
-   - MODAL_TOKEN_ID (starts with `ak-`)
-   - MODAL_TOKEN_SECRET (starts with `as-`)
-**Both Platforms Require:**
-3. **Model Provider API Keys**
-   - OpenAI, Anthropic, Google, etc.
-   - Configure in Settings → LLM Provider API Keys
-   - Passed securely as job secrets
-### Hardware Options & Pricing
-TraceMind **auto-selects optimal hardware** based on your model size and provider:
-**HuggingFace Jobs:**
-- **cpu-basic**: API models (OpenAI, Anthropic) - ~$0.05/hr
-- **t4-small**: Small models (4B-8B parameters) - ~$0.60/hr
-- **a10g-small**: Medium models (7B-13B) - ~$1.10/hr
-- **a100-large**: Large models (70B+) - ~$3.00/hr
-- Pricing: https://huggingface.co/pricing#spaces-pricing
-**Modal:**
-- **CPU**: API models - ~$0.0001/sec
-- **A10G**: Small-medium models (7B-13B) - ~$0.0006/sec
-- **A100-80GB**: Large models (70B+) - ~$0.0030/sec
-- **H200**: Fastest inference - ~$0.0050/sec
-- Pricing: https://modal.com/pricing
-### How to Submit a Job
-1. **Configure API Keys** (Settings tab):
-   - Add HF Token (with Run Jobs permission) - **required for both platforms**
-   - Add Modal credentials (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET) - **for Modal only**
-   - Add LLM provider keys (OpenAI, Anthropic, etc.)
-2. **Create Evaluation** (New Evaluation tab):
-   - **Select infrastructure**: HuggingFace Jobs or Modal
-   - Choose model and agent type
-   - Configure hardware (or use **"auto"** for smart selection)
-   - Set timeout (default: 1h)
-   - Click "💰 Estimate Cost" to preview cost/duration
-   - Click "Submit Evaluation"
-3. **Monitor Job**:
-   - View job ID and status in confirmation screen
-   - **HF Jobs**: Track at https://huggingface.co/jobs or use Job Monitoring tab
-   - **Modal**: Track at https://modal.com/apps
-   - Results automatically appear in leaderboard when complete
-### What Happens During a Job
-1. Job starts on selected infrastructure (HF Jobs or Modal)
-2. Docker container built with required dependencies
-3. SMOLTRACE evaluates your model with OpenTelemetry tracing
-4. Results uploaded to 4 HuggingFace datasets:
-   - Leaderboard entry (summary stats)
-   - Results dataset (test case details)
-   - Traces dataset (OTEL spans)
-   - Metrics dataset (GPU metrics, CO2 emissions)
-5. Results appear in TraceMind leaderboard automatically
-**Expected Duration:**
-- CPU jobs (API models): 2-5 minutes
-- GPU jobs (local models): 15-30 minutes (includes model download)
-## Configuration
-Create a `.env` file with the following variables:
-```env
-# HuggingFace Configuration
-HF_TOKEN=your_token_here
-# Agent Model Configuration (for Chat Screen - Track 2)
-# Options: "hfapi" (default), "inference_client", "litellm"
-AGENT_MODEL_TYPE=hfapi
-# API Keys for different model types
-# Required if AGENT_MODEL_TYPE=litellm
-GEMINI_API_KEY=your_gemini_api_key_here
-# MCP Server URL (note: /sse endpoint for smolagents integration)
-MCP_SERVER_URL=https://mcp-1st-birthday-tracemind-mcp-server.hf.space/gradio_api/mcp/sse
-# Dataset Configuration
-LEADERBOARD_REPO=kshitijthakkar/smoltrace-leaderboard
-# Development Mode (optional - disables OAuth for local testing)
-DISABLE_OAUTH=true
-```
-### Agent Model Options
-The Agent Chat screen supports three model configurations:
-1. **`hfapi` (Default)**: Uses HuggingFace Inference API
-   - Model: `Qwen/Qwen2.5-Coder-32B-Instruct`
-   - Requires: `HF_TOKEN`
-   - Best for: General use, free tier available
-2. **`inference_client`**: Uses Nebius provider
-   - Model: `deepseek-ai/DeepSeek-V3-0324`
-   - Requires: `HF_TOKEN`
-   - Best for: Advanced reasoning, faster inference
-3. **`litellm`**: Uses Google Gemini
-   - Model: `gemini/gemini-2.5-flash`
-   - Requires: `GEMINI_API_KEY`
-   - Best for: Gemini-specific features
-## Data Sources
-TraceMind-AI loads evaluation data from HuggingFace datasets:
-- **Leaderboard**: Aggregate statistics for all evaluation runs
-- **Results**: Individual test case results
-- **Traces**: OpenTelemetry trace data
-- **Metrics**: GPU metrics and performance data
-## Architecture
-### Project Structure
-```
-TraceMind-AI/
-├── app.py                 # Main Gradio application
-├── data_loader.py         # HuggingFace dataset integration
-├── mcp_client/            # MCP client implementation
-│   ├── client.py          # Async MCP client
-│   └── sync_wrapper.py    # Synchronous wrapper
-├── utils/                 # Utilities
-│   ├── auth.py            # HuggingFace OAuth
-│   └── navigation.py      # Screen navigation
-├── screens/               # UI screens
-├── components/            # Reusable components
-└── styles/                # Custom CSS
-```
-### MCP Client Integration
-TraceMind-AI uses the MCP Python SDK to connect to remote MCP servers:
-```python
-from mcp_client.sync_wrapper import get_sync_mcp_client
-# Initialize MCP client
-mcp_client = get_sync_mcp_client()
-mcp_client.initialize()
-# Call MCP tools
-insights = mcp_client.analyze_leaderboard(
-    metric_focus="overall",
-    time_range="last_week",
-    top_n=5
-)
-```
-## Usage
-### Viewing the Leaderboard
-1. Log in with your HuggingFace account
-2. Navigate to the "Leaderboard" tab
-3. Click "Load Leaderboard" to fetch the latest data
-4. View AI-powered insights generated by the MCP server
-### Estimating Costs
-1. Navigate to the "Cost Estimator" tab
-2. Enter the model name (e.g., `openai/gpt-4`)
-3. Select agent type and number of tests
-4. Click "Estimate Cost" for AI-powered analysis
-### Viewing Trace Details
-1. Select an evaluation run from the leaderboard
-2. Click on a specific test case
-3. View detailed OpenTelemetry trace visualization
-4. Ask questions about the trace using MCP-powered analysis
-### Using the Agent Chat (Track 2)
-1. Navigate to the "🤖 Agent Chat" tab
-2. The autonomous agent will initialize with MCP tools from TraceMind MCP Server
-3. Ask questions about agent evaluations:
-   - "What are the top 3 performing models and their costs?"
-   - "Estimate the cost of running 500 tests with DeepSeek-V3 on H200"
-   - "Load the leaderboard and show me the last 5 run IDs"
-4. Watch the agent plan, execute tools, and provide detailed answers
-5. Enable "Show Agent Reasoning" to see step-by-step tool execution
-6. Use Quick Action buttons for common queries
-**Example Questions:**
-- Analysis: "Analyze the current leaderboard and show me the top performing models with their costs"
-- Cost Comparison: "Compare the costs of the top 3 models - which one offers the best value?"
-- Recommendations: "Based on the leaderboard data, which model would you recommend for a production system?"
-## Technology Stack
-- **UI Framework**: Gradio 5.49.1
-- **Agent Framework**: smolagents 1.22.0+ (Track 2)
-- **MCP Protocol**: MCP integration via Gradio & smolagents MCPClient
-- **Data**: HuggingFace Datasets API
-- **Authentication**: HuggingFace OAuth
-- **AI Models**:
-  - Default: Qwen/Qwen2.5-Coder-32B-Instruct (HF Inference API)
-  - Optional: DeepSeek-V3 (Nebius), Gemini 2.5 Flash
-  - MCP Server: Google Gemini 2.5 Pro
-## Development
-### Running Locally
-```bash
-# Install dependencies
-pip install -r requirements.txt
-# Set development mode (optional - disables OAuth)
-export DISABLE_OAUTH=true
-# Run the app
-python app.py
-```
-### Running on HuggingFace Spaces
-This application is configured for deployment on HuggingFace Spaces using the Gradio SDK. The `app.py` file serves as the entry point.
-## Documentation
-For detailed implementation documentation, see:
-- [Data Loader API](data_loader.py) - Dataset loading and caching
-- [MCP Client API](mcp_client/client.py) - MCP protocol integration
-- [Authentication](utils/auth.py) - HuggingFace OAuth integration
-## Demo Video
-[Link to demo video showing the application in action]
-## Social Media
-[Link to social media post about this project]
-## License
-AGPL-3.0 License
-This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
-## Contributing
-Contributions are welcome! Please open an issue or submit a pull request.
-## Built By
 **Track**: MCP in Action (Enterprise)
 **Author**: Kshitij Thakkar
-**Powered by**: MCP Servers (TraceMind-mcp-server) + Gradio
 **Built with**: Gradio 5.49.1 (MCP client integration)
 ---
-## Acknowledgments
-- **MCP Team** - For the Model Context Protocol specification
-- **Gradio Team** - For Gradio 6 with MCP integration
-- **HuggingFace** - For Spaces hosting and dataset infrastructure
-- **Google** - For Gemini API access
-- **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
-## Links
-- **Live Demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
-- **MCP Server**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
-- **GitHub**: https://github.com/Mandark-droid/TraceMind-AI
-- **MCP Specification**: https://modelcontextprotocol.io
 ---
-**MCP's 1st Birthday Hackathon Submission**
-*Track: MCP in Action - Enterprise*

 # 🧠 TraceMind-AI
 <p align="center">
   <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/Logo.png" alt="TraceMind-AI Logo" width="200"/>
 </p>
 **Agent Evaluation Platform with MCP-Powered Intelligence**
 [![MCP's 1st Birthday Hackathon](https://img.shields.io/badge/MCP%27s%201st%20Birthday-Hackathon-blue)](https://github.com/modelcontextprotocol)
+[![Track 2: MCP in Action](https://img.shields.io/badge/Track-MCP%20in%20Action%20(Enterprise)-purple)](https://github.com/modelcontextprotocol/hackathon)
 [![Powered by Gradio](https://img.shields.io/badge/Powered%20by-Gradio-orange)](https://gradio.app/)
 > **🎯 Track 2 Submission**: MCP in Action (Enterprise)
 > **📅 MCP's 1st Birthday Hackathon**: November 14-30, 2025
 ---
+## Why TraceMind-AI?
+**The Challenge**: Evaluating AI agents generates complex data across models, providers, and configurations. Making sense of it all is overwhelming.
+**The Solution**: TraceMind-AI is your **intelligent agent evaluation command center**:
+- 📊 **Live leaderboard** with real-time performance data
+- 🤖 **Autonomous agent chat** powered by MCP tools
+- 💰 **Smart cost estimation** before you run evaluations
+- 🔍 **Deep trace analysis** to debug agent behavior
+- ☁️ **Multi-cloud job submission** (HuggingFace Jobs + Modal)
+All powered by the **Model Context Protocol** for AI-driven insights at every step.
+---
+## 🚀 Try It Now
+- **🌐 Live Demo**: [TraceMind-AI Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind)
+- **🛠️ MCP Server**: [TraceMind-mcp-server](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) (Track 1)
+- **📖 Full Docs**: See [USER_GUIDE.md](USER_GUIDE.md) for complete walkthrough
+- **🎬 MCP Server Quick Demo (5 min)**: [Watch on Loom](https://www.loom.com/share/d4d0003f06fa4327b46ba5c081bdf835)
+- **📺 MCP Server Full Demo (20 min)**: [Watch on Loom](https://www.loom.com/share/de559bb0aef749559c79117b7f951250)
+---
+## The TraceMind Ecosystem
+TraceMind-AI is the **user-facing platform** in a complete 4-project agent evaluation ecosystem:
+<p align="center">
+  <img src="https://raw.githubusercontent.com/Mandark-droid/TraceMind-AI/assets/TraceVerse_Logo.png" alt="TraceVerse Ecosystem" width="400"/>
+  <br/><br/>
+</p>
 ```
+🔭 TraceVerde                    📊 SMOLTRACE
+(genai_otel_instrument)         (Evaluation Engine)
+        ↓                               ↓
+    Instruments                    Evaluates
+    LLM calls                      agents
+        ↓                               ↓
+        └───────────┬───────────────────┘
+                    ↓
+            Generates Datasets
+        (leaderboard, traces, metrics)
+                    ↓
+        ┌───────────┴───────────────────┐
+        ↓                               ↓
+🛠️ TraceMind MCP Server         🧠 TraceMind-AI
+(Track 1 - Building MCP)        (This Project - Track 2)
+Provides AI Tools               Consumes MCP Tools
+        └───────── MCP Protocol ────────┘
 ```
+### The Foundation
+**🔭 TraceVerde** - Automatic OpenTelemetry instrumentation for LLM frameworks
+→ Captures every LLM call, tool usage, and agent step
+→ [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) | [PyPI](https://pypi.org/project/genai-otel-instrument)
+**📊 SMOLTRACE** - Lightweight evaluation engine with built-in tracing
+→ Generates structured datasets (leaderboard, results, traces, metrics)
+→ [GitHub](https://github.com/Mandark-droid/SMOLTRACE) | [PyPI](https://pypi.org/project/smoltrace/)
+### The Platform
+**🛠️ TraceMind MCP Server** - AI-powered analysis tools via MCP
+→ [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) | [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server)
+→ **Track 1**: Building MCP (Enterprise)
+**🧠 TraceMind-AI** (This Project) - Interactive UI that consumes MCP tools
+→ **Track 2**: MCP in Action (Enterprise)
+---
+## Key Features
+### 🎯 MCP Integration (Track 2)
+TraceMind-AI demonstrates **enterprise MCP client usage** in two ways:
+**1. Direct MCP Client Integration**
+- Connects to TraceMind MCP Server via SSE transport
+- Uses 5 AI-powered tools: `analyze_leaderboard`, `estimate_cost`, `debug_trace`, `compare_runs`, `analyze_results`
+- Real-time insights powered by Google Gemini 2.5 Flash
+**2. Autonomous Agent with MCP Tools**
+- Built with `smolagents` framework
+- Agent has access to all MCP server tools
+- Natural language queries → autonomous tool execution
+- Example: *"What are the top 3 models and how much do they cost?"*
+### 📊 Agent Evaluation Features
+- **Live Leaderboard**: View all evaluation runs with sortable metrics
+- **Cost Estimation**: Auto-select hardware and predict costs before running
+- **Trace Visualization**: Deep-dive into OpenTelemetry traces with GPU metrics
+- **Multi-Cloud Jobs**: Submit evaluations to HuggingFace Jobs or Modal
+- **Performance Analytics**: GPU utilization, CO2 emissions, token tracking
+### 💡 Smart Features
+- **Auto Hardware Selection**: Based on model size and provider
+- **Real-time Job Monitoring**: Track HuggingFace Jobs status
+- **Agent Reasoning Visibility**: See step-by-step tool execution
+- **Quick Action Buttons**: One-click common queries
+---
+## Quick Start
+### Option 1: Use the Live Demo (Recommended)
+1. **Visit**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
+2. **Login**: Sign in with your HuggingFace account
+3. **Explore**: Browse the leaderboard, chat with the agent, visualize traces
+### Option 2: Run Locally
+```bash
+# Clone and setup
+git clone https://github.com/Mandark-droid/TraceMind-AI.git
+cd TraceMind-AI
+pip install -r requirements.txt
+# Configure environment
+cp .env.example .env
+# Edit .env with your API keys (see Configuration section)
+# Run the app
+python app.py
+```
+Visit http://localhost:7860
+---
+## Configuration
+### For Viewing (Free)
+**Required**:
+- HuggingFace account (free)
+- HuggingFace token with **Read** permissions
+### For Submitting Jobs (Paid)
+**Required**:
+- ⚠️ **HuggingFace Pro** ($9/month) with credit card
+- HuggingFace token with **Read + Write + Run Jobs** permissions
+- LLM provider API keys (OpenAI, Anthropic, etc.)
+**Optional (Modal Alternative)**:
+- Modal account (pay-per-second, no subscription)
+- Modal API token (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
+### Using Your Own API Keys (Recommended for Judges)
+To prevent rate limits during evaluation:
+**Step 1: Configure MCP Server** (Required for AI tools)
+1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
+2. Go to **⚙️ Settings** tab
+3. Enter: **Gemini API Key** + **HuggingFace Token**
+4. Click **"Save & Override Keys"**
+**Step 2: Configure TraceMind-AI** (Optional)
+1. Visit: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
+2. Go to **⚙️ Settings** tab
+3. Enter: **Gemini API Key** + **HuggingFace Token**
+4. Click **"Save API Keys"**
+**Get Free API Keys**:
+- **Gemini**: https://ai.google.dev/ (1,500 requests/day)
+- **HuggingFace**: https://huggingface.co/settings/tokens (unlimited for public datasets)
+---
+## For Hackathon Judges
+### ✅ Track 2 Compliance
+- **MCP Client Integration**: Connects to remote MCP server via SSE transport
+- **Autonomous Agent**: `smolagents` agent with MCP tool access
+- **Enterprise Focus**: Cost optimization, job submission, performance analytics
+- **Production-Ready**: Deployed to HuggingFace Spaces with OAuth authentication
+- **Real Data**: Live HuggingFace datasets from SMOLTRACE evaluations
+### 🎯 Key Innovations
+1. **Dual MCP Integration**: Both direct MCP client + autonomous agent with MCP tools
+2. **Multi-Cloud Support**: HuggingFace Jobs + Modal for serverless compute
+3. **Auto Hardware Selection**: Smart hardware recommendations based on model size
+4. **Complete Ecosystem**: Part of 4-project platform demonstrating full evaluation workflow
+5. **Agent Reasoning Visibility**: See step-by-step MCP tool execution
+### 📹 Demo Materials
+- **🎥 Demo Video**: [Coming Soon - Link to walkthrough]
+- **📢 Social Post**: [Coming Soon - Link to announcement]
+### 🧪 Testing Suggestions
+**1. Try the Agent Chat** (🤖 Agent Chat tab):
+- "Analyze the current leaderboard and show me the top 5 models"
+- "Compare the costs of the top 3 models"
+- "Estimate the cost of running 100 tests with GPT-4"
+**2. Explore the Leaderboard** (📊 Leaderboard tab):
+- Click "Load Leaderboard" to see live data
+- Read the AI-generated insights (powered by MCP server)
+- Click on a run to see detailed test results
+**3. Visualize Traces** (Select a run → View traces):
+- See OpenTelemetry waterfall diagrams
+- View GPU metrics overlay (for GPU jobs)
+- Ask questions about the trace (MCP-powered debugging)
+---
+## What Can You Do?
+### 📊 View & Analyze
+- **Browse leaderboard** with AI-powered insights
+- **Compare models** side-by-side across metrics
+- **Analyze traces** with interactive visualization
+- **Ask questions** via autonomous agent
+### 💰 Estimate & Plan
+- **Get cost estimates** before running evaluations
+- **Compare hardware options** (CPU vs GPU tiers)
+- **Preview duration** and CO2 emissions
+- **See recommendations** from AI analysis
+### 🚀 Submit & Monitor
+- **Submit evaluation jobs** to HuggingFace or Modal
+- **Track job status** in real-time
+- **View results** automatically when complete
+- **Download datasets** for further analysis
+### 🧪 Generate & Customize
+- **Generate synthetic datasets** for custom domains and tools
+- **Create prompt templates** optimized for your use case
+- **Push to HuggingFace Hub** with one click
+- **Test evaluations** without writing code
+---
+## Documentation
+**For quick evaluation**:
+- Read this README for overview
+- Visit the [Live Demo](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) to try it
+- Check out the **🤖 Agent Chat** tab for autonomous MCP usage
+**For deep dives**:
+- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
+  - Leaderboard tab usage
+  - Agent chat interactions
+  - Synthetic data generator
+  - Job submission workflow
+  - Trace visualization guide
+- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture
+  - How TraceMind-AI connects to MCP server
+  - Agent framework integration (smolagents)
+  - MCP tool usage examples
+- [JOB_SUBMISSION.md](JOB_SUBMISSION.md) - Evaluation job guide
+  - HuggingFace Jobs setup
+  - Modal integration
+  - Hardware selection guide
+  - Cost optimization tips
+- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture
+  - Project structure
+  - Data flow
+  - Authentication
+  - Deployment
+---
+## Technology Stack
+- **UI Framework**: Gradio 5.49.1
+- **Agent Framework**: smolagents 1.22.0+
+- **MCP Integration**: MCP Python SDK + smolagents MCPClient
+- **Data Source**: HuggingFace Datasets API
+- **Authentication**: HuggingFace OAuth
+- **AI Models**:
+  - Agent: Qwen/Qwen2.5-Coder-32B-Instruct (HF API)
+  - MCP Server: Google Gemini 2.5 Flash
+- **Cloud Platforms**: HuggingFace Jobs + Modal
+---
+## Example Workflows
+### Workflow 1: Quick Analysis
+1. Open TraceMind-AI
+2. Go to **🤖 Agent Chat**
+3. Click **"Quick: Top Models"**
+4. See agent fetch leaderboard and analyze top performers
+5. Ask follow-up: *"Which one is most cost-effective?"*
+### Workflow 2: Submit Evaluation Job
+1. Go to **⚙️ Settings** → Configure API keys
+2. Go to **🚀 New Evaluation**
+3. Select model (e.g., `meta-llama/Llama-3.1-8B`)
+4. Choose infrastructure (HuggingFace Jobs or Modal)
+5. Click **"💰 Estimate Cost"** to preview
+6. Click **"Submit Evaluation"**
+7. Monitor job in **📊 Job Monitoring** tab
+8. View results in leaderboard when complete
+### Workflow 3: Debug Agent Behavior
+1. Browse **📊 Leaderboard**
+2. Click on a run with failures
+3. View **detailed test results**
+4. Click on a failed test to see trace
+5. Use MCP-powered Q&A: *"Why did this test fail?"*
+6. Get AI analysis of the execution trace
+### Workflow 4: Generate Custom Test Dataset
+1. Go to **🔬 Synthetic Data Generator**
+2. Configure:
+   - Domain: `finance`
+   - Tools: `get_stock_price,calculate_profit,send_alert`
+   - Number of tasks: `20`
+   - Difficulty: `balanced`
+3. Click **"Generate Dataset"**
+4. Review generated tasks and prompt template
+5. Enter repository name: `yourname/smoltrace-finance-tasks`
+6. Click **"Push to HuggingFace Hub"**
+7. Use your custom dataset in evaluations
+---
+## Screenshots
+*See [SCREENSHOTS.md](SCREENSHOTS.md) for annotated screenshots of all screens*
+---
+## 🔗 Quick Links
+### 📦 Component Links
+| Component | Description | Links |
+|-----------|-------------|-------|
+| **TraceVerde** | OTEL Instrumentation | [GitHub](https://github.com/Mandark-droid/genai_otel_instrument) • [PyPI](https://pypi.org/project/genai-otel-instrument) |
+| **SMOLTRACE** | Evaluation Engine | [GitHub](https://github.com/Mandark-droid/SMOLTRACE) • [PyPI](https://pypi.org/project/smoltrace/) |
+| **MCP Server** | Building MCP (Track 1) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server) • [GitHub](https://github.com/Mandark-droid/TraceMind-mcp-server) |
+| **TraceMind-AI** | MCP in Action (Track 2) | [HF Space](https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind) • [GitHub](https://github.com/Mandark-droid/TraceMind-AI) |
+### 📢 Community Posts
+- 🎉 [**TraceMind Teaser**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_mcpsfirstbirthdayhackathon-mcpsfirstbirthdayhackathon-activity-7395686529270013952-g_id) - MCP's 1st Birthday Hackathon announcement
+- 📊 [**SMOLTRACE Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_ai-machinelearning-llm-activity-7394350375908126720-im_T) - Lightweight agent evaluation engine
+- 🔭 [**TraceVerde Launch**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_genai-opentelemetry-observability-activity-7390339855135813632-wqEg) - Zero-code OTEL instrumentation for LLMs
+- 🙏 [**TraceVerde 3K Downloads**](https://www.linkedin.com/posts/kshitij-thakkar-2061b924_thank-you-open-source-community-a-week-activity-7392205780592132096-nu6U) - Thank you to the community!
+---
+## 🗺️ Future Roadmap
+We're committed to making TraceMind the most comprehensive agent evaluation platform. Here's what's coming next:
+### 1. 🏗️ Dynamic MCP Server Generator
+Generate domain-specific MCP servers on-the-fly with custom tools via AI code generation.
+**Use case**: Rapidly prototype MCP servers without writing boilerplate code.
+### 2. 🎯 Intelligent Model Router
+Automatically select optimal models based on real-time leaderboard data, budget constraints, and accuracy requirements.
+**Use case**: Optimize evaluation costs while maintaining quality for large-scale continuous evaluation.
+### 3. 🔬 Automated A/B Testing Framework
+Compare multiple agent configurations with statistical significance testing and automatic winner selection.
+**Use case**: Find optimal agent configuration scientifically before production deployment.
+### 4. 👥 Collaborative Evaluation Workspace
+Real-time collaboration with shared runs, team comments, cost budgets, and stakeholder reports.
+**Use case**: Streamline team workflows and coordinate evaluation efforts across distributed teams.
+### 5. 🔄 CI/CD Pipeline Integration
+Automated agent evaluation on every PR with GitHub Actions, result comments, and merge blocking on quality drops.
+**Use case**: Catch agent performance regressions before production and maintain quality standards automatically.
+### 6. 🧰 Integrated SMOLTRACE CLI Features
+Bring all SMOLTRACE CLI tools into the UI: clean, copy, distill, merge, export, validate, anonymize datasets.
+**Use case**: Manage evaluation datasets efficiently without command-line, with visual preview and undo capabilities.
+---
+**Implementation Timeline**: Q1-Q4 2026 | **Want to contribute?** Join our community and help shape the future of agent evaluation!
+---
+## Credits
+**Built for**: MCP's 1st Birthday Hackathon (Nov 14-30, 2025)
 **Track**: MCP in Action (Enterprise)
 **Author**: Kshitij Thakkar
+**Powered by**: TraceMind MCP Server + Gradio + smolagents
 **Built with**: Gradio 5.49.1 (MCP client integration)
+**Special Thanks**:
+- **[Eliseu Silva](https://huggingface.co/elismasilva)** - For the [gradio_htmlplus](https://huggingface.co/spaces/elismasilva/gradio_htmlplus) custom component that powers our interactive leaderboard table. Eliseu's timely help and collaboration during the hackathon was invaluable!
+**Sponsors**: HuggingFace • Google Gemini • Modal • Anthropic • Gradio • ElevenLabs • SambaNova • Blaxel
 ---
+## License
+AGPL-3.0 - See [LICENSE](LICENSE) for details
+---
+## Support
+- 📧 GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
+- 💬 HF Discord: `#mcp-1st-birthday-official🏆`
+- 🏷️ Tag: `mcp-in-action-track-enterprise`
+- 🐦 Twitter: [@TraceMindAI](https://twitter.com/TraceMindAI) (placeholder)
 ---
+**Ready to evaluate your agents with AI-powered intelligence?**
+🌐 **Try the live demo**: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind

USER_GUIDE.md ADDED Viewed

	@@ -0,0 +1,1026 @@

+# TraceMind-AI - Complete User Guide
+This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.
+## Table of Contents
+- [Getting Started](#getting-started)
+- [Screen-by-Screen Guide](#screen-by-screen-guide)
+  - [📊 Leaderboard](#-leaderboard)
+  - [🤖 Agent Chat](#-agent-chat)
+  - [🚀 New Evaluation](#-new-evaluation)
+  - [📈 Job Monitoring](#-job-monitoring)
+  - [🔍 Trace Visualization](#-trace-visualization)
+  - [🔬 Synthetic Data Generator](#-synthetic-data-generator)
+  - [⚙️ Settings](#️-settings)
+- [Common Workflows](#common-workflows)
+- [Troubleshooting](#troubleshooting)
+---
+## Getting Started
+### First-Time Setup
+1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
+2. **Sign in** with your HuggingFace account (required for viewing)
+3. **Configure API keys** (optional but recommended):
+   - Go to **⚙️ Settings** tab
+   - Enter Gemini API Key and HuggingFace Token
+   - Click **"Save API Keys"**
+### Navigation
+TraceMind-AI is organized into tabs:
+- **📊 Leaderboard**: View evaluation results with AI insights
+- **🤖 Agent Chat**: Interactive autonomous agent powered by MCP tools
+- **🚀 New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
+- **📈 Job Monitoring**: Track status of submitted jobs
+- **🔍 Trace Visualization**: Deep-dive into agent execution traces
+- **🔬 Synthetic Data Generator**: Create custom test datasets with AI
+- **⚙️ Settings**: Configure API keys and preferences
+---
+## Screen-by-Screen Guide
+### 📊 Leaderboard
+**Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.
+#### Features
+**Main Table**:
+- View all evaluation runs from the SMOLTRACE leaderboard
+- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
+- Click any row to see detailed test results
+**AI Insights Panel** (Top of screen):
+- Automatically generated insights from MCP server
+- Powered by Google Gemini 2.5 Flash
+- Updates when you click "Load Leaderboard"
+- Shows top performers, trends, and recommendations
+**Filter & Sort Options**:
+- Filter by agent type (tool, code, both)
+- Filter by provider (litellm, transformers)
+- Sort by any metric (success rate, cost, duration)
+#### How to Use
+1. **Load Data**:
+   ```
+   Click "Load Leaderboard" button
+   → Fetches latest evaluation runs from HuggingFace
+   → AI generates insights automatically
+   ```
+2. **Read AI Insights**:
+   - Located at top of screen
+   - Summary of evaluation trends
+   - Top performing models
+   - Cost/accuracy trade-offs
+   - Actionable recommendations
+3. **Explore Runs**:
+   - Scroll through table
+   - Sort by clicking column headers
+   - Click on any run to see details
+4. **View Details**:
+   ```
+   Click a row in the table
+   → Opens detail view with:
+      - All test cases (success/failure)
+      - Execution times
+      - Cost breakdown
+      - Link to trace visualization
+   ```
+#### Example Workflow
+```
+Scenario: Find the most cost-effective model for production
+1. Click "Load Leaderboard"
+2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
+3. Sort table by "Cost" (ascending)
+4. Compare top 3 cheapest models
+5. Click on Llama-3.1-8B run to see detailed results
+6. Review success rate (93.4%) and test case breakdowns
+7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
+```
+#### Tips
+- **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
+- **Compare models**: Use the sort function to compare across different metrics
+- **Trust the AI**: The insights panel provides strategic recommendations based on all data
+---
+### 🤖 Agent Chat
+**Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.
+**🎯 Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.
+#### Features
+**Autonomous Agent**:
+- Built with `smolagents` framework
+- Has access to all TraceMind MCP Server tools
+- Plans and executes multi-step actions
+- Provides detailed, data-driven answers
+**MCP Tools Available to Agent**:
+- `analyze_leaderboard` - Get AI insights about top performers
+- `estimate_cost` - Calculate evaluation costs before running
+- `debug_trace` - Analyze execution traces
+- `compare_runs` - Compare two evaluation runs
+- `get_top_performers` - Fetch top N models efficiently
+- `get_leaderboard_summary` - Get high-level statistics
+- `get_dataset` - Load SMOLTRACE datasets
+- `analyze_results` - Analyze detailed test results
+**Agent Reasoning Visibility**:
+- Toggle **"Show Agent Reasoning"** to see:
+  - Planning steps
+  - Tool execution logs
+  - Intermediate results
+  - Final synthesis
+**Quick Action Buttons**:
+- **"Quick: Top Models"**: Get top 5 models with costs
+- **"Quick: Cost Estimate"**: Estimate cost for a model
+- **"Quick: Load Leaderboard"**: Fetch leaderboard summary
+#### How to Use
+1. **Start a Conversation**:
+   ```
+   Type your question in the chat box
+   Example: "What are the top 3 performing models and how much do they cost?"
+   Click "Send"
+   → Agent plans approach
+   → Executes MCP tools
+   → Returns synthesized answer
+   ```
+2. **Watch Agent Work** (optional):
+   ```
+   Enable "Show Agent Reasoning" checkbox
+   → See step-by-step execution:
+      Step 1: Calling get_top_performers with metric=success_rate, top_n=3
+      Step 2: Analyzing costs from results
+      Step 3: Formatting response with recommendations
+   ```
+3. **Ask Follow-Up Questions**:
+   ```
+   You: "Which one is most cost-effective?"
+   Agent: Uses previous context + MCP tools to answer
+   ```
+4. **Use Quick Actions**:
+   ```
+   Click "Quick: Top Models" button
+   → Agent automatically executes query
+   → Returns top 5 models with metrics
+   ```
+#### Example Conversations
+**Example 1: Cost Comparison**
+```
+User: "Compare the costs of the top 3 models. Which one offers the best value?"
+Agent:
+Let me analyze the top 3 models by success rate:
+[Calls: get_top_performers(metric="success_rate", top_n=3)]
+Top 3 Models:
+1. GPT-4: 95.8% success, $0.05/run
+2. Claude-3: 94.1% success, $0.04/run
+3. Llama-3.1-8B: 93.4% success, $0.002/run
+[Calls: analyze_leaderboard(metric_focus="cost")]
+Cost Analysis:
+- GPT-4: Highest accuracy but 25x more expensive than Llama
+- Claude-3: Good middle ground at $0.04/run
+- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop
+Recommendation for "best value":
+→ Llama-3.1-8B offers 93.4% accuracy at $0.002/run
+→ For 1M runs: Saves $48,000/month vs GPT-4
+→ Only 2.4% accuracy trade-off
+```
+**Example 2: Evaluation Planning**
+```
+User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"
+Agent:
+Let me estimate costs for DeepSeek-V3:
+[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]
+Cost Estimate:
+- Model: deepseek-ai/DeepSeek-V3
+- Tests: 100
+- Recommended hardware: GPU A100 (70B+ parameters)
+Breakdown:
+- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
+- HF Jobs GPU cost: ~$3.00/hr for A100
+- Estimated duration: 25 minutes
+- Total compute cost: $1.25
+- Total cost: $1.25 (100 tests)
+Recommendation:
+Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
+Alternative: Modal H200 for 2x faster inference at $2.50.
+```
+#### Tips
+- **Be specific**: Ask clear, focused questions for better answers
+- **Use context**: Agent remembers conversation history
+- **Watch reasoning**: Enable to understand how agent uses MCP tools
+- **Try quick actions**: Fast way to get common information
+---
+### 🚀 New Evaluation
+**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.
+**⚠️ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.
+#### Features
+**Model Selection**:
+- Enter any model name (format: `provider/model-name`)
+- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
+- Auto-detects if API model or local model
+**Infrastructure Choice**:
+- **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
+- **Modal**: Serverless GPU compute (pay-per-second)
+**Hardware Selection**:
+- **Auto** (recommended): Automatically selects optimal hardware based on model size
+- **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU
+**Cost Estimation**:
+- Click **"💰 Estimate Cost"** before submitting
+- Shows predicted:
+  - LLM API costs (for API models)
+  - Compute costs (for local models)
+  - Duration estimate
+  - CO2 emissions
+**Agent Type**:
+- **tool**: Test tool-calling capabilities
+- **code**: Test code generation capabilities
+- **both**: Test both (recommended)
+#### How to Use
+**Step 1: Configure Prerequisites** (One-time setup)
+For **HuggingFace Jobs**:
+```
+1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
+2. Add credit card for compute charges
+3. Create HF token with "Read + Write + Run Jobs" permissions
+4. Go to Settings tab → Enter HF token → Save
+```
+For **Modal** (Alternative):
+```
+1. Sign up: https://modal.com (free tier available)
+2. Generate API token: https://modal.com/settings/tokens
+3. Go to Settings tab → Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET → Save
+```
+For **API Models** (OpenAI, Anthropic, etc.):
+```
+1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
+2. Go to Settings tab → Enter provider API key → Save
+```
+**Step 2: Create Evaluation**
+```
+1. Enter model name:
+   Example: "meta-llama/Llama-3.1-8B"
+2. Select infrastructure:
+   - HuggingFace Jobs (default)
+   - Modal (alternative)
+3. Choose agent type:
+   - "both" (recommended)
+4. Select hardware:
+   - "auto" (recommended - smart selection)
+   - Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200
+5. Set timeout (optional):
+   - Default: 3600s (1 hour)
+   - Range: 300s - 7200s
+6. Click "💰 Estimate Cost":
+   → Shows predicted cost and duration
+   → Example: "$2.00, 20 minutes, 0.5g CO2"
+7. Review estimate, then click "Submit Evaluation"
+```
+**Step 3: Monitor Job**
+```
+After submission:
+→ Job ID displayed
+→ Go to "📈 Job Monitoring" tab to track progress
+→ Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
+```
+**Step 4: View Results**
+```
+When job completes:
+→ Results automatically uploaded to HuggingFace datasets
+→ Appears in Leaderboard within 1-2 minutes
+→ Click on your run to see detailed results
+```
+#### Hardware Selection Guide
+**For API Models** (OpenAI, Anthropic, Google):
+- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
+- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
+- Why: No GPU needed for API calls
+**For Small Models** (4B-8B parameters):
+- Use: `t4-small` (HF) or A10G (Modal)
+- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
+- Examples: Llama-3.1-8B, Mistral-7B
+**For Medium Models** (7B-13B parameters):
+- Use: `a10g-small` (HF) or A10G (Modal)
+- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
+- Examples: Qwen2.5-14B, Mixtral-8x7B
+**For Large Models** (70B+ parameters):
+- Use: `a100-large` (HF) or A100-80GB (Modal)
+- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
+- Examples: Llama-3.1-70B, DeepSeek-V3
+**For Fastest Inference**:
+- Use: `h200` (HF or Modal)
+- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
+- Best for: Time-sensitive evaluations, large batches
+#### Example Workflows
+**Workflow 1: Evaluate API Model (OpenAI GPT-4)**
+```
+1. Model: "openai/gpt-4"
+2. Infrastructure: HuggingFace Jobs
+3. Agent type: both
+4. Hardware: auto (selects cpu-basic)
+5. Estimate: $50.00 (mostly API costs), 45 min
+6. Submit → Monitor → View in leaderboard
+```
+**Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
+```
+1. Model: "meta-llama/Llama-3.1-8B"
+2. Infrastructure: Modal (for pay-per-second billing)
+3. Agent type: both
+4. Hardware: auto (selects A10G)
+5. Estimate: $0.20, 15 min
+6. Submit → Monitor → View in leaderboard
+```
+#### Tips
+- **Always estimate first**: Prevents surprise costs
+- **Use "auto" hardware**: Smart selection based on model size
+- **Start small**: Test with 10-20 tests before scaling to 100+
+- **Monitor jobs**: Check Job Monitoring tab for status
+- **Modal for experimentation**: Pay-per-second is cost-effective for testing
+---
+### 📈 Job Monitoring
+**Purpose**: Track status of submitted evaluation jobs.
+#### Features
+**Job Status Display**:
+- Job ID
+- Current status (pending, running, completed, failed)
+- Start time
+- Duration
+- Infrastructure (HF Jobs or Modal)
+**Real-time Updates**:
+- Auto-refreshes every 30 seconds
+- Manual refresh button
+**Job Actions**:
+- View logs
+- Cancel job (if still running)
+- View results (if completed)
+#### How to Use
+```
+1. Go to "📈 Job Monitoring" tab
+2. See list of your submitted jobs
+3. Click "Refresh" for latest status
+4. When status = "completed":
+   → Click "View Results"
+   → Opens leaderboard filtered to your run
+```
+#### Job Statuses
+- **Pending**: Job queued, waiting for resources
+- **Running**: Evaluation in progress
+- **Completed**: Evaluation finished successfully
+- **Failed**: Evaluation encountered an error
+#### Tips
+- **Check logs** if job fails: Helps diagnose issues
+- **Expected duration**:
+  - API models: 2-5 minutes
+  - Local models: 15-30 minutes (includes model download)
+---
+### 🔍 Trace Visualization
+**Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.
+**Access**: Click on any test case in a run's detail view
+#### Features
+**Waterfall Diagram**:
+- Visual timeline of execution
+- Spans show: LLM calls, tool executions, reasoning steps
+- Duration bars (wider = slower)
+- Parent-child relationships
+**Span Details**:
+- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
+- Start/end times
+- Duration
+- Attributes (model, tokens, cost, tool inputs/outputs)
+- Status (OK, ERROR)
+**GPU Metrics Overlay** (for GPU jobs only):
+- GPU utilization %
+- Memory usage
+- Temperature
+- CO2 emissions
+**MCP-Powered Q&A**:
+- Ask questions about the trace
+- Example: "Why was tool X called twice?"
+- Agent uses `debug_trace` MCP tool to analyze
+#### How to Use
+```
+1. From leaderboard → Click a run → Click a test case
+2. View waterfall diagram:
+   → Spans arranged chronologically
+   → Parent spans (e.g., "Agent Execution")
+   → Child spans (e.g., "LLM Call", "Tool Call")
+3. Click any span:
+   → See detailed attributes
+   → Token counts, costs, inputs/outputs
+4. Ask questions (MCP-powered):
+   User: "Why did this test fail?"
+   → Agent analyzes trace with debug_trace tool
+   → Returns explanation with span references
+5. Check GPU metrics (if available):
+   → Graph shows utilization over time
+   → Overlayed on execution timeline
+```
+#### Example Analysis
+**Scenario: Understanding a slow execution**
+```
+1. Open trace for test_045 (duration: 8.5s)
+2. Waterfall shows:
+   - Span 1: LLM Call - Reasoning (1.2s) ✓
+   - Span 2: Tool Call - search_web (6.5s) ⚠️ SLOW
+   - Span 3: LLM Call - Final Response (0.8s) ✓
+3. Click Span 2 (search_web):
+   - Input: {"query": "weather in Tokyo"}
+   - Output: 5 results
+   - Duration: 6.5s (6x slower than typical)
+4. Ask agent: "Why was the search_web call so slow?"
+   → Agent analysis:
+      "The search_web call took 6.5s due to network latency.
+       Span attributes show API response time: 6.2s.
+       This is an external dependency issue, not agent code.
+       Recommendation: Implement timeout (5s) and fallback strategy."
+```
+#### Tips
+- **Look for patterns**: Similar failures often have common spans
+- **Use MCP Q&A**: Faster than manual trace analysis
+- **Check GPU metrics**: Identify resource bottlenecks
+- **Compare successful vs failed traces**: Spot differences
+---
+### 🔬 Synthetic Data Generator
+**Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.
+#### Features
+**AI-Powered Dataset Generation**:
+- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
+- Customizable domain, tools, difficulty, and agent type
+- Automatic batching for large datasets (parallel generation)
+- SMOLTRACE-format output ready for evaluation
+**Prompt Template Generation**:
+- Customized YAML templates based on smolagents format
+- Optimized for your specific domain and tools
+- Included automatically in dataset card
+**Push to HuggingFace Hub**:
+- One-click upload to HuggingFace Hub
+- Public or private repositories
+- Auto-generated README with usage instructions
+- Ready to use with SMOLTRACE evaluations
+#### How to Use
+**Step 1: Configure & Generate Dataset**
+1. Navigate to **🔬 Synthetic Data Generator** tab
+2. Configure generation parameters:
+   - **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
+   - **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
+   - **Number of Tasks**: 5-100 tasks (slider)
+   - **Difficulty Level**:
+     - `balanced` (40% easy, 40% medium, 20% hard)
+     - `easy_only` (100% easy tasks)
+     - `medium_only` (100% medium tasks)
+     - `hard_only` (100% hard tasks)
+     - `progressive` (50% easy, 30% medium, 20% hard)
+   - **Agent Type**:
+     - `tool` (ToolCallingAgent only)
+     - `code` (CodeAgent only)
+     - `both` (50/50 mix)
+3. Click **"🎲 Generate Synthetic Dataset"**
+4. Wait for generation (30-120s depending on size):
+   - Shows progress message
+   - Automatic batching for >20 tasks
+   - Parallel API calls for faster generation
+**Step 2: Review Generated Content**
+1. **Dataset Preview Tab**:
+   - View all generated tasks in JSON format
+   - Check task IDs, prompts, expected tools, difficulty
+   - See dataset statistics:
+     - Total tasks
+     - Difficulty distribution
+     - Agent type distribution
+     - Tools coverage
+2. **Prompt Template Tab**:
+   - View customized YAML prompt template
+   - Based on smolagents templates
+   - Adapted for your domain and tools
+   - Ready to use with ToolCallingAgent or CodeAgent
+**Step 3: Push to HuggingFace Hub** (Optional)
+1. Enter **Repository Name**:
+   - Format: `username/smoltrace-{domain}-tasks`
+   - Example: `alice/smoltrace-finance-tasks`
+   - Auto-filled with your HF username after generation
+2. Set **Visibility**:
+   - ☐ Private Repository (unchecked = public)
+   - ☑ Private Repository (checked = private)
+3. Provide **HuggingFace Token** (optional):
+   - Leave empty to use environment token (HF_TOKEN from Settings)
+   - Or paste token from https://huggingface.co/settings/tokens
+   - Requires write permissions
+4. Click **"📤 Push to HuggingFace Hub"**
+5. Wait for upload (5-30s):
+   - Creates dataset repository
+   - Uploads tasks
+   - Generates README with:
+     - Usage instructions
+     - Prompt template
+     - SMOLTRACE integration code
+   - Returns dataset URL
+#### Example Workflow
+```
+Scenario: Create finance evaluation dataset with 20 tasks
+1. Configure:
+   Domain: "finance"
+   Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
+   Number of Tasks: 20
+   Difficulty: "balanced"
+   Agent Type: "both"
+2. Click "Generate"
+   → AI generates 20 tasks:
+      - 8 easy (single tool, straightforward)
+      - 8 medium (multiple tools or complex logic)
+      - 4 hard (complex reasoning, edge cases)
+      - 10 for ToolCallingAgent
+      - 10 for CodeAgent
+   → Also generates customized prompt template
+3. Review Dataset Preview:
+   Task 1:
+   {
+     "id": "finance_stock_price_1",
+     "prompt": "What is the current price of AAPL stock?",
+     "expected_tool": "get_stock_price",
+     "difficulty": "easy",
+     "agent_type": "tool",
+     "expected_keywords": ["AAPL", "price", "$"]
+   }
+   Task 15:
+   {
+     "id": "finance_complex_analysis_15",
+     "prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
+     "expected_tool": "calculate_roi",
+     "expected_tool_calls": 2,
+     "difficulty": "hard",
+     "agent_type": "code",
+     "expected_keywords": ["ROI", "15%", "alert"]
+   }
+4. Review Prompt Template:
+   See customized YAML with:
+   - Finance-specific system prompt
+   - Tool descriptions for get_stock_price, calculate_roi, etc.
+   - Response format guidelines
+5. Push to Hub:
+   Repository: "yourname/smoltrace-finance-tasks"
+   Private: No (public)
+   Token: (empty, using environment token)
+   → Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
+   → README includes usage instructions and prompt template
+6. Use in evaluation:
+   # Load your custom dataset
+   dataset = load_dataset("yourname/smoltrace-finance-tasks")
+   # Run SMOLTRACE evaluation
+   smoltrace-eval --model openai/gpt-4 \
+                  --dataset-name yourname/smoltrace-finance-tasks \
+                  --agent-type both
+```
+#### Configuration Reference
+**Difficulty Levels Explained**:
+| Level | Characteristics | Example |
+|-------|----------------|---------|
+| **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" → get_weather("Tokyo") |
+| **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" → get_weather("Tokyo"), get_weather("London"), compare |
+| **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |
+**Agent Types Explained**:
+| Type | Description | Use Case |
+|------|-------------|----------|
+| **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
+| **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
+| **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |
+#### Best Practices
+**Domain Selection**:
+- Be specific: "customer_support_saas" > "support"
+- Match your use case: Use actual business domain
+- Consider tools available: Domain should align with tools
+**Tool Names**:
+- Use descriptive names: "get_stock_price" > "fetch"
+- Match actual tool implementations
+- 3-8 tools is ideal (enough variety, not overwhelming)
+- Include mix of data retrieval and action tools
+**Number of Tasks**:
+- 5-10 tasks: Quick testing, proof of concept
+- 20-30 tasks: Solid evaluation dataset
+- 50-100 tasks: Comprehensive benchmark
+**Difficulty Distribution**:
+- `balanced`: Best for general evaluation
+- `progressive`: Good for learning/debugging
+- `easy_only`: Quick sanity checks
+- `hard_only`: Stress testing advanced capabilities
+**Quality Assurance**:
+- Always review generated tasks before pushing
+- Check for domain relevance and variety
+- Verify expected tools match your actual tools
+- Ensure prompts are clear and executable
+#### Troubleshooting
+**Generation fails with "Invalid API key"**:
+- Go to **⚙️ Settings**
+- Configure Gemini API Key
+- Get key from https://aistudio.google.com/apikey
+**Generated tasks don't match domain**:
+- Be more specific in domain description
+- Try regenerating with adjusted parameters
+- Review prompt template for domain alignment
+**Push to Hub fails with "Authentication error"**:
+- Verify HuggingFace token has write permissions
+- Get token from https://huggingface.co/settings/tokens
+- Check token in **⚙️ Settings** or provide directly
+**Dataset generation is slow (>60s)**:
+- Large requests (>20 tasks) are automatically batched
+- Each batch takes 30-120s
+- Example: 100 tasks = 5 batches × 60s = ~5 minutes
+- This is normal for large datasets
+**Tasks are too easy/hard**:
+- Adjust difficulty distribution
+- Regenerate with different settings
+- Mix difficulty levels with `balanced` or `progressive`
+#### Advanced Tips
+**Iterative Refinement**:
+1. Generate 10 tasks with `balanced` difficulty
+2. Review quality and variety
+3. If satisfied, generate 50-100 tasks with same settings
+4. If not, adjust domain/tools and regenerate
+**Dataset Versioning**:
+- Use version suffixes: `username/smoltrace-finance-tasks-v2`
+- Iterate on datasets as tools evolve
+- Keep track of which version was used for evaluations
+**Combining Datasets**:
+- Generate multiple small datasets for different domains
+- Use SMOLTRACE CLI to merge datasets
+- Create comprehensive multi-domain benchmarks
+**Custom Prompt Templates**:
+- Generate prompt template separately
+- Customize further based on your needs
+- Use in agent initialization before evaluation
+- Include in dataset card for reproducibility
+---
+### ⚙️ Settings
+**Purpose**: Configure API keys, preferences, and authentication.
+#### Features
+**API Key Configuration**:
+- Gemini API Key (for MCP server AI analysis)
+- HuggingFace Token (for dataset access + job submission)
+- Modal Token ID + Secret (for Modal job submission)
+- LLM Provider Keys (OpenAI, Anthropic, etc.)
+**Preferences**:
+- Default infrastructure (HF Jobs vs Modal)
+- Default hardware tier
+- Auto-refresh intervals
+**Security**:
+- Keys stored in browser session only (not server)
+- HTTPS encryption for all API calls
+- Keys never logged or exposed
+#### How to Use
+**Configure Essential Keys**:
+```
+1. Go to "⚙️ Settings" tab
+2. Enter Gemini API Key:
+   - Get from: https://ai.google.dev/
+   - Click "Get API Key" → Create project → Generate
+   - Paste into field
+   - Free tier: 1,500 requests/day
+3. Enter HuggingFace Token:
+   - Get from: https://huggingface.co/settings/tokens
+   - Click "New token" → Name: "TraceMind"
+   - Permissions:
+     - Read (for viewing datasets)
+     - Write (for uploading results)
+     - Run Jobs (for evaluation submission)
+   - Paste into field
+4. Click "Save API Keys"
+   → Keys stored in browser session
+   → MCP server will use your keys
+```
+**Configure for Job Submission** (Optional):
+For **HuggingFace Jobs**:
+```
+Already configured if you entered HF token above with "Run Jobs" permission.
+```
+For **Modal** (Alternative):
+```
+1. Sign up: https://modal.com
+2. Get token: https://modal.com/settings/tokens
+3. Copy MODAL_TOKEN_ID (starts with 'ak-')
+4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
+5. Paste both into Settings → Save
+```
+For **API Model Providers**:
+```
+1. Get API key from provider:
+   - OpenAI: https://platform.openai.com/api-keys
+   - Anthropic: https://console.anthropic.com/settings/keys
+   - Google: https://ai.google.dev/
+2. Paste into corresponding field in Settings
+3. Click "Save LLM Provider Keys"
+```
+#### Security Best Practices
+- **Use environment variables**: For production, set keys via HF Spaces secrets
+- **Rotate keys regularly**: Generate new tokens every 3-6 months
+- **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
+- **Monitor usage**: Check API provider dashboards for unexpected charges
+---
+## Common Workflows
+### Workflow 1: Quick Model Comparison
+```
+Goal: Compare GPT-4 vs Llama-3.1-8B for production use
+Steps:
+1. Go to Leaderboard → Load Leaderboard
+2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
+3. Sort by Success Rate → Note: GPT-4 (95.8%), Llama (93.4%)
+4. Sort by Cost → Note: GPT-4 ($0.05), Llama ($0.002)
+5. Go to Agent Chat → Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
+   → Agent analyzes with MCP tools
+   → Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
+6. Decision: Use Llama-3.1-8B for production
+```
+### Workflow 2: Evaluate Custom Model
+```
+Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark
+Steps:
+1. Ensure model is on HuggingFace: username/my-finetuned-model
+2. Go to Settings → Configure HF token (with Run Jobs permission)
+3. Go to New Evaluation:
+   - Model: "username/my-finetuned-model"
+   - Infrastructure: HuggingFace Jobs
+   - Agent type: both
+   - Hardware: auto
+4. Click "Estimate Cost" → Review: $1.50, 20 min
+5. Click "Submit Evaluation"
+6. Go to Job Monitoring → Wait for "Completed" (15-25 min)
+7. Go to Leaderboard → Refresh → See your model in table
+8. Click your run → Review detailed results
+9. Compare vs other models using Agent Chat
+```
+### Workflow 3: Debug Failed Test
+```
+Goal: Understand why test_045 failed in your evaluation
+Steps:
+1. Go to Leaderboard → Find your run → Click to open details
+2. Filter to failed tests only
+3. Click test_045 → Opens trace visualization
+4. Examine waterfall:
+   - Span 1: LLM Call (OK)
+   - Span 2: Tool Call - "unknown_tool" (ERROR)
+   - No Span 3 (execution stopped)
+5. Ask Agent: "Why did test_045 fail?"
+   → Agent uses debug_trace MCP tool
+   → Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
+6. Fix: Update agent config to include missing tool
+7. Re-run evaluation with fixed config
+```
+---
+## Troubleshooting
+### Leaderboard Issues
+**Problem**: "Load Leaderboard" button doesn't work
+- **Solution**: Check HuggingFace token in Settings (needs Read permission)
+- **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard
+**Problem**: AI insights not showing
+- **Solution**: Check Gemini API key in Settings
+- **Solution**: Wait 5-10 seconds for AI generation to complete
+### Agent Chat Issues
+**Problem**: Agent responds with "MCP server connection failed"
+- **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
+- **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings
+**Problem**: Agent gives incorrect information
+- **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
+- **Solution**: Verify question is clear and specific
+### Evaluation Submission Issues
+**Problem**: "Submit Evaluation" fails with auth error
+- **Solution**: HF token needs "Run Jobs" permission
+- **Solution**: Ensure HF Pro account is active ($9/month)
+- **Solution**: Verify credit card is on file for compute charges
+**Problem**: Job stuck in "Pending" status
+- **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
+- **Solution**: Try Modal as alternative infrastructure
+**Problem**: Job fails with "Out of Memory"
+- **Solution**: Model too large for selected hardware
+- **Solution**: Increase hardware tier (e.g., t4-small → a10g-small)
+- **Solution**: Use auto hardware selection
+### Trace Visualization Issues
+**Problem**: Traces not loading
+- **Solution**: Ensure evaluation completed successfully
+- **Solution**: Check traces dataset exists on HuggingFace
+- **Solution**: Verify HF token has Read permission
+**Problem**: GPU metrics missing
+- **Solution**: Only available for GPU jobs (not API models)
+- **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled
+---
+## Getting Help
+- **📧 GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
+- **💬 HF Discord**: `#agents-mcp-hackathon-winter25`
+- **📖 Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)
+---
+**Last Updated**: November 21, 2025