TraceMind / USER_GUIDE.md
kshitijthakkar's picture
docs: Deploy final documentation package
34f1a7a
# TraceMind-AI - Complete User Guide
This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.
## Table of Contents
- [Getting Started](#getting-started)
- [Screen-by-Screen Guide](#screen-by-screen-guide)
- [πŸ“Š Leaderboard](#-leaderboard)
- [πŸ€– Agent Chat](#-agent-chat)
- [πŸš€ New Evaluation](#-new-evaluation)
- [πŸ“ˆ Job Monitoring](#-job-monitoring)
- [πŸ” Trace Visualization](#-trace-visualization)
- [πŸ”¬ Synthetic Data Generator](#-synthetic-data-generator)
- [βš™οΈ Settings](#️-settings)
- [Common Workflows](#common-workflows)
- [Troubleshooting](#troubleshooting)
---
## Getting Started
### First-Time Setup
1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
2. **Sign in** with your HuggingFace account (required for viewing)
3. **Configure API keys** (optional but recommended):
- Go to **βš™οΈ Settings** tab
- Enter Gemini API Key and HuggingFace Token
- Click **"Save API Keys"**
### Navigation
TraceMind-AI is organized into tabs:
- **πŸ“Š Leaderboard**: View evaluation results with AI insights
- **πŸ€– Agent Chat**: Interactive autonomous agent powered by MCP tools
- **πŸš€ New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
- **πŸ“ˆ Job Monitoring**: Track status of submitted jobs
- **πŸ” Trace Visualization**: Deep-dive into agent execution traces
- **πŸ”¬ Synthetic Data Generator**: Create custom test datasets with AI
- **βš™οΈ Settings**: Configure API keys and preferences
---
## Screen-by-Screen Guide
### πŸ“Š Leaderboard
**Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.
#### Features
**Main Table**:
- View all evaluation runs from the SMOLTRACE leaderboard
- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
- Click any row to see detailed test results
**AI Insights Panel** (Top of screen):
- Automatically generated insights from MCP server
- Powered by Google Gemini 2.5 Flash
- Updates when you click "Load Leaderboard"
- Shows top performers, trends, and recommendations
**Filter & Sort Options**:
- Filter by agent type (tool, code, both)
- Filter by provider (litellm, transformers)
- Sort by any metric (success rate, cost, duration)
#### How to Use
1. **Load Data**:
```
Click "Load Leaderboard" button
β†’ Fetches latest evaluation runs from HuggingFace
β†’ AI generates insights automatically
```
2. **Read AI Insights**:
- Located at top of screen
- Summary of evaluation trends
- Top performing models
- Cost/accuracy trade-offs
- Actionable recommendations
3. **Explore Runs**:
- Scroll through table
- Sort by clicking column headers
- Click on any run to see details
4. **View Details**:
```
Click a row in the table
β†’ Opens detail view with:
- All test cases (success/failure)
- Execution times
- Cost breakdown
- Link to trace visualization
```
#### Example Workflow
```
Scenario: Find the most cost-effective model for production
1. Click "Load Leaderboard"
2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
3. Sort table by "Cost" (ascending)
4. Compare top 3 cheapest models
5. Click on Llama-3.1-8B run to see detailed results
6. Review success rate (93.4%) and test case breakdowns
7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
```
#### Tips
- **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
- **Compare models**: Use the sort function to compare across different metrics
- **Trust the AI**: The insights panel provides strategic recommendations based on all data
---
### πŸ€– Agent Chat
**Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.
**🎯 Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.
#### Features
**Autonomous Agent**:
- Built with `smolagents` framework
- Has access to all TraceMind MCP Server tools
- Plans and executes multi-step actions
- Provides detailed, data-driven answers
**MCP Tools Available to Agent**:
- `analyze_leaderboard` - Get AI insights about top performers
- `estimate_cost` - Calculate evaluation costs before running
- `debug_trace` - Analyze execution traces
- `compare_runs` - Compare two evaluation runs
- `get_top_performers` - Fetch top N models efficiently
- `get_leaderboard_summary` - Get high-level statistics
- `get_dataset` - Load SMOLTRACE datasets
- `analyze_results` - Analyze detailed test results
**Agent Reasoning Visibility**:
- Toggle **"Show Agent Reasoning"** to see:
- Planning steps
- Tool execution logs
- Intermediate results
- Final synthesis
**Quick Action Buttons**:
- **"Quick: Top Models"**: Get top 5 models with costs
- **"Quick: Cost Estimate"**: Estimate cost for a model
- **"Quick: Load Leaderboard"**: Fetch leaderboard summary
#### How to Use
1. **Start a Conversation**:
```
Type your question in the chat box
Example: "What are the top 3 performing models and how much do they cost?"
Click "Send"
β†’ Agent plans approach
β†’ Executes MCP tools
β†’ Returns synthesized answer
```
2. **Watch Agent Work** (optional):
```
Enable "Show Agent Reasoning" checkbox
β†’ See step-by-step execution:
Step 1: Calling get_top_performers with metric=success_rate, top_n=3
Step 2: Analyzing costs from results
Step 3: Formatting response with recommendations
```
3. **Ask Follow-Up Questions**:
```
You: "Which one is most cost-effective?"
Agent: Uses previous context + MCP tools to answer
```
4. **Use Quick Actions**:
```
Click "Quick: Top Models" button
β†’ Agent automatically executes query
β†’ Returns top 5 models with metrics
```
#### Example Conversations
**Example 1: Cost Comparison**
```
User: "Compare the costs of the top 3 models. Which one offers the best value?"
Agent:
Let me analyze the top 3 models by success rate:
[Calls: get_top_performers(metric="success_rate", top_n=3)]
Top 3 Models:
1. GPT-4: 95.8% success, $0.05/run
2. Claude-3: 94.1% success, $0.04/run
3. Llama-3.1-8B: 93.4% success, $0.002/run
[Calls: analyze_leaderboard(metric_focus="cost")]
Cost Analysis:
- GPT-4: Highest accuracy but 25x more expensive than Llama
- Claude-3: Good middle ground at $0.04/run
- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop
Recommendation for "best value":
β†’ Llama-3.1-8B offers 93.4% accuracy at $0.002/run
β†’ For 1M runs: Saves $48,000/month vs GPT-4
β†’ Only 2.4% accuracy trade-off
```
**Example 2: Evaluation Planning**
```
User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"
Agent:
Let me estimate costs for DeepSeek-V3:
[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]
Cost Estimate:
- Model: deepseek-ai/DeepSeek-V3
- Tests: 100
- Recommended hardware: GPU A100 (70B+ parameters)
Breakdown:
- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
- HF Jobs GPU cost: ~$3.00/hr for A100
- Estimated duration: 25 minutes
- Total compute cost: $1.25
- Total cost: $1.25 (100 tests)
Recommendation:
Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
Alternative: Modal H200 for 2x faster inference at $2.50.
```
#### Tips
- **Be specific**: Ask clear, focused questions for better answers
- **Use context**: Agent remembers conversation history
- **Watch reasoning**: Enable to understand how agent uses MCP tools
- **Try quick actions**: Fast way to get common information
---
### πŸš€ New Evaluation
**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.
**⚠️ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.
#### Features
**Model Selection**:
- Enter any model name (format: `provider/model-name`)
- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
- Auto-detects if API model or local model
**Infrastructure Choice**:
- **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
- **Modal**: Serverless GPU compute (pay-per-second)
**Hardware Selection**:
- **Auto** (recommended): Automatically selects optimal hardware based on model size
- **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU
**Cost Estimation**:
- Click **"πŸ’° Estimate Cost"** before submitting
- Shows predicted:
- LLM API costs (for API models)
- Compute costs (for local models)
- Duration estimate
- CO2 emissions
**Agent Type**:
- **tool**: Test tool-calling capabilities
- **code**: Test code generation capabilities
- **both**: Test both (recommended)
#### How to Use
**Step 1: Configure Prerequisites** (One-time setup)
For **HuggingFace Jobs**:
```
1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
2. Add credit card for compute charges
3. Create HF token with "Read + Write + Run Jobs" permissions
4. Go to Settings tab β†’ Enter HF token β†’ Save
```
For **Modal** (Alternative):
```
1. Sign up: https://modal.com (free tier available)
2. Generate API token: https://modal.com/settings/tokens
3. Go to Settings tab β†’ Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET β†’ Save
```
For **API Models** (OpenAI, Anthropic, etc.):
```
1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
2. Go to Settings tab β†’ Enter provider API key β†’ Save
```
**Step 2: Create Evaluation**
```
1. Enter model name:
Example: "meta-llama/Llama-3.1-8B"
2. Select infrastructure:
- HuggingFace Jobs (default)
- Modal (alternative)
3. Choose agent type:
- "both" (recommended)
4. Select hardware:
- "auto" (recommended - smart selection)
- Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200
5. Set timeout (optional):
- Default: 3600s (1 hour)
- Range: 300s - 7200s
6. Click "πŸ’° Estimate Cost":
β†’ Shows predicted cost and duration
β†’ Example: "$2.00, 20 minutes, 0.5g CO2"
7. Review estimate, then click "Submit Evaluation"
```
**Step 3: Monitor Job**
```
After submission:
β†’ Job ID displayed
β†’ Go to "πŸ“ˆ Job Monitoring" tab to track progress
β†’ Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
```
**Step 4: View Results**
```
When job completes:
β†’ Results automatically uploaded to HuggingFace datasets
β†’ Appears in Leaderboard within 1-2 minutes
β†’ Click on your run to see detailed results
```
#### Hardware Selection Guide
**For API Models** (OpenAI, Anthropic, Google):
- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
- Why: No GPU needed for API calls
**For Small Models** (4B-8B parameters):
- Use: `t4-small` (HF) or A10G (Modal)
- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
- Examples: Llama-3.1-8B, Mistral-7B
**For Medium Models** (7B-13B parameters):
- Use: `a10g-small` (HF) or A10G (Modal)
- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
- Examples: Qwen2.5-14B, Mixtral-8x7B
**For Large Models** (70B+ parameters):
- Use: `a100-large` (HF) or A100-80GB (Modal)
- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
- Examples: Llama-3.1-70B, DeepSeek-V3
**For Fastest Inference**:
- Use: `h200` (HF or Modal)
- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
- Best for: Time-sensitive evaluations, large batches
#### Example Workflows
**Workflow 1: Evaluate API Model (OpenAI GPT-4)**
```
1. Model: "openai/gpt-4"
2. Infrastructure: HuggingFace Jobs
3. Agent type: both
4. Hardware: auto (selects cpu-basic)
5. Estimate: $50.00 (mostly API costs), 45 min
6. Submit β†’ Monitor β†’ View in leaderboard
```
**Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
```
1. Model: "meta-llama/Llama-3.1-8B"
2. Infrastructure: Modal (for pay-per-second billing)
3. Agent type: both
4. Hardware: auto (selects A10G)
5. Estimate: $0.20, 15 min
6. Submit β†’ Monitor β†’ View in leaderboard
```
#### Tips
- **Always estimate first**: Prevents surprise costs
- **Use "auto" hardware**: Smart selection based on model size
- **Start small**: Test with 10-20 tests before scaling to 100+
- **Monitor jobs**: Check Job Monitoring tab for status
- **Modal for experimentation**: Pay-per-second is cost-effective for testing
---
### πŸ“ˆ Job Monitoring
**Purpose**: Track status of submitted evaluation jobs.
#### Features
**Job Status Display**:
- Job ID
- Current status (pending, running, completed, failed)
- Start time
- Duration
- Infrastructure (HF Jobs or Modal)
**Real-time Updates**:
- Auto-refreshes every 30 seconds
- Manual refresh button
**Job Actions**:
- View logs
- Cancel job (if still running)
- View results (if completed)
#### How to Use
```
1. Go to "πŸ“ˆ Job Monitoring" tab
2. See list of your submitted jobs
3. Click "Refresh" for latest status
4. When status = "completed":
β†’ Click "View Results"
β†’ Opens leaderboard filtered to your run
```
#### Job Statuses
- **Pending**: Job queued, waiting for resources
- **Running**: Evaluation in progress
- **Completed**: Evaluation finished successfully
- **Failed**: Evaluation encountered an error
#### Tips
- **Check logs** if job fails: Helps diagnose issues
- **Expected duration**:
- API models: 2-5 minutes
- Local models: 15-30 minutes (includes model download)
---
### πŸ” Trace Visualization
**Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.
**Access**: Click on any test case in a run's detail view
#### Features
**Waterfall Diagram**:
- Visual timeline of execution
- Spans show: LLM calls, tool executions, reasoning steps
- Duration bars (wider = slower)
- Parent-child relationships
**Span Details**:
- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
- Start/end times
- Duration
- Attributes (model, tokens, cost, tool inputs/outputs)
- Status (OK, ERROR)
**GPU Metrics Overlay** (for GPU jobs only):
- GPU utilization %
- Memory usage
- Temperature
- CO2 emissions
**MCP-Powered Q&A**:
- Ask questions about the trace
- Example: "Why was tool X called twice?"
- Agent uses `debug_trace` MCP tool to analyze
#### How to Use
```
1. From leaderboard β†’ Click a run β†’ Click a test case
2. View waterfall diagram:
β†’ Spans arranged chronologically
β†’ Parent spans (e.g., "Agent Execution")
β†’ Child spans (e.g., "LLM Call", "Tool Call")
3. Click any span:
β†’ See detailed attributes
β†’ Token counts, costs, inputs/outputs
4. Ask questions (MCP-powered):
User: "Why did this test fail?"
β†’ Agent analyzes trace with debug_trace tool
β†’ Returns explanation with span references
5. Check GPU metrics (if available):
β†’ Graph shows utilization over time
β†’ Overlayed on execution timeline
```
#### Example Analysis
**Scenario: Understanding a slow execution**
```
1. Open trace for test_045 (duration: 8.5s)
2. Waterfall shows:
- Span 1: LLM Call - Reasoning (1.2s) βœ“
- Span 2: Tool Call - search_web (6.5s) ⚠️ SLOW
- Span 3: LLM Call - Final Response (0.8s) βœ“
3. Click Span 2 (search_web):
- Input: {"query": "weather in Tokyo"}
- Output: 5 results
- Duration: 6.5s (6x slower than typical)
4. Ask agent: "Why was the search_web call so slow?"
β†’ Agent analysis:
"The search_web call took 6.5s due to network latency.
Span attributes show API response time: 6.2s.
This is an external dependency issue, not agent code.
Recommendation: Implement timeout (5s) and fallback strategy."
```
#### Tips
- **Look for patterns**: Similar failures often have common spans
- **Use MCP Q&A**: Faster than manual trace analysis
- **Check GPU metrics**: Identify resource bottlenecks
- **Compare successful vs failed traces**: Spot differences
---
### πŸ”¬ Synthetic Data Generator
**Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.
#### Features
**AI-Powered Dataset Generation**:
- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
- Customizable domain, tools, difficulty, and agent type
- Automatic batching for large datasets (parallel generation)
- SMOLTRACE-format output ready for evaluation
**Prompt Template Generation**:
- Customized YAML templates based on smolagents format
- Optimized for your specific domain and tools
- Included automatically in dataset card
**Push to HuggingFace Hub**:
- One-click upload to HuggingFace Hub
- Public or private repositories
- Auto-generated README with usage instructions
- Ready to use with SMOLTRACE evaluations
#### How to Use
**Step 1: Configure & Generate Dataset**
1. Navigate to **πŸ”¬ Synthetic Data Generator** tab
2. Configure generation parameters:
- **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
- **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
- **Number of Tasks**: 5-100 tasks (slider)
- **Difficulty Level**:
- `balanced` (40% easy, 40% medium, 20% hard)
- `easy_only` (100% easy tasks)
- `medium_only` (100% medium tasks)
- `hard_only` (100% hard tasks)
- `progressive` (50% easy, 30% medium, 20% hard)
- **Agent Type**:
- `tool` (ToolCallingAgent only)
- `code` (CodeAgent only)
- `both` (50/50 mix)
3. Click **"🎲 Generate Synthetic Dataset"**
4. Wait for generation (30-120s depending on size):
- Shows progress message
- Automatic batching for >20 tasks
- Parallel API calls for faster generation
**Step 2: Review Generated Content**
1. **Dataset Preview Tab**:
- View all generated tasks in JSON format
- Check task IDs, prompts, expected tools, difficulty
- See dataset statistics:
- Total tasks
- Difficulty distribution
- Agent type distribution
- Tools coverage
2. **Prompt Template Tab**:
- View customized YAML prompt template
- Based on smolagents templates
- Adapted for your domain and tools
- Ready to use with ToolCallingAgent or CodeAgent
**Step 3: Push to HuggingFace Hub** (Optional)
1. Enter **Repository Name**:
- Format: `username/smoltrace-{domain}-tasks`
- Example: `alice/smoltrace-finance-tasks`
- Auto-filled with your HF username after generation
2. Set **Visibility**:
- ☐ Private Repository (unchecked = public)
- β˜‘ Private Repository (checked = private)
3. Provide **HuggingFace Token** (optional):
- Leave empty to use environment token (HF_TOKEN from Settings)
- Or paste token from https://huggingface.co/settings/tokens
- Requires write permissions
4. Click **"πŸ“€ Push to HuggingFace Hub"**
5. Wait for upload (5-30s):
- Creates dataset repository
- Uploads tasks
- Generates README with:
- Usage instructions
- Prompt template
- SMOLTRACE integration code
- Returns dataset URL
#### Example Workflow
```
Scenario: Create finance evaluation dataset with 20 tasks
1. Configure:
Domain: "finance"
Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
Number of Tasks: 20
Difficulty: "balanced"
Agent Type: "both"
2. Click "Generate"
β†’ AI generates 20 tasks:
- 8 easy (single tool, straightforward)
- 8 medium (multiple tools or complex logic)
- 4 hard (complex reasoning, edge cases)
- 10 for ToolCallingAgent
- 10 for CodeAgent
β†’ Also generates customized prompt template
3. Review Dataset Preview:
Task 1:
{
"id": "finance_stock_price_1",
"prompt": "What is the current price of AAPL stock?",
"expected_tool": "get_stock_price",
"difficulty": "easy",
"agent_type": "tool",
"expected_keywords": ["AAPL", "price", "$"]
}
Task 15:
{
"id": "finance_complex_analysis_15",
"prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
"expected_tool": "calculate_roi",
"expected_tool_calls": 2,
"difficulty": "hard",
"agent_type": "code",
"expected_keywords": ["ROI", "15%", "alert"]
}
4. Review Prompt Template:
See customized YAML with:
- Finance-specific system prompt
- Tool descriptions for get_stock_price, calculate_roi, etc.
- Response format guidelines
5. Push to Hub:
Repository: "yourname/smoltrace-finance-tasks"
Private: No (public)
Token: (empty, using environment token)
β†’ Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
β†’ README includes usage instructions and prompt template
6. Use in evaluation:
# Load your custom dataset
dataset = load_dataset("yourname/smoltrace-finance-tasks")
# Run SMOLTRACE evaluation
smoltrace-eval --model openai/gpt-4 \
--dataset-name yourname/smoltrace-finance-tasks \
--agent-type both
```
#### Configuration Reference
**Difficulty Levels Explained**:
| Level | Characteristics | Example |
|-------|----------------|---------|
| **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" β†’ get_weather("Tokyo") |
| **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" β†’ get_weather("Tokyo"), get_weather("London"), compare |
| **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |
**Agent Types Explained**:
| Type | Description | Use Case |
|------|-------------|----------|
| **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
| **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
| **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |
#### Best Practices
**Domain Selection**:
- Be specific: "customer_support_saas" > "support"
- Match your use case: Use actual business domain
- Consider tools available: Domain should align with tools
**Tool Names**:
- Use descriptive names: "get_stock_price" > "fetch"
- Match actual tool implementations
- 3-8 tools is ideal (enough variety, not overwhelming)
- Include mix of data retrieval and action tools
**Number of Tasks**:
- 5-10 tasks: Quick testing, proof of concept
- 20-30 tasks: Solid evaluation dataset
- 50-100 tasks: Comprehensive benchmark
**Difficulty Distribution**:
- `balanced`: Best for general evaluation
- `progressive`: Good for learning/debugging
- `easy_only`: Quick sanity checks
- `hard_only`: Stress testing advanced capabilities
**Quality Assurance**:
- Always review generated tasks before pushing
- Check for domain relevance and variety
- Verify expected tools match your actual tools
- Ensure prompts are clear and executable
#### Troubleshooting
**Generation fails with "Invalid API key"**:
- Go to **βš™οΈ Settings**
- Configure Gemini API Key
- Get key from https://aistudio.google.com/apikey
**Generated tasks don't match domain**:
- Be more specific in domain description
- Try regenerating with adjusted parameters
- Review prompt template for domain alignment
**Push to Hub fails with "Authentication error"**:
- Verify HuggingFace token has write permissions
- Get token from https://huggingface.co/settings/tokens
- Check token in **βš™οΈ Settings** or provide directly
**Dataset generation is slow (>60s)**:
- Large requests (>20 tasks) are automatically batched
- Each batch takes 30-120s
- Example: 100 tasks = 5 batches Γ— 60s = ~5 minutes
- This is normal for large datasets
**Tasks are too easy/hard**:
- Adjust difficulty distribution
- Regenerate with different settings
- Mix difficulty levels with `balanced` or `progressive`
#### Advanced Tips
**Iterative Refinement**:
1. Generate 10 tasks with `balanced` difficulty
2. Review quality and variety
3. If satisfied, generate 50-100 tasks with same settings
4. If not, adjust domain/tools and regenerate
**Dataset Versioning**:
- Use version suffixes: `username/smoltrace-finance-tasks-v2`
- Iterate on datasets as tools evolve
- Keep track of which version was used for evaluations
**Combining Datasets**:
- Generate multiple small datasets for different domains
- Use SMOLTRACE CLI to merge datasets
- Create comprehensive multi-domain benchmarks
**Custom Prompt Templates**:
- Generate prompt template separately
- Customize further based on your needs
- Use in agent initialization before evaluation
- Include in dataset card for reproducibility
---
### βš™οΈ Settings
**Purpose**: Configure API keys, preferences, and authentication.
#### Features
**API Key Configuration**:
- Gemini API Key (for MCP server AI analysis)
- HuggingFace Token (for dataset access + job submission)
- Modal Token ID + Secret (for Modal job submission)
- LLM Provider Keys (OpenAI, Anthropic, etc.)
**Preferences**:
- Default infrastructure (HF Jobs vs Modal)
- Default hardware tier
- Auto-refresh intervals
**Security**:
- Keys stored in browser session only (not server)
- HTTPS encryption for all API calls
- Keys never logged or exposed
#### How to Use
**Configure Essential Keys**:
```
1. Go to "βš™οΈ Settings" tab
2. Enter Gemini API Key:
- Get from: https://ai.google.dev/
- Click "Get API Key" β†’ Create project β†’ Generate
- Paste into field
- Free tier: 1,500 requests/day
3. Enter HuggingFace Token:
- Get from: https://huggingface.co/settings/tokens
- Click "New token" β†’ Name: "TraceMind"
- Permissions:
- Read (for viewing datasets)
- Write (for uploading results)
- Run Jobs (for evaluation submission)
- Paste into field
4. Click "Save API Keys"
β†’ Keys stored in browser session
β†’ MCP server will use your keys
```
**Configure for Job Submission** (Optional):
For **HuggingFace Jobs**:
```
Already configured if you entered HF token above with "Run Jobs" permission.
```
For **Modal** (Alternative):
```
1. Sign up: https://modal.com
2. Get token: https://modal.com/settings/tokens
3. Copy MODAL_TOKEN_ID (starts with 'ak-')
4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
5. Paste both into Settings β†’ Save
```
For **API Model Providers**:
```
1. Get API key from provider:
- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/settings/keys
- Google: https://ai.google.dev/
2. Paste into corresponding field in Settings
3. Click "Save LLM Provider Keys"
```
#### Security Best Practices
- **Use environment variables**: For production, set keys via HF Spaces secrets
- **Rotate keys regularly**: Generate new tokens every 3-6 months
- **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
- **Monitor usage**: Check API provider dashboards for unexpected charges
---
## Common Workflows
### Workflow 1: Quick Model Comparison
```
Goal: Compare GPT-4 vs Llama-3.1-8B for production use
Steps:
1. Go to Leaderboard β†’ Load Leaderboard
2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
3. Sort by Success Rate β†’ Note: GPT-4 (95.8%), Llama (93.4%)
4. Sort by Cost β†’ Note: GPT-4 ($0.05), Llama ($0.002)
5. Go to Agent Chat β†’ Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
β†’ Agent analyzes with MCP tools
β†’ Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
6. Decision: Use Llama-3.1-8B for production
```
### Workflow 2: Evaluate Custom Model
```
Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark
Steps:
1. Ensure model is on HuggingFace: username/my-finetuned-model
2. Go to Settings β†’ Configure HF token (with Run Jobs permission)
3. Go to New Evaluation:
- Model: "username/my-finetuned-model"
- Infrastructure: HuggingFace Jobs
- Agent type: both
- Hardware: auto
4. Click "Estimate Cost" β†’ Review: $1.50, 20 min
5. Click "Submit Evaluation"
6. Go to Job Monitoring β†’ Wait for "Completed" (15-25 min)
7. Go to Leaderboard β†’ Refresh β†’ See your model in table
8. Click your run β†’ Review detailed results
9. Compare vs other models using Agent Chat
```
### Workflow 3: Debug Failed Test
```
Goal: Understand why test_045 failed in your evaluation
Steps:
1. Go to Leaderboard β†’ Find your run β†’ Click to open details
2. Filter to failed tests only
3. Click test_045 β†’ Opens trace visualization
4. Examine waterfall:
- Span 1: LLM Call (OK)
- Span 2: Tool Call - "unknown_tool" (ERROR)
- No Span 3 (execution stopped)
5. Ask Agent: "Why did test_045 fail?"
β†’ Agent uses debug_trace MCP tool
β†’ Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
6. Fix: Update agent config to include missing tool
7. Re-run evaluation with fixed config
```
---
## Troubleshooting
### Leaderboard Issues
**Problem**: "Load Leaderboard" button doesn't work
- **Solution**: Check HuggingFace token in Settings (needs Read permission)
- **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard
**Problem**: AI insights not showing
- **Solution**: Check Gemini API key in Settings
- **Solution**: Wait 5-10 seconds for AI generation to complete
### Agent Chat Issues
**Problem**: Agent responds with "MCP server connection failed"
- **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
- **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings
**Problem**: Agent gives incorrect information
- **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
- **Solution**: Verify question is clear and specific
### Evaluation Submission Issues
**Problem**: "Submit Evaluation" fails with auth error
- **Solution**: HF token needs "Run Jobs" permission
- **Solution**: Ensure HF Pro account is active ($9/month)
- **Solution**: Verify credit card is on file for compute charges
**Problem**: Job stuck in "Pending" status
- **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
- **Solution**: Try Modal as alternative infrastructure
**Problem**: Job fails with "Out of Memory"
- **Solution**: Model too large for selected hardware
- **Solution**: Increase hardware tier (e.g., t4-small β†’ a10g-small)
- **Solution**: Use auto hardware selection
### Trace Visualization Issues
**Problem**: Traces not loading
- **Solution**: Ensure evaluation completed successfully
- **Solution**: Check traces dataset exists on HuggingFace
- **Solution**: Verify HF token has Read permission
**Problem**: GPU metrics missing
- **Solution**: Only available for GPU jobs (not API models)
- **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled
---
## Getting Help
- **πŸ“§ GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
- **πŸ’¬ HF Discord**: `#agents-mcp-hackathon-winter25`
- **πŸ“– Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)
---
**Last Updated**: November 21, 2025