Spaces:

MCP-1st-Birthday
/

TraceMind

Running

App Files Files Community

TraceMind / USER_GUIDE.md

kshitijthakkar

docs: Deploy final documentation package

34f1a7a 19 days ago

preview code

raw

history blame contribute delete

31.1 kB

	# TraceMind-AI - Complete User Guide

	This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.

	## Table of Contents

	- [Getting Started](#getting-started)
	- [Screen-by-Screen Guide](#screen-by-screen-guide)
	- [📊 Leaderboard](#-leaderboard)
	- [🤖 Agent Chat](#-agent-chat)
	- [🚀 New Evaluation](#-new-evaluation)
	- [📈 Job Monitoring](#-job-monitoring)
	- [🔍 Trace Visualization](#-trace-visualization)
	- [🔬 Synthetic Data Generator](#-synthetic-data-generator)
	- [⚙️ Settings](#️-settings)
	- [Common Workflows](#common-workflows)
	- [Troubleshooting](#troubleshooting)

	---

	## Getting Started

	### First-Time Setup

	1. Visit https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
	2. Sign in with your HuggingFace account (required for viewing)
	3. Configure API keys (optional but recommended):
	- Go to ⚙️ Settings tab
	- Enter Gemini API Key and HuggingFace Token
	- Click "Save API Keys"

	### Navigation

	TraceMind-AI is organized into tabs:
	- 📊 Leaderboard: View evaluation results with AI insights
	- 🤖 Agent Chat: Interactive autonomous agent powered by MCP tools
	- 🚀 New Evaluation: Submit evaluation jobs to HF Jobs or Modal
	- 📈 Job Monitoring: Track status of submitted jobs
	- 🔍 Trace Visualization: Deep-dive into agent execution traces
	- 🔬 Synthetic Data Generator: Create custom test datasets with AI
	- ⚙️ Settings: Configure API keys and preferences

	---

	## Screen-by-Screen Guide

	### 📊 Leaderboard

	Purpose: Browse all evaluation runs with AI-powered insights and detailed analysis.

	#### Features

	Main Table:
	- View all evaluation runs from the SMOLTRACE leaderboard
	- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
	- Click any row to see detailed test results

	AI Insights Panel (Top of screen):
	- Automatically generated insights from MCP server
	- Powered by Google Gemini 2.5 Flash
	- Updates when you click "Load Leaderboard"
	- Shows top performers, trends, and recommendations

	Filter & Sort Options:
	- Filter by agent type (tool, code, both)
	- Filter by provider (litellm, transformers)
	- Sort by any metric (success rate, cost, duration)

	#### How to Use

	1. Load Data:
	```
	Click "Load Leaderboard" button
	→ Fetches latest evaluation runs from HuggingFace
	→ AI generates insights automatically
	```

	2. Read AI Insights:
	- Located at top of screen
	- Summary of evaluation trends
	- Top performing models
	- Cost/accuracy trade-offs
	- Actionable recommendations

	3. Explore Runs:
	- Scroll through table
	- Sort by clicking column headers
	- Click on any run to see details

	4. View Details:
	```
	Click a row in the table
	→ Opens detail view with:
	- All test cases (success/failure)
	- Execution times
	- Cost breakdown
	- Link to trace visualization
	```

	#### Example Workflow

	```
	Scenario: Find the most cost-effective model for production

	1. Click "Load Leaderboard"
	2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
	3. Sort table by "Cost" (ascending)
	4. Compare top 3 cheapest models
	5. Click on Llama-3.1-8B run to see detailed results
	6. Review success rate (93.4%) and test case breakdowns
	7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
	```

	#### Tips

	- Refresh regularly: Click "Load Leaderboard" to see new evaluation results
	- Compare models: Use the sort function to compare across different metrics
	- Trust the AI: The insights panel provides strategic recommendations based on all data

	---

	### 🤖 Agent Chat

	Purpose: Interactive autonomous agent that can answer questions about evaluations using MCP tools.

	🎯 Track 2 Feature: This demonstrates MCP client usage with smolagents framework.

	#### Features

	Autonomous Agent:
	- Built with `smolagents` framework
	- Has access to all TraceMind MCP Server tools
	- Plans and executes multi-step actions
	- Provides detailed, data-driven answers

	MCP Tools Available to Agent:
	- `analyze_leaderboard` - Get AI insights about top performers
	- `estimate_cost` - Calculate evaluation costs before running
	- `debug_trace` - Analyze execution traces
	- `compare_runs` - Compare two evaluation runs
	- `get_top_performers` - Fetch top N models efficiently
	- `get_leaderboard_summary` - Get high-level statistics
	- `get_dataset` - Load SMOLTRACE datasets
	- `analyze_results` - Analyze detailed test results

	Agent Reasoning Visibility:
	- Toggle "Show Agent Reasoning" to see:
	- Planning steps
	- Tool execution logs
	- Intermediate results
	- Final synthesis

	Quick Action Buttons:
	- "Quick: Top Models": Get top 5 models with costs
	- "Quick: Cost Estimate": Estimate cost for a model
	- "Quick: Load Leaderboard": Fetch leaderboard summary

	#### How to Use

	1. Start a Conversation:
	```
	Type your question in the chat box
	Example: "What are the top 3 performing models and how much do they cost?"

	Click "Send"
	→ Agent plans approach
	→ Executes MCP tools
	→ Returns synthesized answer
	```

	2. Watch Agent Work (optional):
	```
	Enable "Show Agent Reasoning" checkbox
	→ See step-by-step execution:
	Step 1: Calling get_top_performers with metric=success_rate, top_n=3
	Step 2: Analyzing costs from results
	Step 3: Formatting response with recommendations
	```

	3. Ask Follow-Up Questions:
	```
	You: "Which one is most cost-effective?"
	Agent: Uses previous context + MCP tools to answer
	```

	4. Use Quick Actions:
	```
	Click "Quick: Top Models" button
	→ Agent automatically executes query
	→ Returns top 5 models with metrics
	```

	#### Example Conversations

	Example 1: Cost Comparison
	```
	User: "Compare the costs of the top 3 models. Which one offers the best value?"

	Agent:
	Let me analyze the top 3 models by success rate:

	[Calls: get_top_performers(metric="success_rate", top_n=3)]

	Top 3 Models:
	1. GPT-4: 95.8% success, $0.05/run
	2. Claude-3: 94.1% success, $0.04/run
	3. Llama-3.1-8B: 93.4% success, $0.002/run

	[Calls: analyze_leaderboard(metric_focus="cost")]

	Cost Analysis:
	- GPT-4: Highest accuracy but 25x more expensive than Llama
	- Claude-3: Good middle ground at $0.04/run
	- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop

	Recommendation for "best value":
	→ Llama-3.1-8B offers 93.4% accuracy at $0.002/run
	→ For 1M runs: Saves $48,000/month vs GPT-4
	→ Only 2.4% accuracy trade-off
	```

	Example 2: Evaluation Planning
	```
	User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"

	Agent:
	Let me estimate costs for DeepSeek-V3:

	[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]

	Cost Estimate:
	- Model: deepseek-ai/DeepSeek-V3
	- Tests: 100
	- Recommended hardware: GPU A100 (70B+ parameters)

	Breakdown:
	- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
	- HF Jobs GPU cost: ~$3.00/hr for A100
	- Estimated duration: 25 minutes
	- Total compute cost: $1.25
	- Total cost: $1.25 (100 tests)

	Recommendation:
	Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
	Alternative: Modal H200 for 2x faster inference at $2.50.
	```

	#### Tips

	- Be specific: Ask clear, focused questions for better answers
	- Use context: Agent remembers conversation history
	- Watch reasoning: Enable to understand how agent uses MCP tools
	- Try quick actions: Fast way to get common information

	---

	### 🚀 New Evaluation

	Purpose: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.

	⚠️ Requires: HuggingFace Pro account ($9/month) with credit card, or Modal account.

	#### Features

	Model Selection:
	- Enter any model name (format: `provider/model-name`)
	- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
	- Auto-detects if API model or local model

	Infrastructure Choice:
	- HuggingFace Jobs: Managed compute (H200, A100, A10, T4, CPU)
	- Modal: Serverless GPU compute (pay-per-second)

	Hardware Selection:
	- Auto (recommended): Automatically selects optimal hardware based on model size
	- Manual: Choose specific GPU tier (A10, A100, H200) or CPU

	Cost Estimation:
	- Click "💰 Estimate Cost" before submitting
	- Shows predicted:
	- LLM API costs (for API models)
	- Compute costs (for local models)
	- Duration estimate
	- CO2 emissions

	Agent Type:
	- tool: Test tool-calling capabilities
	- code: Test code generation capabilities
	- both: Test both (recommended)

	#### How to Use

	Step 1: Configure Prerequisites (One-time setup)

	For HuggingFace Jobs:
	```
	1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
	2. Add credit card for compute charges
	3. Create HF token with "Read + Write + Run Jobs" permissions
	4. Go to Settings tab → Enter HF token → Save
	```

	For Modal (Alternative):
	```
	1. Sign up: https://modal.com (free tier available)
	2. Generate API token: https://modal.com/settings/tokens
	3. Go to Settings tab → Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET → Save
	```

	For API Models (OpenAI, Anthropic, etc.):
	```
	1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
	2. Go to Settings tab → Enter provider API key → Save
	```

	Step 2: Create Evaluation

	```
	1. Enter model name:
	Example: "meta-llama/Llama-3.1-8B"

	2. Select infrastructure:
	- HuggingFace Jobs (default)
	- Modal (alternative)

	3. Choose agent type:
	- "both" (recommended)

	4. Select hardware:
	- "auto" (recommended - smart selection)
	- Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200

	5. Set timeout (optional):
	- Default: 3600s (1 hour)
	- Range: 300s - 7200s

	6. Click "💰 Estimate Cost":
	→ Shows predicted cost and duration
	→ Example: "$2.00, 20 minutes, 0.5g CO2"

	7. Review estimate, then click "Submit Evaluation"
	```

	Step 3: Monitor Job

	```
	After submission:
	→ Job ID displayed
	→ Go to "📈 Job Monitoring" tab to track progress
	→ Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
	```

	Step 4: View Results

	```
	When job completes:
	→ Results automatically uploaded to HuggingFace datasets
	→ Appears in Leaderboard within 1-2 minutes
	→ Click on your run to see detailed results
	```

	#### Hardware Selection Guide

	For API Models (OpenAI, Anthropic, Google):
	- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
	- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
	- Why: No GPU needed for API calls

	For Small Models (4B-8B parameters):
	- Use: `t4-small` (HF) or A10G (Modal)
	- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
	- Examples: Llama-3.1-8B, Mistral-7B

	For Medium Models (7B-13B parameters):
	- Use: `a10g-small` (HF) or A10G (Modal)
	- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
	- Examples: Qwen2.5-14B, Mixtral-8x7B

	For Large Models (70B+ parameters):
	- Use: `a100-large` (HF) or A100-80GB (Modal)
	- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
	- Examples: Llama-3.1-70B, DeepSeek-V3

	For Fastest Inference:
	- Use: `h200` (HF or Modal)
	- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
	- Best for: Time-sensitive evaluations, large batches

	#### Example Workflows

	Workflow 1: Evaluate API Model (OpenAI GPT-4)
	```
	1. Model: "openai/gpt-4"
	2. Infrastructure: HuggingFace Jobs
	3. Agent type: both
	4. Hardware: auto (selects cpu-basic)
	5. Estimate: $50.00 (mostly API costs), 45 min
	6. Submit → Monitor → View in leaderboard
	```

	Workflow 2: Evaluate Local Model (Llama-3.1-8B)
	```
	1. Model: "meta-llama/Llama-3.1-8B"
	2. Infrastructure: Modal (for pay-per-second billing)
	3. Agent type: both
	4. Hardware: auto (selects A10G)
	5. Estimate: $0.20, 15 min
	6. Submit → Monitor → View in leaderboard
	```

	#### Tips

	- Always estimate first: Prevents surprise costs
	- Use "auto" hardware: Smart selection based on model size
	- Start small: Test with 10-20 tests before scaling to 100+
	- Monitor jobs: Check Job Monitoring tab for status
	- Modal for experimentation: Pay-per-second is cost-effective for testing

	---

	### 📈 Job Monitoring

	Purpose: Track status of submitted evaluation jobs.

	#### Features

	Job Status Display:
	- Job ID
	- Current status (pending, running, completed, failed)
	- Start time
	- Duration
	- Infrastructure (HF Jobs or Modal)

	Real-time Updates:
	- Auto-refreshes every 30 seconds
	- Manual refresh button

	Job Actions:
	- View logs
	- Cancel job (if still running)
	- View results (if completed)

	#### How to Use

	```
	1. Go to "📈 Job Monitoring" tab
	2. See list of your submitted jobs
	3. Click "Refresh" for latest status
	4. When status = "completed":
	→ Click "View Results"
	→ Opens leaderboard filtered to your run
	```

	#### Job Statuses

	- Pending: Job queued, waiting for resources
	- Running: Evaluation in progress
	- Completed: Evaluation finished successfully
	- Failed: Evaluation encountered an error

	#### Tips

	- Check logs if job fails: Helps diagnose issues
	- Expected duration:
	- API models: 2-5 minutes
	- Local models: 15-30 minutes (includes model download)

	---

	### 🔍 Trace Visualization

	Purpose: Deep-dive into OpenTelemetry traces to understand agent execution.

	Access: Click on any test case in a run's detail view

	#### Features

	Waterfall Diagram:
	- Visual timeline of execution
	- Spans show: LLM calls, tool executions, reasoning steps
	- Duration bars (wider = slower)
	- Parent-child relationships

	Span Details:
	- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
	- Start/end times
	- Duration
	- Attributes (model, tokens, cost, tool inputs/outputs)
	- Status (OK, ERROR)

	GPU Metrics Overlay (for GPU jobs only):
	- GPU utilization %
	- Memory usage
	- Temperature
	- CO2 emissions

	MCP-Powered Q&A:
	- Ask questions about the trace
	- Example: "Why was tool X called twice?"
	- Agent uses `debug_trace` MCP tool to analyze

	#### How to Use

	```
	1. From leaderboard → Click a run → Click a test case
	2. View waterfall diagram:
	→ Spans arranged chronologically
	→ Parent spans (e.g., "Agent Execution")
	→ Child spans (e.g., "LLM Call", "Tool Call")

	3. Click any span:
	→ See detailed attributes
	→ Token counts, costs, inputs/outputs

	4. Ask questions (MCP-powered):
	User: "Why did this test fail?"
	→ Agent analyzes trace with debug_trace tool
	→ Returns explanation with span references

	5. Check GPU metrics (if available):
	→ Graph shows utilization over time
	→ Overlayed on execution timeline
	```

	#### Example Analysis

	Scenario: Understanding a slow execution

	```
	1. Open trace for test_045 (duration: 8.5s)
	2. Waterfall shows:
	- Span 1: LLM Call - Reasoning (1.2s) ✓
	- Span 2: Tool Call - search_web (6.5s) ⚠️ SLOW
	- Span 3: LLM Call - Final Response (0.8s) ✓

	3. Click Span 2 (search_web):
	- Input: {"query": "weather in Tokyo"}
	- Output: 5 results
	- Duration: 6.5s (6x slower than typical)

	4. Ask agent: "Why was the search_web call so slow?"
	→ Agent analysis:
	"The search_web call took 6.5s due to network latency.
	Span attributes show API response time: 6.2s.
	This is an external dependency issue, not agent code.
	Recommendation: Implement timeout (5s) and fallback strategy."
	```

	#### Tips

	- Look for patterns: Similar failures often have common spans
	- Use MCP Q&A: Faster than manual trace analysis
	- Check GPU metrics: Identify resource bottlenecks
	- Compare successful vs failed traces: Spot differences

	---

	### 🔬 Synthetic Data Generator

	Purpose: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.

	#### Features

	AI-Powered Dataset Generation:
	- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
	- Customizable domain, tools, difficulty, and agent type
	- Automatic batching for large datasets (parallel generation)
	- SMOLTRACE-format output ready for evaluation

	Prompt Template Generation:
	- Customized YAML templates based on smolagents format
	- Optimized for your specific domain and tools
	- Included automatically in dataset card

	Push to HuggingFace Hub:
	- One-click upload to HuggingFace Hub
	- Public or private repositories
	- Auto-generated README with usage instructions
	- Ready to use with SMOLTRACE evaluations

	#### How to Use

	Step 1: Configure & Generate Dataset

	1. Navigate to 🔬 Synthetic Data Generator tab

	2. Configure generation parameters:
	- Domain: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
	- Tools: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
	- Number of Tasks: 5-100 tasks (slider)
	- Difficulty Level:
	- `balanced` (40% easy, 40% medium, 20% hard)
	- `easy_only` (100% easy tasks)
	- `medium_only` (100% medium tasks)
	- `hard_only` (100% hard tasks)
	- `progressive` (50% easy, 30% medium, 20% hard)
	- Agent Type:
	- `tool` (ToolCallingAgent only)
	- `code` (CodeAgent only)
	- `both` (50/50 mix)

	3. Click "🎲 Generate Synthetic Dataset"

	4. Wait for generation (30-120s depending on size):
	- Shows progress message
	- Automatic batching for >20 tasks
	- Parallel API calls for faster generation

	Step 2: Review Generated Content

	1. Dataset Preview Tab:
	- View all generated tasks in JSON format
	- Check task IDs, prompts, expected tools, difficulty
	- See dataset statistics:
	- Total tasks
	- Difficulty distribution
	- Agent type distribution
	- Tools coverage

	2. Prompt Template Tab:
	- View customized YAML prompt template
	- Based on smolagents templates
	- Adapted for your domain and tools
	- Ready to use with ToolCallingAgent or CodeAgent

	Step 3: Push to HuggingFace Hub (Optional)

	1. Enter Repository Name:
	- Format: `username/smoltrace-{domain}-tasks`
	- Example: `alice/smoltrace-finance-tasks`
	- Auto-filled with your HF username after generation

	2. Set Visibility:
	- ☐ Private Repository (unchecked = public)
	- ☑ Private Repository (checked = private)

	3. Provide HuggingFace Token (optional):
	- Leave empty to use environment token (HF_TOKEN from Settings)
	- Or paste token from https://huggingface.co/settings/tokens
	- Requires write permissions

	4. Click "📤 Push to HuggingFace Hub"

	5. Wait for upload (5-30s):
	- Creates dataset repository
	- Uploads tasks
	- Generates README with:
	- Usage instructions
	- Prompt template
	- SMOLTRACE integration code
	- Returns dataset URL

	#### Example Workflow

	```
	Scenario: Create finance evaluation dataset with 20 tasks

	1. Configure:
	Domain: "finance"
	Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
	Number of Tasks: 20
	Difficulty: "balanced"
	Agent Type: "both"

	2. Click "Generate"
	→ AI generates 20 tasks:
	- 8 easy (single tool, straightforward)
	- 8 medium (multiple tools or complex logic)
	- 4 hard (complex reasoning, edge cases)
	- 10 for ToolCallingAgent
	- 10 for CodeAgent
	→ Also generates customized prompt template

	3. Review Dataset Preview:
	Task 1:
	{
	"id": "finance_stock_price_1",
	"prompt": "What is the current price of AAPL stock?",
	"expected_tool": "get_stock_price",
	"difficulty": "easy",
	"agent_type": "tool",
	"expected_keywords": ["AAPL", "price", "$"]
	}

	Task 15:
	{
	"id": "finance_complex_analysis_15",
	"prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
	"expected_tool": "calculate_roi",
	"expected_tool_calls": 2,
	"difficulty": "hard",
	"agent_type": "code",
	"expected_keywords": ["ROI", "15%", "alert"]
	}

	4. Review Prompt Template:
	See customized YAML with:
	- Finance-specific system prompt
	- Tool descriptions for get_stock_price, calculate_roi, etc.
	- Response format guidelines

	5. Push to Hub:
	Repository: "yourname/smoltrace-finance-tasks"
	Private: No (public)
	Token: (empty, using environment token)

	→ Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
	→ README includes usage instructions and prompt template

	6. Use in evaluation:
	# Load your custom dataset
	dataset = load_dataset("yourname/smoltrace-finance-tasks")

	# Run SMOLTRACE evaluation
	smoltrace-eval --model openai/gpt-4 \
	--dataset-name yourname/smoltrace-finance-tasks \
	--agent-type both
	```

	#### Configuration Reference

	Difficulty Levels Explained:

	\| Level \| Characteristics \| Example \|
	\|-------\|----------------\|---------\|
	\| Easy \| Single tool call, straightforward input, clear expected output \| "What's the weather in Tokyo?" → get_weather("Tokyo") \|
	\| Medium \| Multiple tool calls OR complex input parsing OR conditional logic \| "Compare weather in Tokyo and London" → get_weather("Tokyo"), get_weather("London"), compare \|
	\| Hard \| Multiple tools, complex reasoning, edge cases, error handling \| "Plan a trip with best weather, book flights if under $500, alert if unavailable" \|

	Agent Types Explained:

	\| Type \| Description \| Use Case \|
	\|------\|-------------\|----------\|
	\| tool \| ToolCallingAgent - Declarative tool calling with structured outputs \| API-based models that support function calling (GPT-4, Claude) \|
	\| code \| CodeAgent - Writes Python code to use tools programmatically \| Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) \|
	\| both \| 50/50 mix of tool and code agent tasks \| Comprehensive evaluation across agent types \|

	#### Best Practices

	Domain Selection:
	- Be specific: "customer_support_saas" > "support"
	- Match your use case: Use actual business domain
	- Consider tools available: Domain should align with tools

	Tool Names:
	- Use descriptive names: "get_stock_price" > "fetch"
	- Match actual tool implementations
	- 3-8 tools is ideal (enough variety, not overwhelming)
	- Include mix of data retrieval and action tools

	Number of Tasks:
	- 5-10 tasks: Quick testing, proof of concept
	- 20-30 tasks: Solid evaluation dataset
	- 50-100 tasks: Comprehensive benchmark

	Difficulty Distribution:
	- `balanced`: Best for general evaluation
	- `progressive`: Good for learning/debugging
	- `easy_only`: Quick sanity checks
	- `hard_only`: Stress testing advanced capabilities

	Quality Assurance:
	- Always review generated tasks before pushing
	- Check for domain relevance and variety
	- Verify expected tools match your actual tools
	- Ensure prompts are clear and executable

	#### Troubleshooting

	Generation fails with "Invalid API key":
	- Go to ⚙️ Settings
	- Configure Gemini API Key
	- Get key from https://aistudio.google.com/apikey

	Generated tasks don't match domain:
	- Be more specific in domain description
	- Try regenerating with adjusted parameters
	- Review prompt template for domain alignment

	Push to Hub fails with "Authentication error":
	- Verify HuggingFace token has write permissions
	- Get token from https://huggingface.co/settings/tokens
	- Check token in ⚙️ Settings or provide directly

	Dataset generation is slow (>60s):
	- Large requests (>20 tasks) are automatically batched
	- Each batch takes 30-120s
	- Example: 100 tasks = 5 batches × 60s = ~5 minutes
	- This is normal for large datasets

	Tasks are too easy/hard:
	- Adjust difficulty distribution
	- Regenerate with different settings
	- Mix difficulty levels with `balanced` or `progressive`

	#### Advanced Tips

	Iterative Refinement:
	1. Generate 10 tasks with `balanced` difficulty
	2. Review quality and variety
	3. If satisfied, generate 50-100 tasks with same settings
	4. If not, adjust domain/tools and regenerate

	Dataset Versioning:
	- Use version suffixes: `username/smoltrace-finance-tasks-v2`
	- Iterate on datasets as tools evolve
	- Keep track of which version was used for evaluations

	Combining Datasets:
	- Generate multiple small datasets for different domains
	- Use SMOLTRACE CLI to merge datasets
	- Create comprehensive multi-domain benchmarks

	Custom Prompt Templates:
	- Generate prompt template separately
	- Customize further based on your needs
	- Use in agent initialization before evaluation
	- Include in dataset card for reproducibility

	---

	### ⚙️ Settings

	Purpose: Configure API keys, preferences, and authentication.

	#### Features

	API Key Configuration:
	- Gemini API Key (for MCP server AI analysis)
	- HuggingFace Token (for dataset access + job submission)
	- Modal Token ID + Secret (for Modal job submission)
	- LLM Provider Keys (OpenAI, Anthropic, etc.)

	Preferences:
	- Default infrastructure (HF Jobs vs Modal)
	- Default hardware tier
	- Auto-refresh intervals

	Security:
	- Keys stored in browser session only (not server)
	- HTTPS encryption for all API calls
	- Keys never logged or exposed

	#### How to Use

	Configure Essential Keys:
	```
	1. Go to "⚙️ Settings" tab

	2. Enter Gemini API Key:
	- Get from: https://ai.google.dev/
	- Click "Get API Key" → Create project → Generate
	- Paste into field
	- Free tier: 1,500 requests/day

	3. Enter HuggingFace Token:
	- Get from: https://huggingface.co/settings/tokens
	- Click "New token" → Name: "TraceMind"
	- Permissions:
	- Read (for viewing datasets)
	- Write (for uploading results)
	- Run Jobs (for evaluation submission)
	- Paste into field

	4. Click "Save API Keys"
	→ Keys stored in browser session
	→ MCP server will use your keys
	```

	Configure for Job Submission (Optional):

	For HuggingFace Jobs:
	```
	Already configured if you entered HF token above with "Run Jobs" permission.
	```

	For Modal (Alternative):
	```
	1. Sign up: https://modal.com
	2. Get token: https://modal.com/settings/tokens
	3. Copy MODAL_TOKEN_ID (starts with 'ak-')
	4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
	5. Paste both into Settings → Save
	```

	For API Model Providers:
	```
	1. Get API key from provider:
	- OpenAI: https://platform.openai.com/api-keys
	- Anthropic: https://console.anthropic.com/settings/keys
	- Google: https://ai.google.dev/

	2. Paste into corresponding field in Settings
	3. Click "Save LLM Provider Keys"
	```

	#### Security Best Practices

	- Use environment variables: For production, set keys via HF Spaces secrets
	- Rotate keys regularly: Generate new tokens every 3-6 months
	- Minimal permissions: Only grant "Run Jobs" if you need to submit evaluations
	- Monitor usage: Check API provider dashboards for unexpected charges

	---

	## Common Workflows

	### Workflow 1: Quick Model Comparison

	```
	Goal: Compare GPT-4 vs Llama-3.1-8B for production use

	Steps:
	1. Go to Leaderboard → Load Leaderboard
	2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
	3. Sort by Success Rate → Note: GPT-4 (95.8%), Llama (93.4%)
	4. Sort by Cost → Note: GPT-4 ($0.05), Llama ($0.002)
	5. Go to Agent Chat → Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
	→ Agent analyzes with MCP tools
	→ Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
	6. Decision: Use Llama-3.1-8B for production
	```

	### Workflow 2: Evaluate Custom Model

	```
	Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark

	Steps:
	1. Ensure model is on HuggingFace: username/my-finetuned-model
	2. Go to Settings → Configure HF token (with Run Jobs permission)
	3. Go to New Evaluation:
	- Model: "username/my-finetuned-model"
	- Infrastructure: HuggingFace Jobs
	- Agent type: both
	- Hardware: auto
	4. Click "Estimate Cost" → Review: $1.50, 20 min
	5. Click "Submit Evaluation"
	6. Go to Job Monitoring → Wait for "Completed" (15-25 min)
	7. Go to Leaderboard → Refresh → See your model in table
	8. Click your run → Review detailed results
	9. Compare vs other models using Agent Chat
	```

	### Workflow 3: Debug Failed Test

	```
	Goal: Understand why test_045 failed in your evaluation

	Steps:
	1. Go to Leaderboard → Find your run → Click to open details
	2. Filter to failed tests only
	3. Click test_045 → Opens trace visualization
	4. Examine waterfall:
	- Span 1: LLM Call (OK)
	- Span 2: Tool Call - "unknown_tool" (ERROR)
	- No Span 3 (execution stopped)
	5. Ask Agent: "Why did test_045 fail?"
	→ Agent uses debug_trace MCP tool
	→ Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
	6. Fix: Update agent config to include missing tool
	7. Re-run evaluation with fixed config
	```

	---

	## Troubleshooting

	### Leaderboard Issues

	Problem: "Load Leaderboard" button doesn't work
	- Solution: Check HuggingFace token in Settings (needs Read permission)
	- Solution: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard

	Problem: AI insights not showing
	- Solution: Check Gemini API key in Settings
	- Solution: Wait 5-10 seconds for AI generation to complete

	### Agent Chat Issues

	Problem: Agent responds with "MCP server connection failed"
	- Solution: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
	- Solution: Configure Gemini API key in both TraceMind-AI and MCP server Settings

	Problem: Agent gives incorrect information
	- Solution: Agent may be using stale data. Ask: "Load the latest leaderboard data"
	- Solution: Verify question is clear and specific

	### Evaluation Submission Issues

	Problem: "Submit Evaluation" fails with auth error
	- Solution: HF token needs "Run Jobs" permission
	- Solution: Ensure HF Pro account is active ($9/month)
	- Solution: Verify credit card is on file for compute charges

	Problem: Job stuck in "Pending" status
	- Solution: HuggingFace Jobs may have queue. Wait 5-10 minutes.
	- Solution: Try Modal as alternative infrastructure

	Problem: Job fails with "Out of Memory"
	- Solution: Model too large for selected hardware
	- Solution: Increase hardware tier (e.g., t4-small → a10g-small)
	- Solution: Use auto hardware selection

	### Trace Visualization Issues

	Problem: Traces not loading
	- Solution: Ensure evaluation completed successfully
	- Solution: Check traces dataset exists on HuggingFace
	- Solution: Verify HF token has Read permission

	Problem: GPU metrics missing
	- Solution: Only available for GPU jobs (not API models)
	- Solution: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled

	---

	## Getting Help

	- 📧 GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
	- 💬 HF Discord: `#agents-mcp-hackathon-winter25`
	- 📖 Documentation: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)

	---

	Last Updated: November 21, 2025