Spaces:
Running
Running
File size: 31,065 Bytes
34f1a7a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 |
# TraceMind-AI - Complete User Guide
This guide provides a comprehensive walkthrough of all features and screens in TraceMind-AI.
## Table of Contents
- [Getting Started](#getting-started)
- [Screen-by-Screen Guide](#screen-by-screen-guide)
- [π Leaderboard](#-leaderboard)
- [π€ Agent Chat](#-agent-chat)
- [π New Evaluation](#-new-evaluation)
- [π Job Monitoring](#-job-monitoring)
- [π Trace Visualization](#-trace-visualization)
- [π¬ Synthetic Data Generator](#-synthetic-data-generator)
- [βοΈ Settings](#οΈ-settings)
- [Common Workflows](#common-workflows)
- [Troubleshooting](#troubleshooting)
---
## Getting Started
### First-Time Setup
1. **Visit** https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind
2. **Sign in** with your HuggingFace account (required for viewing)
3. **Configure API keys** (optional but recommended):
- Go to **βοΈ Settings** tab
- Enter Gemini API Key and HuggingFace Token
- Click **"Save API Keys"**
### Navigation
TraceMind-AI is organized into tabs:
- **π Leaderboard**: View evaluation results with AI insights
- **π€ Agent Chat**: Interactive autonomous agent powered by MCP tools
- **π New Evaluation**: Submit evaluation jobs to HF Jobs or Modal
- **π Job Monitoring**: Track status of submitted jobs
- **π Trace Visualization**: Deep-dive into agent execution traces
- **π¬ Synthetic Data Generator**: Create custom test datasets with AI
- **βοΈ Settings**: Configure API keys and preferences
---
## Screen-by-Screen Guide
### π Leaderboard
**Purpose**: Browse all evaluation runs with AI-powered insights and detailed analysis.
#### Features
**Main Table**:
- View all evaluation runs from the SMOLTRACE leaderboard
- Sortable columns: Model, Success Rate, Cost, Duration, CO2 emissions
- Click any row to see detailed test results
**AI Insights Panel** (Top of screen):
- Automatically generated insights from MCP server
- Powered by Google Gemini 2.5 Flash
- Updates when you click "Load Leaderboard"
- Shows top performers, trends, and recommendations
**Filter & Sort Options**:
- Filter by agent type (tool, code, both)
- Filter by provider (litellm, transformers)
- Sort by any metric (success rate, cost, duration)
#### How to Use
1. **Load Data**:
```
Click "Load Leaderboard" button
β Fetches latest evaluation runs from HuggingFace
β AI generates insights automatically
```
2. **Read AI Insights**:
- Located at top of screen
- Summary of evaluation trends
- Top performing models
- Cost/accuracy trade-offs
- Actionable recommendations
3. **Explore Runs**:
- Scroll through table
- Sort by clicking column headers
- Click on any run to see details
4. **View Details**:
```
Click a row in the table
β Opens detail view with:
- All test cases (success/failure)
- Execution times
- Cost breakdown
- Link to trace visualization
```
#### Example Workflow
```
Scenario: Find the most cost-effective model for production
1. Click "Load Leaderboard"
2. Read AI insights: "Llama-3.1-8B offers best cost/performance at $0.002/run"
3. Sort table by "Cost" (ascending)
4. Compare top 3 cheapest models
5. Click on Llama-3.1-8B run to see detailed results
6. Review success rate (93.4%) and test case breakdowns
7. Decision: Use Llama-3.1-8B for cost-sensitive workloads
```
#### Tips
- **Refresh regularly**: Click "Load Leaderboard" to see new evaluation results
- **Compare models**: Use the sort function to compare across different metrics
- **Trust the AI**: The insights panel provides strategic recommendations based on all data
---
### π€ Agent Chat
**Purpose**: Interactive autonomous agent that can answer questions about evaluations using MCP tools.
**π― Track 2 Feature**: This demonstrates MCP client usage with smolagents framework.
#### Features
**Autonomous Agent**:
- Built with `smolagents` framework
- Has access to all TraceMind MCP Server tools
- Plans and executes multi-step actions
- Provides detailed, data-driven answers
**MCP Tools Available to Agent**:
- `analyze_leaderboard` - Get AI insights about top performers
- `estimate_cost` - Calculate evaluation costs before running
- `debug_trace` - Analyze execution traces
- `compare_runs` - Compare two evaluation runs
- `get_top_performers` - Fetch top N models efficiently
- `get_leaderboard_summary` - Get high-level statistics
- `get_dataset` - Load SMOLTRACE datasets
- `analyze_results` - Analyze detailed test results
**Agent Reasoning Visibility**:
- Toggle **"Show Agent Reasoning"** to see:
- Planning steps
- Tool execution logs
- Intermediate results
- Final synthesis
**Quick Action Buttons**:
- **"Quick: Top Models"**: Get top 5 models with costs
- **"Quick: Cost Estimate"**: Estimate cost for a model
- **"Quick: Load Leaderboard"**: Fetch leaderboard summary
#### How to Use
1. **Start a Conversation**:
```
Type your question in the chat box
Example: "What are the top 3 performing models and how much do they cost?"
Click "Send"
β Agent plans approach
β Executes MCP tools
β Returns synthesized answer
```
2. **Watch Agent Work** (optional):
```
Enable "Show Agent Reasoning" checkbox
β See step-by-step execution:
Step 1: Calling get_top_performers with metric=success_rate, top_n=3
Step 2: Analyzing costs from results
Step 3: Formatting response with recommendations
```
3. **Ask Follow-Up Questions**:
```
You: "Which one is most cost-effective?"
Agent: Uses previous context + MCP tools to answer
```
4. **Use Quick Actions**:
```
Click "Quick: Top Models" button
β Agent automatically executes query
β Returns top 5 models with metrics
```
#### Example Conversations
**Example 1: Cost Comparison**
```
User: "Compare the costs of the top 3 models. Which one offers the best value?"
Agent:
Let me analyze the top 3 models by success rate:
[Calls: get_top_performers(metric="success_rate", top_n=3)]
Top 3 Models:
1. GPT-4: 95.8% success, $0.05/run
2. Claude-3: 94.1% success, $0.04/run
3. Llama-3.1-8B: 93.4% success, $0.002/run
[Calls: analyze_leaderboard(metric_focus="cost")]
Cost Analysis:
- GPT-4: Highest accuracy but 25x more expensive than Llama
- Claude-3: Good middle ground at $0.04/run
- Llama-3.1: Best value at $0.002/run with only 2.4% accuracy drop
Recommendation for "best value":
β Llama-3.1-8B offers 93.4% accuracy at $0.002/run
β For 1M runs: Saves $48,000/month vs GPT-4
β Only 2.4% accuracy trade-off
```
**Example 2: Evaluation Planning**
```
User: "I want to evaluate DeepSeek-V3 with 100 tests. How much will it cost and which hardware should I use?"
Agent:
Let me estimate costs for DeepSeek-V3:
[Calls: estimate_cost(model="deepseek-ai/DeepSeek-V3", num_tests=100, hardware="auto")]
Cost Estimate:
- Model: deepseek-ai/DeepSeek-V3
- Tests: 100
- Recommended hardware: GPU A100 (70B+ parameters)
Breakdown:
- LLM inference cost: ~$0.01/test (self-hosted, no API fees)
- HF Jobs GPU cost: ~$3.00/hr for A100
- Estimated duration: 25 minutes
- Total compute cost: $1.25
- Total cost: $1.25 (100 tests)
Recommendation:
Use HuggingFace Jobs with A100 GPU. Total cost: $1.25 for 100 tests.
Alternative: Modal H200 for 2x faster inference at $2.50.
```
#### Tips
- **Be specific**: Ask clear, focused questions for better answers
- **Use context**: Agent remembers conversation history
- **Watch reasoning**: Enable to understand how agent uses MCP tools
- **Try quick actions**: Fast way to get common information
---
### π New Evaluation
**Purpose**: Submit evaluation jobs to HuggingFace Jobs or Modal for running SMOLTRACE evaluations.
**β οΈ Requires**: HuggingFace Pro account ($9/month) with credit card, or Modal account.
#### Features
**Model Selection**:
- Enter any model name (format: `provider/model-name`)
- Examples: `openai/gpt-4`, `meta-llama/Llama-3.1-8B`, `deepseek-ai/DeepSeek-V3`
- Auto-detects if API model or local model
**Infrastructure Choice**:
- **HuggingFace Jobs**: Managed compute (H200, A100, A10, T4, CPU)
- **Modal**: Serverless GPU compute (pay-per-second)
**Hardware Selection**:
- **Auto** (recommended): Automatically selects optimal hardware based on model size
- **Manual**: Choose specific GPU tier (A10, A100, H200) or CPU
**Cost Estimation**:
- Click **"π° Estimate Cost"** before submitting
- Shows predicted:
- LLM API costs (for API models)
- Compute costs (for local models)
- Duration estimate
- CO2 emissions
**Agent Type**:
- **tool**: Test tool-calling capabilities
- **code**: Test code generation capabilities
- **both**: Test both (recommended)
#### How to Use
**Step 1: Configure Prerequisites** (One-time setup)
For **HuggingFace Jobs**:
```
1. Sign up for HF Pro: https://huggingface.co/pricing ($9/month)
2. Add credit card for compute charges
3. Create HF token with "Read + Write + Run Jobs" permissions
4. Go to Settings tab β Enter HF token β Save
```
For **Modal** (Alternative):
```
1. Sign up: https://modal.com (free tier available)
2. Generate API token: https://modal.com/settings/tokens
3. Go to Settings tab β Enter MODAL_TOKEN_ID + MODAL_TOKEN_SECRET β Save
```
For **API Models** (OpenAI, Anthropic, etc.):
```
1. Get API key from provider (e.g., https://platform.openai.com/api-keys)
2. Go to Settings tab β Enter provider API key β Save
```
**Step 2: Create Evaluation**
```
1. Enter model name:
Example: "meta-llama/Llama-3.1-8B"
2. Select infrastructure:
- HuggingFace Jobs (default)
- Modal (alternative)
3. Choose agent type:
- "both" (recommended)
4. Select hardware:
- "auto" (recommended - smart selection)
- Or choose manually: cpu-basic, t4-small, a10g-small, a100-large, h200
5. Set timeout (optional):
- Default: 3600s (1 hour)
- Range: 300s - 7200s
6. Click "π° Estimate Cost":
β Shows predicted cost and duration
β Example: "$2.00, 20 minutes, 0.5g CO2"
7. Review estimate, then click "Submit Evaluation"
```
**Step 3: Monitor Job**
```
After submission:
β Job ID displayed
β Go to "π Job Monitoring" tab to track progress
β Or visit HuggingFace Jobs dashboard: https://huggingface.co/jobs
```
**Step 4: View Results**
```
When job completes:
β Results automatically uploaded to HuggingFace datasets
β Appears in Leaderboard within 1-2 minutes
β Click on your run to see detailed results
```
#### Hardware Selection Guide
**For API Models** (OpenAI, Anthropic, Google):
- Use: `cpu-basic` (HF Jobs) or CPU (Modal)
- Cost: ~$0.05/hr (HF), ~$0.0001/sec (Modal)
- Why: No GPU needed for API calls
**For Small Models** (4B-8B parameters):
- Use: `t4-small` (HF) or A10G (Modal)
- Cost: ~$0.60/hr (HF), ~$0.0006/sec (Modal)
- Examples: Llama-3.1-8B, Mistral-7B
**For Medium Models** (7B-13B parameters):
- Use: `a10g-small` (HF) or A10G (Modal)
- Cost: ~$1.10/hr (HF), ~$0.0006/sec (Modal)
- Examples: Qwen2.5-14B, Mixtral-8x7B
**For Large Models** (70B+ parameters):
- Use: `a100-large` (HF) or A100-80GB (Modal)
- Cost: ~$3.00/hr (HF), ~$0.0030/sec (Modal)
- Examples: Llama-3.1-70B, DeepSeek-V3
**For Fastest Inference**:
- Use: `h200` (HF or Modal)
- Cost: ~$5.00/hr (HF), ~$0.0050/sec (Modal)
- Best for: Time-sensitive evaluations, large batches
#### Example Workflows
**Workflow 1: Evaluate API Model (OpenAI GPT-4)**
```
1. Model: "openai/gpt-4"
2. Infrastructure: HuggingFace Jobs
3. Agent type: both
4. Hardware: auto (selects cpu-basic)
5. Estimate: $50.00 (mostly API costs), 45 min
6. Submit β Monitor β View in leaderboard
```
**Workflow 2: Evaluate Local Model (Llama-3.1-8B)**
```
1. Model: "meta-llama/Llama-3.1-8B"
2. Infrastructure: Modal (for pay-per-second billing)
3. Agent type: both
4. Hardware: auto (selects A10G)
5. Estimate: $0.20, 15 min
6. Submit β Monitor β View in leaderboard
```
#### Tips
- **Always estimate first**: Prevents surprise costs
- **Use "auto" hardware**: Smart selection based on model size
- **Start small**: Test with 10-20 tests before scaling to 100+
- **Monitor jobs**: Check Job Monitoring tab for status
- **Modal for experimentation**: Pay-per-second is cost-effective for testing
---
### π Job Monitoring
**Purpose**: Track status of submitted evaluation jobs.
#### Features
**Job Status Display**:
- Job ID
- Current status (pending, running, completed, failed)
- Start time
- Duration
- Infrastructure (HF Jobs or Modal)
**Real-time Updates**:
- Auto-refreshes every 30 seconds
- Manual refresh button
**Job Actions**:
- View logs
- Cancel job (if still running)
- View results (if completed)
#### How to Use
```
1. Go to "π Job Monitoring" tab
2. See list of your submitted jobs
3. Click "Refresh" for latest status
4. When status = "completed":
β Click "View Results"
β Opens leaderboard filtered to your run
```
#### Job Statuses
- **Pending**: Job queued, waiting for resources
- **Running**: Evaluation in progress
- **Completed**: Evaluation finished successfully
- **Failed**: Evaluation encountered an error
#### Tips
- **Check logs** if job fails: Helps diagnose issues
- **Expected duration**:
- API models: 2-5 minutes
- Local models: 15-30 minutes (includes model download)
---
### π Trace Visualization
**Purpose**: Deep-dive into OpenTelemetry traces to understand agent execution.
**Access**: Click on any test case in a run's detail view
#### Features
**Waterfall Diagram**:
- Visual timeline of execution
- Spans show: LLM calls, tool executions, reasoning steps
- Duration bars (wider = slower)
- Parent-child relationships
**Span Details**:
- Span name (e.g., "LLM Call - Reasoning", "Tool Call - get_weather")
- Start/end times
- Duration
- Attributes (model, tokens, cost, tool inputs/outputs)
- Status (OK, ERROR)
**GPU Metrics Overlay** (for GPU jobs only):
- GPU utilization %
- Memory usage
- Temperature
- CO2 emissions
**MCP-Powered Q&A**:
- Ask questions about the trace
- Example: "Why was tool X called twice?"
- Agent uses `debug_trace` MCP tool to analyze
#### How to Use
```
1. From leaderboard β Click a run β Click a test case
2. View waterfall diagram:
β Spans arranged chronologically
β Parent spans (e.g., "Agent Execution")
β Child spans (e.g., "LLM Call", "Tool Call")
3. Click any span:
β See detailed attributes
β Token counts, costs, inputs/outputs
4. Ask questions (MCP-powered):
User: "Why did this test fail?"
β Agent analyzes trace with debug_trace tool
β Returns explanation with span references
5. Check GPU metrics (if available):
β Graph shows utilization over time
β Overlayed on execution timeline
```
#### Example Analysis
**Scenario: Understanding a slow execution**
```
1. Open trace for test_045 (duration: 8.5s)
2. Waterfall shows:
- Span 1: LLM Call - Reasoning (1.2s) β
- Span 2: Tool Call - search_web (6.5s) β οΈ SLOW
- Span 3: LLM Call - Final Response (0.8s) β
3. Click Span 2 (search_web):
- Input: {"query": "weather in Tokyo"}
- Output: 5 results
- Duration: 6.5s (6x slower than typical)
4. Ask agent: "Why was the search_web call so slow?"
β Agent analysis:
"The search_web call took 6.5s due to network latency.
Span attributes show API response time: 6.2s.
This is an external dependency issue, not agent code.
Recommendation: Implement timeout (5s) and fallback strategy."
```
#### Tips
- **Look for patterns**: Similar failures often have common spans
- **Use MCP Q&A**: Faster than manual trace analysis
- **Check GPU metrics**: Identify resource bottlenecks
- **Compare successful vs failed traces**: Spot differences
---
### π¬ Synthetic Data Generator
**Purpose**: Generate custom synthetic test datasets for agent evaluation using AI, complete with domain-specific tasks and prompt templates.
#### Features
**AI-Powered Dataset Generation**:
- Generate 5-100 synthetic tasks using Google Gemini 2.5 Flash
- Customizable domain, tools, difficulty, and agent type
- Automatic batching for large datasets (parallel generation)
- SMOLTRACE-format output ready for evaluation
**Prompt Template Generation**:
- Customized YAML templates based on smolagents format
- Optimized for your specific domain and tools
- Included automatically in dataset card
**Push to HuggingFace Hub**:
- One-click upload to HuggingFace Hub
- Public or private repositories
- Auto-generated README with usage instructions
- Ready to use with SMOLTRACE evaluations
#### How to Use
**Step 1: Configure & Generate Dataset**
1. Navigate to **π¬ Synthetic Data Generator** tab
2. Configure generation parameters:
- **Domain**: Topic/industry (e.g., "travel", "finance", "healthcare", "customer_support")
- **Tools**: Comma-separated list of tool names (e.g., "get_weather,search_flights,book_hotel")
- **Number of Tasks**: 5-100 tasks (slider)
- **Difficulty Level**:
- `balanced` (40% easy, 40% medium, 20% hard)
- `easy_only` (100% easy tasks)
- `medium_only` (100% medium tasks)
- `hard_only` (100% hard tasks)
- `progressive` (50% easy, 30% medium, 20% hard)
- **Agent Type**:
- `tool` (ToolCallingAgent only)
- `code` (CodeAgent only)
- `both` (50/50 mix)
3. Click **"π² Generate Synthetic Dataset"**
4. Wait for generation (30-120s depending on size):
- Shows progress message
- Automatic batching for >20 tasks
- Parallel API calls for faster generation
**Step 2: Review Generated Content**
1. **Dataset Preview Tab**:
- View all generated tasks in JSON format
- Check task IDs, prompts, expected tools, difficulty
- See dataset statistics:
- Total tasks
- Difficulty distribution
- Agent type distribution
- Tools coverage
2. **Prompt Template Tab**:
- View customized YAML prompt template
- Based on smolagents templates
- Adapted for your domain and tools
- Ready to use with ToolCallingAgent or CodeAgent
**Step 3: Push to HuggingFace Hub** (Optional)
1. Enter **Repository Name**:
- Format: `username/smoltrace-{domain}-tasks`
- Example: `alice/smoltrace-finance-tasks`
- Auto-filled with your HF username after generation
2. Set **Visibility**:
- β Private Repository (unchecked = public)
- β Private Repository (checked = private)
3. Provide **HuggingFace Token** (optional):
- Leave empty to use environment token (HF_TOKEN from Settings)
- Or paste token from https://huggingface.co/settings/tokens
- Requires write permissions
4. Click **"π€ Push to HuggingFace Hub"**
5. Wait for upload (5-30s):
- Creates dataset repository
- Uploads tasks
- Generates README with:
- Usage instructions
- Prompt template
- SMOLTRACE integration code
- Returns dataset URL
#### Example Workflow
```
Scenario: Create finance evaluation dataset with 20 tasks
1. Configure:
Domain: "finance"
Tools: "get_stock_price,calculate_roi,get_market_news,send_alert"
Number of Tasks: 20
Difficulty: "balanced"
Agent Type: "both"
2. Click "Generate"
β AI generates 20 tasks:
- 8 easy (single tool, straightforward)
- 8 medium (multiple tools or complex logic)
- 4 hard (complex reasoning, edge cases)
- 10 for ToolCallingAgent
- 10 for CodeAgent
β Also generates customized prompt template
3. Review Dataset Preview:
Task 1:
{
"id": "finance_stock_price_1",
"prompt": "What is the current price of AAPL stock?",
"expected_tool": "get_stock_price",
"difficulty": "easy",
"agent_type": "tool",
"expected_keywords": ["AAPL", "price", "$"]
}
Task 15:
{
"id": "finance_complex_analysis_15",
"prompt": "Calculate the ROI for investing $10,000 in AAPL last year and send an alert if ROI > 15%",
"expected_tool": "calculate_roi",
"expected_tool_calls": 2,
"difficulty": "hard",
"agent_type": "code",
"expected_keywords": ["ROI", "15%", "alert"]
}
4. Review Prompt Template:
See customized YAML with:
- Finance-specific system prompt
- Tool descriptions for get_stock_price, calculate_roi, etc.
- Response format guidelines
5. Push to Hub:
Repository: "yourname/smoltrace-finance-tasks"
Private: No (public)
Token: (empty, using environment token)
β Uploads to https://huggingface.co/datasets/yourname/smoltrace-finance-tasks
β README includes usage instructions and prompt template
6. Use in evaluation:
# Load your custom dataset
dataset = load_dataset("yourname/smoltrace-finance-tasks")
# Run SMOLTRACE evaluation
smoltrace-eval --model openai/gpt-4 \
--dataset-name yourname/smoltrace-finance-tasks \
--agent-type both
```
#### Configuration Reference
**Difficulty Levels Explained**:
| Level | Characteristics | Example |
|-------|----------------|---------|
| **Easy** | Single tool call, straightforward input, clear expected output | "What's the weather in Tokyo?" β get_weather("Tokyo") |
| **Medium** | Multiple tool calls OR complex input parsing OR conditional logic | "Compare weather in Tokyo and London" β get_weather("Tokyo"), get_weather("London"), compare |
| **Hard** | Multiple tools, complex reasoning, edge cases, error handling | "Plan a trip with best weather, book flights if under $500, alert if unavailable" |
**Agent Types Explained**:
| Type | Description | Use Case |
|------|-------------|----------|
| **tool** | ToolCallingAgent - Declarative tool calling with structured outputs | API-based models that support function calling (GPT-4, Claude) |
| **code** | CodeAgent - Writes Python code to use tools programmatically | Models that excel at code generation (Qwen-Coder, DeepSeek-Coder) |
| **both** | 50/50 mix of tool and code agent tasks | Comprehensive evaluation across agent types |
#### Best Practices
**Domain Selection**:
- Be specific: "customer_support_saas" > "support"
- Match your use case: Use actual business domain
- Consider tools available: Domain should align with tools
**Tool Names**:
- Use descriptive names: "get_stock_price" > "fetch"
- Match actual tool implementations
- 3-8 tools is ideal (enough variety, not overwhelming)
- Include mix of data retrieval and action tools
**Number of Tasks**:
- 5-10 tasks: Quick testing, proof of concept
- 20-30 tasks: Solid evaluation dataset
- 50-100 tasks: Comprehensive benchmark
**Difficulty Distribution**:
- `balanced`: Best for general evaluation
- `progressive`: Good for learning/debugging
- `easy_only`: Quick sanity checks
- `hard_only`: Stress testing advanced capabilities
**Quality Assurance**:
- Always review generated tasks before pushing
- Check for domain relevance and variety
- Verify expected tools match your actual tools
- Ensure prompts are clear and executable
#### Troubleshooting
**Generation fails with "Invalid API key"**:
- Go to **βοΈ Settings**
- Configure Gemini API Key
- Get key from https://aistudio.google.com/apikey
**Generated tasks don't match domain**:
- Be more specific in domain description
- Try regenerating with adjusted parameters
- Review prompt template for domain alignment
**Push to Hub fails with "Authentication error"**:
- Verify HuggingFace token has write permissions
- Get token from https://huggingface.co/settings/tokens
- Check token in **βοΈ Settings** or provide directly
**Dataset generation is slow (>60s)**:
- Large requests (>20 tasks) are automatically batched
- Each batch takes 30-120s
- Example: 100 tasks = 5 batches Γ 60s = ~5 minutes
- This is normal for large datasets
**Tasks are too easy/hard**:
- Adjust difficulty distribution
- Regenerate with different settings
- Mix difficulty levels with `balanced` or `progressive`
#### Advanced Tips
**Iterative Refinement**:
1. Generate 10 tasks with `balanced` difficulty
2. Review quality and variety
3. If satisfied, generate 50-100 tasks with same settings
4. If not, adjust domain/tools and regenerate
**Dataset Versioning**:
- Use version suffixes: `username/smoltrace-finance-tasks-v2`
- Iterate on datasets as tools evolve
- Keep track of which version was used for evaluations
**Combining Datasets**:
- Generate multiple small datasets for different domains
- Use SMOLTRACE CLI to merge datasets
- Create comprehensive multi-domain benchmarks
**Custom Prompt Templates**:
- Generate prompt template separately
- Customize further based on your needs
- Use in agent initialization before evaluation
- Include in dataset card for reproducibility
---
### βοΈ Settings
**Purpose**: Configure API keys, preferences, and authentication.
#### Features
**API Key Configuration**:
- Gemini API Key (for MCP server AI analysis)
- HuggingFace Token (for dataset access + job submission)
- Modal Token ID + Secret (for Modal job submission)
- LLM Provider Keys (OpenAI, Anthropic, etc.)
**Preferences**:
- Default infrastructure (HF Jobs vs Modal)
- Default hardware tier
- Auto-refresh intervals
**Security**:
- Keys stored in browser session only (not server)
- HTTPS encryption for all API calls
- Keys never logged or exposed
#### How to Use
**Configure Essential Keys**:
```
1. Go to "βοΈ Settings" tab
2. Enter Gemini API Key:
- Get from: https://ai.google.dev/
- Click "Get API Key" β Create project β Generate
- Paste into field
- Free tier: 1,500 requests/day
3. Enter HuggingFace Token:
- Get from: https://huggingface.co/settings/tokens
- Click "New token" β Name: "TraceMind"
- Permissions:
- Read (for viewing datasets)
- Write (for uploading results)
- Run Jobs (for evaluation submission)
- Paste into field
4. Click "Save API Keys"
β Keys stored in browser session
β MCP server will use your keys
```
**Configure for Job Submission** (Optional):
For **HuggingFace Jobs**:
```
Already configured if you entered HF token above with "Run Jobs" permission.
```
For **Modal** (Alternative):
```
1. Sign up: https://modal.com
2. Get token: https://modal.com/settings/tokens
3. Copy MODAL_TOKEN_ID (starts with 'ak-')
4. Copy MODAL_TOKEN_SECRET (starts with 'as-')
5. Paste both into Settings β Save
```
For **API Model Providers**:
```
1. Get API key from provider:
- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/settings/keys
- Google: https://ai.google.dev/
2. Paste into corresponding field in Settings
3. Click "Save LLM Provider Keys"
```
#### Security Best Practices
- **Use environment variables**: For production, set keys via HF Spaces secrets
- **Rotate keys regularly**: Generate new tokens every 3-6 months
- **Minimal permissions**: Only grant "Run Jobs" if you need to submit evaluations
- **Monitor usage**: Check API provider dashboards for unexpected charges
---
## Common Workflows
### Workflow 1: Quick Model Comparison
```
Goal: Compare GPT-4 vs Llama-3.1-8B for production use
Steps:
1. Go to Leaderboard β Load Leaderboard
2. Read AI insights: "GPT-4 leads accuracy, Llama-3.1 best cost"
3. Sort by Success Rate β Note: GPT-4 (95.8%), Llama (93.4%)
4. Sort by Cost β Note: GPT-4 ($0.05), Llama ($0.002)
5. Go to Agent Chat β Ask: "Compare GPT-4 and Llama-3.1. Which should I use for 1M runs/month?"
β Agent analyzes with MCP tools
β Returns: "Llama saves $48K/month, only 2.4% accuracy drop"
6. Decision: Use Llama-3.1-8B for production
```
### Workflow 2: Evaluate Custom Model
```
Goal: Evaluate your fine-tuned model on SMOLTRACE benchmark
Steps:
1. Ensure model is on HuggingFace: username/my-finetuned-model
2. Go to Settings β Configure HF token (with Run Jobs permission)
3. Go to New Evaluation:
- Model: "username/my-finetuned-model"
- Infrastructure: HuggingFace Jobs
- Agent type: both
- Hardware: auto
4. Click "Estimate Cost" β Review: $1.50, 20 min
5. Click "Submit Evaluation"
6. Go to Job Monitoring β Wait for "Completed" (15-25 min)
7. Go to Leaderboard β Refresh β See your model in table
8. Click your run β Review detailed results
9. Compare vs other models using Agent Chat
```
### Workflow 3: Debug Failed Test
```
Goal: Understand why test_045 failed in your evaluation
Steps:
1. Go to Leaderboard β Find your run β Click to open details
2. Filter to failed tests only
3. Click test_045 β Opens trace visualization
4. Examine waterfall:
- Span 1: LLM Call (OK)
- Span 2: Tool Call - "unknown_tool" (ERROR)
- No Span 3 (execution stopped)
5. Ask Agent: "Why did test_045 fail?"
β Agent uses debug_trace MCP tool
β Returns: "Tool 'unknown_tool' not found. Add to agent's tool list."
6. Fix: Update agent config to include missing tool
7. Re-run evaluation with fixed config
```
---
## Troubleshooting
### Leaderboard Issues
**Problem**: "Load Leaderboard" button doesn't work
- **Solution**: Check HuggingFace token in Settings (needs Read permission)
- **Solution**: Verify leaderboard dataset exists: https://huggingface.co/datasets/kshitijthakkar/smoltrace-leaderboard
**Problem**: AI insights not showing
- **Solution**: Check Gemini API key in Settings
- **Solution**: Wait 5-10 seconds for AI generation to complete
### Agent Chat Issues
**Problem**: Agent responds with "MCP server connection failed"
- **Solution**: Check MCP server status: https://huggingface.co/spaces/MCP-1st-Birthday/TraceMind-mcp-server
- **Solution**: Configure Gemini API key in both TraceMind-AI and MCP server Settings
**Problem**: Agent gives incorrect information
- **Solution**: Agent may be using stale data. Ask: "Load the latest leaderboard data"
- **Solution**: Verify question is clear and specific
### Evaluation Submission Issues
**Problem**: "Submit Evaluation" fails with auth error
- **Solution**: HF token needs "Run Jobs" permission
- **Solution**: Ensure HF Pro account is active ($9/month)
- **Solution**: Verify credit card is on file for compute charges
**Problem**: Job stuck in "Pending" status
- **Solution**: HuggingFace Jobs may have queue. Wait 5-10 minutes.
- **Solution**: Try Modal as alternative infrastructure
**Problem**: Job fails with "Out of Memory"
- **Solution**: Model too large for selected hardware
- **Solution**: Increase hardware tier (e.g., t4-small β a10g-small)
- **Solution**: Use auto hardware selection
### Trace Visualization Issues
**Problem**: Traces not loading
- **Solution**: Ensure evaluation completed successfully
- **Solution**: Check traces dataset exists on HuggingFace
- **Solution**: Verify HF token has Read permission
**Problem**: GPU metrics missing
- **Solution**: Only available for GPU jobs (not API models)
- **Solution**: Ensure evaluation was run with SMOLTRACE's GPU metrics enabled
---
## Getting Help
- **π§ GitHub Issues**: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
- **π¬ HF Discord**: `#agents-mcp-hackathon-winter25`
- **π Documentation**: See [MCP_INTEGRATION.md](MCP_INTEGRATION.md) and [ARCHITECTURE.md](ARCHITECTURE.md)
---
**Last Updated**: November 21, 2025
|