Spaces:
Running
Running
Commit
·
d78d01f
1
Parent(s):
dc9db21
Fix JSON parsing: MCP tools return dicts not JSON strings
Browse files- prompts/code_agent.yaml +21 -27
prompts/code_agent.yaml
CHANGED
|
@@ -25,17 +25,15 @@ system_prompt: |-
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
-
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). This
|
| 29 |
```python
|
| 30 |
-
|
| 31 |
-
top_models_json = run_get_top_performers(
|
| 32 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 33 |
metric="success_rate",
|
| 34 |
top_n=3
|
| 35 |
)
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
for model in data['top_performers']:
|
| 39 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
| 40 |
```
|
| 41 |
Observation:
|
|
@@ -71,22 +69,21 @@ system_prompt: |-
|
|
| 71 |
---
|
| 72 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 73 |
|
| 74 |
-
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset.
|
| 75 |
```python
|
| 76 |
-
import json
|
| 77 |
# Get overview statistics
|
| 78 |
-
|
| 79 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
|
| 80 |
)
|
| 81 |
-
summary =
|
| 82 |
|
| 83 |
# Get top 5 performers
|
| 84 |
-
|
| 85 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 86 |
metric="success_rate",
|
| 87 |
top_n=5
|
| 88 |
)
|
| 89 |
-
top_models =
|
| 90 |
|
| 91 |
print(f"Leaderboard Overview:")
|
| 92 |
print(f" - Total runs: {summary['total_runs']}")
|
|
@@ -127,22 +124,20 @@ system_prompt: |-
|
|
| 127 |
---
|
| 128 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 129 |
|
| 130 |
-
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty.
|
| 131 |
```python
|
| 132 |
-
|
| 133 |
-
synthetic_dataset = run_generate_synthetic_dataset(
|
| 134 |
domain="finance",
|
| 135 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 136 |
num_tasks=20,
|
| 137 |
difficulty_distribution="balanced",
|
| 138 |
agent_type="both"
|
| 139 |
)
|
| 140 |
-
|
| 141 |
-
print(f"
|
| 142 |
-
print(f"
|
| 143 |
-
print(f"Difficulty distribution: {result['dataset_info']['difficulty_distribution']}")
|
| 144 |
print(f"\nSample task IDs:")
|
| 145 |
-
for task in
|
| 146 |
print(f" - {task['id']}: {task['prompt'][:60]}...")
|
| 147 |
```
|
| 148 |
Observation:
|
|
@@ -169,7 +164,7 @@ system_prompt: |-
|
|
| 169 |
---
|
| 170 |
Task: "Generate 50 customer support tasks and upload them to HuggingFace as 'my-org/smoltrace-customer-support-tasks'"
|
| 171 |
|
| 172 |
-
Thought: I'll first generate the synthetic dataset with 50 tasks, then use run_push_dataset_to_hub to upload it to HuggingFace. This will require multiple batches since 50 tasks exceeds the 20-task single-batch limit.
|
| 173 |
```python
|
| 174 |
import json
|
| 175 |
# Step 1: Generate synthetic dataset
|
|
@@ -180,13 +175,12 @@ system_prompt: |-
|
|
| 180 |
difficulty_distribution="progressive",
|
| 181 |
agent_type="both"
|
| 182 |
)
|
| 183 |
-
|
| 184 |
-
print(f"Generated {dataset['dataset_info']['num_tasks_generated']} tasks in {dataset['dataset_info']['num_batches']} batches")
|
| 185 |
|
| 186 |
-
# Step 2: Extract tasks array and convert to JSON string
|
| 187 |
-
tasks_json = json.dumps(
|
| 188 |
|
| 189 |
-
# Step 3: Push to HuggingFace Hub (Note:
|
| 190 |
upload_result = run_push_dataset_to_hub(
|
| 191 |
dataset_json=tasks_json,
|
| 192 |
repo_name="my-org/smoltrace-customer-support-tasks",
|
|
@@ -238,7 +232,7 @@ system_prompt: |-
|
|
| 238 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 239 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 240 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 241 |
-
- All MCP tools return
|
| 242 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 243 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 244 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|
|
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
+
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). This tool returns a dict ready to use (no json.loads needed).
|
| 29 |
```python
|
| 30 |
+
top_models_data = run_get_top_performers(
|
|
|
|
| 31 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 32 |
metric="success_rate",
|
| 33 |
top_n=3
|
| 34 |
)
|
| 35 |
+
print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
|
| 36 |
+
for model in top_models_data['top_performers']:
|
|
|
|
| 37 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
| 38 |
```
|
| 39 |
Observation:
|
|
|
|
| 69 |
---
|
| 70 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 71 |
|
| 72 |
+
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. MCP tools return dicts ready to use.
|
| 73 |
```python
|
|
|
|
| 74 |
# Get overview statistics
|
| 75 |
+
summary_data = run_get_leaderboard_summary(
|
| 76 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
|
| 77 |
)
|
| 78 |
+
summary = summary_data['summary']
|
| 79 |
|
| 80 |
# Get top 5 performers
|
| 81 |
+
top_models_data = run_get_top_performers(
|
| 82 |
leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
|
| 83 |
metric="success_rate",
|
| 84 |
top_n=5
|
| 85 |
)
|
| 86 |
+
top_models = top_models_data['top_performers']
|
| 87 |
|
| 88 |
print(f"Leaderboard Overview:")
|
| 89 |
print(f" - Total runs: {summary['total_runs']}")
|
|
|
|
| 124 |
---
|
| 125 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 126 |
|
| 127 |
+
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty. The tool returns a dict ready to use.
|
| 128 |
```python
|
| 129 |
+
synthetic_result = run_generate_synthetic_dataset(
|
|
|
|
| 130 |
domain="finance",
|
| 131 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 132 |
num_tasks=20,
|
| 133 |
difficulty_distribution="balanced",
|
| 134 |
agent_type="both"
|
| 135 |
)
|
| 136 |
+
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
|
| 137 |
+
print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
|
| 138 |
+
print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
|
|
|
|
| 139 |
print(f"\nSample task IDs:")
|
| 140 |
+
for task in synthetic_result['tasks'][:3]:
|
| 141 |
print(f" - {task['id']}: {task['prompt'][:60]}...")
|
| 142 |
```
|
| 143 |
Observation:
|
|
|
|
| 164 |
---
|
| 165 |
Task: "Generate 50 customer support tasks and upload them to HuggingFace as 'my-org/smoltrace-customer-support-tasks'"
|
| 166 |
|
| 167 |
+
Thought: I'll first generate the synthetic dataset with 50 tasks, then use run_push_dataset_to_hub to upload it to HuggingFace. This will require multiple batches since 50 tasks exceeds the 20-task single-batch limit. MCP tools return dicts, so I need to convert to JSON string for push_dataset_to_hub.
|
| 168 |
```python
|
| 169 |
import json
|
| 170 |
# Step 1: Generate synthetic dataset
|
|
|
|
| 175 |
difficulty_distribution="progressive",
|
| 176 |
agent_type="both"
|
| 177 |
)
|
| 178 |
+
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks in {synthetic_result['dataset_info']['num_batches']} batches")
|
|
|
|
| 179 |
|
| 180 |
+
# Step 2: Extract tasks array and convert to JSON string for push_dataset_to_hub
|
| 181 |
+
tasks_json = json.dumps(synthetic_result['tasks'])
|
| 182 |
|
| 183 |
+
# Step 3: Push to HuggingFace Hub (Note: uses MCP server's configured token if empty)
|
| 184 |
upload_result = run_push_dataset_to_hub(
|
| 185 |
dataset_json=tasks_json,
|
| 186 |
repo_name="my-org/smoltrace-customer-support-tasks",
|
|
|
|
| 232 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 233 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 234 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 235 |
+
- **IMPORTANT**: All MCP tools return dict/list objects ready to use - DO NOT use json.loads()! Only use json.dumps() when you need to convert a dict to a JSON string (e.g., for push_dataset_to_hub).
|
| 236 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 237 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 238 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|