kshitijthakkar commited on
Commit
6022c4b
·
1 Parent(s): d78d01f

Add defensive type handling for MCP tool returns and fix prompt template agent_type

Browse files
Files changed (2) hide show
  1. prompts/code_agent.yaml +27 -8
  2. screens/chat.py +1 -1
prompts/code_agent.yaml CHANGED
@@ -25,13 +25,17 @@ system_prompt: |-
25
  ---
26
  Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
27
 
28
- Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). This tool returns a dict ready to use (no json.loads needed).
29
  ```python
30
- top_models_data = run_get_top_performers(
 
31
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
32
  metric="success_rate",
33
  top_n=3
34
  )
 
 
 
35
  print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
36
  for model in top_models_data['top_performers']:
37
  print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
@@ -69,20 +73,23 @@ system_prompt: |-
69
  ---
70
  Task: "Analyze the current leaderboard and show me the top performing models with their costs"
71
 
72
- Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. MCP tools return dicts ready to use.
73
  ```python
 
74
  # Get overview statistics
75
- summary_data = run_get_leaderboard_summary(
76
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
77
  )
 
78
  summary = summary_data['summary']
79
 
80
  # Get top 5 performers
81
- top_models_data = run_get_top_performers(
82
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
83
  metric="success_rate",
84
  top_n=5
85
  )
 
86
  top_models = top_models_data['top_performers']
87
 
88
  print(f"Leaderboard Overview:")
@@ -124,15 +131,22 @@ system_prompt: |-
124
  ---
125
  Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
126
 
127
- Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty. The tool returns a dict ready to use.
128
  ```python
129
- synthetic_result = run_generate_synthetic_dataset(
 
130
  domain="finance",
131
  tool_names="get_stock_price,calculate_roi,fetch_company_info",
132
  num_tasks=20,
133
  difficulty_distribution="balanced",
134
  agent_type="both"
135
  )
 
 
 
 
 
 
136
  print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
137
  print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
138
  print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
@@ -232,7 +246,12 @@ system_prompt: |-
232
  - For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
233
  - For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
234
  - ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
235
- - **IMPORTANT**: All MCP tools return dict/list objects ready to use - DO NOT use json.loads()! Only use json.dumps() when you need to convert a dict to a JSON string (e.g., for push_dataset_to_hub).
 
 
 
 
 
236
  5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
237
  6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
238
  7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
 
25
  ---
26
  Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
27
 
28
+ Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). I'll handle both string and dict returns defensively.
29
  ```python
30
+ import json
31
+ top_raw = run_get_top_performers(
32
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
33
  metric="success_rate",
34
  top_n=3
35
  )
36
+ # Defensive: handle both string and dict returns
37
+ top_models_data = json.loads(top_raw) if isinstance(top_raw, str) else top_raw
38
+
39
  print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
40
  for model in top_models_data['top_performers']:
41
  print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
 
73
  ---
74
  Task: "Analyze the current leaderboard and show me the top performing models with their costs"
75
 
76
+ Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. I'll defensively handle string/dict returns.
77
  ```python
78
+ import json
79
  # Get overview statistics
80
+ summary_raw = run_get_leaderboard_summary(
81
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard"
82
  )
83
+ summary_data = json.loads(summary_raw) if isinstance(summary_raw, str) else summary_raw
84
  summary = summary_data['summary']
85
 
86
  # Get top 5 performers
87
+ top_raw = run_get_top_performers(
88
  leaderboard_repo="kshitijthakkar/smoltrace-leaderboard",
89
  metric="success_rate",
90
  top_n=5
91
  )
92
+ top_models_data = json.loads(top_raw) if isinstance(top_raw, str) else top_raw
93
  top_models = top_models_data['top_performers']
94
 
95
  print(f"Leaderboard Overview:")
 
131
  ---
132
  Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
133
 
134
+ Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty. MCP tools may return strings or dicts depending on the client, so I'll handle both cases safely.
135
  ```python
136
+ import json
137
+ synthetic_raw = run_generate_synthetic_dataset(
138
  domain="finance",
139
  tool_names="get_stock_price,calculate_roi,fetch_company_info",
140
  num_tasks=20,
141
  difficulty_distribution="balanced",
142
  agent_type="both"
143
  )
144
+ # Defensive: handle both string and dict returns
145
+ if isinstance(synthetic_raw, str):
146
+ synthetic_result = json.loads(synthetic_raw)
147
+ else:
148
+ synthetic_result = synthetic_raw
149
+
150
  print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
151
  print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
152
  print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
 
246
  - For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
247
  - For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
248
  - ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
249
+ - **IMPORTANT - Defensive Type Handling**: MCP tools may return strings OR dicts depending on the client. ALWAYS use this pattern:
250
+ ```python
251
+ result_raw = run_tool(...)
252
+ result = json.loads(result_raw) if isinstance(result_raw, str) else result_raw
253
+ ```
254
+ Then access dict keys safely: `result['key']`. Use json.dumps() when converting dict to string (e.g., for push_dataset_to_hub).
255
  5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
256
  6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
257
  7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
screens/chat.py CHANGED
@@ -600,6 +600,6 @@ def on_quick_action(action_type):
600
  "costs": "Compare the costs of the top 3 models - which one offers the best value?",
601
  "recommend": "Based on the leaderboard data, which model would you recommend for a production system that needs both good accuracy and reasonable cost?",
602
  "multi_tool": "Analyze the leaderboard with focus on cost and accuracy, identify the top 2 models, compare them, and estimate the cost of running 500 evaluations on the cheaper one",
603
- "synthetic": "Generate a synthetic test dataset with 100 tasks for the food-delivery domain using these tools: search_restaurants, view_menu, place_order, track_delivery, apply_promo, rate_restaurant, contact_driver with difficulty_distribution='balanced' and agent_type='both'. Then create a prompt template for the same domain and tools, and push the dataset to MCP-1st-Birthday/smoltrace-food-delivery-tasks-v2"
604
  }
605
  return prompts.get(action_type, "")
 
600
  "costs": "Compare the costs of the top 3 models - which one offers the best value?",
601
  "recommend": "Based on the leaderboard data, which model would you recommend for a production system that needs both good accuracy and reasonable cost?",
602
  "multi_tool": "Analyze the leaderboard with focus on cost and accuracy, identify the top 2 models, compare them, and estimate the cost of running 500 evaluations on the cheaper one",
603
+ "synthetic": "Generate a synthetic test dataset with 100 tasks for the food-delivery domain using these tools: search_restaurants, view_menu, place_order, track_delivery, apply_promo, rate_restaurant, contact_driver with difficulty_distribution='balanced' and agent_type='both'. Then create a prompt template for the same domain and tools using agent_type='tool', and push the dataset to MCP-1st-Birthday/smoltrace-food-delivery-tasks-v2"
604
  }
605
  return prompts.get(action_type, "")