Spaces:

MCP-1st-Birthday
/

TraceMind

Running

Mandark-droid commited on 29 days ago

Commit

942ce50

1 Parent(s): 98dc4d3

Implement Run Detail and Trace Detail screens with full navigation

Day 3 (Guide): Run Detail + Trace Detail screens completed

Run Detail Screen (Screen 3):
- Enhanced with 3 tabs: Overview, Test Cases, Performance
- Overview tab: Run metadata with gradient card styling
- Test Cases tab: Interactive dataframe with click-to-trace navigation
- Performance tab: 4-chart dashboard (response time histogram, token usage, cost, success/failure pie)
- Added create_performance_charts() function for performance visualizations

Trace Detail Screen (Screen 4):
- Created complete screen with 5 tabs:
* Thought Graph: Network visualization of agent reasoning flow
* Waterfall: Interactive timeline diagram of span execution
* GPU Metrics: Time series dashboard + raw metrics data (2 sub-tabs)
* Span Details: Detailed table with tokens, cost, duration per span
* Raw Data: JSON view of OpenTelemetry trace data
* Ask About This Trace: Accordion with Q&A placeholder (for MCP integration)

Components Added:
- components/thought_graph.py: Network graph visualization of agent reasoning
- screens/trace_detail.py: All trace visualization functions
* create_span_visualization(): Waterfall chart with color-coded spans
* create_gpu_metrics_dashboard(): Multi-panel GPU metrics time series
* create_gpu_summary_cards(): HTML summary cards for GPU metrics
* process_trace_data(): Trace data processor with timestamp handling
* create_span_table(): JSON view of span details

Navigation Handlers:
- on_test_case_select(): Navigate from Run Detail to Trace Detail
- go_back_to_run_detail(): Back button from Trace Detail to Run Detail
- create_trace_metadata_html(): Trace metadata HTML generator
- create_span_details_table(): Span details dataframe generator

Event Wiring:
- test_cases_table.select → on_test_case_select (loads trace, switches screens)
- back_to_run_detail_btn.click → go_back_to_run_detail (returns to run detail)
- Integrated all 11 trace detail outputs (graphs, tables, JSON)

Navigation Flow:
Leaderboard (Screen 1) → Run Detail (Screen 3) → Trace Detail (Screen 4)
- Click DrillDown row → navigate to Run Detail with 3 tabs
- Click Test Case row → navigate to Trace Detail with 5 tabs
- Back buttons work correctly between all screens

File Stats:
- app.py: 832 → 1193 lines (+361 lines)
- New files: components/thought_graph.py, screens/trace_detail.py
- All functions compile and type-check successfully

Files changed (3) hide show

app.py +447 -85
components/thought_graph.py +398 -0
screens/trace_detail.py +721 -0

app.py CHANGED Viewed

@@ -21,8 +21,339 @@ from components.analytics_charts import (
     create_cost_efficiency_scatter
 )
 from components.report_cards import generate_leaderboard_summary_card
 from utils.navigation import Navigator, Screen
 # Initialize data loader
 data_loader = create_data_loader_from_env()
 navigator = Navigator()
@@ -265,30 +596,8 @@ def on_html_table_row_click(row_index_str):
         results_df = data_loader.load_results(results_dataset)
-        # Create metadata HTML
-        metadata_html = f"""
-        <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-                    padding: 20px; border-radius: 10px; color: white; margin-bottom: 20px;">
-            <h2 style="margin: 0 0 10px 0;">📊 Run Detail: {run_data.get('model', 'Unknown')}</h2>
-            <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 15px;">
-                <div>
-                    <strong>Agent Type:</strong> {run_data.get('agent_type', 'N/A')}<br>
-                    <strong>Provider:</strong> {run_data.get('provider', 'N/A')}<br>
-                    <strong>Success Rate:</strong> {run_data.get('success_rate', 0):.1f}%
-                </div>
-                <div>
-                    <strong>Total Tests:</strong> {run_data.get('total_tests', 0)}<br>
-                    <strong>Successful:</strong> {run_data.get('successful_tests', 0)}<br>
-                    <strong>Failed:</strong> {run_data.get('failed_tests', 0)}
-                </div>
-                <div>
-                    <strong>Total Cost:</strong> ${run_data.get('total_cost_usd', 0):.4f}<br>
-                    <strong>Avg Duration:</strong> {run_data.get('avg_duration_ms', 0):.0f}ms<br>
-                    <strong>Submitted By:</strong> {run_data.get('submitted_by', 'Unknown')}
-                </div>
-            </div>
-        </div>
-        """
         # Format results for display
         display_df = results_df.copy()
@@ -358,30 +667,8 @@ def load_run_detail(run_id):
         results_df = data_loader.load_results(results_dataset)
-        # Create metadata HTML
-        metadata_html = f"""
-        <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-                    padding: 20px; border-radius: 10px; color: white; margin-bottom: 20px;">
-            <h2 style="margin: 0 0 10px 0;">📊 Run Detail: {run_data.get('model', 'Unknown')}</h2>
-            <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 15px;">
-                <div>
-                    <strong>Agent Type:</strong> {run_data.get('agent_type', 'N/A')}<br>
-                    <strong>Provider:</strong> {run_data.get('provider', 'N/A')}<br>
-                    <strong>Success Rate:</strong> {run_data.get('success_rate', 0):.1f}%
-                </div>
-                <div>
-                    <strong>Total Tests:</strong> {run_data.get('total_tests', 0)}<br>
-                    <strong>Successful:</strong> {run_data.get('successful_tests', 0)}<br>
-                    <strong>Failed:</strong> {run_data.get('failed_tests', 0)}
-                </div>
-                <div>
-                    <strong>Total Cost:</strong> ${run_data.get('total_cost_usd', 0):.4f}<br>
-                    <strong>Avg Duration:</strong> {run_data.get('avg_duration_ms', 0):.0f}ms<br>
-                    <strong>Submitted By:</strong> {run_data.get('submitted_by', 'Unknown')}
-                </div>
-            </div>
-        </div>
-        """
         # Format results for display
         display_df = results_df.copy()
@@ -458,30 +745,8 @@ def on_drilldown_select(evt: gr.SelectData, df):
         results_df = data_loader.load_results(results_dataset)
-        # Create metadata HTML
-        metadata_html = f"""
-        <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-                    padding: 20px; border-radius: 10px; color: white; margin-bottom: 20px;">
-            <h2 style="margin: 0 0 10px 0;">📊 Run Detail: {run_data.get('model', 'Unknown')}</h2>
-            <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 20px; margin-top: 15px;">
-                <div>
-                    <strong>Agent Type:</strong> {run_data.get('agent_type', 'N/A')}<br>
-                    <strong>Provider:</strong> {run_data.get('provider', 'N/A')}<br>
-                    <strong>Success Rate:</strong> {run_data.get('success_rate', 0):.1f}%
-                </div>
-                <div>
-                    <strong>Total Tests:</strong> {run_data.get('total_tests', 0)}<br>
-                    <strong>Successful:</strong> {run_data.get('successful_tests', 0)}<br>
-                    <strong>Failed:</strong> {run_data.get('failed_tests', 0)}
-                </div>
-                <div>
-                    <strong>Total Cost:</strong> ${run_data.get('total_cost_usd', 0):.4f}<br>
-                    <strong>Avg Duration:</strong> {run_data.get('avg_duration_ms', 0):.0f}ms<br>
-                    <strong>Submitted By:</strong> {run_data.get('submitted_by', 'Unknown')}
-                </div>
-            </div>
-        </div>
-        """
         # Format results for display
         display_df = results_df.copy()
@@ -697,23 +962,95 @@ with gr.Blocks(title="TraceMind-AI", theme=theme) as app:
             # Hidden textbox for row selection (JavaScript bridge)
             selected_row_index = gr.Textbox(visible=False, elem_id="selected_row_index")
-        # Screen 3: Run Detail
         with gr.Column(visible=False) as run_detail_screen:
             # Navigation
             with gr.Row():
                 back_to_leaderboard_btn = gr.Button("⬅️ Back to Leaderboard", variant="secondary", size="sm")
-            # Run metadata display
-            run_metadata_html = gr.HTML()
-            # Test cases table
-            gr.Markdown("## 📋 Test Cases")
-            test_cases_table = gr.Dataframe(
-                headers=["Task ID", "Status", "Tool", "Duration", "Tokens", "Cost", "Trace ID"],
-                interactive=False,
-                wrap=True
-            )
         # Event handlers
         app.load(
         fn=load_leaderboard,
@@ -812,6 +1149,31 @@ with gr.Blocks(title="TraceMind-AI", theme=theme) as app:
         outputs=[leaderboard_screen, run_detail_screen]
         )
         # HTML table row click handler (JavaScript bridge via hidden textbox)
         selected_row_index.change(
         fn=on_html_table_row_click,

     create_cost_efficiency_scatter
 )
 from components.report_cards import generate_leaderboard_summary_card
+from screens.trace_detail import (
+    create_span_visualization,
+    create_span_table,
+    create_gpu_metrics_dashboard,
+    create_gpu_summary_cards
+)
 from utils.navigation import Navigator, Screen
+# Trace Detail handlers and helpers
+def create_span_details_table(spans):
+    """
+    Create table view of span details
+    Args:
+        spans: List of span dictionaries
+    Returns:
+        DataFrame with span details
+    """
+    try:
+        if not spans:
+            return pd.DataFrame(columns=["Span Name", "Kind", "Duration (ms)", "Tokens", "Cost (USD)", "Status"])
+        rows = []
+        for span in spans:
+            name = span.get('name', 'Unknown')
+            kind = span.get('kind', 'INTERNAL')
+            # Get attributes
+            attributes = span.get('attributes', {})
+            if isinstance(attributes, dict) and 'openinference.span.kind' in attributes:
+                kind = attributes.get('openinference.span.kind', kind)
+            # Calculate duration
+            start = span.get('startTime') or span.get('startTimeUnixNano', 0)
+            end = span.get('endTime') or span.get('endTimeUnixNano', 0)
+            duration = (end - start) / 1000000 if start and end else 0  # Convert to ms
+            status = span.get('status', {}).get('code', 'OK') if isinstance(span.get('status'), dict) else 'OK'
+            # Extract tokens and cost information
+            tokens_str = "-"
+            cost_str = "-"
+            if isinstance(attributes, dict):
+                # Check for token usage
+                prompt_tokens = attributes.get('gen_ai.usage.prompt_tokens') or attributes.get('llm.token_count.prompt')
+                completion_tokens = attributes.get('gen_ai.usage.completion_tokens') or attributes.get('llm.token_count.completion')
+                total_tokens = attributes.get('llm.usage.total_tokens')
+                # Build tokens string
+                if prompt_tokens is not None and completion_tokens is not None:
+                    total = int(prompt_tokens) + int(completion_tokens)
+                    tokens_str = f"{total} ({int(prompt_tokens)}+{int(completion_tokens)})"
+                elif total_tokens is not None:
+                    tokens_str = str(int(total_tokens))
+                # Check for cost
+                cost = attributes.get('gen_ai.usage.cost.total') or attributes.get('llm.usage.cost')
+                if cost is not None:
+                    cost_str = f"${float(cost):.6f}"
+            rows.append({
+                "Span Name": name,
+                "Kind": kind,
+                "Duration (ms)": round(duration, 2),
+                "Tokens": tokens_str,
+                "Cost (USD)": cost_str,
+                "Status": status
+            })
+        return pd.DataFrame(rows)
+    except Exception as e:
+        print(f"[ERROR] create_span_details_table: {e}")
+        import traceback
+        traceback.print_exc()
+        return pd.DataFrame(columns=["Span Name", "Kind", "Duration (ms)", "Tokens", "Cost (USD)", "Status"])
+def create_trace_metadata_html(trace_data: dict) -> str:
+    """Create HTML for trace metadata display"""
+    trace_id = trace_data.get('trace_id', 'Unknown')
+    spans = trace_data.get('spans', [])
+    if hasattr(spans, 'tolist'):
+        spans = spans.tolist()
+    elif not isinstance(spans, list):
+        spans = list(spans) if spans is not None else []
+    metadata_html = f"""
+    <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+                padding: 20px; border-radius: 10px; color: white; margin-bottom: 20px;">
+        <h3 style="margin: 0 0 10px 0;">Trace Information</h3>
+        <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 15px;">
+            <div>
+                <strong>Trace ID:</strong> {trace_id}<br>
+                <strong>Total Spans:</strong> {len(spans)}
+            </div>
+        </div>
+    </div>
+    """
+    return metadata_html
+def on_test_case_select(evt: gr.SelectData, df):
+    """Handle test case selection in run detail - navigate to trace detail"""
+    global current_selected_run, current_selected_trace
+    print(f"[DEBUG] on_test_case_select called with index: {evt.index}")
+    # Check if we have a selected run
+    if current_selected_run is None:
+        print("[ERROR] No run selected - current_selected_run is None")
+        gr.Warning("Please select a run from the leaderboard first")
+        return {}
+    try:
+        # Get selected test case
+        selected_idx = evt.index[0]
+        if df is None or df.empty or selected_idx >= len(df):
+            gr.Warning("Invalid test case selection")
+            return {}
+        test_case = df.iloc[selected_idx].to_dict()
+        trace_id = test_case.get('trace_id')
+        print(f"[DEBUG] Selected test case: {test_case.get('task_id', 'Unknown')} (trace_id: {trace_id})")
+        # Load trace data
+        traces_dataset = current_selected_run.get('traces_dataset')
+        if not traces_dataset:
+            gr.Warning("No traces dataset found in current run")
+            return {}
+        trace_data = data_loader.get_trace_by_id(traces_dataset, trace_id)
+        if not trace_data:
+            gr.Warning(f"Trace not found: {trace_id}")
+            return {}
+        current_selected_trace = trace_data
+        # Get spans and ensure it's a list
+        spans = trace_data.get('spans', [])
+        if hasattr(spans, 'tolist'):
+            spans = spans.tolist()
+        elif not isinstance(spans, list):
+            spans = list(spans) if spans is not None else []
+        print(f"[DEBUG] Loaded trace with {len(spans)} spans")
+        # Create visualizations
+        span_viz_plot = create_span_visualization(spans, trace_id)
+        span_details_json = create_span_table(spans).value
+        # Create thought graph
+        from components.thought_graph import create_thought_graph as create_network_graph
+        thought_graph_plot = create_network_graph(spans, trace_id)
+        # Create span details table
+        span_table_df = create_span_details_table(spans)
+        # Load GPU metrics (if available)
+        gpu_summary_html = "<div style='padding: 20px; text-align: center;'>⚠️ No GPU metrics available (expected for API models)</div>"
+        gpu_plot = None
+        gpu_json_data = {}
+        try:
+            if 'metrics_dataset' in current_selected_run and current_selected_run['metrics_dataset']:
+                metrics_dataset = current_selected_run['metrics_dataset']
+                gpu_metrics_data = data_loader.load_metrics(metrics_dataset)
+                if gpu_metrics_data is not None and not gpu_metrics_data.empty:
+                    gpu_plot = create_gpu_metrics_dashboard(gpu_metrics_data)
+                    gpu_summary_html = create_gpu_summary_cards(gpu_metrics_data)
+                    gpu_json_data = gpu_metrics_data.to_dict('records')
+        except Exception as e:
+            print(f"[WARNING] Could not load GPU metrics: {e}")
+        # Return dictionary with visibility updates and data
+        return {
+            run_detail_screen: gr.update(visible=False),
+            trace_detail_screen: gr.update(visible=True),
+            trace_title: gr.update(value=f"# 🔍 Trace Detail: {trace_id}"),
+            trace_metadata_html: gr.update(value=create_trace_metadata_html(trace_data)),
+            trace_thought_graph: gr.update(value=thought_graph_plot),
+            span_visualization: gr.update(value=span_viz_plot),
+            span_details_table: gr.update(value=span_table_df),
+            span_details_json: gr.update(value=span_details_json),
+            gpu_summary_cards_html: gr.update(value=gpu_summary_html),
+            gpu_metrics_plot: gr.update(value=gpu_plot),
+            gpu_metrics_json: gr.update(value=gpu_json_data)
+        }
+    except Exception as e:
+        print(f"[ERROR] on_test_case_select failed: {e}")
+        import traceback
+        traceback.print_exc()
+        gr.Warning(f"Error loading trace: {e}")
+        return {}
+def create_performance_charts(results_df):
+    """
+    Create performance analysis charts for the Performance tab
+    Args:
+        results_df: DataFrame with test results
+    Returns:
+        Plotly figure with performance metrics
+    """
+    import plotly.graph_objects as go
+    from plotly.subplots import make_subplots
+    try:
+        if results_df.empty:
+            fig = go.Figure()
+            fig.add_annotation(text="No performance data available", showarrow=False)
+            return fig
+        # Create 2x2 subplots
+        fig = make_subplots(
+            rows=2, cols=2,
+            subplot_titles=(
+                "Response Time Distribution",
+                "Token Usage per Test",
+                "Cost per Test",
+                "Success vs Failure"
+            ),
+            specs=[[{"type": "histogram"}, {"type": "bar"}],
+                   [{"type": "bar"}, {"type": "pie"}]]
+        )
+        # 1. Response Time Distribution (Histogram)
+        if 'execution_time_ms' in results_df.columns:
+            fig.add_trace(
+                go.Histogram(
+                    x=results_df['execution_time_ms'],
+                    nbinsx=20,
+                    marker_color='#3498DB',
+                    name='Response Time',
+                    showlegend=False
+                ),
+                row=1, col=1
+            )
+            fig.update_xaxes(title_text="Time (ms)", row=1, col=1)
+            fig.update_yaxes(title_text="Count", row=1, col=1)
+        # 2. Token Usage per Test (Bar)
+        if 'total_tokens' in results_df.columns:
+            test_indices = list(range(len(results_df)))
+            fig.add_trace(
+                go.Bar(
+                    x=test_indices,
+                    y=results_df['total_tokens'],
+                    marker_color='#9B59B6',
+                    name='Tokens',
+                    showlegend=False
+                ),
+                row=1, col=2
+            )
+            fig.update_xaxes(title_text="Test Index", row=1, col=2)
+            fig.update_yaxes(title_text="Tokens", row=1, col=2)
+        # 3. Cost per Test (Bar)
+        if 'cost_usd' in results_df.columns:
+            test_indices = list(range(len(results_df)))
+            fig.add_trace(
+                go.Bar(
+                    x=test_indices,
+                    y=results_df['cost_usd'],
+                    marker_color='#E67E22',
+                    name='Cost',
+                    showlegend=False
+                ),
+                row=2, col=1
+            )
+            fig.update_xaxes(title_text="Test Index", row=2, col=1)
+            fig.update_yaxes(title_text="Cost (USD)", row=2, col=1)
+        # 4. Success vs Failure (Pie)
+        if 'success' in results_df.columns:
+            # Convert to boolean if needed
+            success_series = results_df['success']
+            if success_series.dtype == object:
+                success_series = success_series == "✅"
+            success_count = int(success_series.sum())
+            failure_count = len(results_df) - success_count
+            fig.add_trace(
+                go.Pie(
+                    labels=['Success', 'Failure'],
+                    values=[success_count, failure_count],
+                    marker_colors=['#2ECC71', '#E74C3C'],
+                    showlegend=True
+                ),
+                row=2, col=2
+            )
+        # Update layout
+        fig.update_layout(
+            height=700,
+            showlegend=False,
+            title_text="Performance Analysis Dashboard",
+            title_x=0.5
+        )
+        return fig
+    except Exception as e:
+        print(f"[ERROR] create_performance_charts: {e}")
+        import traceback
+        traceback.print_exc()
+        fig = go.Figure()
+        fig.add_annotation(text=f"Error creating charts: {str(e)}", showarrow=False)
+        return fig
+def go_back_to_run_detail():
+    """Navigate from trace detail back to run detail"""
+    return {
+        run_detail_screen: gr.update(visible=True),
+        trace_detail_screen: gr.update(visible=False)
+    }
 # Initialize data loader
 data_loader = create_data_loader_from_env()
 navigator = Navigator()
         results_df = data_loader.load_results(results_dataset)
+        # Generate performance chart
+        perf_chart = create_performance_charts(results_df)
         # Format results for display
         display_df = results_df.copy()
         results_df = data_loader.load_results(results_dataset)
+        # Generate performance chart
+        perf_chart = create_performance_charts(results_df)
         # Format results for display
         display_df = results_df.copy()
         results_df = data_loader.load_results(results_dataset)
+        # Generate performance chart
+        perf_chart = create_performance_charts(results_df)
         # Format results for display
         display_df = results_df.copy()
             # Hidden textbox for row selection (JavaScript bridge)
             selected_row_index = gr.Textbox(visible=False, elem_id="selected_row_index")
+        # Screen 3: Run Detail (Enhanced with Tabs)
         with gr.Column(visible=False) as run_detail_screen:
             # Navigation
             with gr.Row():
                 back_to_leaderboard_btn = gr.Button("⬅️ Back to Leaderboard", variant="secondary", size="sm")
+            run_detail_title = gr.Markdown("# 📊 Run Detail")
+            with gr.Tabs():
+                with gr.TabItem("📋 Overview"):
+                    gr.Markdown("*Run metadata and summary*")
+                    run_metadata_html = gr.HTML("")
+                with gr.TabItem("✅ Test Cases"):
+                    gr.Markdown("*Individual test case results*")
+                    test_cases_table = gr.Dataframe(
+                        headers=["Task ID", "Status", "Tool", "Duration", "Tokens", "Cost", "Trace ID"],
+                        interactive=False,
+                        wrap=True
+                    )
+                    gr.Markdown("*Click a test case to view detailed trace (including Thought Graph)*")
+                with gr.TabItem("⚡ Performance"):
+                    gr.Markdown("*Performance metrics and charts*")
+                    performance_charts = gr.Plot(label="Performance Analysis", show_label=False)
+        # Screen 4: Trace Detail with Sub-tabs
+        with gr.Column(visible=False) as trace_detail_screen:
+            with gr.Row():
+                back_to_run_detail_btn = gr.Button("⬅️ Back to Run Detail", variant="secondary", size="sm")
+            trace_title = gr.Markdown("# 🔍 Trace Detail")
+            trace_metadata_html = gr.HTML("")
+            with gr.Tabs():
+                with gr.TabItem("🧠 Thought Graph"):
+                    gr.Markdown("""
+                    ### Agent Reasoning Flow
+                    This interactive network graph shows **how your agent thinks** - the logical flow of reasoning steps,
+                    tool calls, and LLM interactions.
+                    **How to read it:**
+                    - 🟣 **Purple nodes** = LLM reasoning steps
+                    - 🟠 **Orange nodes** = Tool calls
+                    - 🔵 **Blue nodes** = Chains/Agents
+                    - **Arrows** = Flow from one step to the next
+                    - **Hover** = See tokens, costs, and timing details
+                    """)
+                    trace_thought_graph = gr.Plot(label="Thought Graph", show_label=False)
+                with gr.TabItem("📊 Waterfall"):
+                    gr.Markdown("*Interactive waterfall diagram showing span execution timeline*")
+                    gr.Markdown("*Hover over spans for details. Drag to zoom, double-click to reset.*")
+                    span_visualization = gr.Plot(label="Trace Waterfall", show_label=False)
+                with gr.TabItem("🖥️ GPU Metrics"):
+                    gr.Markdown("*Performance metrics for GPU-based models (not available for API models)*")
+                    gpu_summary_cards_html = gr.HTML(label="GPU Summary", show_label=False)
+                    with gr.Tabs():
+                        with gr.TabItem("📈 Time Series Dashboard"):
+                            gpu_metrics_plot = gr.Plot(label="GPU Metrics Over Time", show_label=False)
+                        with gr.TabItem("📋 Raw Metrics Data"):
+                            gpu_metrics_json = gr.JSON(label="GPU Metrics Data")
+                with gr.TabItem("📝 Span Details"):
+                    gr.Markdown("*Detailed span information with token and cost data*")
+                    span_details_table = gr.Dataframe(
+                        headers=["Span Name", "Kind", "Duration (ms)", "Tokens", "Cost (USD)", "Status"],
+                        interactive=False,
+                        wrap=True,
+                        label="Span Breakdown"
+                    )
+                with gr.TabItem("🔍 Raw Data"):
+                    gr.Markdown("*Raw OpenTelemetry trace data (JSON)*")
+                    span_details_json = gr.JSON()
+            with gr.Accordion("🤖 Ask About This Trace", open=False):
+                trace_question = gr.Textbox(
+                    label="Question",
+                    placeholder="e.g., Why was the tool called twice?",
+                    lines=2
+                )
+                trace_ask_btn = gr.Button("Ask", variant="primary")
+                trace_answer = gr.Markdown("*Ask a question to get AI-powered insights*")
         # Event handlers
         app.load(
         fn=load_leaderboard,
         outputs=[leaderboard_screen, run_detail_screen]
         )
+        # Trace detail navigation
+        test_cases_table.select(
+            fn=on_test_case_select,
+            inputs=[test_cases_table],
+            outputs=[
+                run_detail_screen,
+                trace_detail_screen,
+                trace_title,
+                trace_metadata_html,
+                trace_thought_graph,
+                span_visualization,
+                span_details_table,
+                span_details_json,
+                gpu_summary_cards_html,
+                gpu_metrics_plot,
+                gpu_metrics_json
+            ]
+        )
+        back_to_run_detail_btn.click(
+            fn=go_back_to_run_detail,
+            outputs=[run_detail_screen, trace_detail_screen]
+        )
         # HTML table row click handler (JavaScript bridge via hidden textbox)
         selected_row_index.change(
         fn=on_html_table_row_click,

components/thought_graph.py ADDED Viewed

	@@ -0,0 +1,398 @@

+"""
+Thought Graph Visualization Component
+Visualizes agent reasoning flow as an interactive network graph
+"""
+import plotly.graph_objects as go
+import networkx as nx
+from typing import List, Dict, Any, Tuple
+import colorsys
+def create_thought_graph(spans: List[Dict[str, Any]], trace_id: str = "Unknown") -> go.Figure:
+    """
+    Create an interactive thought graph showing agent reasoning flow
+    This is different from the waterfall chart - it shows the logical flow
+    of the agent's thinking process (LLM calls, Tool calls, etc.) as a
+    directed graph rather than a timeline.
+    Args:
+        spans: List of OpenTelemetry span dictionaries
+        trace_id: Trace identifier
+    Returns:
+        Plotly figure with interactive network graph
+    """
+    # Ensure spans is a list
+    if hasattr(spans, 'tolist'):
+        spans = spans.tolist()
+    elif not isinstance(spans, list):
+        spans = list(spans) if spans is not None else []
+    if not spans:
+        # Return empty figure with message
+        fig = go.Figure()
+        fig.add_annotation(
+            text="No reasoning steps to display",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, xanchor='center', yanchor='middle',
+            showarrow=False,
+            font=dict(size=20)
+        )
+        return fig
+    # Build graph from spans
+    G = nx.DiGraph()
+    # First pass: Add all nodes and build span_map
+    span_map = {}
+    for span in spans:
+        span_id = span.get('spanId') or span.get('span_id') or span.get('spanID')
+        if not span_id:
+            continue
+        # Get span details
+        name = span.get('name', 'Unknown')
+        kind = span.get('kind', 'INTERNAL')
+        attributes = span.get('attributes', {})
+        # Check for OpenInference span kind
+        if isinstance(attributes, dict) and 'openinference.span.kind' in attributes:
+            openinference_kind = attributes.get('openinference.span.kind', kind)
+            if openinference_kind:  # Only call .upper() if not None
+                kind = openinference_kind.upper()
+        # Extract metadata for node
+        node_data = {
+            'span_id': span_id,
+            'name': name,
+            'kind': kind,
+            'attributes': attributes,
+            'status': span.get('status', {}).get('code', 'OK')
+        }
+        # Add token and cost info if available
+        if isinstance(attributes, dict):
+            # Token info
+            if 'gen_ai.usage.prompt_tokens' in attributes:
+                node_data['prompt_tokens'] = attributes['gen_ai.usage.prompt_tokens']
+            if 'gen_ai.usage.completion_tokens' in attributes:
+                node_data['completion_tokens'] = attributes['gen_ai.usage.completion_tokens']
+            # Cost info
+            if 'gen_ai.usage.cost.total' in attributes:
+                node_data['cost'] = attributes['gen_ai.usage.cost.total']
+            elif 'llm.usage.cost' in attributes:
+                node_data['cost'] = attributes['llm.usage.cost']
+            # Model info
+            if 'gen_ai.request.model' in attributes:
+                node_data['model'] = attributes['gen_ai.request.model']
+            elif 'llm.model' in attributes:
+                node_data['model'] = attributes['llm.model']
+            # Tool info
+            if 'tool.name' in attributes:
+                node_data['tool_name'] = attributes['tool.name']
+        # Add node to graph
+        G.add_node(span_id, **node_data)
+        span_map[span_id] = span
+    # Second pass: Add all edges (now all nodes exist in span_map)
+    for span in spans:
+        span_id = span.get('spanId') or span.get('span_id') or span.get('spanID')
+        if not span_id:
+            continue
+        parent_id = span.get('parentSpanId') or span.get('parent_span_id') or span.get('parentSpanID')
+        if parent_id and parent_id in span_map:
+            G.add_edge(parent_id, span_id)
+            print(f"[DEBUG] Added edge: {parent_id} → {span_id}")
+    print(f"[DEBUG] Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
+    if G.number_of_nodes() == 0:
+        # Return empty figure with message
+        fig = go.Figure()
+        fig.add_annotation(
+            text="No valid spans to display",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, xanchor='center', yanchor='middle',
+            showarrow=False,
+            font=dict(size=20)
+        )
+        return fig
+    # Calculate layout using hierarchical layout
+    try:
+        # Try to use hierarchical layout (for DAGs)
+        pos = nx.spring_layout(G, k=2, iterations=50, seed=42)
+        # If graph is a DAG, use hierarchical layout
+        if nx.is_directed_acyclic_graph(G):
+            # Get levels using longest_path_length
+            levels = {}
+            for node in G.nodes():
+                # Find longest path from any root to this node
+                try:
+                    # Get all paths from roots to this node
+                    roots = [n for n in G.nodes() if G.in_degree(n) == 0]
+                    max_depth = 0
+                    for root in roots:
+                        if nx.has_path(G, root, node):
+                            paths = list(nx.all_simple_paths(G, root, node))
+                            max_depth = max(max_depth, max(len(p) for p in paths) if paths else 0)
+                    levels[node] = max_depth
+                except:
+                    levels[node] = 0
+            # Create hierarchical layout
+            pos = create_hierarchical_layout(G, levels)
+    except Exception as e:
+        print(f"[DEBUG] Layout calculation error: {e}")
+        # Fallback to circular layout
+        pos = nx.circular_layout(G)
+    # Extract node positions
+    node_x = []
+    node_y = []
+    node_text = []
+    node_colors = []
+    node_sizes = []
+    hover_text = []
+    for node in G.nodes():
+        x, y = pos[node]
+        node_x.append(x)
+        node_y.append(y)
+        # Get node data
+        node_data = G.nodes[node]
+        name = node_data.get('name', 'Unknown')
+        kind = node_data.get('kind', 'INTERNAL')
+        # Create label (shortened)
+        label = shorten_label(name, max_length=20)
+        node_text.append(label)
+        # Assign color based on kind
+        color = get_node_color(kind, node_data.get('status', 'OK'))
+        node_colors.append(color)
+        # Size based on importance (LLM and AGENT nodes are larger)
+        size = 40 if kind in ['LLM', 'AGENT', 'CHAIN'] else 30
+        node_sizes.append(size)
+        # Create detailed hover text
+        hover = f"<b>{name}</b><br>"
+        hover += f"Type: {kind}<br>"
+        hover += f"Status: {node_data.get('status', 'OK')}<br>"
+        if 'model' in node_data:
+            hover += f"Model: {node_data['model']}<br>"
+        if 'tool_name' in node_data:
+            hover += f"Tool: {node_data['tool_name']}<br>"
+        if 'prompt_tokens' in node_data or 'completion_tokens' in node_data:
+            prompt = node_data.get('prompt_tokens', 0)
+            completion = node_data.get('completion_tokens', 0)
+            hover += f"Tokens: {prompt + completion} (p:{prompt}, c:{completion})<br>"
+        if 'cost' in node_data and node_data['cost'] is not None:
+            hover += f"Cost: ${node_data['cost']:.6f}<br>"
+        hover_text.append(hover)
+    # Extract edges
+    edge_x = []
+    edge_y = []
+    edge_traces = []
+    print(f"[DEBUG] Drawing {G.number_of_edges()} edges")
+    for edge in G.edges():
+        x0, y0 = pos[edge[0]]
+        x1, y1 = pos[edge[1]]
+        print(f"[DEBUG] Edge from ({x0:.2f}, {y0:.2f}) to ({x1:.2f}, {y1:.2f})")
+        # Create edge line (make it thicker and darker for visibility)
+        edge_trace = go.Scatter(
+            x=[x0, x1, None],
+            y=[y0, y1, None],
+            mode='lines',
+            line=dict(width=3, color='#555'),  # Increased width from 2 to 3, darker color
+            hoverinfo='none',
+            showlegend=False
+        )
+        edge_traces.append(edge_trace)
+        # Add arrow annotation
+        edge_traces.append(create_arrow_annotation(x0, y0, x1, y1))
+    # Create node trace
+    node_trace = go.Scatter(
+        x=node_x,
+        y=node_y,
+        mode='markers+text',
+        marker=dict(
+            size=node_sizes,
+            color=node_colors,
+            line=dict(width=2, color='white')
+        ),
+        text=node_text,
+        textposition='bottom center',
+        textfont=dict(size=10, color='#333'),
+        hovertext=hover_text,
+        hoverinfo='text',
+        showlegend=False
+    )
+    # Create figure
+    fig = go.Figure(data=edge_traces + [node_trace])
+    # Update layout with better visibility settings
+    fig.update_layout(
+        title={
+            'text': f"🧠 Agent Thought Graph: {trace_id}",
+            'x': 0.5,
+            'xanchor': 'center',
+            'font': {'size': 20}
+        },
+        showlegend=False,
+        hovermode='closest',
+        margin=dict(t=100, b=40, l=40, r=40),
+        height=600,
+        xaxis=dict(
+            showgrid=False,
+            zeroline=False,
+            showticklabels=False,
+            range=[-0.1, 1.1]  # Add padding to see edges at boundaries
+        ),
+        yaxis=dict(
+            showgrid=False,
+            zeroline=False,
+            showticklabels=False,
+            range=[-0.1, 1.1]  # Add padding to see edges at boundaries
+        ),
+        plot_bgcolor='white',  # Pure white background for maximum contrast
+        paper_bgcolor='#f8f9fa',  # Light gray paper
+        annotations=[
+            dict(
+                text="💡 Hover over nodes to see details | Arrows show execution flow",
+                xref="paper", yref="paper",
+                x=0.5, y=-0.05, xanchor='center', yanchor='top',
+                showarrow=False,
+                font=dict(size=11, color='#666')
+            )
+        ]
+    )
+    # Add legend for node types
+    legend_items = create_legend_items()
+    fig.add_annotation(
+        text=legend_items,
+        xref="paper", yref="paper",
+        x=1.0, y=1.0, xanchor='right', yanchor='top',
+        showarrow=False,
+        font=dict(size=10),
+        align='left',
+        bgcolor='white',
+        bordercolor='#ccc',
+        borderwidth=1,
+        borderpad=8
+    )
+    return fig
+def create_hierarchical_layout(G: nx.DiGraph, levels: Dict[str, int]) -> Dict[str, Tuple[float, float]]:
+    """Create a hierarchical layout for the graph"""
+    pos = {}
+    # Group nodes by level
+    level_nodes = {}
+    for node, level in levels.items():
+        if level not in level_nodes:
+            level_nodes[level] = []
+        level_nodes[level].append(node)
+    # Assign positions
+    max_level = max(levels.values()) if levels else 0
+    for level, nodes in level_nodes.items():
+        y = 1.0 - (level / max(max_level, 1))  # Top to bottom
+        num_nodes = len(nodes)
+        for i, node in enumerate(nodes):
+            x = (i + 1) / (num_nodes + 1)  # Spread evenly
+            pos[node] = (x, y)
+    return pos
+def get_node_color(kind: str, status: str) -> str:
+    """Get color for node based on kind and status"""
+    # Error status overrides kind color
+    if status == 'ERROR':
+        return '#DC143C'  # Crimson
+    # Color by kind
+    color_map = {
+        'LLM': '#9B59B6',  # Purple
+        'AGENT': '#1ABC9C',  # Turquoise
+        'CHAIN': '#3498DB',  # Light Blue
+        'TOOL': '#E67E22',  # Orange
+        'RETRIEVER': '#F39C12',  # Yellow-Orange
+        'EMBEDDING': '#8E44AD',  # Dark Purple
+        'CLIENT': '#4169E1',  # Royal Blue
+        'SERVER': '#2E8B57',  # Sea Green
+        'INTERNAL': '#95A5A6',  # Gray
+    }
+    return color_map.get(kind, '#4682B4')  # Steel Blue default
+def shorten_label(text: str, max_length: int = 20) -> str:
+    """Shorten label for display"""
+    if len(text) <= max_length:
+        return text
+    return text[:max_length-3] + '...'
+def create_arrow_annotation(x0: float, y0: float, x1: float, y1: float) -> go.Scatter:
+    """Create an arrow annotation between two points"""
+    # Calculate arrow position (70% along the line, closer to end)
+    arrow_x = x0 + 0.7 * (x1 - x0)
+    arrow_y = y0 + 0.7 * (y1 - y0)
+    # Calculate angle for arrow direction
+    import math
+    angle = math.atan2(y1 - y0, x1 - x0)
+    # Create arrow head (larger and more visible)
+    arrow_size = 0.03  # Increased from 0.02
+    arrow_dx = arrow_size * math.cos(angle + 2.8)
+    arrow_dy = arrow_size * math.sin(angle + 2.8)
+    arrow_trace = go.Scatter(
+        x=[arrow_x - arrow_dx, arrow_x, arrow_x + arrow_size * math.cos(angle - 2.8)],
+        y=[arrow_y - arrow_dy, arrow_y, arrow_y + arrow_size * math.sin(angle - 2.8)],
+        mode='lines',
+        line=dict(width=2, color='#555'),  # Match edge color
+        fill='toself',
+        fillcolor='#555',  # Darker fill color
+        hoverinfo='none',
+        showlegend=False
+    )
+    return arrow_trace
+def create_legend_items() -> str:
+    """Create HTML legend for node types"""
+    legend = "<b>Node Types:</b><br>"
+    legend += "🟣 LLM Call<br>"
+    legend += "🟠 Tool Call<br>"
+    legend += "🔵 Chain/Agent<br>"
+    legend += "⚪ Other<br>"
+    legend += "🔴 Error"
+    return legend

screens/trace_detail.py ADDED Viewed

	@@ -0,0 +1,721 @@

+"""
+Screen 4: Trace Detail View
+Shows detailed OpenTelemetry trace visualization
+"""
+import gradio as gr
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+from datetime import datetime
+import pandas as pd
+from typing import Optional, Callable, Dict, Any, List
+from components.thought_graph import create_thought_graph
+def create_trace_detail_screen(
+    trace_data: dict,
+    on_back: Optional[Callable] = None,
+    mcp_qa_enabled: bool = True
+) -> gr.Blocks:
+    """
+    Create the trace detail screen UI
+    Args:
+        trace_data: OpenTelemetry trace data
+        on_back: Callback for back button
+        mcp_qa_enabled: Enable MCP Q&A tool
+    Returns:
+        Gradio Blocks for trace detail screen
+    """
+    with gr.Blocks() as trace_detail:
+        with gr.Row():
+            if on_back:
+                back_btn = gr.Button("⬅️ Back to Run Detail", variant="secondary", size="sm")
+        gr.Markdown(f"# 🔍 Trace Detail: {trace_data.get('trace_id', 'Unknown')}")
+        # Safely extract spans
+        spans = trace_data.get('spans', [])
+        if hasattr(spans, 'tolist'):
+            spans = spans.tolist()
+        elif not isinstance(spans, list):
+            spans = list(spans) if spans is not None else []
+        # Trace metadata
+        with gr.Row():
+            gr.Markdown(f"""
+            **Trace ID:** `{trace_data.get('trace_id', 'N/A')}`
+            **Total Spans:** {len(spans)}
+            """)
+        # Tabs for different visualizations
+        with gr.Tabs() as tabs:
+            # Tab 1: Thought Graph (STAR FEATURE!)
+            with gr.Tab("🧠 Thought Graph"):
+                gr.Markdown("""
+                ### Agent Reasoning Flow
+                This graph visualizes how your agent thinks - showing the flow of reasoning steps,
+                tool calls, and LLM interactions as a network.
+                **Node Colors:**
+                - 🟣 Purple: LLM reasoning steps
+                - 🟠 Orange: Tool calls
+                - 🔵 Blue: Chains/Agents
+                - 🔴 Red: Errors
+                """)
+                # Create and display thought graph
+                thought_graph_plot = gr.Plot(
+                    value=create_thought_graph(spans, trace_data.get('trace_id', 'Unknown')),
+                    label=""
+                )
+            # Tab 2: Execution Timeline (Waterfall)
+            with gr.Tab("⏱️ Execution Timeline"):
+                gr.Markdown("""
+                ### Waterfall Chart
+                Timeline view showing when each span executed and for how long.
+                """)
+                # Span visualization
+                span_viz = gr.Plot(
+                    value=create_span_visualization(spans, trace_data.get('trace_id', 'Unknown')),
+                    label=""
+                )
+            # Tab 3: Span Details
+            with gr.Tab("📋 Span Details"):
+                gr.Markdown("""
+                ### Detailed Span Information
+                Raw span data with attributes, status, and metadata.
+                """)
+                # Span details table
+                span_table = create_span_table(spans)
+        # MCP Q&A Tool (below tabs)
+        gr.Markdown("---")
+        if mcp_qa_enabled:
+            with gr.Accordion("🤖 Ask About This Trace", open=False):
+                question_input = gr.Textbox(
+                    label="Question",
+                    placeholder="e.g., Why was the tool called twice? What tool did the agent use first?",
+                    lines=2
+                )
+                ask_btn = gr.Button("Ask", variant="primary")
+                answer_output = gr.Markdown("*Ask a question to get AI-powered insights*")
+                # Wire up MCP Q&A (placeholder for now)
+                ask_btn.click(
+                    fn=lambda q: f"**Answer:** This is a placeholder. MCP integration coming soon.\n\n**Your question:** {q}",
+                    inputs=[question_input],
+                    outputs=[answer_output]
+                )
+        # Wire up events
+        if on_back:
+            back_btn.click(fn=on_back, inputs=[], outputs=[])
+    return trace_detail
+def process_trace_data(spans: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """Process trace spans for waterfall visualization"""
+    # Ensure spans is a list
+    if hasattr(spans, 'tolist'):
+        spans = spans.tolist()
+    elif not isinstance(spans, list):
+        spans = list(spans) if spans is not None else []
+    if not spans:
+        return []
+    # Helper function to get timestamp from span (handles different field names)
+    def get_timestamp(span, field_name):
+        """Get timestamp handling different OpenTelemetry field name variations"""
+        # Try different variations of field names
+        variations = [
+            field_name,  # e.g., 'startTime'
+            field_name.lower(),  # e.g., 'starttime'
+            field_name.replace('Time', 'TimeUnixNano'),  # e.g., 'startTimeUnixNano'
+            field_name[0].lower() + field_name[1:],  # e.g., 'startTime'
+            # Add snake_case variations (start_time, end_time)
+            field_name.replace('Time', '_time').lower(),  # e.g., 'start_time'
+            field_name.replace('Time', '_time_unix_nano').lower(),  # e.g., 'start_time_unix_nano'
+        ]
+        for var in variations:
+            if var in span:
+                value = span[var]
+                # Handle both string and numeric timestamps
+                if isinstance(value, str):
+                    return int(value)
+                return value
+        # If not found, return 0
+        return 0
+    # Calculate relative times
+    start_times = [get_timestamp(span, 'startTime') for span in spans]
+    min_start = min(start_times) if start_times else 0
+    max_start = max(start_times) if start_times else 0
+    # Check if we have any actual timing data
+    has_timing_data = min_start > 0 or max_start > 0
+    # Debug: Print first span's raw timestamps
+    if spans:
+        first_span = spans[0]
+        print(f"[DEBUG] First span raw data sample:")
+        print(f"  startTime field: {first_span.get('startTime', 'NOT FOUND')}")
+        print(f"  endTime field: {first_span.get('endTime', 'NOT FOUND')}")
+        print(f"  startTimeUnixNano field: {first_span.get('startTimeUnixNano', 'NOT FOUND')}")
+        print(f"  endTimeUnixNano field: {first_span.get('endTimeUnixNano', 'NOT FOUND')}")
+        print(f"  HAS_TIMING_DATA: {has_timing_data}")
+        if 'attributes' in first_span:
+            attrs = first_span['attributes']
+            print(f"  Sample attributes: {list(attrs.keys())[:5] if isinstance(attrs, dict) else 'N/A'}")
+            if isinstance(attrs, dict):
+                # Check for cost fields
+                cost_fields = [k for k in attrs.keys() if 'cost' in k.lower() or 'price' in k.lower()]
+                if cost_fields:
+                    print(f"  Cost-related fields found: {cost_fields}")
+    # Auto-detect timestamp unit based on magnitude
+    time_divisor = 1000000  # Default: assume nanoseconds, convert to milliseconds
+    if start_times and min_start > 0:
+        # If timestamp is > 1e15, it's likely nanoseconds
+        # If timestamp is > 1e12, it's likely microseconds
+        # If timestamp is > 1e9, it's likely milliseconds
+        # If timestamp is < 1e9, it's likely seconds
+        if min_start > 1e15:
+            time_divisor = 1000000  # nanoseconds to milliseconds
+            time_unit = "nanoseconds"
+        elif min_start > 1e12:
+            time_divisor = 1000  # microseconds to milliseconds
+            time_unit = "microseconds"
+        elif min_start > 1e9:
+            time_divisor = 1  # already in milliseconds
+            time_unit = "milliseconds"
+        else:
+            time_divisor = 0.001  # seconds to milliseconds
+            time_unit = "seconds"
+        print(f"[DEBUG] Auto-detected timestamp unit: {time_unit} (min_start={min_start}, divisor={time_divisor})")
+    processed_spans = []
+    for idx, span in enumerate(spans):
+        start_time = get_timestamp(span, 'startTime')
+        end_time = get_timestamp(span, 'endTime')
+        # Calculate relative start
+        relative_start = (start_time - min_start) / time_divisor if has_timing_data else 0
+        # Calculate duration - prefer duration_ms if available
+        if 'duration_ms' in span and span['duration_ms'] is not None:
+            actual_duration = float(span['duration_ms'])
+        else:
+            actual_duration = (end_time - start_time) / time_divisor
+        # Debug: Print first few durations
+        if idx < 3:
+            duration_source = 'duration_ms' if 'duration_ms' in span else 'calculated'
+            print(f"[DEBUG] Span {idx}: start={start_time}, end={end_time}, duration={actual_duration:.3f}ms ({duration_source})")
+        # Handle span ID variations
+        span_id = span.get('spanId') or span.get('span_id') or span.get('spanID') or f'span_{idx}'
+        parent_id = span.get('parentSpanId') or span.get('parent_span_id') or span.get('parentSpanID')
+        # Get span kind - check both top-level and OpenInference attributes
+        span_kind = span.get('kind', 'INTERNAL')
+        attributes = span.get('attributes', {})
+        # Check for OpenInference span kind in attributes
+        if isinstance(attributes, dict) and 'openinference.span.kind' in attributes:
+            openinference_kind = attributes.get('openinference.span.kind')
+            # Map OpenInference kinds to OpenTelemetry kinds for consistency
+            # OpenInference kinds: CHAIN, TOOL, LLM, RETRIEVER, EMBEDDING, AGENT, etc.
+            if openinference_kind:
+                span_kind = openinference_kind.upper()
+        # Extract token and cost information from attributes
+        token_info = {}
+        cost_info = {}
+        if isinstance(attributes, dict):
+            # Helper to safely extract numeric values
+            def safe_numeric(value):
+                """Safely convert to numeric, return None if invalid"""
+                if value is None:
+                    return None
+                try:
+                    if isinstance(value, (int, float)):
+                        return value
+                    return float(value)
+                except (ValueError, TypeError):
+                    return None
+            # Check for token usage (various formats)
+            prompt_tokens = None
+            completion_tokens = None
+            if 'gen_ai.usage.prompt_tokens' in attributes:
+                prompt_tokens = safe_numeric(attributes['gen_ai.usage.prompt_tokens'])
+            if 'gen_ai.usage.completion_tokens' in attributes:
+                completion_tokens = safe_numeric(attributes['gen_ai.usage.completion_tokens'])
+            if 'llm.token_count.prompt' in attributes and prompt_tokens is None:
+                prompt_tokens = safe_numeric(attributes['llm.token_count.prompt'])
+            if 'llm.token_count.completion' in attributes and completion_tokens is None:
+                completion_tokens = safe_numeric(attributes['llm.token_count.completion'])
+            # Store valid token counts
+            if prompt_tokens is not None:
+                token_info['prompt_tokens'] = int(prompt_tokens)
+            if completion_tokens is not None:
+                token_info['completion_tokens'] = int(completion_tokens)
+            # Calculate total tokens
+            if 'prompt_tokens' in token_info and 'completion_tokens' in token_info:
+                token_info['total_tokens'] = token_info['prompt_tokens'] + token_info['completion_tokens']
+            elif 'llm.usage.total_tokens' in attributes:
+                total = safe_numeric(attributes['llm.usage.total_tokens'])
+                if total is not None:
+                    token_info['total_tokens'] = int(total)
+            # Check for cost information (various formats)
+            if 'gen_ai.usage.cost.total' in attributes:
+                cost = safe_numeric(attributes['gen_ai.usage.cost.total'])
+                if cost is not None:
+                    cost_info['total_cost'] = cost
+            elif 'llm.usage.cost' in attributes:
+                cost = safe_numeric(attributes['llm.usage.cost'])
+                if cost is not None:
+                    cost_info['total_cost'] = cost
+            # Debug: Print cost info for LLM spans
+            if idx < 2 and span_kind == 'LLM':
+                print(f"[DEBUG] LLM Span {idx} cost extraction:")
+                print(f"  gen_ai.usage.cost.total: {attributes.get('gen_ai.usage.cost.total', 'NOT FOUND')}")
+                print(f"  llm.usage.cost: {attributes.get('llm.usage.cost', 'NOT FOUND')}")
+                print(f"  cost_info: {cost_info}")
+        # Store actual duration for tooltip, use minimum for visualization
+        display_duration = max(actual_duration, 0.1)  # Minimum width for visibility
+        processed_spans.append({
+            'span_id': span_id,
+            'parent_id': parent_id,
+            'name': span.get('name', 'Unknown'),
+            'kind': span_kind,
+            'start_time': relative_start,
+            'duration': display_duration,  # For bar width
+            'actual_duration': actual_duration,  # For tooltip
+            'end_time': relative_start + actual_duration,  # Use actual for end time
+            'attributes': attributes,
+            'status': span.get('status', {}).get('code', 'UNKNOWN'),
+            'tokens': token_info,
+            'cost': cost_info
+        })
+    print(f"[DEBUG] Total spans in input: {len(spans)}")
+    print(f"[DEBUG] Processed spans: {len(processed_spans)}")
+    # Debug: Show span kinds and statuses detected
+    span_kinds = {}
+    span_statuses = {}
+    durations = []
+    spans_with_tokens = 0
+    spans_with_cost = 0
+    for span in processed_spans:
+        kind = span['kind']
+        status = span['status']
+        span_kinds[kind] = span_kinds.get(kind, 0) + 1
+        span_statuses[status] = span_statuses.get(status, 0) + 1
+        durations.append(span['actual_duration'])
+        if span['tokens']:
+            spans_with_tokens += 1
+        if span['cost']:
+            spans_with_cost += 1
+    print(f"[DEBUG] Span kinds detected: {span_kinds}")
+    print(f"[DEBUG] Span statuses detected: {span_statuses}")
+    if durations:
+        print(f"[DEBUG] Duration range: {min(durations):.3f}ms - {max(durations):.3f}ms")
+    print(f"[DEBUG] Spans with token info: {spans_with_tokens}/{len(processed_spans)}")
+    print(f"[DEBUG] Spans with cost info: {spans_with_cost}/{len(processed_spans)}")
+    return processed_spans
+def create_span_visualization(spans: List[Dict[str, Any]], trace_id: str = "Unknown") -> go.Figure:
+    """Create an interactive Plotly waterfall visualization of spans"""
+    processed_spans = process_trace_data(spans)
+    print(f"[DEBUG] create_span_visualization - Received {len(spans)} spans")
+    print(f"[DEBUG] create_span_visualization - Processed {len(processed_spans)} spans")
+    if not processed_spans:
+        # Return empty figure with message
+        fig = go.Figure()
+        fig.add_annotation(
+            text="No spans to display",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, xanchor='center', yanchor='middle',
+            showarrow=False,
+            font=dict(size=20)
+        )
+        return fig
+    # Sort spans by start time for better visualization
+    processed_spans.sort(key=lambda x: x['start_time'])
+    # Create unique labels for each span (include index to ensure uniqueness)
+    for idx, span in enumerate(processed_spans):
+        # Add span index to make labels unique
+        span['display_name'] = f"{span['name']} [{idx}]"
+    # Create colors based on span status and kind
+    colors = []
+    color_map = {}  # Track which colors are assigned to which kinds
+    for span in processed_spans:
+        status = span['status']
+        kind = span['kind']
+        # Only show red for actual errors (ERROR status)
+        if status == 'ERROR':
+            color = '#DC143C'  # Crimson for errors
+        else:
+            # Color by span kind (supports both OpenTelemetry and OpenInference)
+            if kind == 'SERVER':
+                color = '#2E8B57'  # Sea Green
+            elif kind == 'CLIENT':
+                color = '#4169E1'  # Royal Blue
+            elif kind == 'LLM':
+                color = '#9B59B6'  # Purple for LLM calls
+            elif kind == 'TOOL':
+                color = '#E67E22'  # Orange for Tool calls
+            elif kind == 'CHAIN':
+                color = '#3498DB'  # Light Blue for Chains
+            elif kind == 'AGENT':
+                color = '#1ABC9C'  # Turquoise for Agents
+            elif kind == 'RETRIEVER':
+                color = '#F39C12'  # Yellow-Orange for Retrievers
+            elif kind == 'EMBEDDING':
+                color = '#8E44AD'  # Dark Purple for Embeddings
+            else:
+                color = '#4682B4'  # Steel Blue for INTERNAL/unknown
+        colors.append(color)
+        if kind not in color_map:
+            color_map[kind] = color
+    print(f"[DEBUG] Color assignments: {color_map}")
+    # Create the waterfall chart
+    fig = go.Figure()
+    # Prepare custom data for hover tooltips
+    customdata = []
+    for span in processed_spans:
+        # Build token info string
+        token_str = ""
+        if span['tokens']:
+            tokens = span['tokens']
+            if 'total_tokens' in tokens:
+                token_str = f"<br>Tokens: {tokens['total_tokens']}"
+                if 'prompt_tokens' in tokens and 'completion_tokens' in tokens:
+                    token_str += f" (prompt: {tokens['prompt_tokens']}, completion: {tokens['completion_tokens']})"
+            elif 'prompt_tokens' in tokens or 'completion_tokens' in tokens:
+                parts = []
+                if 'prompt_tokens' in tokens:
+                    parts.append(f"prompt: {tokens['prompt_tokens']}")
+                if 'completion_tokens' in tokens:
+                    parts.append(f"completion: {tokens['completion_tokens']}")
+                token_str = f"<br>Tokens: {', '.join(parts)}"
+        # Build cost info string
+        cost_str = ""
+        if span['cost'] and 'total_cost' in span['cost']:
+            cost_str = f"<br>Cost: ${span['cost']['total_cost']:.6f}"
+        customdata.append([
+            span['name'],
+            span['kind'],
+            span['span_id'],
+            span['end_time'],
+            span['actual_duration'],  # Show actual duration, not display duration
+            token_str,
+            cost_str
+        ])
+    # Add bars for each span (use display_name for unique y-axis labels)
+    fig.add_trace(go.Bar(
+        y=[span['display_name'] for span in processed_spans],
+        x=[span['duration'] for span in processed_spans],  # Display duration (min 0.1ms)
+        base=[span['start_time'] for span in processed_spans],
+        orientation='h',
+        marker_color=colors,
+        hovertemplate=(
+            "<b>%{customdata[0]}</b><br>" +
+            "Type: %{customdata[1]}<br>" +
+            "Span ID: %{customdata[2]}<br>" +
+            "Duration: %{customdata[4]:.3f} ms<br>" +  # Actual duration with 3 decimal places
+            "Start: %{base:.2f} ms<br>" +
+            "End: %{customdata[3]:.2f} ms" +
+            "%{customdata[5]}" +  # Token info (already formatted)
+            "%{customdata[6]}" +  # Cost info (already formatted)
+            "<extra></extra>"
+        ),
+        customdata=customdata,
+        name="Spans"
+    ))
+    # Update layout for better visualization
+    fig.update_layout(
+        title={
+            'text': f"OpenTelemetry Trace: {trace_id}",
+            'x': 0.5,
+            'xanchor': 'center'
+        },
+        xaxis_title="Time (milliseconds)",
+        yaxis_title="Spans",
+        showlegend=False,
+        height=400 + len(processed_spans) * 30,  # Dynamic height based on span count
+        bargap=0.2,
+        hovermode='closest'
+    )
+    return fig
+def create_span_table(spans: List[Dict[str, Any]]) -> gr.JSON:
+    """Create detailed span information display"""
+    # Ensure spans is a list
+    if hasattr(spans, 'tolist'):
+        spans = spans.tolist()
+    elif not isinstance(spans, list):
+        spans = list(spans) if spans is not None else []
+    # Helper function to get timestamp (same as in process_trace_data)
+    def get_timestamp(span, field_name):
+        variations = [
+            field_name,
+            field_name.lower(),
+            field_name.replace('Time', 'TimeUnixNano'),
+            field_name[0].lower() + field_name[1:],
+        ]
+        for var in variations:
+            if var in span:
+                value = span[var]
+                if isinstance(value, str):
+                    return int(value)
+                return value
+        return 0
+    # Simplify span data for display
+    simplified_spans = []
+    for span in spans:
+        start_time = get_timestamp(span, 'startTime')
+        end_time = get_timestamp(span, 'endTime')
+        duration_ms = (end_time - start_time) / 1000000 if (end_time and start_time) else 0
+        # Handle span ID variations
+        span_id = span.get('spanId') or span.get('span_id') or span.get('spanID') or 'N/A'
+        parent_id = span.get('parentSpanId') or span.get('parent_span_id') or span.get('parentSpanID') or 'root'
+        simplified_spans.append({
+            "Span ID": span_id,
+            "Parent": parent_id,
+            "Name": span.get('name', 'N/A'),
+            "Kind": span.get('kind', 'N/A'),
+            "Duration (ms)": round(duration_ms, 2),
+            "Attributes": span.get('attributes', {}),
+            "Status": span.get('status', {}).get('code', 'UNKNOWN')
+        })
+    return gr.JSON(value=simplified_spans, label="Span Details")
+# GPU Metrics Visualization Functions
+def extract_metrics_data(metrics_df):
+    """
+    Extract and prepare GPU metrics data for visualization
+    Args:
+        metrics_df: DataFrame with flat metrics structure (from HuggingFace dataset)
+                   Expected columns: timestamp, gpu_utilization_percent, gpu_memory_used_mib,
+                                   gpu_temperature_celsius, gpu_power_watts, co2_emissions_gco2e
+    Returns:
+        DataFrame ready for visualization
+    """
+    if metrics_df is None or metrics_df.empty:
+        return pd.DataFrame()
+    # Ensure timestamp is datetime
+    if 'timestamp' in metrics_df.columns:
+        if not pd.api.types.is_datetime64_any_dtype(metrics_df['timestamp']):
+            metrics_df['timestamp'] = pd.to_datetime(metrics_df['timestamp'])
+    # Sort by timestamp
+    metrics_df = metrics_df.sort_values('timestamp')
+    return metrics_df
+def create_gpu_summary_cards(df):
+    """
+    Create summary cards for GPU metrics
+    Args:
+        df: DataFrame with flat metrics structure (columns: gpu_utilization_percent, etc.)
+    Returns:
+        HTML string with summary cards
+    """
+    if df is None or df.empty:
+        return "<div style='padding: 20px; text-align: center;'>⚠️ No GPU metrics available (expected for API models)</div>"
+    # Get the latest row (assumes df is sorted by timestamp)
+    latest = df.iloc[-1]
+    # Extract values (with safe fallback)
+    utilization = latest.get('gpu_utilization_percent', 0)
+    memory_used = latest.get('gpu_memory_used_mib', 0)
+    temperature = latest.get('gpu_temperature_celsius', 0)
+    co2_emissions = latest.get('co2_emissions_gco2e', 0)
+    power = latest.get('gpu_power_watts', 0)
+    # Also get memory total if available for percentage
+    memory_total = latest.get('gpu_memory_total_mib', 0)
+    memory_percent = (memory_used / memory_total * 100) if memory_total > 0 else 0
+    cards_html = f"""
+    <div style="display: grid; grid-template-columns: repeat(4, 1fr); gap: 15px; margin: 20px 0;">
+        <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;">
+            <h3 style="margin: 0 0 10px 0; font-size: 1em;">GPU Utilization</h3>
+            <h2 style="margin: 0; font-size: 2em;">{utilization:.1f}%</h2>
+        </div>
+        <div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;">
+            <h3 style="margin: 0 0 10px 0; font-size: 1em;">GPU Memory</h3>
+            <h2 style="margin: 0; font-size: 2em;">{memory_used:.0f} MiB</h2>
+            <p style="margin: 5px 0 0 0; font-size: 0.8em; opacity: 0.9;">{memory_percent:.1f}% of {memory_total:.0f} MiB</p>
+        </div>
+        <div style="background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;">
+            <h3 style="margin: 0 0 10px 0; font-size: 1em;">GPU Temperature</h3>
+            <h2 style="margin: 0; font-size: 2em;">{temperature:.0f}°C</h2>
+        </div>
+        <div style="background: linear-gradient(135deg, #43e97b 0%, #38f9d7 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;">
+            <h3 style="margin: 0 0 10px 0; font-size: 1em;">CO2 Emissions</h3>
+            <h2 style="margin: 0; font-size: 2em;">{co2_emissions:.4f} g</h2>
+            <p style="margin: 5px 0 0 0; font-size: 0.8em; opacity: 0.9;">Power: {power:.1f} W</p>
+        </div>
+    </div>
+    """
+    return cards_html
+def create_gpu_metrics_dashboard(metrics_df):
+    """
+    Create a combined dashboard with GPU metric charts
+    Args:
+        metrics_df: DataFrame with flat metrics structure (from HuggingFace dataset)
+    Returns:
+        Plotly figure with GPU metrics time series
+    """
+    if metrics_df is None or metrics_df.empty:
+        # Return empty figure with message
+        fig = go.Figure()
+        fig.add_annotation(
+            text="No GPU metrics available (expected for API models)",
+            xref="paper", yref="paper",
+            x=0.5, y=0.5, xanchor='center', yanchor='middle',
+            showarrow=False,
+            font=dict(size=16)
+        )
+        return fig
+    # Prepare data
+    df = extract_metrics_data(metrics_df)
+    if df.empty:
+        return None
+    # Create subplots for GPU metrics
+    # We'll show: Utilization, Memory, Temperature, Power, CO2
+    fig = make_subplots(
+        rows=3, cols=2,
+        subplot_titles=[
+            'GPU Utilization (%)',
+            'GPU Memory (MiB)',
+            'GPU Temperature (°C)',
+            'GPU Power (W)',
+            'CO2 Emissions (g)',
+            'Power Cost (USD)'
+        ],
+        vertical_spacing=0.10,
+        horizontal_spacing=0.12,
+        specs=[[{}, {}], [{}, {}], [{}, {}]]
+    )
+    colors = ['#667eea', '#f093fb', '#4facfe', '#FFE66D', '#43e97b', '#FF6B6B']
+    # Define metrics to plot
+    metrics_config = [
+        ('gpu_utilization_percent', 'GPU Utilization (%)', 1, 1, colors[0]),
+        ('gpu_memory_used_mib', 'GPU Memory (MiB)', 1, 2, colors[1]),
+        ('gpu_temperature_celsius', 'GPU Temperature (°C)', 2, 1, colors[2]),
+        ('gpu_power_watts', 'GPU Power (W)', 2, 2, colors[3]),
+        ('co2_emissions_gco2e', 'CO2 Emissions (g)', 3, 1, colors[4]),
+        ('power_cost_usd', 'Power Cost (USD)', 3, 2, colors[5]),
+    ]
+    for col_name, title, row, col, color in metrics_config:
+        if col_name in df.columns:
+            fig.add_trace(
+                go.Scatter(
+                    x=df['timestamp'],
+                    y=df[col_name],
+                    mode='lines+markers',
+                    name=title,
+                    line=dict(color=color, width=3),
+                    marker=dict(size=6, color=color),
+                    hovertemplate=(
+                        f"<b>{title}</b><br>" +
+                        "Time: %{x}<br>" +
+                        "Value: %{y:.2f}<br>" +
+                        "<extra></extra>"
+                    )
+                ),
+                row=row, col=col
+            )
+    # Add memory total as a dashed line if available
+    if 'gpu_memory_total_mib' in df.columns:
+        total_memory = df['gpu_memory_total_mib'].iloc[0]
+        fig.add_hline(
+            y=total_memory,
+            line_dash="dash",
+            line_color="gray",
+            annotation_text=f"Total: {total_memory:.0f} MiB",
+            annotation_position="right",
+            row=1, col=2
+        )
+    fig.update_layout(
+        title_text="GPU Metrics Over Time",
+        height=900,
+        template="plotly_white",
+        showlegend=False,
+        hovermode='x unified'
+    )
+    # Update x-axes to show time format
+    fig.update_xaxes(tickformat='%H:%M:%S')
+    return fig