Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

VibecoderMcSwaggins commited on 23 days ago

Commit

7f11675

1 Parent(s): d36ce3c

docs: clean up resolved bug reports, update P3 commit hash

Delete 13 obsolete bug docs (all resolved):
- FIX_PLAN_*.md (superseded by implementations)
- INVESTIGATION_*.md (completed)
- P0_*, P1_*, P2_* (all fixed)
- SENIOR_AGENT_*.md (one-time prompts)

Update ACTIVE_BUGS.md:
- P3 commit hash: (Pending) → d36ce3c
- Remove broken link to deleted P1_GRADIO_SETTINGS_CLEANUP.md

Bug index now shows all bugs resolved. Zero active bugs.

Files changed (14) hide show

docs/bugs/ACTIVE_BUGS.md +1 -2
docs/bugs/FIX_PLAN_CRITICAL_BUGS.md +0 -36
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md +0 -227
docs/bugs/FIX_UI_SIMPLIFICATION.md +0 -314
docs/bugs/INVESTIGATION_INVALID_MODELS.md +0 -31
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md +0 -49
docs/bugs/P0_CRITICAL_BUGS.md +0 -43
docs/bugs/P0_GRADIO_EXAMPLE_CACHING_CRASH.md +0 -134
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md +0 -81
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md +0 -181
docs/bugs/P1_MULTIPLE_UX_BUGS.md +0 -49
docs/bugs/P2_MAGENTIC_THINKING_STATE.md +0 -232
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md +0 -247
docs/bugs/SENIOR_AUDIT_RESULTS.md +0 -84

docs/bugs/ACTIVE_BUGS.md CHANGED Viewed

@@ -11,7 +11,7 @@
 ## Resolved Bugs
 ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
-**Commit**: `(Pending)` (2025-11-29)
 - Added `final_event_received` tracking in `orchestrator_magentic.py`
 - Added fallback yield for "max iterations reached" scenario
@@ -40,7 +40,6 @@
 - Users now see feedback during 2-5 minute initial processing
 ### ~~P1 - Gradio Settings Accordion~~ WONTFIX
-**File**: [P1_GRADIO_SETTINGS_CLEANUP.md](./P1_GRADIO_SETTINGS_CLEANUP.md)
 Decision: Removed nested Blocks, using ChatInterface directly.
 Accordion behavior is default Gradio - acceptable for demo.

 ## Resolved Bugs
 ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
+**Commit**: `d36ce3c` (2025-11-29)
 - Added `final_event_received` tracking in `orchestrator_magentic.py`
 - Added fallback yield for "max iterations reached" scenario
 - Users now see feedback during 2-5 minute initial processing
 ### ~~P1 - Gradio Settings Accordion~~ WONTFIX
 Decision: Removed nested Blocks, using ChatInterface directly.
 Accordion behavior is default Gradio - acceptable for demo.

docs/bugs/FIX_PLAN_CRITICAL_BUGS.md DELETED Viewed

@@ -1,36 +0,0 @@
-# Fix Plan: Critical Bugs (P0)
-**Date**: 2025-11-28
-**Status**: COMPLETED (2025-11-29)
-**Based on**: `docs/bugs/SENIOR_AUDIT_RESULTS.md`
----
-## Summary of Fixes
-### 1. Fixed Data Leak (Bug 4 & 2)
-- **Action**: Removed singleton `_embedding_service` in `src/services/embeddings.py`.
-- **Action**: Updated `EmbeddingService.__init__` to use a unique collection name (`evidence_{uuid}`) for complete isolation per instance.
-- **Action**: Refactored `SentenceTransformer` loading to a shared global to maintain performance while isolating state.
-- **Verified**: Unit tests passed, including new isolation verification.
-### 2. Fixed Advanced Mode BYOK (Bug 3)
-- **Action**: Updated `create_orchestrator` in `src/orchestrator_factory.py` to accept `api_key`.
-- **Action**: Updated `MagenticOrchestrator` to accept and use the `api_key` for the manager and agents.
-- **Action**: Updated `src/app.py` to pass the user's API key during orchestrator configuration.
-- **Verified**: `test_dual_mode_e2e.py` passed.
-### 3. Fixed Free Tier Experience (Bug 1)
-- **Action**: Updated `HFInferenceJudgeHandler` in `src/agent_factory/judges.py` to catch 402 (Payment Required) errors.
-- **Action**: Added logic to return a "synthesize" assessment with a clear error message when quota is exhausted, stopping the infinite loop.
-- **Verified**: Unit tests passed.
----
-## Verification
-All changes have been verified with:
-- `make check` (lint, typecheck, test) - ALL PASSED
-- Custom reproduction script for isolation - PASSED
-The system is now stable for the hackathon demo.

docs/bugs/FIX_PLAN_MAGENTIC_MODE.md DELETED Viewed

@@ -1,227 +0,0 @@
-# Fix Plan: Magentic Mode Report Generation
-**Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
-**Approach**: Test-Driven Development (TDD)
-**Estimated Scope**: 4 tasks, ~2-3 hours
----
-## Problem Summary
-Magentic mode runs but fails to produce readable reports due to:
-1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
-2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
-3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
----
-## Fix Order (TDD)
-### Phase 1: Write Failing Tests
-**Task 1.1**: Create test for ChatMessage text extraction
-```python
-# tests/unit/test_orchestrator_magentic.py
-def test_process_event_extracts_text_from_chat_message():
-    """Final result event should extract text from ChatMessage object."""
-    # Arrange: Mock ChatMessage with .content attribute
-    # Act: Call _process_event with MagenticFinalResultEvent
-    # Assert: Returned AgentEvent.message is a string, not object repr
-```
-**Task 1.2**: Create test for max rounds configuration
-```python
-def test_orchestrator_uses_configured_max_rounds():
-    """MagenticOrchestrator should use max_rounds from constructor."""
-    # Arrange: Create orchestrator with max_rounds=10
-    # Act: Build workflow
-    # Assert: Workflow has max_round_count=10
-```
-**Task 1.3**: Create test for bioRxiv reference removal
-```python
-def test_task_prompt_references_europe_pmc():
-    """Task prompt should reference Europe PMC, not bioRxiv."""
-    # Arrange: Create orchestrator
-    # Act: Check task string in run()
-    # Assert: Contains "Europe PMC", not "bioRxiv"
-```
----
-### Phase 2: Fix ChatMessage Text Extraction
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 192-199
-**Current Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    text = event.message.text if event.message else "No result"
-```
-**Fixed Code**:
-```python
-elif isinstance(event, MagenticFinalResultEvent):
-    if event.message:
-        # ChatMessage may have .content or .text depending on version
-        if hasattr(event.message, 'content') and event.message.content:
-            text = str(event.message.content)
-        elif hasattr(event.message, 'text') and event.message.text:
-            text = str(event.message.text)
-        else:
-            # Fallback: convert entire message to string
-            text = str(event.message)
-    else:
-        text = "No result generated"
-```
-**Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
----
-### Phase 3: Fix Max Rounds Configuration
-**File**: `src/orchestrator_magentic.py`
-**Lines**: 97-99
-**Current Code**:
-```python
-.with_standard_manager(
-    chat_client=manager_client,
-    max_round_count=self._max_rounds,  # Already uses config
-    max_stall_count=3,
-    max_reset_count=2,
-)
-```
-**Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
-**Fix**: Verify the value flows through correctly. Add logging.
-```python
-logger.info(
-    "Building Magentic workflow",
-    max_rounds=self._max_rounds,
-    max_stall=3,
-    max_reset=2,
-)
-```
-**Also check**: `src/orchestrator_factory.py` passes config correctly:
-```python
-return MagenticOrchestrator(
-    max_rounds=config.max_iterations if config else 10,
-)
-```
----
-### Phase 4: Fix Stale bioRxiv References
-**Files to update**:
-| File | Line | Change |
-|------|------|--------|
-| `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
-| `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
-| `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
-**Search command to verify**:
-```bash
-grep -rn "bioRxiv\|biorxiv" src/
-```
----
-## Implementation Checklist
-```
-[ ] Phase 1: Write failing tests
-    [ ] 1.1 Test ChatMessage text extraction
-    [ ] 1.2 Test max rounds configuration
-    [ ] 1.3 Test Europe PMC references
-[ ] Phase 2: Fix ChatMessage extraction
-    [ ] Update _process_event() in orchestrator_magentic.py
-    [ ] Run test 1.1 - should pass
-[ ] Phase 3: Fix max rounds
-    [ ] Add logging to _build_workflow()
-    [ ] Verify factory passes config correctly
-    [ ] Run test 1.2 - should pass
-[ ] Phase 4: Fix bioRxiv references
-    [ ] Update orchestrator_magentic.py task prompt
-    [ ] Update magentic_agents.py descriptions
-    [ ] Update app.py UI text
-    [ ] Run test 1.3 - should pass
-    [ ] Run grep to verify no remaining refs
-[ ] Final Verification
-    [ ] make check passes
-    [ ] All tests pass (108+)
-    [ ] Manual test: run_magentic.py produces readable report
-```
----
-## Test Commands
-```bash
-# Run specific test file
-uv run pytest tests/unit/test_orchestrator_magentic.py -v
-# Run all tests
-uv run pytest tests/unit/ -v
-# Full check
-make check
-# Manual integration test
-set -a && source .env && set +a
-uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
-```
----
-## Success Criteria
-1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
-2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
-3. No "Max round count reached" error with default settings
-4. No "bioRxiv" references anywhere in codebase
-5. All 108+ tests pass
-6. `make check` passes
----
-## Files Modified
-```
-src/
-├── orchestrator_magentic.py   # ChatMessage fix, logging
-├── agents/magentic_agents.py  # bioRxiv → Europe PMC
-└── app.py                     # bioRxiv → Europe PMC
-tests/unit/
-└── test_orchestrator_magentic.py  # NEW: 3 tests
-```
----
-## Notes for AI Agent
-When implementing this fix plan:
-1. **DO NOT** create mock data or fake responses
-2. **DO** write real tests that verify actual behavior
-3. **DO** run `make check` after each phase
-4. **DO** test with real OpenAI API key via `.env`
-5. **DO** preserve existing functionality - simple mode must still work
-6. **DO NOT** over-engineer - minimal changes to fix the specific bugs

docs/bugs/FIX_UI_SIMPLIFICATION.md DELETED Viewed

@@ -1,314 +0,0 @@
-# UI Simplification: Remove API Provider Dropdown
-**Issues**: #52, #53
-**Priority**: P1 - UX improvement for hackathon demo
-**Estimated Time**: 30 minutes
-**Senior Review**: ✅ Approved with changes (incorporated below)
----
-## Problem
-The current UI has confusing BYOK (Bring Your Own Key) settings:
-1. **Provider dropdown is misleading** - Shows "openai" but actually uses free tier when no key
-2. **Examples table shows useless columns** - API Key (empty), Provider (ignored)
-3. **Anthropic doesn't work with Advanced mode** - Only OpenAI has `agent-framework` support
-## Solution
-Remove `api_provider` dropdown entirely. Auto-detect provider from key prefix.
-**Functionality preserved:**
-- Simple mode: Free tier, OpenAI, OR Anthropic (all work)
-- Advanced mode: OpenAI only (Magentic multi-agent requires `OpenAIChatClient`)
----
-## Implementation
-### File: `src/app.py`
-#### Change 1: Update `configure_orchestrator()` signature (lines 23-28)
-```python
-# BEFORE
-def configure_orchestrator(
-    use_mock: bool = False,
-    mode: str = "simple",
-    user_api_key: str | None = None,
-    api_provider: str = "openai",  # ← REMOVE
-) -> tuple[Any, str]:
-# AFTER
-def configure_orchestrator(
-    use_mock: bool = False,
-    mode: str = "simple",
-    user_api_key: str | None = None,
-) -> tuple[Any, str]:
-```
-#### Change 2: Update docstring (lines 29-40)
-```python
-# AFTER
-    """
-    Create an orchestrator instance.
-    Args:
-        use_mock: If True, use MockJudgeHandler (no API key needed)
-        mode: Orchestrator mode ("simple" or "advanced")
-        user_api_key: Optional user-provided API key (BYOK) - auto-detects provider
-    Returns:
-        Tuple of (Orchestrator instance, backend_name)
-    """
-```
-#### Change 3: Replace provider logic with auto-detection (lines 62-88)
-```python
-# BEFORE (lines 62-88) - complex provider checking with api_provider param
-# AFTER - auto-detect from key prefix
-    # 2. Paid API Key (User provided or Env)
-    elif user_api_key and user_api_key.strip():
-        # Auto-detect provider from key prefix
-        model: AnthropicModel | OpenAIModel
-        if user_api_key.startswith("sk-ant-"):
-            # Anthropic key
-            anthropic_provider = AnthropicProvider(api_key=user_api_key)
-            model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
-            backend_info = "Paid API (Anthropic)"
-        elif user_api_key.startswith("sk-"):
-            # OpenAI key
-            openai_provider = OpenAIProvider(api_key=user_api_key)
-            model = OpenAIModel(settings.openai_model, provider=openai_provider)
-            backend_info = "Paid API (OpenAI)"
-        else:
-            raise ValueError(
-                "Invalid API key format. Expected sk-... (OpenAI) or sk-ant-... (Anthropic)"
-            )
-        judge_handler = JudgeHandler(model=model)
-    # 3. Environment API Keys (fallback)
-    elif os.getenv("OPENAI_API_KEY"):
-        judge_handler = JudgeHandler(model=None)  # Uses env key
-        backend_info = "Paid API (OpenAI from env)"
-    elif os.getenv("ANTHROPIC_API_KEY"):
-        judge_handler = JudgeHandler(model=None)  # Uses env key
-        backend_info = "Paid API (Anthropic from env)"
-    # 4. Free Tier (HuggingFace Inference)
-    else:
-        judge_handler = HFInferenceJudgeHandler()
-        backend_info = "Free Tier (Llama 3.1 / Mistral)"
-```
-#### Change 4: Update `research_agent()` signature (lines 105-111)
-```python
-# BEFORE
-async def research_agent(
-    message: str,
-    history: list[dict[str, Any]],
-    mode: str = "simple",
-    api_key: str = "",
-    api_provider: str = "openai",  # ← REMOVE
-) -> AsyncGenerator[str, None]:
-# AFTER
-async def research_agent(
-    message: str,
-    history: list[dict[str, Any]],
-    mode: str = "simple",
-    api_key: str = "",
-) -> AsyncGenerator[str, None]:
-```
-#### Change 5: Update docstring (lines 112-124)
-```python
-# AFTER
-    """
-    Gradio chat function that runs the research agent.
-    Args:
-        message: User's research question
-        history: Chat history (Gradio format)
-        mode: Orchestrator mode ("simple" or "advanced")
-        api_key: Optional user-provided API key (BYOK - auto-detects provider)
-    Yields:
-        Markdown-formatted responses for streaming
-    """
-```
-#### Change 6: Fix Advanced mode check (line 139)
-```python
-# BEFORE
-if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
-# AFTER - auto-detect OpenAI key from prefix
-is_openai_user_key = user_api_key and user_api_key.startswith("sk-") and not user_api_key.startswith("sk-ant-")
-if mode == "advanced" and not (has_openai or is_openai_user_key):
-    yield (
-        "⚠️ **Advanced mode requires OpenAI API key.** "
-        "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
-    )
-    mode = "simple"
-```
-#### Change 7: Remove premature "Using your key" message (lines 146-151)
-```python
-# BEFORE - uses api_provider which no longer exists
-if has_user_key:
-    yield (
-        f"🔑 **Using your {api_provider.upper()} API key** - "
-        "Your key is used only for this session and is never stored.\n\n"
-    )
-# AFTER - remove this block entirely
-# The backend_name from configure_orchestrator already shows "Paid API (OpenAI)" or "Paid API (Anthropic)"
-# No need for duplicate messaging
-```
-#### Change 8: Update configure_orchestrator call (lines 165-170)
-```python
-# BEFORE
-orchestrator, backend_name = configure_orchestrator(
-    use_mock=False,
-    mode=mode,
-    user_api_key=user_api_key,
-    api_provider=api_provider,  # ← REMOVE
-)
-# AFTER
-orchestrator, backend_name = configure_orchestrator(
-    use_mock=False,
-    mode=mode,
-    user_api_key=user_api_key,
-)
-```
-#### Change 9: Simplify examples (lines 210-229)
-```python
-# BEFORE - 4 items per example
-examples=[
-    ["What drugs improve female libido post-menopause?", "simple", "", "openai"],
-    ["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "simple", "", "openai"],
-    ["Evidence for testosterone therapy in women with HSDD?", "simple", "", "openai"],
-],
-# AFTER - 2 items per example (query, mode) - API key always empty in examples
-examples=[
-    ["What drugs improve female libido post-menopause?", "simple"],
-    ["Clinical trials for ED alternatives to PDE5 inhibitors?", "simple"],
-    ["Evidence for testosterone therapy in women with HSDD?", "simple"],
-],
-```
-#### Change 10: Update additional_inputs (lines 231-252)
-```python
-# BEFORE - 3 inputs (mode, api_key, api_provider)
-additional_inputs=[
-    gr.Radio(
-        choices=["simple", "advanced"],
-        value="simple",
-        label="Orchestrator Mode",
-        info="Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)",
-    ),
-    gr.Textbox(
-        label="🔑 API Key (Optional - BYOK)",
-        placeholder="sk-... or sk-ant-...",
-        type="password",
-        info="Enter your own API key. Never stored.",
-    ),
-    gr.Radio(  # ← REMOVE THIS ENTIRE BLOCK
-        choices=["openai", "anthropic"],
-        value="openai",
-        label="API Provider",
-        info="Select the provider for your API key",
-    ),
-],
-# AFTER - 2 inputs (mode, api_key)
-additional_inputs=[
-    gr.Radio(
-        choices=["simple", "advanced"],
-        value="simple",
-        label="Orchestrator Mode",
-        info="Simple: Works with any key or free tier | Advanced: Requires OpenAI key",
-    ),
-    gr.Textbox(
-        label="🔑 API Key (Optional)",
-        placeholder="sk-... (OpenAI) or sk-ant-... (Anthropic)",
-        type="password",
-        info="Leave empty for free tier. Auto-detects provider from key prefix.",
-    ),
-],
-```
-#### Change 11: Update accordion label (line 230)
-```python
-# BEFORE
-additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
-# AFTER
-additional_inputs_accordion=gr.Accordion(label="⚙️ Settings (Free tier works without API key)", open=False),
-```
----
-## Testing Checklist
-### Manual Tests
-- [ ] **No key**: Shows "Free Tier (Llama 3.1 / Mistral)" in backend
-- [ ] **OpenAI key (sk-...)**: Shows "Paid API (OpenAI)" in backend
-- [ ] **Anthropic key (sk-ant-...)**: Shows "Paid API (Anthropic)" in backend
-- [ ] **Invalid key format**: Shows error message
-- [ ] **Anthropic key + Advanced mode**: Falls back to Simple with warning
-- [ ] **OpenAI key + Advanced mode**: Uses full Magentic multi-agent
-- [ ] **Examples table**: Shows only 2 columns (query, mode)
-- [ ] **MCP server**: Still accessible at `/gradio_api/mcp/`
-### Unit Test Updates
-- [ ] `tests/unit/test_app_smoke.py` - may need update if checking input count
----
-## Definition of Done
-- [ ] `api_provider` parameter removed from `configure_orchestrator()`
-- [ ] `api_provider` parameter removed from `research_agent()`
-- [ ] Auto-detection logic works for `sk-` and `sk-ant-` prefixes
-- [ ] Advanced mode check uses auto-detection (not removed param)
-- [ ] "Using your X key" message removed (backend_name handles this)
-- [ ] Examples table shows 2 columns
-- [ ] Accordion label updated
-- [ ] Placeholder text shows both key formats
-- [ ] All existing tests pass
-- [ ] MCP server still works
----
-## Mode Compatibility Matrix (Unchanged)
-| Mode | No Key | OpenAI Key | Anthropic Key |
-|------|--------|------------|---------------|
-| **Simple** | ✅ Free tier | ✅ GPT-5.1 | ✅ Claude Sonnet 4.5 |
-| **Advanced** | ⚠️ Falls back | ✅ Full Magentic | ⚠️ Falls back to Simple |
----
-## Related
-- Issue #52: UI Polish - Examples table confusion
-- Issue #53: API Provider Simplification
-- Senior Review: Approved 2025-11-28

docs/bugs/INVESTIGATION_INVALID_MODELS.md DELETED Viewed

@@ -1,31 +0,0 @@
-# Bug Investigation: Invalid Default LLM Models
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Component:** `src/utils/config.py`
-- **Priority:** High (Magentic Mode Blocker)
-- **Resolution:** FIXED
-## Issue Description
-The user encountered a 403 error when running in Magentic mode:
-`Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
-## Root Cause Analysis
-OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
-- `gpt-5.1` (current flagship)
-- `gpt-5-mini`
-- `gpt-5-nano`
-- `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
-- `o3`, `o4-mini`
-The base `gpt-5` is NO LONGER available via API.
-## Solution Implemented
-Updated `src/utils/config.py` to use:
-- `openai_model`: `gpt-5.1` (the actual current model)
-- `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
-## Verification
-- `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
-- User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.

docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md DELETED Viewed

@@ -1,49 +0,0 @@
-# Bug Investigation: HF Free Tier Quota Exhaustion
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Component:** `HFInferenceJudgeHandler`
-- **Priority:** High (UX Blocker for Free Tier)
-- **Resolution:** FIXED
-## Issue Description
-On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
-`⚠️ Free Tier Quota Exceeded ⚠️`
-This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
-## Evidence
-Output provided:
-```text
-### Citations (20 sources)
-...
-### Reasoning
-⚠️ **Free Tier Quota Exceeded** ⚠️
-```
-## Root Cause Analysis
-1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
-2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
-3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
-4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
-5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
-## The "Deep Blocker"
-The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
-## Solution Implemented
-Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
-1. Accept the `evidence` list as an argument.
-2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
-   - Use titles as "Key Findings" (first 5 sources).
-   - Add a clear message in "Drug Candidates" telling the user to upgrade.
-3. Return this "Partial" assessment instead of an empty one.
-## Verification
-- Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
-  - 402 errors are caught.
-  - `sufficient` is set to `True` (stops loop).
-  - `key_findings` are populated from search result titles.
-  - `reasoning` contains the warning message.
-- Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.

docs/bugs/P0_CRITICAL_BUGS.md DELETED Viewed

@@ -1,43 +0,0 @@
-# P0 Critical Bugs - DeepBoner Demo Broken
-**Date**: 2025-11-28
-**Status**: RESOLVED (2025-11-29)
-**Priority**: P0 - Blocking hackathon submission
----
-## Summary
-The Gradio demo was non-functional due to 4 critical bugs. All have been fixed and verified.
----
-## Bug 1: Free Tier LLM Quota Exhausted (P0) - FIXED
-**Resolution**:
-- Implemented `QuotaExhaustedError` detection in `HFInferenceJudgeHandler`.
-- The agent now gracefully stops and displays a clear "Free Tier Quota Exceeded" message instead of looping infinitely.
-## Bug 2: Evidence Counter Shows 0 After Dedup (P1) - FIXED
-**Resolution**:
-- Fixed by resolving Bug 4 (Data Leak). Deduplication now works correctly on isolated per-request collections.
-## Bug 3: API Key Not Passed to Advanced Mode (P0) - FIXED
-**Resolution**:
-- Plumbed `api_key` from the UI through `configure_orchestrator` -> `create_orchestrator` -> `MagenticOrchestrator`.
-- Magentic agents now correctly use the user-provided OpenAI key.
-## Bug 4: Singleton EmbeddingService Causes Cross-Session Pollution (P0) - FIXED
-**Resolution**:
-- Removed the singleton pattern for `EmbeddingService`.
-- Each request now gets a fresh `EmbeddingService` with a unique, isolated ChromaDB collection (`evidence_{uuid}`).
-- `SentenceTransformer` model is lazily cached globally to maintain performance.
----
-## Verification
-Run `make check` to verify all tests pass.

docs/bugs/P0_GRADIO_EXAMPLE_CACHING_CRASH.md DELETED Viewed

@@ -1,134 +0,0 @@
-# P0 Bug Report: Gradio Example Caching Crash
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P0 CRITICAL (Production Down)
-- **Component:** `src/app.py:131`
-- **Environment:** HuggingFace Spaces (Python 3.11, Gradio)
-## Error Message
-```text
-AttributeError: 'NoneType' object has no attribute 'strip'
-```
-## Full Stack Trace
-```text
-File "/app/src/app.py", line 131, in research_agent
-    user_api_key = (api_key.strip() or api_key_state.strip()) or None
-                    ^^^^^^^^^^^^^
-AttributeError: 'NoneType' object has no attribute 'strip'
-```
-## Root Cause Analysis
-### The Trigger
-Gradio's example caching mechanism runs the `research_agent` function during startup to pre-cache example outputs. This happens at:
-```text
-File "/usr/local/lib/python3.11/site-packages/gradio/helpers.py", line 509, in _start_caching
-    await self.cache()
-```
-### The Problem
-Our examples only provide values for 2 of the 4 function parameters:
-```python
-examples=[
-    ["What is the evidence for testosterone therapy in women with HSDD?", "simple"],
-    ["Promising drug candidates for endometriosis pain management", "simple"],
-]
-```
-These map to `[message, mode]` but **NOT** to `api_key` or `api_key_state`.
-When Gradio runs the function for caching, it passes `None` for the unprovided parameters:
-```python
-async def research_agent(
-    message: str,           # ✅ Provided by example
-    history: list[...],     # ✅ Empty list default
-    mode: str = "simple",   # ✅ Provided by example
-    api_key: str = "",      # ❌ Becomes None during caching!
-    api_key_state: str = "" # ❌ Becomes None during caching!
-) -> AsyncGenerator[...]:
-```
-### The Crash
-Line 131 attempts to call `.strip()` on `None`:
-```python
-user_api_key = (api_key.strip() or api_key_state.strip()) or None
-#               ^^^^^^^^^^^^^
-#               NoneType has no attribute 'strip'
-```
-## Gradio Warning (Ignored)
-Gradio actually warned us about this:
-```text
-UserWarning: Examples will be cached but not all input components have
-example values. This may result in an exception being thrown by your function.
-```
-## Solution
-### Option A: Defensive None Handling (Recommended)
-Add None guards before calling `.strip()`:
-```python
-# Handle None values from Gradio example caching
-api_key_str = api_key or ""
-api_key_state_str = api_key_state or ""
-user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
-```
-### Option B: Disable Example Caching
-Set `cache_examples=False` in ChatInterface:
-```python
-gr.ChatInterface(
-    fn=research_agent,
-    examples=[...],
-    cache_examples=False,  # Disable caching
-)
-```
-This avoids the crash but loses the UX benefit of pre-cached examples.
-### Option C: Provide Full Example Values
-Include all 4 columns in examples:
-```python
-examples=[
-    ["What is the evidence...", "simple", "", ""],  # [msg, mode, api_key, state]
-]
-```
-This is verbose and exposes internal state to users.
-## Recommendation
-**Option A** is the cleanest fix. It:
-1. Maintains cached examples for fast UX
-2. Handles edge cases defensively
-3. Doesn't expose internal state in examples
-## Pre-Merge Checklist
-- [ ] Fix applied to `src/app.py`
-- [ ] Unit test added for None parameter handling
-- [ ] `make check` passes
-- [ ] Test locally with `uv run python -m src.app`
-- [ ] Verify example caching works without crash
-- [ ] Deploy to HuggingFace Spaces
-- [ ] Verify Space starts without error
-## Lessons Learned
-1. Always test Gradio apps with example caching enabled locally before deploying
-2. Gradio's "partial examples" feature passes `None` for missing columns
-3. Default parameter values (`str = ""`) are ignored when Gradio explicitly passes `None`
-4. The Gradio warning about missing example values should be treated as an error

docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md DELETED Viewed

@@ -1,81 +0,0 @@
-# P1 Bug: Gradio Settings Accordion Not Collapsing
-**Priority**: P1 (UX Bug)
-**Status**: OPEN
-**Date**: 2025-11-27
-**Target Component**: `src/app.py`
----
-## 1. Problem Description
-The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
-### Symptoms
-- Accordion arrow toggles visually, but content remains visible.
-- Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
----
-## 2. Root Cause Analysis
-**Definitive Cause**: Nested `Blocks` Context Bug.
-`gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
-**Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
----
-## 3. Solution Strategy: "The Unwrap Fix"
-We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
-### Implementation Plan
-**Refactor `src/app.py` / `create_demo()`**:
-1.  **Remove** the `with gr.Blocks() as demo:` context manager.
-2.  **Instantiate** `gr.ChatInterface` directly as the `demo` object.
-3.  **Migrate UI Elements**:
-    *   **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
-    *   **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
-### Before (Buggy)
-```python
-def create_demo():
-    with gr.Blocks() as demo:  # <--- CAUSE OF BUG
-        gr.Markdown("# Title")
-        gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
-        gr.Markdown("Footer")
-    return demo
-```
-### After (Correct)
-```python
-def create_demo():
-    return gr.ChatInterface(   # <--- FIX: Top-level component
-        ...,
-        title="🧬 DeepBoner",
-        description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
-        additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
-    )
-```
----
-## 4. Validation
-1.  **Run**: `uv run python src/app.py`
-2.  **Check**: Open `http://localhost:7860`
-3.  **Verify**:
-    *   Settings accordion starts **COLLAPSED**.
-    *   Header title ("DeepBoner") is visible.
-    *   Footer text ("MCP Server Active") is visible in the description area.
-    *   Chat functionality works (Magentic/Simple modes).
----
-## 5. Constraints & Notes
-- **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
-- **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.

docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md DELETED Viewed

@@ -1,181 +0,0 @@
-# Bug Report: Magentic Mode Integration Issues
-## Status
-- **Date:** 2025-11-29
-- **Reporter:** CLI User
-- **Priority:** P1 (UX Degradation + Deprecation Warnings)
-- **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
-- **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
-- **Tests:** 138 passing (136 original + 2 new validation tests)
----
-## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
-### Symptoms
-When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
-```text
-📡 STREAMING: Below
-📡 STREAMING: is
-📡 STREAMING: a
-📡 STREAMING: curated
-📡 STREAMING: list
-...
-```
-Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
-### Root Cause (VALIDATED)
-**File:** `src/orchestrator_magentic.py:247-254`
-```python
-elif isinstance(event, MagenticAgentDeltaEvent):
-    if event.text:
-        return AgentEvent(
-            type="streaming",
-            message=event.text,  # Single token!
-            data={"agent_id": event.agent_id},
-            iteration=iteration,
-        )
-```
-Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
-**File:** `src/app.py:171-192` (BEFORE FIX)
-```python
-async for event in orchestrator.run(message):
-    event_md = event.to_markdown()
-    response_parts.append(event_md)  # Appends EVERY token
-    if event.type == "complete":
-        yield event.message
-    else:
-        yield "\n\n".join(response_parts)  # Yields ALL accumulated tokens
-```
-For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
-### Fix Applied
-**File:** `src/app.py:175-204`
-Implemented streaming token buffering with live updates:
-1. Added `streaming_buffer = ""` to accumulate tokens
-2. For each streaming event: append to buffer, yield immediately (for live typing UX)
-3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
-4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
-5. Flush buffer to `response_parts` only when non-streaming event occurs
-**Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
-### Proposed Fix Options
-**Option A: Buffer streaming tokens (recommended)**
-```python
-# In app.py - accumulate streaming tokens, yield periodically
-streaming_buffer = ""
-last_yield_time = time.time()
-async for event in orchestrator.run(message):
-    if event.type == "streaming":
-        streaming_buffer += event.message
-        # Only yield every 500ms or on newline
-        if time.time() - last_yield_time > 0.5 or "\n" in event.message:
-            yield f"📡 {streaming_buffer}"
-            last_yield_time = time.time()
-    elif event.type == "complete":
-        yield event.message
-    else:
-        # Non-streaming events
-        response_parts.append(event.to_markdown())
-        yield "\n\n".join(response_parts)
-```
-**Option B: Don't yield streaming events at all**
-```python
-# In app.py - only yield meaningful events
-async for event in orchestrator.run(message):
-    if event.type == "streaming":
-        continue  # Skip token-by-token spam
-    # ... rest of logic
-```
-**Option C: Fix at orchestrator level**
-Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
----
-## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
-### Symptoms
-1. User opens the "Mode & API Key" accordion
-2. User pastes their API key into the password textbox
-3. User clicks an example OR clicks elsewhere
-4. The API key textbox is now empty - value lost
-### Root Cause (VALIDATED)
-**File:** `src/app.py:255-267` (BEFORE FIX)
-```python
-additional_inputs_accordion=additional_inputs_accordion,
-additional_inputs=[
-    gr.Radio(...),
-    gr.Textbox(
-        label="🔑 API Key (Optional)",
-        type="password",
-        # No `value` parameter - defaults to empty
-        # No state persistence mechanism
-    ),
-],
-```
-Gradio's `ChatInterface` with `additional_inputs` has known issues:
-1. Clicking examples resets additional inputs to defaults
-2. The accordion state and input values may not persist correctly
-3. No explicit state management for the API key
-### Fix Applied
-**Files Modified:**
-1. `src/app.py`
-2. `src/utils/llm_factory.py`
-**Bug 1 (Streaming Spam):**
-- Accumulate tokens in `streaming_buffer`
-- Yield updates immediately for live typing UX
-- **Key**: Don't append to `response_parts` until stream segment complete
-- Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
-**Bug 2 (API Key Persistence):**
-- **Strategy:** Partial example list (relies on Gradio behavior)
-  - Examples have only 2 elements `[message, mode]` instead of 4
-  - Gradio only updates inputs with corresponding example values
-  - Remaining inputs (api_key textbox) are left unchanged
-  - `api_key_state` parameter exists as fallback but may be redundant
-- **Note:** This is a workaround relying on undocumented Gradio behavior
-**Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
-- Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
-### Test Results
-```bash
-uv run pytest tests/ -q
-============================= 138 passed in 20.60s =============================
-```
-**Status:** ✅ All tests passing
-### Why This Fix Works
-**Bug 1 (Streaming Spam):**
-- **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
-- **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
-- **Impact:** Smooth streaming, no visual spam, no browser freeze.
-**Bug 2 (API Key):**
-- **Before:** Example click → Overwrote API Key textbox with `""`.
-- **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
-- **Impact:** User input persists naturally.
-### Remaining Work
-- **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue

docs/bugs/P1_MULTIPLE_UX_BUGS.md DELETED Viewed

@@ -1,49 +0,0 @@
-# P1 Bug Report: Multiple UX and Configuration Issues
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P1 (Multiple user-facing issues)
-- **Components:** `src/app.py`, `src/orchestrator_magentic.py`
-## Resolved Issues (Fixed 2025-11-29)
-### Bug 1: API Key Cleared When Clicking Examples
-**Fixed.** Updated `examples` in `app.py` to include explicit `None` values for additional inputs. Gradio preserves values when the example value is `None`.
-### Bug 2: No Loading/Processing Indicator
-**Fixed.** `research_agent` yields an immediate "⏳ Processing..." message before starting the orchestrator.
-### Bug 3: Advanced Mode Temperature Error
-**Fixed.** Explicitly set `temperature=1.0` for all Magentic agents in `src/agents/magentic_agents.py`. This is compatible with OpenAI reasoning models (o1/o3) which require `temperature=1` and were rejecting the default (likely 0.3 or None).
-### Bug 4: HSDD Acronym Not Spelled Out
-**Fixed.** Updated example text in `app.py` to "HSDD (Hypoactive Sexual Desire Disorder)".
----
-## Open / Deferred Issues
-### Bug 5: Free Tier Quota Exhausted (UX Improvement)
-**Deferred.** Currently shows standard error message. Improve if users report confusion.
-### Bug 6: Asyncio File Descriptor Warnings
-**Won't Fix.** Cosmetic issue only.
----
-## Priority Order (Completed)
-1. **Bug 4 (HSDD)** - Fixed
-2. **Bug 2 (Loading indicator)** - Fixed
-3. **Bug 3 (Temperature)** - Fixed
-4. **Bug 1 (API key)** - Fixed
----
-## Test Plan
-- [x] Fix HSDD acronym
-- [x] Add loading indicator yield
-- [x] Test advanced mode with temperature fix (Static analysis/Code change)
-- [x] Research Gradio example behavior for API key (Implemented None fix)
-- [ ] Run `make check`
-- [ ] Deploy and test on HuggingFace Spaces

docs/bugs/P2_MAGENTIC_THINKING_STATE.md DELETED Viewed

@@ -1,232 +0,0 @@
-# P2 Bug Report: Advanced Mode Missing "Thinking" State
-## Status
-- **Date:** 2025-11-29
-- **Priority:** P2 (UX polish, not blocking functionality)
-- **Component:** `src/orchestrator_magentic.py`, `src/app.py`
----
-## Symptoms
-User experience in **Advanced (Magentic) mode**:
-1. Click example or submit query
-2. See: `🚀 **STARTED**: Starting research (Magentic mode)...`
-3. **2+ minutes of nothing** (no spinner, no progress, no indication work is happening)
-4. Eventually see: `🧠 **JUDGING**: Manager (user_task)...`
-**User perception:** "Is it frozen? Did it crash?"
-### Container Logs Confirm Work IS Happening
-```
-14:54:22 [info] Starting Magentic orchestrator query='...'
-14:54:22 [info] Embedding service enabled
-... 2+ MINUTES OF SILENCE (agent-framework doing internal LLM calls) ...
-14:56:38 [info] Creating orchestrator mode=advanced
-```
-The silence is because `workflow.run_stream()` doesn't yield events during its setup phase.
----
-## Root Cause Analysis
-### Current Flow (`src/orchestrator_magentic.py`)
-```python
-async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-    # 1. Immediately yields "started"
-    yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
-    # 2. Setup (fast, no yield needed)
-    embedding_service = self._init_embedding_service()
-    init_magentic_state(embedding_service)
-    workflow = self._build_workflow()
-    # 3. GAP: workflow.run_stream() blocks for 2+ minutes before first event
-    async for event in workflow.run_stream(task):  # <-- THE BOTTLENECK
-        yield self._process_event(event)
-```
-The `agent-framework`'s `workflow.run_stream()` is calling OpenAI's API, building the manager prompt, coordinating agents, etc. **It doesn't yield events during this setup phase**.
----
-## Gold Standard UX (What We'd Want)
-### Gradio's Native Thinking Support
-Per [Gradio Chatbot Docs](https://www.gradio.app/docs/gradio/chatbot):
-> "The Gradio Chatbot can natively display intermediate thoughts and tool usage in a collapsible accordion next to a chat message. This makes it perfect for creating UIs for LLM agents and chain-of-thought (CoT) or reasoning demos."
-**Features available:**
-- `gr.ChatMessage` with `metadata={"status": "pending"}` shows spinner
-- `metadata={"title": "Thinking...", "status": "pending"}` creates collapsible accordion
-- Nested thoughts via `id` and `parent_id`
-- `duration` metadata shows time spent
-**Example from Gradio docs:**
-```python
-import gradio as gr
-def chat_fn(message, history):
-    # Yield thinking state with spinner
-    yield gr.ChatMessage(
-        role="assistant",
-        metadata={"title": "🧠 Thinking...", "status": "pending"}
-    )
-    # Do work...
-    # Update with completed thought
-    yield gr.ChatMessage(
-        role="assistant",
-        content="Analysis complete",
-        metadata={"title": "🧠 Thinking...", "status": "done", "duration": 5.2}
-    )
-    yield "Here's the final answer..."
-```
----
-## Why This is Complex for DeepBoner
-### Constraint 1: ChatInterface Returns Strings
-Our `research_agent()` yields plain strings:
-```python
-yield "🧠 **Backend**: {backend_name}\n\n"
-yield "⏳ **Processing...** Searching PubMed...\n"
-yield "\n\n".join(response_parts)
-```
-Converting to `gr.ChatMessage` objects would require refactoring the entire response pipeline.
-### Constraint 2: Agent-Framework is the Bottleneck
-The 2-minute gap is inside `workflow.run_stream(task)`, which is the `agent-framework` library. We can't inject yields into a third-party library's blocking call.
-### Constraint 3: ChatInterface vs Blocks
-`gr.ChatInterface` is a convenience wrapper. The full `gr.ChatMessage` metadata features work best with raw `gr.Blocks` + `gr.Chatbot` components.
----
-## Options
-### Option A: Yield "Thinking" Before Blocking Call (Recommended)
-**Effort:** 5 minutes
-**Impact:** Users see *something* while waiting
-```python
-# In src/orchestrator_magentic.py
-async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
-    yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
-    # NEW: Yield thinking state before the blocking call
-    yield AgentEvent(
-        type="thinking",  # New event type
-        message="🧠 Agents are reasoning... This may take 2-5 minutes for complex queries.",
-        iteration=0,
-    )
-    # ... rest of setup ...
-    async for event in workflow.run_stream(task):
-        yield self._process_event(event)
-```
-**Pros:**
-- Simple, doesn't require Gradio changes
-- Works with current string-based approach
-- Sets user expectations ("2-5 minutes")
-**Cons:**
-- No spinner/animation (static text)
-- Doesn't show real-time progress during the gap
-### Option B: Use `gr.ChatMessage` with Metadata (Major Refactor)
-**Effort:** 2-4 hours
-**Impact:** Full gold-standard UX
-Would require:
-1. Changing `research_agent()` to yield `gr.ChatMessage` objects
-2. Adding thinking states with `metadata={"status": "pending"}`
-3. Updating all event handlers to produce proper ChatMessage objects
-### Option C: Heartbeat/Polling (Over-Engineering)
-**Effort:** 4+ hours
-**Impact:** Spinner during blocking call
-Create a background task that yields "still working..." every 10 seconds while waiting for the agent-framework. Requires:
-- `asyncio.create_task()` for heartbeat
-- Task cancellation when real events arrive
-- Proper cleanup
-**Verdict:** Over-engineering for a demo.
-### Option D: Accept the Limitation (Document It)
-**Effort:** 0
-**Impact:** None (users still confused)
-Just document that Advanced mode takes 2-5 minutes and users should wait.
----
-## Recommendation
-**Implement Option A** - Add a "thinking" yield before the blocking call.
-It's:
-1. Minimal code change (5 minutes)
-2. Sets user expectations clearly
-3. Doesn't require Gradio refactoring
-4. Better than silence
----
-## Implementation Plan
-### Step 1: Add "thinking" Event Type
-```python
-# In src/utils/models.py
-class AgentEvent(BaseModel):
-    type: Literal[
-        "started", "thinking", "searching", ...  # Add "thinking"
-    ]
-```
-### Step 2: Yield Thinking Event in Magentic Orchestrator
-```python
-# In src/orchestrator_magentic.py, run() method
-yield AgentEvent(
-    type="thinking",
-    message="🧠 Multi-agent reasoning in progress... This may take 2-5 minutes.",
-    iteration=0,
-)
-```
-### Step 3: Handle in App
-```python
-# In src/app.py, research_agent()
-if event.type == "thinking":
-    yield f"⏳ {event.message}"
-```
----
-## Test Plan
-- [ ] Add `"thinking"` to AgentEvent type literals
-- [ ] Add yield before `workflow.run_stream()`
-- [ ] Handle in app.py
-- [ ] `make check` passes
-- [ ] Manual test: Advanced mode shows "reasoning in progress" message
-- [ ] Deploy to HuggingFace, verify UX improvement
----
-## References
-- [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
-- [Gradio Chatbot Metadata](https://www.gradio.app/docs/gradio/chatbot)
-- [Agents and Tool Usage Guide](https://www.gradio.app/guides/agents-and-tool-usage)
-- [GitHub Issue: Streaming text not working](https://github.com/gradio-app/gradio/issues/11443)

docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md DELETED Viewed

@@ -1,247 +0,0 @@
-# Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
-**Date**: 2025-11-28
-**Requesting Agent**: Claude (Opus)
-**Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
----
-## Your Mission
-You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
-1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
-2. **FIND** any additional bugs (P0-P4) that could affect the demo
-3. **TRACE** the complete code paths for Simple and Advanced modes
-4. **IDENTIFY** any silent failures, race conditions, or edge cases
----
-## Context: What DeepBoner Does
-DeepBoner is a Gradio-based biomedical research agent that:
-1. Takes a research question from user
-2. Searches PubMed, ClinicalTrials.gov, Europe PMC
-3. Uses an LLM "judge" to evaluate if evidence is sufficient
-4. Either loops for more evidence or synthesizes a final report
-**Two Modes**:
-- **Simple**: Linear orchestrator with search → judge → report loop
-- **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
-**Three Backend Options**:
-- Free tier: HuggingFace Inference API (Llama/Mistral)
-- OpenAI: User-provided or env var key
-- Anthropic: User-provided or env var key (Simple mode only)
----
-## Files to Audit (Priority Order)
-### Critical Path Files:
-1. `src/app.py` - Gradio UI, entry point, key routing
-2. `src/orchestrator.py` - Simple mode main loop
-3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
-4. `src/orchestrator_magentic.py` - Advanced mode implementation
-5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
-6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
-### Supporting Files:
-7. `src/tools/search_handler.py` - Parallel search orchestration
-8. `src/tools/pubmed.py` - PubMed API integration
-9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
-10. `src/tools/europepmc.py` - Europe PMC API
-11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
-12. `src/utils/config.py` - Settings and configuration
-13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
----
-## Known Bugs to Verify
-### Bug 1: Free Tier LLM Quota Exhausted
-**Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
-**Verify**:
-- Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
-- Trace the fallback chain: Llama → Mistral → Zephyr
-- Confirm what happens when ALL fail (does it return default "continue"?)
-- Check if the error message reaches the user or is swallowed
-### Bug 2: Evidence Counter Shows 0 After Dedup
-**Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
-**Verify**:
-- Check `src/orchestrator.py` lines 97-114 and 219
-- Trace what happens if `embeddings.deduplicate()` returns `[]`
-- Is there defensive handling? Does exception handler catch this?
-- Could this be a race condition in async code?
-### Bug 3: API Key Not Passed to Advanced Mode
-**Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
-**Verify**:
-- Trace: `app.py:research_agent()` → `configure_orchestrator()` → `orchestrator_factory.py`
-- Check if `user_api_key` is passed to `create_orchestrator()`
-- Check if `MagenticOrchestrator.__init__()` receives a key
-- Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
-### Bug 4: Singleton EmbeddingService Cross-Session Pollution
-**Claim**: ChromaDB collection persists across requests, causing false duplicates
-**Verify**:
-- Check `src/services/embeddings.py` singleton pattern
-- Is `_embedding_service` ever reset?
-- What happens to ChromaDB collection between Gradio requests?
-- Could this cause "Found 20 new sources (0 total)"?
----
-## Additional Bug Categories to Search For
-### A. Error Handling Gaps
-- [ ] Silent `except: pass` blocks
-- [ ] Exceptions logged but not re-raised
-- [ ] Missing error messages to user
-- [ ] Swallowed API errors
-### B. Async/Concurrency Issues
-- [ ] Race conditions in parallel searches
-- [ ] Shared mutable state across async calls
-- [ ] Missing `await` keywords
-- [ ] Event loop blocking (sync code in async context)
-### C. API Integration Bugs
-- [ ] Missing rate limiting
-- [ ] Hardcoded timeouts that are too short
-- [ ] XML/JSON parsing failures not handled
-- [ ] Empty response handling
-### D. State Management Issues
-- [ ] Global singletons that should be session-scoped
-- [ ] Gradio state not properly isolated between users
-- [ ] Memory leaks from accumulated data
-### E. Configuration Bugs
-- [ ] Missing env var defaults
-- [ ] Type mismatches in settings
-- [ ] Hardcoded values that should be configurable
-### F. UI/UX Bugs
-- [ ] Streaming not working properly
-- [ ] Progress messages misleading
-- [ ] Examples not matching actual functionality
-- [ ] Error messages not user-friendly
----
-## Output Format
-Please produce a report with:
-### 1. Verification of Known Bugs
-For each of the 4 bugs in P0_CRITICAL_BUGS.md:
-- **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
-- Exact file:line references
-- Any corrections or additional details
-### 2. New Bugs Found
-For each new bug:
-```
-## Bug N: [Title]
-**Priority**: P0/P1/P2/P3/P4
-**File**: path/to/file.py:line
-**Symptoms**: What the user sees
-**Root Cause**: Technical explanation
-**Code**:
-```python
-# The buggy code
-```
-**Fix**:
-```python
-# The corrected code
-```
-```
-### 3. Code Quality Concerns
-Any patterns that aren't bugs but could cause issues:
-- Technical debt
-- Missing tests for critical paths
-- Unclear error handling
-### 4. Recommended Fix Order
-Prioritized list of what to fix first for a working demo.
----
-## Commands to Help Your Investigation
-```bash
-# Run the tests
-make check
-# Test search works
-uv run python -c "
-import asyncio
-from src.tools.pubmed import PubMedTool
-async def test():
-    tool = PubMedTool()
-    results = await tool.search('female libido', 5)
-    print(f'Found {len(results)} results')
-asyncio.run(test())
-"
-# Test HF inference (will show 402 if quota exhausted)
-uv run python -c "
-from huggingface_hub import InferenceClient
-client = InferenceClient()
-try:
-    resp = client.chat_completion(
-        messages=[{'role': 'user', 'content': 'Hi'}],
-        model='meta-llama/Llama-3.1-8B-Instruct',
-        max_tokens=10
-    )
-    print(resp)
-except Exception as e:
-    print(f'Error: {e}')
-"
-# Test full orchestrator (simple mode)
-uv run python -c "
-import asyncio
-from src.app import configure_orchestrator
-async def test():
-    orch, backend = configure_orchestrator(use_mock=True, mode='simple')
-    print(f'Backend: {backend}')
-    async for event in orch.run('test query'):
-        print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
-asyncio.run(test())
-"
-# Check for hardcoded API keys (security)
-grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
-# Find all singletons
-grep -r "_.*: .* | None = None" src/ --include="*.py"
-# Find all except blocks
-grep -rn "except.*:" src/ --include="*.py" | head -50
-```
----
-## Important Notes
-1. **DO NOT fix bugs** - just document them
-2. **Be thorough** - check edge cases and error paths
-3. **Be specific** - include file:line references
-4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
-5. **Think like a user** - what would break the demo experience?
-The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
----
-## Deliverable
-A comprehensive markdown report that:
-1. Confirms or corrects the 4 known bugs
-2. Lists any new bugs found (with priority)
-3. Recommends the optimal fix order
-4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`

docs/bugs/SENIOR_AUDIT_RESULTS.md DELETED Viewed

@@ -1,84 +0,0 @@
-# Senior Agent Audit Results: DeepBoner Codebase
-**Date**: 2025-11-28
-**Auditor**: Claude (Senior Software Engineer)
-**Status**: COMPLETE
----
-## Executive Summary
-The DeepBoner codebase has **4 critical defects** that render the demo non-functional for most users. The most severe is a **data leak** where the vector database persists across user sessions, causing search result corruption and potential privacy issues. Additionally, the "Advanced" mode ignores user-provided API keys, and the "Free Tier" mode fails silently when quotas are exhausted.
-**Recommendation**: Immediate remediation of P0 bugs is required before hackathon submission.
----
-## 1. Verification of Known Bugs (P0_CRITICAL_BUGS.md)
-| Bug | Claim | Verification Status | Notes |
-| :--- | :--- | :--- | :--- |
-| **Bug 1** | Free Tier LLM Quota Exhausted | **CONFIRMED** | `HFInferenceJudgeHandler` catches errors but returns a fallback assessment with `recommendation="continue"`. This causes the orchestrator to loop uselessly until `max_iterations` is reached. The user sees no error message. |
-| **Bug 2** | Evidence Counter Shows 0 | **CONFIRMED** | Directly caused by Bug 4. Deduplication logic works correctly *in isolation*, but fails because the underlying ChromaDB collection is polluted with stale data from previous sessions. |
-| **Bug 3** | API Key Not Passed to Advanced | **CONFIRMED** | `create_orchestrator` in `orchestrator_factory.py` ignores the user's API key. `MagenticOrchestrator` and its agents fall back to `settings.openai_api_key` (env var), which is empty for BYOK users. |
-| **Bug 4** | Singleton EmbeddingService | **CONFIRMED** | `EmbeddingService` is a global singleton with an in-memory ChromaDB. The collection is never cleared. Data leaks between sessions, causing valid new results to be marked as duplicates of old results. |
----
-## 2. New Bugs Found
-### Bug 5: Search Error Swallowing (P2)
-**File**: `src/orchestrator.py` / `src/tools/search_handler.py`
-**Symptoms**: If all search tools fail (e.g., network issue, API limit), the UI shows "Found 0 sources" without explaining why.
-**Root Cause**: `SearchHandler` captures exceptions and returns them in an `errors` list, but `Orchestrator` only logs them to the console (`logger.warning`) and proceeds with empty evidence.
-**Fix**: Yield an `AgentEvent(type="error")` or include errors in the `search_complete` event message.
-### Bug 6: Hardcoded Model Names (P3)
-**File**: `src/agent_factory/judges.py`
-**Symptoms**: Maintenance burden.
-**Root Cause**: Model names like `meta-llama/Llama-3.1-8B-Instruct` are hardcoded in the class `HFInferenceJudgeHandler` rather than pulled from `config.py`.
-**Fix**: Move to `Settings`.
----
-## 3. Code Quality Concerns
-1.  **Singleton Abuse**: The `_embedding_service` global in `src/services/embeddings.py` is a major architectural flaw for a multi-user web app (even a demo). It should be scoped to the `Orchestrator` instance.
-2.  **Inconsistent Factory Signatures**: `create_orchestrator` does not accept `api_key`, forcing hacks or reliance on global env vars.
-3.  **Silent Failures**: The pervasive use of `try...except Exception` with only logging (no user feedback) makes debugging difficult for end-users.
----
-## 4. Recommended Fix Order
-### Step 1: Fix the Data Leak (Bug 4 & 2)
-**Why**: Prevents result corruption and cross-user data leakage.
-**Plan**:
-1.  Remove singleton pattern from `src/services/embeddings.py`.
-2.  Make `EmbeddingService` an instance variable of `Orchestrator`.
-3.  Initialize a fresh `EmbeddingService` (and ChromaDB collection) for each `run()`.
-### Step 2: Fix Advanced Mode BYOK (Bug 3)
-**Why**: Enables the core "Advanced" feature for judges/users.
-**Plan**:
-1.  Update `create_orchestrator` signature to accept `api_key`.
-2.  Update `MagenticOrchestrator` to accept `api_key`.
-3.  Update `configure_orchestrator` in `app.py` to pass the key.
-4.  Ensure `MagenticOrchestrator` constructs `OpenAIChatClient` with the user's key.
-### Step 3: Fix Free Tier Experience (Bug 1)
-**Why**: Ensures a usable fallback for those without keys.
-**Plan**:
-1.  In `HFInferenceJudgeHandler`, detect 402/429 errors.
-2.  If caught, return a `JudgeAssessment` that triggers a "Complete" event with a clear error message, rather than "Continue".
-3.  Add `HF_TOKEN` to the deployment environment if possible.
----
-## Verification Plan
-After applying fixes, run:
-1.  **Unit Tests**: `make check`
-2.  **Manual Test (Simple)**: Run without key, verify 402 error is handled OR works if token added.
-3.  **Manual Test (Advanced)**: Run with OpenAI key, verify it proceeds past initialization.
-4.  **Manual Test (Dedup)**: Run same query twice. Second run should find same number of results (not 0).