VibecoderMcSwaggins commited on
Commit
7f11675
·
1 Parent(s): d36ce3c

docs: clean up resolved bug reports, update P3 commit hash

Browse files

Delete 13 obsolete bug docs (all resolved):
- FIX_PLAN_*.md (superseded by implementations)
- INVESTIGATION_*.md (completed)
- P0_*, P1_*, P2_* (all fixed)
- SENIOR_AGENT_*.md (one-time prompts)

Update ACTIVE_BUGS.md:
- P3 commit hash: (Pending) → d36ce3c
- Remove broken link to deleted P1_GRADIO_SETTINGS_CLEANUP.md

Bug index now shows all bugs resolved. Zero active bugs.

docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -11,7 +11,7 @@
11
  ## Resolved Bugs
12
 
13
  ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
14
- **Commit**: `(Pending)` (2025-11-29)
15
 
16
  - Added `final_event_received` tracking in `orchestrator_magentic.py`
17
  - Added fallback yield for "max iterations reached" scenario
@@ -40,7 +40,6 @@
40
  - Users now see feedback during 2-5 minute initial processing
41
 
42
  ### ~~P1 - Gradio Settings Accordion~~ WONTFIX
43
- **File**: [P1_GRADIO_SETTINGS_CLEANUP.md](./P1_GRADIO_SETTINGS_CLEANUP.md)
44
 
45
  Decision: Removed nested Blocks, using ChatInterface directly.
46
  Accordion behavior is default Gradio - acceptable for demo.
 
11
  ## Resolved Bugs
12
 
13
  ### ~~P3 - Magentic Mode Missing Termination Guarantee~~ FIXED
14
+ **Commit**: `d36ce3c` (2025-11-29)
15
 
16
  - Added `final_event_received` tracking in `orchestrator_magentic.py`
17
  - Added fallback yield for "max iterations reached" scenario
 
40
  - Users now see feedback during 2-5 minute initial processing
41
 
42
  ### ~~P1 - Gradio Settings Accordion~~ WONTFIX
 
43
 
44
  Decision: Removed nested Blocks, using ChatInterface directly.
45
  Accordion behavior is default Gradio - acceptable for demo.
docs/bugs/FIX_PLAN_CRITICAL_BUGS.md DELETED
@@ -1,36 +0,0 @@
1
- # Fix Plan: Critical Bugs (P0)
2
-
3
- **Date**: 2025-11-28
4
- **Status**: COMPLETED (2025-11-29)
5
- **Based on**: `docs/bugs/SENIOR_AUDIT_RESULTS.md`
6
-
7
- ---
8
-
9
- ## Summary of Fixes
10
-
11
- ### 1. Fixed Data Leak (Bug 4 & 2)
12
- - **Action**: Removed singleton `_embedding_service` in `src/services/embeddings.py`.
13
- - **Action**: Updated `EmbeddingService.__init__` to use a unique collection name (`evidence_{uuid}`) for complete isolation per instance.
14
- - **Action**: Refactored `SentenceTransformer` loading to a shared global to maintain performance while isolating state.
15
- - **Verified**: Unit tests passed, including new isolation verification.
16
-
17
- ### 2. Fixed Advanced Mode BYOK (Bug 3)
18
- - **Action**: Updated `create_orchestrator` in `src/orchestrator_factory.py` to accept `api_key`.
19
- - **Action**: Updated `MagenticOrchestrator` to accept and use the `api_key` for the manager and agents.
20
- - **Action**: Updated `src/app.py` to pass the user's API key during orchestrator configuration.
21
- - **Verified**: `test_dual_mode_e2e.py` passed.
22
-
23
- ### 3. Fixed Free Tier Experience (Bug 1)
24
- - **Action**: Updated `HFInferenceJudgeHandler` in `src/agent_factory/judges.py` to catch 402 (Payment Required) errors.
25
- - **Action**: Added logic to return a "synthesize" assessment with a clear error message when quota is exhausted, stopping the infinite loop.
26
- - **Verified**: Unit tests passed.
27
-
28
- ---
29
-
30
- ## Verification
31
-
32
- All changes have been verified with:
33
- - `make check` (lint, typecheck, test) - ALL PASSED
34
- - Custom reproduction script for isolation - PASSED
35
-
36
- The system is now stable for the hackathon demo.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/FIX_PLAN_MAGENTIC_MODE.md DELETED
@@ -1,227 +0,0 @@
1
- # Fix Plan: Magentic Mode Report Generation
2
-
3
- **Related Bug**: `P0_MAGENTIC_MODE_BROKEN.md`
4
- **Approach**: Test-Driven Development (TDD)
5
- **Estimated Scope**: 4 tasks, ~2-3 hours
6
-
7
- ---
8
-
9
- ## Problem Summary
10
-
11
- Magentic mode runs but fails to produce readable reports due to:
12
-
13
- 1. **Primary Bug**: `MagenticFinalResultEvent.message` returns `ChatMessage` object, not text
14
- 2. **Secondary Bug**: Max rounds (3) reached before ReportAgent completes
15
- 3. **Tertiary Issues**: Stale "bioRxiv" references in prompts
16
-
17
- ---
18
-
19
- ## Fix Order (TDD)
20
-
21
- ### Phase 1: Write Failing Tests
22
-
23
- **Task 1.1**: Create test for ChatMessage text extraction
24
-
25
- ```python
26
- # tests/unit/test_orchestrator_magentic.py
27
-
28
- def test_process_event_extracts_text_from_chat_message():
29
- """Final result event should extract text from ChatMessage object."""
30
- # Arrange: Mock ChatMessage with .content attribute
31
- # Act: Call _process_event with MagenticFinalResultEvent
32
- # Assert: Returned AgentEvent.message is a string, not object repr
33
- ```
34
-
35
- **Task 1.2**: Create test for max rounds configuration
36
-
37
- ```python
38
- def test_orchestrator_uses_configured_max_rounds():
39
- """MagenticOrchestrator should use max_rounds from constructor."""
40
- # Arrange: Create orchestrator with max_rounds=10
41
- # Act: Build workflow
42
- # Assert: Workflow has max_round_count=10
43
- ```
44
-
45
- **Task 1.3**: Create test for bioRxiv reference removal
46
-
47
- ```python
48
- def test_task_prompt_references_europe_pmc():
49
- """Task prompt should reference Europe PMC, not bioRxiv."""
50
- # Arrange: Create orchestrator
51
- # Act: Check task string in run()
52
- # Assert: Contains "Europe PMC", not "bioRxiv"
53
- ```
54
-
55
- ---
56
-
57
- ### Phase 2: Fix ChatMessage Text Extraction
58
-
59
- **File**: `src/orchestrator_magentic.py`
60
- **Lines**: 192-199
61
-
62
- **Current Code**:
63
- ```python
64
- elif isinstance(event, MagenticFinalResultEvent):
65
- text = event.message.text if event.message else "No result"
66
- ```
67
-
68
- **Fixed Code**:
69
- ```python
70
- elif isinstance(event, MagenticFinalResultEvent):
71
- if event.message:
72
- # ChatMessage may have .content or .text depending on version
73
- if hasattr(event.message, 'content') and event.message.content:
74
- text = str(event.message.content)
75
- elif hasattr(event.message, 'text') and event.message.text:
76
- text = str(event.message.text)
77
- else:
78
- # Fallback: convert entire message to string
79
- text = str(event.message)
80
- else:
81
- text = "No result generated"
82
- ```
83
-
84
- **Why**: The `agent_framework.ChatMessage` object structure may vary. We need defensive extraction.
85
-
86
- ---
87
-
88
- ### Phase 3: Fix Max Rounds Configuration
89
-
90
- **File**: `src/orchestrator_magentic.py`
91
- **Lines**: 97-99
92
-
93
- **Current Code**:
94
- ```python
95
- .with_standard_manager(
96
- chat_client=manager_client,
97
- max_round_count=self._max_rounds, # Already uses config
98
- max_stall_count=3,
99
- max_reset_count=2,
100
- )
101
- ```
102
-
103
- **Issue**: Default `max_rounds` in `__init__` is 10, but workflow may need more for complex queries.
104
-
105
- **Fix**: Verify the value flows through correctly. Add logging.
106
-
107
- ```python
108
- logger.info(
109
- "Building Magentic workflow",
110
- max_rounds=self._max_rounds,
111
- max_stall=3,
112
- max_reset=2,
113
- )
114
- ```
115
-
116
- **Also check**: `src/orchestrator_factory.py` passes config correctly:
117
- ```python
118
- return MagenticOrchestrator(
119
- max_rounds=config.max_iterations if config else 10,
120
- )
121
- ```
122
-
123
- ---
124
-
125
- ### Phase 4: Fix Stale bioRxiv References
126
-
127
- **Files to update**:
128
-
129
- | File | Line | Change |
130
- |------|------|--------|
131
- | `src/orchestrator_magentic.py` | 131 | "bioRxiv" → "Europe PMC" |
132
- | `src/agents/magentic_agents.py` | 32-33 | "bioRxiv" → "Europe PMC" |
133
- | `src/app.py` | 202-203 | "bioRxiv" → "Europe PMC" |
134
-
135
- **Search command to verify**:
136
- ```bash
137
- grep -rn "bioRxiv\|biorxiv" src/
138
- ```
139
-
140
- ---
141
-
142
- ## Implementation Checklist
143
-
144
- ```
145
- [ ] Phase 1: Write failing tests
146
- [ ] 1.1 Test ChatMessage text extraction
147
- [ ] 1.2 Test max rounds configuration
148
- [ ] 1.3 Test Europe PMC references
149
-
150
- [ ] Phase 2: Fix ChatMessage extraction
151
- [ ] Update _process_event() in orchestrator_magentic.py
152
- [ ] Run test 1.1 - should pass
153
-
154
- [ ] Phase 3: Fix max rounds
155
- [ ] Add logging to _build_workflow()
156
- [ ] Verify factory passes config correctly
157
- [ ] Run test 1.2 - should pass
158
-
159
- [ ] Phase 4: Fix bioRxiv references
160
- [ ] Update orchestrator_magentic.py task prompt
161
- [ ] Update magentic_agents.py descriptions
162
- [ ] Update app.py UI text
163
- [ ] Run test 1.3 - should pass
164
- [ ] Run grep to verify no remaining refs
165
-
166
- [ ] Final Verification
167
- [ ] make check passes
168
- [ ] All tests pass (108+)
169
- [ ] Manual test: run_magentic.py produces readable report
170
- ```
171
-
172
- ---
173
-
174
- ## Test Commands
175
-
176
- ```bash
177
- # Run specific test file
178
- uv run pytest tests/unit/test_orchestrator_magentic.py -v
179
-
180
- # Run all tests
181
- uv run pytest tests/unit/ -v
182
-
183
- # Full check
184
- make check
185
-
186
- # Manual integration test
187
- set -a && source .env && set +a
188
- uv run python examples/orchestrator_demo/run_magentic.py "metformin alzheimer"
189
- ```
190
-
191
- ---
192
-
193
- ## Success Criteria
194
-
195
- 1. `run_magentic.py` outputs a readable research report (not `<ChatMessage object>`)
196
- 2. Report includes: Executive Summary, Key Findings, Drug Candidates, References
197
- 3. No "Max round count reached" error with default settings
198
- 4. No "bioRxiv" references anywhere in codebase
199
- 5. All 108+ tests pass
200
- 6. `make check` passes
201
-
202
- ---
203
-
204
- ## Files Modified
205
-
206
- ```
207
- src/
208
- ├── orchestrator_magentic.py # ChatMessage fix, logging
209
- ├── agents/magentic_agents.py # bioRxiv → Europe PMC
210
- └── app.py # bioRxiv → Europe PMC
211
-
212
- tests/unit/
213
- └── test_orchestrator_magentic.py # NEW: 3 tests
214
- ```
215
-
216
- ---
217
-
218
- ## Notes for AI Agent
219
-
220
- When implementing this fix plan:
221
-
222
- 1. **DO NOT** create mock data or fake responses
223
- 2. **DO** write real tests that verify actual behavior
224
- 3. **DO** run `make check` after each phase
225
- 4. **DO** test with real OpenAI API key via `.env`
226
- 5. **DO** preserve existing functionality - simple mode must still work
227
- 6. **DO NOT** over-engineer - minimal changes to fix the specific bugs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/FIX_UI_SIMPLIFICATION.md DELETED
@@ -1,314 +0,0 @@
1
- # UI Simplification: Remove API Provider Dropdown
2
-
3
- **Issues**: #52, #53
4
- **Priority**: P1 - UX improvement for hackathon demo
5
- **Estimated Time**: 30 minutes
6
- **Senior Review**: ✅ Approved with changes (incorporated below)
7
-
8
- ---
9
-
10
- ## Problem
11
-
12
- The current UI has confusing BYOK (Bring Your Own Key) settings:
13
-
14
- 1. **Provider dropdown is misleading** - Shows "openai" but actually uses free tier when no key
15
- 2. **Examples table shows useless columns** - API Key (empty), Provider (ignored)
16
- 3. **Anthropic doesn't work with Advanced mode** - Only OpenAI has `agent-framework` support
17
-
18
- ## Solution
19
-
20
- Remove `api_provider` dropdown entirely. Auto-detect provider from key prefix.
21
-
22
- **Functionality preserved:**
23
- - Simple mode: Free tier, OpenAI, OR Anthropic (all work)
24
- - Advanced mode: OpenAI only (Magentic multi-agent requires `OpenAIChatClient`)
25
-
26
- ---
27
-
28
- ## Implementation
29
-
30
- ### File: `src/app.py`
31
-
32
- #### Change 1: Update `configure_orchestrator()` signature (lines 23-28)
33
-
34
- ```python
35
- # BEFORE
36
- def configure_orchestrator(
37
- use_mock: bool = False,
38
- mode: str = "simple",
39
- user_api_key: str | None = None,
40
- api_provider: str = "openai", # ← REMOVE
41
- ) -> tuple[Any, str]:
42
-
43
- # AFTER
44
- def configure_orchestrator(
45
- use_mock: bool = False,
46
- mode: str = "simple",
47
- user_api_key: str | None = None,
48
- ) -> tuple[Any, str]:
49
- ```
50
-
51
- #### Change 2: Update docstring (lines 29-40)
52
-
53
- ```python
54
- # AFTER
55
- """
56
- Create an orchestrator instance.
57
-
58
- Args:
59
- use_mock: If True, use MockJudgeHandler (no API key needed)
60
- mode: Orchestrator mode ("simple" or "advanced")
61
- user_api_key: Optional user-provided API key (BYOK) - auto-detects provider
62
-
63
- Returns:
64
- Tuple of (Orchestrator instance, backend_name)
65
- """
66
- ```
67
-
68
- #### Change 3: Replace provider logic with auto-detection (lines 62-88)
69
-
70
- ```python
71
- # BEFORE (lines 62-88) - complex provider checking with api_provider param
72
-
73
- # AFTER - auto-detect from key prefix
74
- # 2. Paid API Key (User provided or Env)
75
- elif user_api_key and user_api_key.strip():
76
- # Auto-detect provider from key prefix
77
- model: AnthropicModel | OpenAIModel
78
- if user_api_key.startswith("sk-ant-"):
79
- # Anthropic key
80
- anthropic_provider = AnthropicProvider(api_key=user_api_key)
81
- model = AnthropicModel(settings.anthropic_model, provider=anthropic_provider)
82
- backend_info = "Paid API (Anthropic)"
83
- elif user_api_key.startswith("sk-"):
84
- # OpenAI key
85
- openai_provider = OpenAIProvider(api_key=user_api_key)
86
- model = OpenAIModel(settings.openai_model, provider=openai_provider)
87
- backend_info = "Paid API (OpenAI)"
88
- else:
89
- raise ValueError(
90
- "Invalid API key format. Expected sk-... (OpenAI) or sk-ant-... (Anthropic)"
91
- )
92
- judge_handler = JudgeHandler(model=model)
93
-
94
- # 3. Environment API Keys (fallback)
95
- elif os.getenv("OPENAI_API_KEY"):
96
- judge_handler = JudgeHandler(model=None) # Uses env key
97
- backend_info = "Paid API (OpenAI from env)"
98
-
99
- elif os.getenv("ANTHROPIC_API_KEY"):
100
- judge_handler = JudgeHandler(model=None) # Uses env key
101
- backend_info = "Paid API (Anthropic from env)"
102
-
103
- # 4. Free Tier (HuggingFace Inference)
104
- else:
105
- judge_handler = HFInferenceJudgeHandler()
106
- backend_info = "Free Tier (Llama 3.1 / Mistral)"
107
- ```
108
-
109
- #### Change 4: Update `research_agent()` signature (lines 105-111)
110
-
111
- ```python
112
- # BEFORE
113
- async def research_agent(
114
- message: str,
115
- history: list[dict[str, Any]],
116
- mode: str = "simple",
117
- api_key: str = "",
118
- api_provider: str = "openai", # ← REMOVE
119
- ) -> AsyncGenerator[str, None]:
120
-
121
- # AFTER
122
- async def research_agent(
123
- message: str,
124
- history: list[dict[str, Any]],
125
- mode: str = "simple",
126
- api_key: str = "",
127
- ) -> AsyncGenerator[str, None]:
128
- ```
129
-
130
- #### Change 5: Update docstring (lines 112-124)
131
-
132
- ```python
133
- # AFTER
134
- """
135
- Gradio chat function that runs the research agent.
136
-
137
- Args:
138
- message: User's research question
139
- history: Chat history (Gradio format)
140
- mode: Orchestrator mode ("simple" or "advanced")
141
- api_key: Optional user-provided API key (BYOK - auto-detects provider)
142
-
143
- Yields:
144
- Markdown-formatted responses for streaming
145
- """
146
- ```
147
-
148
- #### Change 6: Fix Advanced mode check (line 139)
149
-
150
- ```python
151
- # BEFORE
152
- if mode == "advanced" and not (has_openai or (has_user_key and api_provider == "openai")):
153
-
154
- # AFTER - auto-detect OpenAI key from prefix
155
- is_openai_user_key = user_api_key and user_api_key.startswith("sk-") and not user_api_key.startswith("sk-ant-")
156
- if mode == "advanced" and not (has_openai or is_openai_user_key):
157
- yield (
158
- "⚠️ **Advanced mode requires OpenAI API key.** "
159
- "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
160
- )
161
- mode = "simple"
162
- ```
163
-
164
- #### Change 7: Remove premature "Using your key" message (lines 146-151)
165
-
166
- ```python
167
- # BEFORE - uses api_provider which no longer exists
168
- if has_user_key:
169
- yield (
170
- f"🔑 **Using your {api_provider.upper()} API key** - "
171
- "Your key is used only for this session and is never stored.\n\n"
172
- )
173
-
174
- # AFTER - remove this block entirely
175
- # The backend_name from configure_orchestrator already shows "Paid API (OpenAI)" or "Paid API (Anthropic)"
176
- # No need for duplicate messaging
177
- ```
178
-
179
- #### Change 8: Update configure_orchestrator call (lines 165-170)
180
-
181
- ```python
182
- # BEFORE
183
- orchestrator, backend_name = configure_orchestrator(
184
- use_mock=False,
185
- mode=mode,
186
- user_api_key=user_api_key,
187
- api_provider=api_provider, # ← REMOVE
188
- )
189
-
190
- # AFTER
191
- orchestrator, backend_name = configure_orchestrator(
192
- use_mock=False,
193
- mode=mode,
194
- user_api_key=user_api_key,
195
- )
196
- ```
197
-
198
- #### Change 9: Simplify examples (lines 210-229)
199
-
200
- ```python
201
- # BEFORE - 4 items per example
202
- examples=[
203
- ["What drugs improve female libido post-menopause?", "simple", "", "openai"],
204
- ["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "simple", "", "openai"],
205
- ["Evidence for testosterone therapy in women with HSDD?", "simple", "", "openai"],
206
- ],
207
-
208
- # AFTER - 2 items per example (query, mode) - API key always empty in examples
209
- examples=[
210
- ["What drugs improve female libido post-menopause?", "simple"],
211
- ["Clinical trials for ED alternatives to PDE5 inhibitors?", "simple"],
212
- ["Evidence for testosterone therapy in women with HSDD?", "simple"],
213
- ],
214
- ```
215
-
216
- #### Change 10: Update additional_inputs (lines 231-252)
217
-
218
- ```python
219
- # BEFORE - 3 inputs (mode, api_key, api_provider)
220
- additional_inputs=[
221
- gr.Radio(
222
- choices=["simple", "advanced"],
223
- value="simple",
224
- label="Orchestrator Mode",
225
- info="Simple: Linear (Free Tier Friendly) | Advanced: Multi-Agent (Requires OpenAI)",
226
- ),
227
- gr.Textbox(
228
- label="🔑 API Key (Optional - BYOK)",
229
- placeholder="sk-... or sk-ant-...",
230
- type="password",
231
- info="Enter your own API key. Never stored.",
232
- ),
233
- gr.Radio( # ← REMOVE THIS ENTIRE BLOCK
234
- choices=["openai", "anthropic"],
235
- value="openai",
236
- label="API Provider",
237
- info="Select the provider for your API key",
238
- ),
239
- ],
240
-
241
- # AFTER - 2 inputs (mode, api_key)
242
- additional_inputs=[
243
- gr.Radio(
244
- choices=["simple", "advanced"],
245
- value="simple",
246
- label="Orchestrator Mode",
247
- info="Simple: Works with any key or free tier | Advanced: Requires OpenAI key",
248
- ),
249
- gr.Textbox(
250
- label="🔑 API Key (Optional)",
251
- placeholder="sk-... (OpenAI) or sk-ant-... (Anthropic)",
252
- type="password",
253
- info="Leave empty for free tier. Auto-detects provider from key prefix.",
254
- ),
255
- ],
256
- ```
257
-
258
- #### Change 11: Update accordion label (line 230)
259
-
260
- ```python
261
- # BEFORE
262
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False),
263
-
264
- # AFTER
265
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings (Free tier works without API key)", open=False),
266
- ```
267
-
268
- ---
269
-
270
- ## Testing Checklist
271
-
272
- ### Manual Tests
273
- - [ ] **No key**: Shows "Free Tier (Llama 3.1 / Mistral)" in backend
274
- - [ ] **OpenAI key (sk-...)**: Shows "Paid API (OpenAI)" in backend
275
- - [ ] **Anthropic key (sk-ant-...)**: Shows "Paid API (Anthropic)" in backend
276
- - [ ] **Invalid key format**: Shows error message
277
- - [ ] **Anthropic key + Advanced mode**: Falls back to Simple with warning
278
- - [ ] **OpenAI key + Advanced mode**: Uses full Magentic multi-agent
279
- - [ ] **Examples table**: Shows only 2 columns (query, mode)
280
- - [ ] **MCP server**: Still accessible at `/gradio_api/mcp/`
281
-
282
- ### Unit Test Updates
283
- - [ ] `tests/unit/test_app_smoke.py` - may need update if checking input count
284
-
285
- ---
286
-
287
- ## Definition of Done
288
-
289
- - [ ] `api_provider` parameter removed from `configure_orchestrator()`
290
- - [ ] `api_provider` parameter removed from `research_agent()`
291
- - [ ] Auto-detection logic works for `sk-` and `sk-ant-` prefixes
292
- - [ ] Advanced mode check uses auto-detection (not removed param)
293
- - [ ] "Using your X key" message removed (backend_name handles this)
294
- - [ ] Examples table shows 2 columns
295
- - [ ] Accordion label updated
296
- - [ ] Placeholder text shows both key formats
297
- - [ ] All existing tests pass
298
- - [ ] MCP server still works
299
-
300
- ---
301
-
302
- ## Mode Compatibility Matrix (Unchanged)
303
-
304
- | Mode | No Key | OpenAI Key | Anthropic Key |
305
- |------|--------|------------|---------------|
306
- | **Simple** | ✅ Free tier | ✅ GPT-5.1 | ✅ Claude Sonnet 4.5 |
307
- | **Advanced** | ⚠️ Falls back | ✅ Full Magentic | ⚠️ Falls back to Simple |
308
-
309
- ---
310
-
311
- ## Related
312
- - Issue #52: UI Polish - Examples table confusion
313
- - Issue #53: API Provider Simplification
314
- - Senior Review: Approved 2025-11-28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/INVESTIGATION_INVALID_MODELS.md DELETED
@@ -1,31 +0,0 @@
1
- # Bug Investigation: Invalid Default LLM Models
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Component:** `src/utils/config.py`
7
- - **Priority:** High (Magentic Mode Blocker)
8
- - **Resolution:** FIXED
9
-
10
- ## Issue Description
11
- The user encountered a 403 error when running in Magentic mode:
12
- `Error code: 403 - {'error': {'message': 'Project ... does not have access to model gpt-5', ... 'code': 'model_not_found'}}`
13
-
14
- ## Root Cause Analysis
15
- OpenAI deprecated the base `gpt-5` model. Tier 5 accounts now have access to:
16
- - `gpt-5.1` (current flagship)
17
- - `gpt-5-mini`
18
- - `gpt-5-nano`
19
- - `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano`
20
- - `o3`, `o4-mini`
21
-
22
- The base `gpt-5` is NO LONGER available via API.
23
-
24
- ## Solution Implemented
25
- Updated `src/utils/config.py` to use:
26
- - `openai_model`: `gpt-5.1` (the actual current model)
27
- - `anthropic_model`: `claude-sonnet-4-5-20250929` (unchanged)
28
-
29
- ## Verification
30
- - `tests/unit/agent_factory/test_judges_factory.py` updated and passed.
31
- - User confirmed Tier 5 access to `gpt-5.1` via OpenAI dashboard.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/INVESTIGATION_QUOTA_BLOCKER.md DELETED
@@ -1,49 +0,0 @@
1
- # Bug Investigation: HF Free Tier Quota Exhaustion
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Component:** `HFInferenceJudgeHandler`
7
- - **Priority:** High (UX Blocker for Free Tier)
8
- - **Resolution:** FIXED
9
-
10
- ## Issue Description
11
- On a fresh run with a simple query ("What drugs improve female libido post-menopause?"), the system retrieved 20 valid sources but failed during the Judge/Analysis phase with:
12
- `⚠️ Free Tier Quota Exceeded ⚠️`
13
-
14
- This results in a "Synthesis" step that has 0 candidates and 0 findings, rendering the application useless for free users once the (very low) limit is hit, despite having valid search results.
15
-
16
- ## Evidence
17
- Output provided:
18
- ```text
19
- ### Citations (20 sources)
20
- ...
21
- ### Reasoning
22
- ⚠️ **Free Tier Quota Exceeded** ⚠️
23
- ```
24
-
25
- ## Root Cause Analysis
26
- 1. **Search Success:** `SearchAgent` correctly found 20 documents (PubMed/EuropePMC).
27
- 2. **Judge Failure:** `HFInferenceJudgeHandler` called the HF Inference API.
28
- 3. **Quota Trap:** The API returned a 402 (Payment Required) or Quota error.
29
- 4. **Previous Handling:** The handler caught this error and returned a `JudgeAssessment` with `sufficient=True` (to stop the loop) and *empty* fields.
30
- 5. **Data Loss:** The 20 valid search results were effectively discarded from the "Analysis" perspective.
31
-
32
- ## The "Deep Blocker"
33
- The system had a "hard failure" mode for quota exhaustion, assuming that if the LLM can't judge, we have *no* useful information. This "bricked" the UX for free users immediately upon hitting the limit.
34
-
35
- ## Solution Implemented
36
- Modified `HFInferenceJudgeHandler._create_quota_exhausted_assessment` to:
37
- 1. Accept the `evidence` list as an argument.
38
- 2. Perform basic heuristic extraction (borrowed from `MockJudgeHandler` logic):
39
- - Use titles as "Key Findings" (first 5 sources).
40
- - Add a clear message in "Drug Candidates" telling the user to upgrade.
41
- 3. Return this "Partial" assessment instead of an empty one.
42
-
43
- ## Verification
44
- - Created `tests/unit/agent_factory/test_judges_hf_quota.py` to verify that:
45
- - 402 errors are caught.
46
- - `sufficient` is set to `True` (stops loop).
47
- - `key_findings` are populated from search result titles.
48
- - `reasoning` contains the warning message.
49
- - Ran existing tests `tests/unit/agent_factory/test_judges_hf.py` - All passed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P0_CRITICAL_BUGS.md DELETED
@@ -1,43 +0,0 @@
1
- # P0 Critical Bugs - DeepBoner Demo Broken
2
-
3
- **Date**: 2025-11-28
4
- **Status**: RESOLVED (2025-11-29)
5
- **Priority**: P0 - Blocking hackathon submission
6
-
7
- ---
8
-
9
- ## Summary
10
-
11
- The Gradio demo was non-functional due to 4 critical bugs. All have been fixed and verified.
12
-
13
- ---
14
-
15
- ## Bug 1: Free Tier LLM Quota Exhausted (P0) - FIXED
16
-
17
- **Resolution**:
18
- - Implemented `QuotaExhaustedError` detection in `HFInferenceJudgeHandler`.
19
- - The agent now gracefully stops and displays a clear "Free Tier Quota Exceeded" message instead of looping infinitely.
20
-
21
- ## Bug 2: Evidence Counter Shows 0 After Dedup (P1) - FIXED
22
-
23
- **Resolution**:
24
- - Fixed by resolving Bug 4 (Data Leak). Deduplication now works correctly on isolated per-request collections.
25
-
26
- ## Bug 3: API Key Not Passed to Advanced Mode (P0) - FIXED
27
-
28
- **Resolution**:
29
- - Plumbed `api_key` from the UI through `configure_orchestrator` -> `create_orchestrator` -> `MagenticOrchestrator`.
30
- - Magentic agents now correctly use the user-provided OpenAI key.
31
-
32
- ## Bug 4: Singleton EmbeddingService Causes Cross-Session Pollution (P0) - FIXED
33
-
34
- **Resolution**:
35
- - Removed the singleton pattern for `EmbeddingService`.
36
- - Each request now gets a fresh `EmbeddingService` with a unique, isolated ChromaDB collection (`evidence_{uuid}`).
37
- - `SentenceTransformer` model is lazily cached globally to maintain performance.
38
-
39
- ---
40
-
41
- ## Verification
42
-
43
- Run `make check` to verify all tests pass.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P0_GRADIO_EXAMPLE_CACHING_CRASH.md DELETED
@@ -1,134 +0,0 @@
1
- # P0 Bug Report: Gradio Example Caching Crash
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P0 CRITICAL (Production Down)
6
- - **Component:** `src/app.py:131`
7
- - **Environment:** HuggingFace Spaces (Python 3.11, Gradio)
8
-
9
- ## Error Message
10
-
11
- ```text
12
- AttributeError: 'NoneType' object has no attribute 'strip'
13
- ```
14
-
15
- ## Full Stack Trace
16
-
17
- ```text
18
- File "/app/src/app.py", line 131, in research_agent
19
- user_api_key = (api_key.strip() or api_key_state.strip()) or None
20
- ^^^^^^^^^^^^^
21
- AttributeError: 'NoneType' object has no attribute 'strip'
22
- ```
23
-
24
- ## Root Cause Analysis
25
-
26
- ### The Trigger
27
- Gradio's example caching mechanism runs the `research_agent` function during startup to pre-cache example outputs. This happens at:
28
-
29
- ```text
30
- File "/usr/local/lib/python3.11/site-packages/gradio/helpers.py", line 509, in _start_caching
31
- await self.cache()
32
- ```
33
-
34
- ### The Problem
35
- Our examples only provide values for 2 of the 4 function parameters:
36
-
37
- ```python
38
- examples=[
39
- ["What is the evidence for testosterone therapy in women with HSDD?", "simple"],
40
- ["Promising drug candidates for endometriosis pain management", "simple"],
41
- ]
42
- ```
43
-
44
- These map to `[message, mode]` but **NOT** to `api_key` or `api_key_state`.
45
-
46
- When Gradio runs the function for caching, it passes `None` for the unprovided parameters:
47
-
48
- ```python
49
- async def research_agent(
50
- message: str, # ✅ Provided by example
51
- history: list[...], # ✅ Empty list default
52
- mode: str = "simple", # ✅ Provided by example
53
- api_key: str = "", # ❌ Becomes None during caching!
54
- api_key_state: str = "" # ❌ Becomes None during caching!
55
- ) -> AsyncGenerator[...]:
56
- ```
57
-
58
- ### The Crash
59
- Line 131 attempts to call `.strip()` on `None`:
60
-
61
- ```python
62
- user_api_key = (api_key.strip() or api_key_state.strip()) or None
63
- # ^^^^^^^^^^^^^
64
- # NoneType has no attribute 'strip'
65
- ```
66
-
67
- ## Gradio Warning (Ignored)
68
-
69
- Gradio actually warned us about this:
70
-
71
- ```text
72
- UserWarning: Examples will be cached but not all input components have
73
- example values. This may result in an exception being thrown by your function.
74
- ```
75
-
76
- ## Solution
77
-
78
- ### Option A: Defensive None Handling (Recommended)
79
- Add None guards before calling `.strip()`:
80
-
81
- ```python
82
- # Handle None values from Gradio example caching
83
- api_key_str = api_key or ""
84
- api_key_state_str = api_key_state or ""
85
- user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
86
- ```
87
-
88
- ### Option B: Disable Example Caching
89
- Set `cache_examples=False` in ChatInterface:
90
-
91
- ```python
92
- gr.ChatInterface(
93
- fn=research_agent,
94
- examples=[...],
95
- cache_examples=False, # Disable caching
96
- )
97
- ```
98
-
99
- This avoids the crash but loses the UX benefit of pre-cached examples.
100
-
101
- ### Option C: Provide Full Example Values
102
- Include all 4 columns in examples:
103
-
104
- ```python
105
- examples=[
106
- ["What is the evidence...", "simple", "", ""], # [msg, mode, api_key, state]
107
- ]
108
- ```
109
-
110
- This is verbose and exposes internal state to users.
111
-
112
- ## Recommendation
113
-
114
- **Option A** is the cleanest fix. It:
115
- 1. Maintains cached examples for fast UX
116
- 2. Handles edge cases defensively
117
- 3. Doesn't expose internal state in examples
118
-
119
- ## Pre-Merge Checklist
120
-
121
- - [ ] Fix applied to `src/app.py`
122
- - [ ] Unit test added for None parameter handling
123
- - [ ] `make check` passes
124
- - [ ] Test locally with `uv run python -m src.app`
125
- - [ ] Verify example caching works without crash
126
- - [ ] Deploy to HuggingFace Spaces
127
- - [ ] Verify Space starts without error
128
-
129
- ## Lessons Learned
130
-
131
- 1. Always test Gradio apps with example caching enabled locally before deploying
132
- 2. Gradio's "partial examples" feature passes `None` for missing columns
133
- 3. Default parameter values (`str = ""`) are ignored when Gradio explicitly passes `None`
134
- 4. The Gradio warning about missing example values should be treated as an error
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P1_GRADIO_SETTINGS_CLEANUP.md DELETED
@@ -1,81 +0,0 @@
1
- # P1 Bug: Gradio Settings Accordion Not Collapsing
2
-
3
- **Priority**: P1 (UX Bug)
4
- **Status**: OPEN
5
- **Date**: 2025-11-27
6
- **Target Component**: `src/app.py`
7
-
8
- ---
9
-
10
- ## 1. Problem Description
11
-
12
- The "Settings" accordion in the Gradio UI (containing Orchestrator Mode, API Key, Provider) fails to collapse, even when configured with `open=False`. It remains permanently expanded, cluttering the interface and obscuring the chat history.
13
-
14
- ### Symptoms
15
- - Accordion arrow toggles visually, but content remains visible.
16
- - Occurs in both local development (`uv run src/app.py`) and HuggingFace Spaces.
17
-
18
- ---
19
-
20
- ## 2. Root Cause Analysis
21
-
22
- **Definitive Cause**: Nested `Blocks` Context Bug.
23
- `gr.ChatInterface` is itself a high-level abstraction that creates a `gr.Blocks` context. Wrapping `gr.ChatInterface` inside an external `with gr.Blocks():` context causes event listener conflicts, specifically breaking the JavaScript state management for `additional_inputs_accordion`.
24
-
25
- **Reference**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) confirms that `additional_inputs_accordion` malfunctions when `ChatInterface` is not the top-level block.
26
-
27
- ---
28
-
29
- ## 3. Solution Strategy: "The Unwrap Fix"
30
-
31
- We will remove the redundant `gr.Blocks` wrapper. This restores the native behavior of `ChatInterface`, ensuring the accordion respects `open=False`.
32
-
33
- ### Implementation Plan
34
-
35
- **Refactor `src/app.py` / `create_demo()`**:
36
-
37
- 1. **Remove** the `with gr.Blocks() as demo:` context manager.
38
- 2. **Instantiate** `gr.ChatInterface` directly as the `demo` object.
39
- 3. **Migrate UI Elements**:
40
- * **Header**: Move the H1/Title text into the `title` parameter of `ChatInterface`.
41
- * **Footer**: Move the footer text ("MCP Server Active...") into the `description` parameter. `ChatInterface` supports Markdown in `description`, making it the ideal place for static info below the title but above the chat.
42
-
43
- ### Before (Buggy)
44
- ```python
45
- def create_demo():
46
- with gr.Blocks() as demo: # <--- CAUSE OF BUG
47
- gr.Markdown("# Title")
48
- gr.ChatInterface(..., additional_inputs_accordion=gr.Accordion(open=False))
49
- gr.Markdown("Footer")
50
- return demo
51
- ```
52
-
53
- ### After (Correct)
54
- ```python
55
- def create_demo():
56
- return gr.ChatInterface( # <--- FIX: Top-level component
57
- ...,
58
- title="🧬 DeepBoner",
59
- description="*AI-Powered Drug Repurposing Agent...*\n\n---\n**MCP Server Active**...",
60
- additional_inputs_accordion=gr.Accordion(label="⚙️ Settings", open=False)
61
- )
62
- ```
63
-
64
- ---
65
-
66
- ## 4. Validation
67
-
68
- 1. **Run**: `uv run python src/app.py`
69
- 2. **Check**: Open `http://localhost:7860`
70
- 3. **Verify**:
71
- * Settings accordion starts **COLLAPSED**.
72
- * Header title ("DeepBoner") is visible.
73
- * Footer text ("MCP Server Active") is visible in the description area.
74
- * Chat functionality works (Magentic/Simple modes).
75
-
76
- ---
77
-
78
- ## 5. Constraints & Notes
79
-
80
- - **Layout**: We lose the ability to place arbitrary elements *below* the chat box (footer will move to top, under title), but this is an acceptable trade-off for a working UI.
81
- - **CSS**: `ChatInterface` handles its own CSS; any custom class styling from the previous footer will be standardized to the description text style.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P1_MAGENTIC_STREAMING_AND_KEY_PERSISTENCE.md DELETED
@@ -1,181 +0,0 @@
1
- # Bug Report: Magentic Mode Integration Issues
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Reporter:** CLI User
6
- - **Priority:** P1 (UX Degradation + Deprecation Warnings)
7
- - **Component:** `src/app.py`, `src/orchestrator_magentic.py`, `src/utils/llm_factory.py`
8
- - **Status:** ✅ FIXED (Bug 1 & Bug 2) - 2025-11-29
9
- - **Tests:** 138 passing (136 original + 2 new validation tests)
10
-
11
- ---
12
-
13
- ## Bug 1: Token-by-Token Streaming Spam ✅ FIXED
14
-
15
- ### Symptoms
16
- When running Magentic (Advanced) mode, the UI shows hundreds of individual lines like:
17
- ```text
18
- 📡 STREAMING: Below
19
- 📡 STREAMING: is
20
- 📡 STREAMING: a
21
- 📡 STREAMING: curated
22
- 📡 STREAMING: list
23
- ...
24
- ```
25
-
26
- Each token is displayed as a separate streaming event, creating visual spam and making it impossible to read the output until completion.
27
-
28
- ### Root Cause (VALIDATED)
29
- **File:** `src/orchestrator_magentic.py:247-254`
30
-
31
- ```python
32
- elif isinstance(event, MagenticAgentDeltaEvent):
33
- if event.text:
34
- return AgentEvent(
35
- type="streaming",
36
- message=event.text, # Single token!
37
- data={"agent_id": event.agent_id},
38
- iteration=iteration,
39
- )
40
- ```
41
-
42
- Every LLM token emits a `MagenticAgentDeltaEvent`, which creates an `AgentEvent(type="streaming")`.
43
-
44
- **File:** `src/app.py:171-192` (BEFORE FIX)
45
-
46
- ```python
47
- async for event in orchestrator.run(message):
48
- event_md = event.to_markdown()
49
- response_parts.append(event_md) # Appends EVERY token
50
-
51
- if event.type == "complete":
52
- yield event.message
53
- else:
54
- yield "\n\n".join(response_parts) # Yields ALL accumulated tokens
55
- ```
56
-
57
- For N tokens, this yields N times, each time showing all previous tokens. This is O(N²) string operations and creates massive visual spam.
58
-
59
- ### Fix Applied
60
- **File:** `src/app.py:175-204`
61
-
62
- Implemented streaming token buffering with live updates:
63
- 1. Added `streaming_buffer = ""` to accumulate tokens
64
- 2. For each streaming event: append to buffer, yield immediately (for live typing UX)
65
- 3. **Key fix**: Don't append streaming events to `response_parts` (prevents O(N²) list growth)
66
- 4. Each yield has only ONE `📡 STREAMING:` line (the accumulated buffer)
67
- 5. Flush buffer to `response_parts` only when non-streaming event occurs
68
-
69
- **Result**: Live typing feel preserved, but no visual spam (each update replaces, not accumulates)
70
-
71
- ### Proposed Fix Options
72
-
73
- **Option A: Buffer streaming tokens (recommended)**
74
- ```python
75
- # In app.py - accumulate streaming tokens, yield periodically
76
- streaming_buffer = ""
77
- last_yield_time = time.time()
78
-
79
- async for event in orchestrator.run(message):
80
- if event.type == "streaming":
81
- streaming_buffer += event.message
82
- # Only yield every 500ms or on newline
83
- if time.time() - last_yield_time > 0.5 or "\n" in event.message:
84
- yield f"📡 {streaming_buffer}"
85
- last_yield_time = time.time()
86
- elif event.type == "complete":
87
- yield event.message
88
- else:
89
- # Non-streaming events
90
- response_parts.append(event.to_markdown())
91
- yield "\n\n".join(response_parts)
92
- ```
93
-
94
- **Option B: Don't yield streaming events at all**
95
- ```python
96
- # In app.py - only yield meaningful events
97
- async for event in orchestrator.run(message):
98
- if event.type == "streaming":
99
- continue # Skip token-by-token spam
100
- # ... rest of logic
101
- ```
102
-
103
- **Option C: Fix at orchestrator level**
104
- Don't emit `AgentEvent` for every delta - buffer in `_process_event`.
105
-
106
- ---
107
-
108
- ## Bug 2: API Key Does Not Persist in Textbox ✅ FIXED
109
-
110
- ### Symptoms
111
- 1. User opens the "Mode & API Key" accordion
112
- 2. User pastes their API key into the password textbox
113
- 3. User clicks an example OR clicks elsewhere
114
- 4. The API key textbox is now empty - value lost
115
-
116
- ### Root Cause (VALIDATED)
117
- **File:** `src/app.py:255-267` (BEFORE FIX)
118
-
119
- ```python
120
- additional_inputs_accordion=additional_inputs_accordion,
121
- additional_inputs=[
122
- gr.Radio(...),
123
- gr.Textbox(
124
- label="🔑 API Key (Optional)",
125
- type="password",
126
- # No `value` parameter - defaults to empty
127
- # No state persistence mechanism
128
- ),
129
- ],
130
- ```
131
-
132
- Gradio's `ChatInterface` with `additional_inputs` has known issues:
133
- 1. Clicking examples resets additional inputs to defaults
134
- 2. The accordion state and input values may not persist correctly
135
- 3. No explicit state management for the API key
136
-
137
- ### Fix Applied
138
- **Files Modified:**
139
- 1. `src/app.py`
140
- 2. `src/utils/llm_factory.py`
141
-
142
- **Bug 1 (Streaming Spam):**
143
- - Accumulate tokens in `streaming_buffer`
144
- - Yield updates immediately for live typing UX
145
- - **Key**: Don't append to `response_parts` until stream segment complete
146
- - Each yield has ONE `📡 STREAMING:` line (not N accumulated lines)
147
-
148
- **Bug 2 (API Key Persistence):**
149
- - **Strategy:** Partial example list (relies on Gradio behavior)
150
- - Examples have only 2 elements `[message, mode]` instead of 4
151
- - Gradio only updates inputs with corresponding example values
152
- - Remaining inputs (api_key textbox) are left unchanged
153
- - `api_key_state` parameter exists as fallback but may be redundant
154
- - **Note:** This is a workaround relying on undocumented Gradio behavior
155
-
156
- **Bug 3 (OpenAIModel Deprecation):** ✅ FIXED
157
- - Replaced all `OpenAIModel` imports with `OpenAIChatModel` in `src/app.py` and `src/utils/llm_factory.py`.
158
-
159
- ### Test Results
160
- ```bash
161
- uv run pytest tests/ -q
162
- ============================= 138 passed in 20.60s =============================
163
- ```
164
-
165
- **Status:** ✅ All tests passing
166
-
167
- ### Why This Fix Works
168
-
169
- **Bug 1 (Streaming Spam):**
170
- - **Before:** Every token → `append()` to list → `yield` → List grew to size N → O(N²) complexity.
171
- - **After:** Every token → `yield` dynamically constructed string (buffer + history) → List stays size K (number of *events*).
172
- - **Impact:** Smooth streaming, no visual spam, no browser freeze.
173
-
174
- **Bug 2 (API Key):**
175
- - **Before:** Example click → Overwrote API Key textbox with `""`.
176
- - **After:** Example click → Updates only `message` and `mode` → API Key textbox untouched.
177
- - **Impact:** User input persists naturally.
178
-
179
- ### Remaining Work
180
- - **Bug 4 (Asyncio GC errors):** Monitoring only - likely Gradio/HF Spaces issue
181
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P1_MULTIPLE_UX_BUGS.md DELETED
@@ -1,49 +0,0 @@
1
- # P1 Bug Report: Multiple UX and Configuration Issues
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P1 (Multiple user-facing issues)
6
- - **Components:** `src/app.py`, `src/orchestrator_magentic.py`
7
-
8
- ## Resolved Issues (Fixed 2025-11-29)
9
-
10
- ### Bug 1: API Key Cleared When Clicking Examples
11
- **Fixed.** Updated `examples` in `app.py` to include explicit `None` values for additional inputs. Gradio preserves values when the example value is `None`.
12
-
13
- ### Bug 2: No Loading/Processing Indicator
14
- **Fixed.** `research_agent` yields an immediate "⏳ Processing..." message before starting the orchestrator.
15
-
16
- ### Bug 3: Advanced Mode Temperature Error
17
- **Fixed.** Explicitly set `temperature=1.0` for all Magentic agents in `src/agents/magentic_agents.py`. This is compatible with OpenAI reasoning models (o1/o3) which require `temperature=1` and were rejecting the default (likely 0.3 or None).
18
-
19
- ### Bug 4: HSDD Acronym Not Spelled Out
20
- **Fixed.** Updated example text in `app.py` to "HSDD (Hypoactive Sexual Desire Disorder)".
21
-
22
- ---
23
-
24
- ## Open / Deferred Issues
25
-
26
- ### Bug 5: Free Tier Quota Exhausted (UX Improvement)
27
- **Deferred.** Currently shows standard error message. Improve if users report confusion.
28
-
29
- ### Bug 6: Asyncio File Descriptor Warnings
30
- **Won't Fix.** Cosmetic issue only.
31
-
32
- ---
33
-
34
- ## Priority Order (Completed)
35
-
36
- 1. **Bug 4 (HSDD)** - Fixed
37
- 2. **Bug 2 (Loading indicator)** - Fixed
38
- 3. **Bug 3 (Temperature)** - Fixed
39
- 4. **Bug 1 (API key)** - Fixed
40
-
41
- ---
42
-
43
- ## Test Plan
44
- - [x] Fix HSDD acronym
45
- - [x] Add loading indicator yield
46
- - [x] Test advanced mode with temperature fix (Static analysis/Code change)
47
- - [x] Research Gradio example behavior for API key (Implemented None fix)
48
- - [ ] Run `make check`
49
- - [ ] Deploy and test on HuggingFace Spaces
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/P2_MAGENTIC_THINKING_STATE.md DELETED
@@ -1,232 +0,0 @@
1
- # P2 Bug Report: Advanced Mode Missing "Thinking" State
2
-
3
- ## Status
4
- - **Date:** 2025-11-29
5
- - **Priority:** P2 (UX polish, not blocking functionality)
6
- - **Component:** `src/orchestrator_magentic.py`, `src/app.py`
7
-
8
- ---
9
-
10
- ## Symptoms
11
-
12
- User experience in **Advanced (Magentic) mode**:
13
- 1. Click example or submit query
14
- 2. See: `🚀 **STARTED**: Starting research (Magentic mode)...`
15
- 3. **2+ minutes of nothing** (no spinner, no progress, no indication work is happening)
16
- 4. Eventually see: `🧠 **JUDGING**: Manager (user_task)...`
17
-
18
- **User perception:** "Is it frozen? Did it crash?"
19
-
20
- ### Container Logs Confirm Work IS Happening
21
- ```
22
- 14:54:22 [info] Starting Magentic orchestrator query='...'
23
- 14:54:22 [info] Embedding service enabled
24
- ... 2+ MINUTES OF SILENCE (agent-framework doing internal LLM calls) ...
25
- 14:56:38 [info] Creating orchestrator mode=advanced
26
- ```
27
-
28
- The silence is because `workflow.run_stream()` doesn't yield events during its setup phase.
29
-
30
- ---
31
-
32
- ## Root Cause Analysis
33
-
34
- ### Current Flow (`src/orchestrator_magentic.py`)
35
- ```python
36
- async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
37
- # 1. Immediately yields "started"
38
- yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
39
-
40
- # 2. Setup (fast, no yield needed)
41
- embedding_service = self._init_embedding_service()
42
- init_magentic_state(embedding_service)
43
- workflow = self._build_workflow()
44
-
45
- # 3. GAP: workflow.run_stream() blocks for 2+ minutes before first event
46
- async for event in workflow.run_stream(task): # <-- THE BOTTLENECK
47
- yield self._process_event(event)
48
- ```
49
-
50
- The `agent-framework`'s `workflow.run_stream()` is calling OpenAI's API, building the manager prompt, coordinating agents, etc. **It doesn't yield events during this setup phase**.
51
-
52
- ---
53
-
54
- ## Gold Standard UX (What We'd Want)
55
-
56
- ### Gradio's Native Thinking Support
57
-
58
- Per [Gradio Chatbot Docs](https://www.gradio.app/docs/gradio/chatbot):
59
-
60
- > "The Gradio Chatbot can natively display intermediate thoughts and tool usage in a collapsible accordion next to a chat message. This makes it perfect for creating UIs for LLM agents and chain-of-thought (CoT) or reasoning demos."
61
-
62
- **Features available:**
63
- - `gr.ChatMessage` with `metadata={"status": "pending"}` shows spinner
64
- - `metadata={"title": "Thinking...", "status": "pending"}` creates collapsible accordion
65
- - Nested thoughts via `id` and `parent_id`
66
- - `duration` metadata shows time spent
67
-
68
- **Example from Gradio docs:**
69
- ```python
70
- import gradio as gr
71
-
72
- def chat_fn(message, history):
73
- # Yield thinking state with spinner
74
- yield gr.ChatMessage(
75
- role="assistant",
76
- metadata={"title": "🧠 Thinking...", "status": "pending"}
77
- )
78
-
79
- # Do work...
80
-
81
- # Update with completed thought
82
- yield gr.ChatMessage(
83
- role="assistant",
84
- content="Analysis complete",
85
- metadata={"title": "🧠 Thinking...", "status": "done", "duration": 5.2}
86
- )
87
-
88
- yield "Here's the final answer..."
89
- ```
90
-
91
- ---
92
-
93
- ## Why This is Complex for DeepBoner
94
-
95
- ### Constraint 1: ChatInterface Returns Strings
96
- Our `research_agent()` yields plain strings:
97
- ```python
98
- yield "🧠 **Backend**: {backend_name}\n\n"
99
- yield "⏳ **Processing...** Searching PubMed...\n"
100
- yield "\n\n".join(response_parts)
101
- ```
102
-
103
- Converting to `gr.ChatMessage` objects would require refactoring the entire response pipeline.
104
-
105
- ### Constraint 2: Agent-Framework is the Bottleneck
106
- The 2-minute gap is inside `workflow.run_stream(task)`, which is the `agent-framework` library. We can't inject yields into a third-party library's blocking call.
107
-
108
- ### Constraint 3: ChatInterface vs Blocks
109
- `gr.ChatInterface` is a convenience wrapper. The full `gr.ChatMessage` metadata features work best with raw `gr.Blocks` + `gr.Chatbot` components.
110
-
111
- ---
112
-
113
- ## Options
114
-
115
- ### Option A: Yield "Thinking" Before Blocking Call (Recommended)
116
- **Effort:** 5 minutes
117
- **Impact:** Users see *something* while waiting
118
-
119
- ```python
120
- # In src/orchestrator_magentic.py
121
- async def run(self, query: str) -> AsyncGenerator[AgentEvent, None]:
122
- yield AgentEvent(type="started", message=f"Starting research (Magentic mode): {query}")
123
-
124
- # NEW: Yield thinking state before the blocking call
125
- yield AgentEvent(
126
- type="thinking", # New event type
127
- message="🧠 Agents are reasoning... This may take 2-5 minutes for complex queries.",
128
- iteration=0,
129
- )
130
-
131
- # ... rest of setup ...
132
-
133
- async for event in workflow.run_stream(task):
134
- yield self._process_event(event)
135
- ```
136
-
137
- **Pros:**
138
- - Simple, doesn't require Gradio changes
139
- - Works with current string-based approach
140
- - Sets user expectations ("2-5 minutes")
141
-
142
- **Cons:**
143
- - No spinner/animation (static text)
144
- - Doesn't show real-time progress during the gap
145
-
146
- ### Option B: Use `gr.ChatMessage` with Metadata (Major Refactor)
147
- **Effort:** 2-4 hours
148
- **Impact:** Full gold-standard UX
149
-
150
- Would require:
151
- 1. Changing `research_agent()` to yield `gr.ChatMessage` objects
152
- 2. Adding thinking states with `metadata={"status": "pending"}`
153
- 3. Updating all event handlers to produce proper ChatMessage objects
154
-
155
- ### Option C: Heartbeat/Polling (Over-Engineering)
156
- **Effort:** 4+ hours
157
- **Impact:** Spinner during blocking call
158
-
159
- Create a background task that yields "still working..." every 10 seconds while waiting for the agent-framework. Requires:
160
- - `asyncio.create_task()` for heartbeat
161
- - Task cancellation when real events arrive
162
- - Proper cleanup
163
-
164
- **Verdict:** Over-engineering for a demo.
165
-
166
- ### Option D: Accept the Limitation (Document It)
167
- **Effort:** 0
168
- **Impact:** None (users still confused)
169
-
170
- Just document that Advanced mode takes 2-5 minutes and users should wait.
171
-
172
- ---
173
-
174
- ## Recommendation
175
-
176
- **Implement Option A** - Add a "thinking" yield before the blocking call.
177
-
178
- It's:
179
- 1. Minimal code change (5 minutes)
180
- 2. Sets user expectations clearly
181
- 3. Doesn't require Gradio refactoring
182
- 4. Better than silence
183
-
184
- ---
185
-
186
- ## Implementation Plan
187
-
188
- ### Step 1: Add "thinking" Event Type
189
- ```python
190
- # In src/utils/models.py
191
- class AgentEvent(BaseModel):
192
- type: Literal[
193
- "started", "thinking", "searching", ... # Add "thinking"
194
- ]
195
- ```
196
-
197
- ### Step 2: Yield Thinking Event in Magentic Orchestrator
198
- ```python
199
- # In src/orchestrator_magentic.py, run() method
200
- yield AgentEvent(
201
- type="thinking",
202
- message="🧠 Multi-agent reasoning in progress... This may take 2-5 minutes.",
203
- iteration=0,
204
- )
205
- ```
206
-
207
- ### Step 3: Handle in App
208
- ```python
209
- # In src/app.py, research_agent()
210
- if event.type == "thinking":
211
- yield f"⏳ {event.message}"
212
- ```
213
-
214
- ---
215
-
216
- ## Test Plan
217
-
218
- - [ ] Add `"thinking"` to AgentEvent type literals
219
- - [ ] Add yield before `workflow.run_stream()`
220
- - [ ] Handle in app.py
221
- - [ ] `make check` passes
222
- - [ ] Manual test: Advanced mode shows "reasoning in progress" message
223
- - [ ] Deploy to HuggingFace, verify UX improvement
224
-
225
- ---
226
-
227
- ## References
228
-
229
- - [Gradio ChatInterface Docs](https://www.gradio.app/docs/gradio/chatinterface)
230
- - [Gradio Chatbot Metadata](https://www.gradio.app/docs/gradio/chatbot)
231
- - [Agents and Tool Usage Guide](https://www.gradio.app/guides/agents-and-tool-usage)
232
- - [GitHub Issue: Streaming text not working](https://github.com/gradio-app/gradio/issues/11443)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/SENIOR_AGENT_AUDIT_PROMPT.md DELETED
@@ -1,247 +0,0 @@
1
- # Senior Agent Audit Request: DeepBoner Codebase Bug Hunt
2
-
3
- **Date**: 2025-11-28
4
- **Requesting Agent**: Claude (Opus)
5
- **Purpose**: Comprehensive bug audit and verification of P0_CRITICAL_BUGS.md
6
-
7
- ---
8
-
9
- ## Your Mission
10
-
11
- You are a senior software engineer performing a comprehensive audit of the DeepBoner codebase. Your goals:
12
-
13
- 1. **VERIFY** the 4 bugs documented in `docs/bugs/P0_CRITICAL_BUGS.md` are accurately described
14
- 2. **FIND** any additional bugs (P0-P4) that could affect the demo
15
- 3. **TRACE** the complete code paths for Simple and Advanced modes
16
- 4. **IDENTIFY** any silent failures, race conditions, or edge cases
17
-
18
- ---
19
-
20
- ## Context: What DeepBoner Does
21
-
22
- DeepBoner is a Gradio-based biomedical research agent that:
23
- 1. Takes a research question from user
24
- 2. Searches PubMed, ClinicalTrials.gov, Europe PMC
25
- 3. Uses an LLM "judge" to evaluate if evidence is sufficient
26
- 4. Either loops for more evidence or synthesizes a final report
27
-
28
- **Two Modes**:
29
- - **Simple**: Linear orchestrator with search → judge → report loop
30
- - **Advanced**: Magentic multi-agent with SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent
31
-
32
- **Three Backend Options**:
33
- - Free tier: HuggingFace Inference API (Llama/Mistral)
34
- - OpenAI: User-provided or env var key
35
- - Anthropic: User-provided or env var key (Simple mode only)
36
-
37
- ---
38
-
39
- ## Files to Audit (Priority Order)
40
-
41
- ### Critical Path Files:
42
- 1. `src/app.py` - Gradio UI, entry point, key routing
43
- 2. `src/orchestrator.py` - Simple mode main loop
44
- 3. `src/orchestrator_factory.py` - Mode selection and orchestrator creation
45
- 4. `src/orchestrator_magentic.py` - Advanced mode implementation
46
- 5. `src/services/embeddings.py` - Deduplication singleton (KNOWN BUG)
47
- 6. `src/agent_factory/judges.py` - LLM judge handlers (HF, OpenAI, Anthropic)
48
-
49
- ### Supporting Files:
50
- 7. `src/tools/search_handler.py` - Parallel search orchestration
51
- 8. `src/tools/pubmed.py` - PubMed API integration
52
- 9. `src/tools/clinicaltrials.py` - ClinicalTrials.gov API
53
- 10. `src/tools/europepmc.py` - Europe PMC API
54
- 11. `src/agents/magentic_agents.py` - Agent factories (KNOWN BUG: hardcoded env key)
55
- 12. `src/utils/config.py` - Settings and configuration
56
- 13. `src/utils/models.py` - Data models (Evidence, Citation, etc.)
57
-
58
- ---
59
-
60
- ## Known Bugs to Verify
61
-
62
- ### Bug 1: Free Tier LLM Quota Exhausted
63
- **Claim**: HuggingFace Inference returns 402, all 3 fallback models fail
64
- **Verify**:
65
- - Check `src/agent_factory/judges.py` class `HFInferenceJudgeHandler`
66
- - Trace the fallback chain: Llama → Mistral → Zephyr
67
- - Confirm what happens when ALL fail (does it return default "continue"?)
68
- - Check if the error message reaches the user or is swallowed
69
-
70
- ### Bug 2: Evidence Counter Shows 0 After Dedup
71
- **Claim**: `_deduplicate_and_rank()` can return empty list, losing all evidence
72
- **Verify**:
73
- - Check `src/orchestrator.py` lines 97-114 and 219
74
- - Trace what happens if `embeddings.deduplicate()` returns `[]`
75
- - Is there defensive handling? Does exception handler catch this?
76
- - Could this be a race condition in async code?
77
-
78
- ### Bug 3: API Key Not Passed to Advanced Mode
79
- **Claim**: User's API key from Gradio is never passed to MagenticOrchestrator
80
- **Verify**:
81
- - Trace: `app.py:research_agent()` → `configure_orchestrator()` → `orchestrator_factory.py`
82
- - Check if `user_api_key` is passed to `create_orchestrator()`
83
- - Check if `MagenticOrchestrator.__init__()` receives a key
84
- - Check `src/agents/magentic_agents.py` - do agents use `settings.openai_api_key`?
85
-
86
- ### Bug 4: Singleton EmbeddingService Cross-Session Pollution
87
- **Claim**: ChromaDB collection persists across requests, causing false duplicates
88
- **Verify**:
89
- - Check `src/services/embeddings.py` singleton pattern
90
- - Is `_embedding_service` ever reset?
91
- - What happens to ChromaDB collection between Gradio requests?
92
- - Could this cause "Found 20 new sources (0 total)"?
93
-
94
- ---
95
-
96
- ## Additional Bug Categories to Search For
97
-
98
- ### A. Error Handling Gaps
99
- - [ ] Silent `except: pass` blocks
100
- - [ ] Exceptions logged but not re-raised
101
- - [ ] Missing error messages to user
102
- - [ ] Swallowed API errors
103
-
104
- ### B. Async/Concurrency Issues
105
- - [ ] Race conditions in parallel searches
106
- - [ ] Shared mutable state across async calls
107
- - [ ] Missing `await` keywords
108
- - [ ] Event loop blocking (sync code in async context)
109
-
110
- ### C. API Integration Bugs
111
- - [ ] Missing rate limiting
112
- - [ ] Hardcoded timeouts that are too short
113
- - [ ] XML/JSON parsing failures not handled
114
- - [ ] Empty response handling
115
-
116
- ### D. State Management Issues
117
- - [ ] Global singletons that should be session-scoped
118
- - [ ] Gradio state not properly isolated between users
119
- - [ ] Memory leaks from accumulated data
120
-
121
- ### E. Configuration Bugs
122
- - [ ] Missing env var defaults
123
- - [ ] Type mismatches in settings
124
- - [ ] Hardcoded values that should be configurable
125
-
126
- ### F. UI/UX Bugs
127
- - [ ] Streaming not working properly
128
- - [ ] Progress messages misleading
129
- - [ ] Examples not matching actual functionality
130
- - [ ] Error messages not user-friendly
131
-
132
- ---
133
-
134
- ## Output Format
135
-
136
- Please produce a report with:
137
-
138
- ### 1. Verification of Known Bugs
139
- For each of the 4 bugs in P0_CRITICAL_BUGS.md:
140
- - **CONFIRMED** or **INCORRECT** or **PARTIALLY CORRECT**
141
- - Exact file:line references
142
- - Any corrections or additional details
143
-
144
- ### 2. New Bugs Found
145
- For each new bug:
146
- ```
147
- ## Bug N: [Title]
148
- **Priority**: P0/P1/P2/P3/P4
149
- **File**: path/to/file.py:line
150
- **Symptoms**: What the user sees
151
- **Root Cause**: Technical explanation
152
- **Code**:
153
- ```python
154
- # The buggy code
155
- ```
156
- **Fix**:
157
- ```python
158
- # The corrected code
159
- ```
160
- ```
161
-
162
- ### 3. Code Quality Concerns
163
- Any patterns that aren't bugs but could cause issues:
164
- - Technical debt
165
- - Missing tests for critical paths
166
- - Unclear error handling
167
-
168
- ### 4. Recommended Fix Order
169
- Prioritized list of what to fix first for a working demo.
170
-
171
- ---
172
-
173
- ## Commands to Help Your Investigation
174
-
175
- ```bash
176
- # Run the tests
177
- make check
178
-
179
- # Test search works
180
- uv run python -c "
181
- import asyncio
182
- from src.tools.pubmed import PubMedTool
183
- async def test():
184
- tool = PubMedTool()
185
- results = await tool.search('female libido', 5)
186
- print(f'Found {len(results)} results')
187
- asyncio.run(test())
188
- "
189
-
190
- # Test HF inference (will show 402 if quota exhausted)
191
- uv run python -c "
192
- from huggingface_hub import InferenceClient
193
- client = InferenceClient()
194
- try:
195
- resp = client.chat_completion(
196
- messages=[{'role': 'user', 'content': 'Hi'}],
197
- model='meta-llama/Llama-3.1-8B-Instruct',
198
- max_tokens=10
199
- )
200
- print(resp)
201
- except Exception as e:
202
- print(f'Error: {e}')
203
- "
204
-
205
- # Test full orchestrator (simple mode)
206
- uv run python -c "
207
- import asyncio
208
- from src.app import configure_orchestrator
209
- async def test():
210
- orch, backend = configure_orchestrator(use_mock=True, mode='simple')
211
- print(f'Backend: {backend}')
212
- async for event in orch.run('test query'):
213
- print(f'{event.type}: {event.message[:50] if event.message else \"\"}'[:60])
214
- asyncio.run(test())
215
- "
216
-
217
- # Check for hardcoded API keys (security)
218
- grep -r "sk-" src/ --include="*.py" | grep -v "sk-..." | grep -v "sk-ant-..."
219
-
220
- # Find all singletons
221
- grep -r "_.*: .* | None = None" src/ --include="*.py"
222
-
223
- # Find all except blocks
224
- grep -rn "except.*:" src/ --include="*.py" | head -50
225
- ```
226
-
227
- ---
228
-
229
- ## Important Notes
230
-
231
- 1. **DO NOT fix bugs** - just document them
232
- 2. **Be thorough** - check edge cases and error paths
233
- 3. **Be specific** - include file:line references
234
- 4. **Be skeptical** - verify claims in P0_CRITICAL_BUGS.md independently
235
- 5. **Think like a user** - what would break the demo experience?
236
-
237
- The hackathon deadline is approaching. We need a working demo. Your audit will determine what gets fixed first.
238
-
239
- ---
240
-
241
- ## Deliverable
242
-
243
- A comprehensive markdown report that:
244
- 1. Confirms or corrects the 4 known bugs
245
- 2. Lists any new bugs found (with priority)
246
- 3. Recommends the optimal fix order
247
- 4. Can be saved as `docs/bugs/SENIOR_AUDIT_RESULTS.md`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/bugs/SENIOR_AUDIT_RESULTS.md DELETED
@@ -1,84 +0,0 @@
1
- # Senior Agent Audit Results: DeepBoner Codebase
2
-
3
- **Date**: 2025-11-28
4
- **Auditor**: Claude (Senior Software Engineer)
5
- **Status**: COMPLETE
6
-
7
- ---
8
-
9
- ## Executive Summary
10
-
11
- The DeepBoner codebase has **4 critical defects** that render the demo non-functional for most users. The most severe is a **data leak** where the vector database persists across user sessions, causing search result corruption and potential privacy issues. Additionally, the "Advanced" mode ignores user-provided API keys, and the "Free Tier" mode fails silently when quotas are exhausted.
12
-
13
- **Recommendation**: Immediate remediation of P0 bugs is required before hackathon submission.
14
-
15
- ---
16
-
17
- ## 1. Verification of Known Bugs (P0_CRITICAL_BUGS.md)
18
-
19
- | Bug | Claim | Verification Status | Notes |
20
- | :--- | :--- | :--- | :--- |
21
- | **Bug 1** | Free Tier LLM Quota Exhausted | **CONFIRMED** | `HFInferenceJudgeHandler` catches errors but returns a fallback assessment with `recommendation="continue"`. This causes the orchestrator to loop uselessly until `max_iterations` is reached. The user sees no error message. |
22
- | **Bug 2** | Evidence Counter Shows 0 | **CONFIRMED** | Directly caused by Bug 4. Deduplication logic works correctly *in isolation*, but fails because the underlying ChromaDB collection is polluted with stale data from previous sessions. |
23
- | **Bug 3** | API Key Not Passed to Advanced | **CONFIRMED** | `create_orchestrator` in `orchestrator_factory.py` ignores the user's API key. `MagenticOrchestrator` and its agents fall back to `settings.openai_api_key` (env var), which is empty for BYOK users. |
24
- | **Bug 4** | Singleton EmbeddingService | **CONFIRMED** | `EmbeddingService` is a global singleton with an in-memory ChromaDB. The collection is never cleared. Data leaks between sessions, causing valid new results to be marked as duplicates of old results. |
25
-
26
- ---
27
-
28
- ## 2. New Bugs Found
29
-
30
- ### Bug 5: Search Error Swallowing (P2)
31
- **File**: `src/orchestrator.py` / `src/tools/search_handler.py`
32
- **Symptoms**: If all search tools fail (e.g., network issue, API limit), the UI shows "Found 0 sources" without explaining why.
33
- **Root Cause**: `SearchHandler` captures exceptions and returns them in an `errors` list, but `Orchestrator` only logs them to the console (`logger.warning`) and proceeds with empty evidence.
34
- **Fix**: Yield an `AgentEvent(type="error")` or include errors in the `search_complete` event message.
35
-
36
- ### Bug 6: Hardcoded Model Names (P3)
37
- **File**: `src/agent_factory/judges.py`
38
- **Symptoms**: Maintenance burden.
39
- **Root Cause**: Model names like `meta-llama/Llama-3.1-8B-Instruct` are hardcoded in the class `HFInferenceJudgeHandler` rather than pulled from `config.py`.
40
- **Fix**: Move to `Settings`.
41
-
42
- ---
43
-
44
- ## 3. Code Quality Concerns
45
-
46
- 1. **Singleton Abuse**: The `_embedding_service` global in `src/services/embeddings.py` is a major architectural flaw for a multi-user web app (even a demo). It should be scoped to the `Orchestrator` instance.
47
- 2. **Inconsistent Factory Signatures**: `create_orchestrator` does not accept `api_key`, forcing hacks or reliance on global env vars.
48
- 3. **Silent Failures**: The pervasive use of `try...except Exception` with only logging (no user feedback) makes debugging difficult for end-users.
49
-
50
- ---
51
-
52
- ## 4. Recommended Fix Order
53
-
54
- ### Step 1: Fix the Data Leak (Bug 4 & 2)
55
- **Why**: Prevents result corruption and cross-user data leakage.
56
- **Plan**:
57
- 1. Remove singleton pattern from `src/services/embeddings.py`.
58
- 2. Make `EmbeddingService` an instance variable of `Orchestrator`.
59
- 3. Initialize a fresh `EmbeddingService` (and ChromaDB collection) for each `run()`.
60
-
61
- ### Step 2: Fix Advanced Mode BYOK (Bug 3)
62
- **Why**: Enables the core "Advanced" feature for judges/users.
63
- **Plan**:
64
- 1. Update `create_orchestrator` signature to accept `api_key`.
65
- 2. Update `MagenticOrchestrator` to accept `api_key`.
66
- 3. Update `configure_orchestrator` in `app.py` to pass the key.
67
- 4. Ensure `MagenticOrchestrator` constructs `OpenAIChatClient` with the user's key.
68
-
69
- ### Step 3: Fix Free Tier Experience (Bug 1)
70
- **Why**: Ensures a usable fallback for those without keys.
71
- **Plan**:
72
- 1. In `HFInferenceJudgeHandler`, detect 402/429 errors.
73
- 2. If caught, return a `JudgeAssessment` that triggers a "Complete" event with a clear error message, rather than "Continue".
74
- 3. Add `HF_TOKEN` to the deployment environment if possible.
75
-
76
- ---
77
-
78
- ## Verification Plan
79
-
80
- After applying fixes, run:
81
- 1. **Unit Tests**: `make check`
82
- 2. **Manual Test (Simple)**: Run without key, verify 402 error is handled OR works if token added.
83
- 3. **Manual Test (Advanced)**: Run with OpenAI key, verify it proceeds past initialization.
84
- 4. **Manual Test (Dedup)**: Run same query twice. Second run should find same number of results (not 0).