You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(agent): add batch mode and model selection docs
- Add --batch flag to run all 29 genius-browser.js prompt examples
- Add --prompts-file flag to run custom prompt lists
- Extract _connect() for shared DB/client across batch runs
- run_agent() returns AgentResult dataclass with structured metrics
- run_batch() prints summary table with pass/fail, timing, eval scores
- Make _setup_logging() idempotent for repeated calls
- Document model selection by device RAM (8GB-32GB+)
- Document benchmark results: 26/29 pass on qwen3.5:9b
- Document model comparison on failure cases across 4 models
- Add TASK-309 with full test results
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
title: Test all Genius prompt examples against agent.py
4
+
status: Done
5
+
assignee: []
6
+
created_date: '2026-04-04 04:07'
7
+
updated_date: '2026-04-04 04:40'
8
+
labels:
9
+
- testing
10
+
- genius
11
+
- agent
12
+
dependencies: []
13
+
references:
14
+
- app/frontend/js/components/genius-browser.js
15
+
- scripts/agent.py
16
+
- docs/agent.md
17
+
priority: medium
18
+
---
19
+
20
+
## Description
21
+
22
+
<!-- SECTION:DESCRIPTION:BEGIN -->
23
+
Run each of the 29 prompt examples from `app/frontend/js/components/genius-browser.js` through `scripts/agent.py` using the default qwen3.5:9b model. Record which prompts succeed (valid playlist generated) and which fail (parse failure, no matches, bad output). This establishes a baseline for prompt coverage against the current library.
24
+
<!-- SECTION:DESCRIPTION:END -->
25
+
26
+
## Acceptance Criteria
27
+
<!-- AC:BEGIN -->
28
+
-[x]#1 All 29 prompt examples tested against agent.py with qwen3.5:9b
| 28 | a party mix from what I already have | PASS | 12 | 11 | 2/5 | 2 | 2 | 2 | 2.00 |
70
+
| 29 | blues and classic rock deep cuts | FAIL | - | - | - | - | - | - | - |
71
+
72
+
### Failure Analysis
73
+
74
+
All 3 failures share the same root cause: **model dumps 50-100+ track IDs then tries to self-correct multiple times**, never producing a clean single-line `Playlist:` / `Tracks:` output. The `**Playlist:**` markdown bold formatting also breaks the parser.
75
+
76
+
-**#14 "tracks with heavy bass lines"** — model found many matching tracks, dumped all IDs, then looped trying to reduce the list
77
+
-**#23 "melancholy but beautiful tracks"** — same pattern, 130+ IDs dumped, repeated failed attempts to curate
78
+
-**#29 "blues and classic rock deep cuts"** — model listed tracks in prose format instead of using Playlist:/Tracks: format
79
+
80
+
### Library Coverage Gaps (Concept score = 0)
81
+
82
+
-**#10 "jazz and soul from the 60s and 70s"** — library has no jazz/soul, model fell back to post-punk from that era
83
+
-**#24 "hip-hop and R&B from the 90s"** — library has no hip-hop/R&B, model fell back to 90s alternative/indie
84
+
85
+
### Low Variety Scores
86
+
87
+
-**#3 "post-punk artists"** — 25 tracks from only 7 artists (too many per artist)
88
+
-**#6 "deep cuts"** — 25 tracks from 6 artists
89
+
-**#15 "most played this year"** — Variety=0, 25 tracks from 16 artists (judge was harsh)
90
+
-**#26 "instrumental tracks"** — 25 tracks from 9 artists
91
+
92
+
### Raw output saved to `/tmp/genius_prompt_results/`
0 commit comments