Skip to content

Commit a203129

Browse files
feat(agent): add batch mode and model selection docs
- Add --batch flag to run all 29 genius-browser.js prompt examples - Add --prompts-file flag to run custom prompt lists - Extract _connect() for shared DB/client across batch runs - run_agent() returns AgentResult dataclass with structured metrics - run_batch() prints summary table with pass/fail, timing, eval scores - Make _setup_logging() idempotent for repeated calls - Document model selection by device RAM (8GB-32GB+) - Document benchmark results: 26/29 pass on qwen3.5:9b - Document model comparison on failure cases across 4 models - Add TASK-309 with full test results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8604bd6 commit a203129

File tree

3 files changed

+477
-39
lines changed

3 files changed

+477
-39
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
id: TASK-309
3+
title: Test all Genius prompt examples against agent.py
4+
status: Done
5+
assignee: []
6+
created_date: '2026-04-04 04:07'
7+
updated_date: '2026-04-04 04:40'
8+
labels:
9+
- testing
10+
- genius
11+
- agent
12+
dependencies: []
13+
references:
14+
- app/frontend/js/components/genius-browser.js
15+
- scripts/agent.py
16+
- docs/agent.md
17+
priority: medium
18+
---
19+
20+
## Description
21+
22+
<!-- SECTION:DESCRIPTION:BEGIN -->
23+
Run each of the 29 prompt examples from `app/frontend/js/components/genius-browser.js` through `scripts/agent.py` using the default qwen3.5:9b model. Record which prompts succeed (valid playlist generated) and which fail (parse failure, no matches, bad output). This establishes a baseline for prompt coverage against the current library.
24+
<!-- SECTION:DESCRIPTION:END -->
25+
26+
## Acceptance Criteria
27+
<!-- AC:BEGIN -->
28+
- [x] #1 All 29 prompt examples tested against agent.py with qwen3.5:9b
29+
- [x] #2 Results documented: pass/fail status, track count, artist variety, eval scores
30+
- [x] #3 Known failures identified (e.g. genres not well-represented in library)
31+
<!-- AC:END -->
32+
33+
## Final Summary
34+
35+
<!-- SECTION:FINAL_SUMMARY:BEGIN -->
36+
## Results: 26/29 pass (89.7%), 3 parse failures, 0 errors
37+
38+
### Full Results (qwen3.5:9b, default settings)
39+
40+
| # | Prompt | Result | Tracks | Artists | Turns | C | I | V | H |
41+
|---|--------|--------|--------|---------|-------|---|---|---|---|
42+
| 1 | make me a chill playlist from my library | PASS | 15 | 15 | 2/5 | 2 | 2 | 2 | 2.00 |
43+
| 2 | something similar to what I listened to recently | PASS | 25 | 24 | 3/5 | 2 | 2 | 2 | 2.00 |
44+
| 3 | find me post-punk artists I don't usually listen to | PASS | 25 | 7 | 2/5 | 2 | 2 | 1 | 1.50 |
45+
| 4 | upbeat tracks for a morning run | PASS | 20 | 19 | 4/5 | 1 | 2 | 2 | 1.50 |
46+
| 5 | rainy day songs with acoustic guitars | PASS | 17 | 17 | 3/5 | 2 | 2 | 2 | 2.00 |
47+
| 6 | deep cuts I haven't played in months | PASS | 25 | 6 | 3/5 | 2 | 2 | 1 | 1.50 |
48+
| 7 | a late-night driving mix | PASS | 20 | 9 | 3/5 | 2 | 2 | 2 | 2.00 |
49+
| 8 | something moody and atmospheric | PASS | 20 | 20 | 4/5 | 2 | 2 | 2 | 2.00 |
50+
| 9 | high energy tracks for cleaning the house | PASS | 25 | 6 | 2/5 | 2 | 2 | 2 | 2.00 |
51+
| 10 | jazz and soul from the 60s and 70s | PASS | 25 | 5 | 5/5 | 0 | 2 | 2 | 0.00 |
52+
| 11 | songs that build slowly then explode | PASS | 24 | 15 | 2/5 | 2 | 2 | 2 | 2.00 |
53+
| 12 | artists similar to Radiohead in my library | PASS | 20 | 16 | 2/5 | 2 | 2 | 2 | 2.00 |
54+
| 13 | a Sunday morning coffee playlist | PASS | 18 | 18 | 2/5 | 2 | 2 | 2 | 2.00 |
55+
| 14 | tracks with heavy bass lines | FAIL | - | - | - | - | - | - | - |
56+
| 15 | my most played songs from this year | PASS | 25 | 16 | 5/5 | 2 | 2 | 0 | 0.00 |
57+
| 16 | something dreamy and shoegaze-y | PASS | 12 | 11 | 2/5 | 2 | 2 | 2 | 2.00 |
58+
| 17 | a workout mix that keeps escalating | PASS | 25 | 8 | 3/5 | 2 | 2 | 2 | 2.00 |
59+
| 18 | underrated albums I barely touched | PASS | 24 | 24 | 4/5 | 2 | 2 | 2 | 2.00 |
60+
| 19 | folksy singer-songwriter vibes | PASS | 20 | 20 | 5/5 | 2 | 2 | 2 | 2.00 |
61+
| 20 | electronic music that isn't too intense | PASS | 23 | 20 | 4/5 | 2 | 2 | 2 | 2.00 |
62+
| 21 | songs to cook dinner to | PASS | 25 | 7 | 3/5 | 2 | 2 | 2 | 2.00 |
63+
| 22 | a road trip playlist from my collection | PASS | 25 | 21 | 2/5 | 2 | 2 | 2 | 2.00 |
64+
| 23 | melancholy but beautiful tracks | FAIL | - | - | - | - | - | - | - |
65+
| 24 | hip-hop and R&B from the 90s | PASS | 25 | 16 | 5/5 | 0 | 2 | 2 | 0.00 |
66+
| 25 | everything by female vocalists | PASS | 13 | 10 | 5/5 | 2 | 2 | 2 | 2.00 |
67+
| 26 | instrumental tracks only | PASS | 25 | 9 | 5/5 | 2 | 2 | 1 | 1.50 |
68+
| 27 | songs under three minutes | PASS | 20 | 16 | 4/5 | 2 | 2 | 2 | 2.00 |
69+
| 28 | a party mix from what I already have | PASS | 12 | 11 | 2/5 | 2 | 2 | 2 | 2.00 |
70+
| 29 | blues and classic rock deep cuts | FAIL | - | - | - | - | - | - | - |
71+
72+
### Failure Analysis
73+
74+
All 3 failures share the same root cause: **model dumps 50-100+ track IDs then tries to self-correct multiple times**, never producing a clean single-line `Playlist:` / `Tracks:` output. The `**Playlist:**` markdown bold formatting also breaks the parser.
75+
76+
- **#14 "tracks with heavy bass lines"** — model found many matching tracks, dumped all IDs, then looped trying to reduce the list
77+
- **#23 "melancholy but beautiful tracks"** — same pattern, 130+ IDs dumped, repeated failed attempts to curate
78+
- **#29 "blues and classic rock deep cuts"** — model listed tracks in prose format instead of using Playlist:/Tracks: format
79+
80+
### Library Coverage Gaps (Concept score = 0)
81+
82+
- **#10 "jazz and soul from the 60s and 70s"** — library has no jazz/soul, model fell back to post-punk from that era
83+
- **#24 "hip-hop and R&B from the 90s"** — library has no hip-hop/R&B, model fell back to 90s alternative/indie
84+
85+
### Low Variety Scores
86+
87+
- **#3 "post-punk artists"** — 25 tracks from only 7 artists (too many per artist)
88+
- **#6 "deep cuts"** — 25 tracks from 6 artists
89+
- **#15 "most played this year"** — Variety=0, 25 tracks from 16 artists (judge was harsh)
90+
- **#26 "instrumental tracks"** — 25 tracks from 9 artists
91+
92+
### Raw output saved to `/tmp/genius_prompt_results/`
93+
<!-- SECTION:FINAL_SUMMARY:END -->

docs/agent.md

Lines changed: 111 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,44 @@ to the Rust backend (`crates/mt-tauri/src/agent/`).
4646
## Usage
4747

4848
```bash
49-
# Basic
49+
# Single prompt
5050
uv run scripts/agent.py "make me a chill playlist"
5151

5252
# With options
5353
uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 "shoegaze deep cuts"
5454

5555
# Extended thinking
5656
uv run scripts/agent.py --think --max-turns 8 "jazz from my library"
57+
58+
# Batch: run all 29 built-in prompt examples
59+
uv run scripts/agent.py --batch
60+
61+
# Batch with a different model
62+
uv run scripts/agent.py --batch --model qwen3:14b
63+
64+
# Batch from a file (one prompt per line, # comments ignored)
65+
uv run scripts/agent.py --batch --prompts-file prompts.txt
66+
```
67+
68+
Batch mode shares a single DB connection and Ollama client across all prompts,
69+
prints per-prompt results as they complete, and outputs a summary table:
70+
5771
```
72+
================================================================================
73+
BATCH SUMMARY — qwen3.5:9b
74+
================================================================================
75+
# Result Time Tracks Artists Turns C I V H Prompt
76+
--------------------------------------------------------------------------------
77+
1 PASS 35s 18 18 2/5 2 2 2 2.00 a Sunday morning coffee playlist
78+
2 PASS 20s 12 12 2/5 2 2 2 2.00 artists similar to Radiohead in my library
79+
3 FAIL 45s 5/5 tracks with heavy bass lines (parse_failure)
80+
--------------------------------------------------------------------------------
81+
Pass: 2/3 (67%) Avg time: 33s Total: 100s Avg harmonic (pass): 2.00
82+
```
83+
84+
`run_agent()` returns an `AgentResult` dataclass with status, playlist name,
85+
track IDs, valid count, unique artists, turns used, eval scores, harmonic mean,
86+
and elapsed time. `run_batch()` collects these for programmatic use.
5887

5988
## Configuration
6089

@@ -74,6 +103,8 @@ env var defaults.
74103
| `AGENT_MAX_PLAYLIST_TRACKS` | `25` ||
75104
| `AGENT_LOG_FILE` | `/tmp/ollama_python_agent.jsonl` | `--log-file` |
76105
| `LASTFM_API_KEY` |||
106+
||| `--batch` |
107+
||| `--prompts-file` |
77108

78109
## Tools
79110

@@ -230,6 +261,81 @@ jq 'select(.event == "eval_scores") | .data' /tmp/ollama_python_agent.jsonl
230261
The evaluation uses `temperature=0.0` for deterministic judging and a 128-token
231262
cap since only three scores are needed.
232263

264+
## Model Selection
265+
266+
The default model is `qwen3.5:9b` — chosen to fit 8GB unified memory devices
267+
(e.g. MacBook Air M3) while maintaining reliable tool calling. Larger models
268+
improve quality and reduce turn count but require more RAM.
269+
270+
### Requirements
271+
272+
The agent needs a model that can:
273+
274+
- Make **parallel tool calls** (multiple tools in a single turn)
275+
- Follow a complex 8-tool system prompt with strategy routing
276+
- Produce structured output (`Playlist: name` / `Tracks: comma-separated IDs`)
277+
- Reason about user intent to select the right tool combination
278+
279+
Models below ~4B parameters (e.g. llama3.2:1b) lack the reasoning capacity for
280+
this task. Parallel tool calling support in Ollama is required — models that only
281+
support single tool calls per turn (e.g. gpt-oss) double the number of turns needed.
282+
283+
### Recommended models by device RAM
284+
285+
| Device RAM | Model | Size (Q4) | Active Params | Notes |
286+
|------------|-------|-----------|---------------|-------|
287+
| 8GB | `qwen3.5:9b` | ~7GB | 9B dense | Default. Fits tight but works |
288+
| 8GB | `qwen3:8b` | ~5GB | 8B dense | Fallback if 3.5 has issues |
289+
| 16GB | `qwen3:14b` | ~9GB | 14B dense | Highest tool F1 (0.971) |
290+
| 32GB+ | `qwen3-coder:30b-a3b` | ~18GB | 3B active (MoE) | Fast inference, good quality |
291+
| 32GB+ | `glm-4.7-flash` | ~19GB | dense | Strong agent benchmarks |
292+
293+
Sticking with the Qwen family across tiers keeps prompt behavior consistent —
294+
same tool calling format, same instruction following patterns.
295+
296+
### Benchmark: 29 prompt examples (2026-04-04)
297+
298+
Tested all 29 prompt examples from `genius-browser.js` against `qwen3.5:9b`
299+
with default settings. Full results in TASK-309.
300+
301+
**Overall: 26/29 pass (89.7%), 3 parse failures, 0 errors**
302+
303+
The 3 failures share a root cause: the model dumps 50-100+ track IDs then
304+
loops trying to self-correct, never producing clean `Playlist:` / `Tracks:`
305+
output. Affected prompts: "tracks with heavy bass lines", "melancholy but
306+
beautiful tracks", "blues and classic rock deep cuts".
307+
308+
Two prompts scored Concept=0 due to library coverage gaps (no jazz/soul or
309+
hip-hop/R&B in the test library) — the model correctly identified the gap
310+
and fell back to related genres.
311+
312+
### Model comparison on failure cases
313+
314+
Tested the 3 failed prompts + 2 reference prompts across larger models:
315+
316+
| Prompt | qwen3.5:9b | qwen3-coder:30b-a3b | glm-4.7-flash | qwen3.5:35b-a3b |
317+
|--------|-----------|---------------------|---------------|-----------------|
318+
| tracks with heavy bass lines | **FAIL** | PASS 42s H=2.00 | PASS 46s H=2.00 | PASS 60s H=1.50 |
319+
| melancholy but beautiful | **FAIL** | PASS 39s H=1.50 | PASS 75s H=2.00 | PASS 46s H=2.00 |
320+
| blues/classic rock deep cuts | **FAIL** | PASS 26s H=1.20 | PASS 58s H=1.50 | PASS 68s H=1.20 |
321+
| chill playlist | PASS ~30s H=2.00 | PASS 38s H=1.50 | PASS 47s H=2.00 | PASS 27s H=2.00 |
322+
| similar to Radiohead | PASS ~30s H=2.00 | PASS 23s H=2.00 | PASS 25s H=2.00 | PASS 41s H=2.00 |
323+
324+
All 3 larger models pass the prompts that `qwen3.5:9b` failed — the failures
325+
are a reasoning/self-control issue at 9B scale, not a tool calling issue.
326+
327+
`qwen3-coder:30b-a3b` (MoE, 3B active) is the fastest larger model due to low
328+
active parameter count on Apple Silicon. `glm-4.7-flash` has the most
329+
consistent eval scores. `qwen3.5:35b-a3b` was slower than expected and did not
330+
improve over the other two.
331+
332+
### Speed observations
333+
334+
End-to-end prompt completion time is dominated by **number of turns**, not raw
335+
token speed. A model completing in 2 turns at 30 tok/s beats a model needing 5
336+
turns at 100 tok/s. The primary optimization path is reducing turn count through
337+
better prompt engineering, not switching to faster models.
338+
233339
## Applying to Rust Backend
234340

235341
The script mirrors the Rust agent in `crates/mt-tauri/src/agent/`:
@@ -241,7 +347,10 @@ The script mirrors the Rust agent in `crates/mt-tauri/src/agent/`:
241347
| `tool_get_similar_tracks()` | `tools.rs::GetSimilarTracks::call()` |
242348
| `_lastfm_get()` | `lastfm/client.rs::api_call()` |
243349
| `parse_response()` | `mod.rs::parse_agent_response()` |
244-
| `run_agent()` loop | `mod.rs::agent_generate_playlist()` |
350+
| `run_agent()``AgentResult` | `mod.rs::agent_generate_playlist()` |
351+
| `run_batch()` | — (Python-only test harness) |
352+
| `_connect()` | Managed by Tauri app state |
353+
| `BATCH_PROMPTS` | — (Python-only test data) |
245354

246355
Changes validated in the Python script should be ported to Rust:
247356

0 commit comments

Comments
 (0)