feat(agent): add batch mode and model selection docs

pythoninthegrass · claude · pythoninthegrass · commit a203129c5043 · 2026-04-04T02:31:24.000-05:00
- Add --batch flag to run all 29 genius-browser.js prompt examples
- Add --prompts-file flag to run custom prompt lists
- Extract _connect() for shared DB/client across batch runs
- run_agent() returns AgentResult dataclass with structured metrics
- run_batch() prints summary table with pass/fail, timing, eval scores
- Make _setup_logging() idempotent for repeated calls
- Document model selection by device RAM (8GB-32GB+)
- Document benchmark results: 26/29 pass on qwen3.5:9b
- Document model comparison on failure cases across 4 models
- Add TASK-309 with full test results

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/backlog/tasks/task-309 - Test-all-Genius-prompt-examples-against-agent.py.md b/backlog/tasks/task-309 - Test-all-Genius-prompt-examples-against-agent.py.md
@@ -0,0 +1,93 @@
+---
+id: TASK-309
+title: Test all Genius prompt examples against agent.py
+status: Done
+assignee: []
+created_date: '2026-04-04 04:07'
+updated_date: '2026-04-04 04:40'
+labels:
+  - testing
+  - genius
+  - agent
+dependencies: []
+references:
+  - app/frontend/js/components/genius-browser.js
+  - scripts/agent.py
+  - docs/agent.md
+priority: medium
+---
+
+## Description
+
+<!-- SECTION:DESCRIPTION:BEGIN -->
+Run each of the 29 prompt examples from `app/frontend/js/components/genius-browser.js` through `scripts/agent.py` using the default qwen3.5:9b model. Record which prompts succeed (valid playlist generated) and which fail (parse failure, no matches, bad output). This establishes a baseline for prompt coverage against the current library.
+<!-- SECTION:DESCRIPTION:END -->
+
+## Acceptance Criteria
+<!-- AC:BEGIN -->
+- [x] #1 All 29 prompt examples tested against agent.py with qwen3.5:9b
+- [x] #2 Results documented: pass/fail status, track count, artist variety, eval scores
+- [x] #3 Known failures identified (e.g. genres not well-represented in library)
+<!-- AC:END -->
+
+## Final Summary
+
+<!-- SECTION:FINAL_SUMMARY:BEGIN -->
+## Results: 26/29 pass (89.7%), 3 parse failures, 0 errors
+
+### Full Results (qwen3.5:9b, default settings)
+
+| # | Prompt | Result | Tracks | Artists | Turns | C | I | V | H |
+|---|--------|--------|--------|---------|-------|---|---|---|---|
+| 1 | make me a chill playlist from my library | PASS | 15 | 15 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 2 | something similar to what I listened to recently | PASS | 25 | 24 | 3/5 | 2 | 2 | 2 | 2.00 |
+| 3 | find me post-punk artists I don't usually listen to | PASS | 25 | 7 | 2/5 | 2 | 2 | 1 | 1.50 |
+| 4 | upbeat tracks for a morning run | PASS | 20 | 19 | 4/5 | 1 | 2 | 2 | 1.50 |
+| 5 | rainy day songs with acoustic guitars | PASS | 17 | 17 | 3/5 | 2 | 2 | 2 | 2.00 |
+| 6 | deep cuts I haven't played in months | PASS | 25 | 6 | 3/5 | 2 | 2 | 1 | 1.50 |
+| 7 | a late-night driving mix | PASS | 20 | 9 | 3/5 | 2 | 2 | 2 | 2.00 |
+| 8 | something moody and atmospheric | PASS | 20 | 20 | 4/5 | 2 | 2 | 2 | 2.00 |
+| 9 | high energy tracks for cleaning the house | PASS | 25 | 6 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 10 | jazz and soul from the 60s and 70s | PASS | 25 | 5 | 5/5 | 0 | 2 | 2 | 0.00 |
+| 11 | songs that build slowly then explode | PASS | 24 | 15 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 12 | artists similar to Radiohead in my library | PASS | 20 | 16 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 13 | a Sunday morning coffee playlist | PASS | 18 | 18 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 14 | tracks with heavy bass lines | FAIL | - | - | - | - | - | - | - |
+| 15 | my most played songs from this year | PASS | 25 | 16 | 5/5 | 2 | 2 | 0 | 0.00 |
+| 16 | something dreamy and shoegaze-y | PASS | 12 | 11 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 17 | a workout mix that keeps escalating | PASS | 25 | 8 | 3/5 | 2 | 2 | 2 | 2.00 |
+| 18 | underrated albums I barely touched | PASS | 24 | 24 | 4/5 | 2 | 2 | 2 | 2.00 |
+| 19 | folksy singer-songwriter vibes | PASS | 20 | 20 | 5/5 | 2 | 2 | 2 | 2.00 |
+| 20 | electronic music that isn't too intense | PASS | 23 | 20 | 4/5 | 2 | 2 | 2 | 2.00 |
+| 21 | songs to cook dinner to | PASS | 25 | 7 | 3/5 | 2 | 2 | 2 | 2.00 |
+| 22 | a road trip playlist from my collection | PASS | 25 | 21 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 23 | melancholy but beautiful tracks | FAIL | - | - | - | - | - | - | - |
+| 24 | hip-hop and R&B from the 90s | PASS | 25 | 16 | 5/5 | 0 | 2 | 2 | 0.00 |
+| 25 | everything by female vocalists | PASS | 13 | 10 | 5/5 | 2 | 2 | 2 | 2.00 |
+| 26 | instrumental tracks only | PASS | 25 | 9 | 5/5 | 2 | 2 | 1 | 1.50 |
+| 27 | songs under three minutes | PASS | 20 | 16 | 4/5 | 2 | 2 | 2 | 2.00 |
+| 28 | a party mix from what I already have | PASS | 12 | 11 | 2/5 | 2 | 2 | 2 | 2.00 |
+| 29 | blues and classic rock deep cuts | FAIL | - | - | - | - | - | - | - |
+
+### Failure Analysis
+
+All 3 failures share the same root cause: **model dumps 50-100+ track IDs then tries to self-correct multiple times**, never producing a clean single-line `Playlist:` / `Tracks:` output. The `**Playlist:**` markdown bold formatting also breaks the parser.
+
+- **#14 "tracks with heavy bass lines"** — model found many matching tracks, dumped all IDs, then looped trying to reduce the list
+- **#23 "melancholy but beautiful tracks"** — same pattern, 130+ IDs dumped, repeated failed attempts to curate
+- **#29 "blues and classic rock deep cuts"** — model listed tracks in prose format instead of using Playlist:/Tracks: format
+
+### Library Coverage Gaps (Concept score = 0)
+
+- **#10 "jazz and soul from the 60s and 70s"** — library has no jazz/soul, model fell back to post-punk from that era
+- **#24 "hip-hop and R&B from the 90s"** — library has no hip-hop/R&B, model fell back to 90s alternative/indie
+
+### Low Variety Scores
+
+- **#3 "post-punk artists"** — 25 tracks from only 7 artists (too many per artist)
+- **#6 "deep cuts"** — 25 tracks from 6 artists
+- **#15 "most played this year"** — Variety=0, 25 tracks from 16 artists (judge was harsh)
+- **#26 "instrumental tracks"** — 25 tracks from 9 artists
+
+### Raw output saved to `/tmp/genius_prompt_results/`
+<!-- SECTION:FINAL_SUMMARY:END -->
diff --git a/docs/agent.md b/docs/agent.md
@@ -46,15 +46,44 @@ to the Rust backend (`crates/mt-tauri/src/agent/`).
 ## Usage
 
 ```bash
-# Basic
+# Single prompt
 uv run scripts/agent.py "make me a chill playlist"
 
 # With options
 uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 "shoegaze deep cuts"
 
 # Extended thinking
 uv run scripts/agent.py --think --max-turns 8 "jazz from my library"
+
+# Batch: run all 29 built-in prompt examples
+uv run scripts/agent.py --batch
+
+# Batch with a different model
+uv run scripts/agent.py --batch --model qwen3:14b
+
+# Batch from a file (one prompt per line, # comments ignored)
+uv run scripts/agent.py --batch --prompts-file prompts.txt
+```
+
+Batch mode shares a single DB connection and Ollama client across all prompts,
+prints per-prompt results as they complete, and outputs a summary table:
+
 ```
+================================================================================
+BATCH SUMMARY — qwen3.5:9b
+================================================================================
+#    Result  Time Tracks Artists Turns  C  I  V     H  Prompt
+--------------------------------------------------------------------------------
+1    PASS     35s     18      18  2/5   2  2  2  2.00  a Sunday morning coffee playlist
+2    PASS     20s     12      12  2/5   2  2  2  2.00  artists similar to Radiohead in my library
+3    FAIL     45s                  5/5                  tracks with heavy bass lines (parse_failure)
+--------------------------------------------------------------------------------
+Pass: 2/3 (67%)  Avg time: 33s  Total: 100s  Avg harmonic (pass): 2.00
+```
+
+`run_agent()` returns an `AgentResult` dataclass with status, playlist name,
+track IDs, valid count, unique artists, turns used, eval scores, harmonic mean,
+and elapsed time. `run_batch()` collects these for programmatic use.
 
 ## Configuration
 
@@ -74,6 +103,8 @@ env var defaults.
 | `AGENT_MAX_PLAYLIST_TRACKS` | `25` | — |
 | `AGENT_LOG_FILE` | `/tmp/ollama_python_agent.jsonl` | `--log-file` |
 | `LASTFM_API_KEY` | — | — |
+| — | — | `--batch` |
+| — | — | `--prompts-file` |
 
 ## Tools
 
@@ -230,6 +261,81 @@ jq 'select(.event == "eval_scores") | .data' /tmp/ollama_python_agent.jsonl
 The evaluation uses `temperature=0.0` for deterministic judging and a 128-token
 cap since only three scores are needed.
 
+## Model Selection
+
+The default model is `qwen3.5:9b` — chosen to fit 8GB unified memory devices
+(e.g. MacBook Air M3) while maintaining reliable tool calling. Larger models
+improve quality and reduce turn count but require more RAM.
+
+### Requirements
+
+The agent needs a model that can:
+
+- Make **parallel tool calls** (multiple tools in a single turn)
+- Follow a complex 8-tool system prompt with strategy routing
+- Produce structured output (`Playlist: name` / `Tracks: comma-separated IDs`)
+- Reason about user intent to select the right tool combination
+
+Models below ~4B parameters (e.g. llama3.2:1b) lack the reasoning capacity for
+this task. Parallel tool calling support in Ollama is required — models that only
+support single tool calls per turn (e.g. gpt-oss) double the number of turns needed.
+
+### Recommended models by device RAM
+
+| Device RAM | Model | Size (Q4) | Active Params | Notes |
+|------------|-------|-----------|---------------|-------|
+| 8GB | `qwen3.5:9b` | ~7GB | 9B dense | Default. Fits tight but works |
+| 8GB | `qwen3:8b` | ~5GB | 8B dense | Fallback if 3.5 has issues |
+| 16GB | `qwen3:14b` | ~9GB | 14B dense | Highest tool F1 (0.971) |
+| 32GB+ | `qwen3-coder:30b-a3b` | ~18GB | 3B active (MoE) | Fast inference, good quality |
+| 32GB+ | `glm-4.7-flash` | ~19GB | dense | Strong agent benchmarks |
+
+Sticking with the Qwen family across tiers keeps prompt behavior consistent —
+same tool calling format, same instruction following patterns.
+
+### Benchmark: 29 prompt examples (2026-04-04)
+
+Tested all 29 prompt examples from `genius-browser.js` against `qwen3.5:9b`
+with default settings. Full results in TASK-309.
+
+**Overall: 26/29 pass (89.7%), 3 parse failures, 0 errors**
+
+The 3 failures share a root cause: the model dumps 50-100+ track IDs then
+loops trying to self-correct, never producing clean `Playlist:` / `Tracks:`
+output. Affected prompts: "tracks with heavy bass lines", "melancholy but
+beautiful tracks", "blues and classic rock deep cuts".
+
+Two prompts scored Concept=0 due to library coverage gaps (no jazz/soul or
+hip-hop/R&B in the test library) — the model correctly identified the gap
+and fell back to related genres.
+
+### Model comparison on failure cases
+
+Tested the 3 failed prompts + 2 reference prompts across larger models:
+
+| Prompt | qwen3.5:9b | qwen3-coder:30b-a3b | glm-4.7-flash | qwen3.5:35b-a3b |
+|--------|-----------|---------------------|---------------|-----------------|
+| tracks with heavy bass lines | **FAIL** | PASS 42s H=2.00 | PASS 46s H=2.00 | PASS 60s H=1.50 |
+| melancholy but beautiful | **FAIL** | PASS 39s H=1.50 | PASS 75s H=2.00 | PASS 46s H=2.00 |
+| blues/classic rock deep cuts | **FAIL** | PASS 26s H=1.20 | PASS 58s H=1.50 | PASS 68s H=1.20 |
+| chill playlist | PASS ~30s H=2.00 | PASS 38s H=1.50 | PASS 47s H=2.00 | PASS 27s H=2.00 |
+| similar to Radiohead | PASS ~30s H=2.00 | PASS 23s H=2.00 | PASS 25s H=2.00 | PASS 41s H=2.00 |
+
+All 3 larger models pass the prompts that `qwen3.5:9b` failed — the failures
+are a reasoning/self-control issue at 9B scale, not a tool calling issue.
+
+`qwen3-coder:30b-a3b` (MoE, 3B active) is the fastest larger model due to low
+active parameter count on Apple Silicon. `glm-4.7-flash` has the most
+consistent eval scores. `qwen3.5:35b-a3b` was slower than expected and did not
+improve over the other two.
+
+### Speed observations
+
+End-to-end prompt completion time is dominated by **number of turns**, not raw
+token speed. A model completing in 2 turns at 30 tok/s beats a model needing 5
+turns at 100 tok/s. The primary optimization path is reducing turn count through
+better prompt engineering, not switching to faster models.
+
 ## Applying to Rust Backend
 
 The script mirrors the Rust agent in `crates/mt-tauri/src/agent/`:
@@ -241,7 +347,10 @@ The script mirrors the Rust agent in `crates/mt-tauri/src/agent/`:
 | `tool_get_similar_tracks()` | `tools.rs::GetSimilarTracks::call()` |
 | `_lastfm_get()` | `lastfm/client.rs::api_call()` |
 | `parse_response()` | `mod.rs::parse_agent_response()` |
-| `run_agent()` loop | `mod.rs::agent_generate_playlist()` |
+| `run_agent()` → `AgentResult` | `mod.rs::agent_generate_playlist()` |
+| `run_batch()` | — (Python-only test harness) |
+| `_connect()` | Managed by Tauri app state |
+| `BATCH_PROMPTS` | — (Python-only test data) |
 
 Changes validated in the Python script should be ported to Rust:
 
diff --git a/scripts/agent.py b/scripts/agent.py