Skip to content

Commit d5b5416

Browse files
feat(agent): add Python prompt override harness
1 parent 3f433ad commit d5b5416

File tree

2 files changed

+97
-174
lines changed

2 files changed

+97
-174
lines changed

backlog/tasks/task-277 - Genius-playlist-creator.md

Lines changed: 57 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,16 @@ title: Genius playlist creator
44
status: In Progress
55
assignee: []
66
created_date: '2026-02-18 05:58'
7-
updated_date: '2026-04-02 21:01'
7+
updated_date: '2026-04-02 22:34'
88
labels:
99
- feature
1010
- playlists
1111
- recommendation
1212
- agent
1313
- ollama
1414
- lastfm
15-
dependencies: []
15+
dependencies:
16+
- TASK-308
1617
references:
1718
- docs/genius.md
1819
priority: high
@@ -150,163 +151,69 @@ Cfg-gate fix (fd8593a): After merging main (which had reverted the lastfm discov
150151
- Alpine.js store for agent state, basecoat/Tailwind for UI components
151152

152153
Must satisfy AC #1 (prompt input), AC #2 (LLM interpretation), AC #3 (8 tools visible), AC #4 (local tracks only), AC #5 (graceful degradation UX).
154+
155+
Documented next-step options only; not started in this commit:
156+
1. Continue Python-only validation before any Rust port.
157+
2. Prototype deterministic candidate aggregation/finalization in `scripts/agent.py` so the LLM does discovery/routing while Python enforces playlist policy.
158+
3. Candidate guard rails to evaluate: track count bounds, one-track-per-artist default, seed-artist cap for artist-based prompts, scoring by supporting tool sources, local genre, Last.fm similarity/tag evidence, and recency/history signals.
159+
4. Once Python business logic is stable across mood, artist-based, and mixed-history prompts, port the proven rules to Rust.
153160
<!-- SECTION:PLAN:END -->
154161

155162
## Implementation Notes
156163

157164
<!-- SECTION:NOTES:BEGIN -->
158-
## Current State (2026-03-31)
159-
160-
- Phase 1 (deps) complete — commit fc5846b
161-
- Phase 2 (types + prompt + module scaffold) complete — commit 2f121f7
162-
- Phase 3 (tools) complete — commit d9a37c9
163-
- Phase 4 (agent loop + Tauri commands) complete — commit d9a37c9
164-
- Phase 5 (onboarding + setup) complete — commit f1cf52c
165-
- Phase 6 (evals) complete — commit 27f9c2e, cfg-gate fix fd8593a
166-
- Merged into main and pushed (fast-forward merge fd8593a)
167-
168-
753/753 tests pass with `cargo nextest run --workspace --features agent`.
169-
Both `cargo check --features agent` and `cargo check` compile cleanly (zero warnings).
170-
171-
## Phase 5 Summary
172-
173-
Added `setup.rs` with 4 public functions + 13 new tests:
174-
- `check_ollama_status()` — health check returning OllamaStatus (available/unavailable + model list)
175-
- `pull_model(app, model)` — POST streaming to Ollama, emits `agent://pull-progress` Tauri events
176-
- `get_onboarding_state(app)` — reads from `agent.json` store
177-
- `set_onboarding_complete(app, model)` — writes to `agent.json` store
178-
179-
New types in `types.rs`: OllamaStatus, PullProgress, OnboardingState, PullModelResult
180-
181-
4 new Tauri command pairs (cfg/not-cfg) wired in lib.rs:
182-
- `agent_check_ollama`, `agent_pull_model`, `agent_get_onboarding_state`, `agent_set_onboarding_complete`
183-
184-
## Phase 6 Summary
185-
186-
13 eval tests in `evals.rs` using wiremock mock Ollama server.
187-
Categories: tool execution (3), output format (6), degradation (5).
188-
Refactored `build_agent()` and `check_ollama()` to accept `base_url: &str` for test injection.
189-
Added `wiremock = "0.6"` to dev-dependencies.
190-
191-
Cfg-gate fix: After merging main (which had reverted lastfm discovery methods), restored them and gated all discovery types, methods, and tests behind `#[cfg(feature = "agent")]`.
165+
## 2026-04-02: Python to Rust Migration Complete
192166

193-
## Key Design Decisions
194-
195-
- Feature-flagged (`agent`) — zero overhead when disabled
196-
- Uses llama3.2:1b via Ollama — small enough for local inference
197-
- 8 tools covering local library + Last.fm APIs with graceful degradation
198-
- Heuristic evals (no LLM judge) for deterministic CI
199-
- Thin wrapper Tauri commands in lib.rs with cfg pairs (agent/not-agent) to keep generate_handler! unconditional
200-
- Agent module functions are plain async fns (not #[tauri::command]) — lib.rs wrappers handle Tauri integration
201-
- LastFmClient constructed fresh inside agent_generate_playlist (not managed as Tauri state)
202-
- Onboarding state persisted via tauri-plugin-store (`agent.json` with key `agent_onboarding`)
203-
- Model pull uses streaming POST to Ollama with Tauri event emission for progress
204-
205-
## Rig API Notes (v0.27 + experimental)
206-
207-
- `rig::client::CompletionClient` trait needed for `.agent()` method on client
208-
- `rig::completion::Prompt` trait needed for `.prompt()` method on agent
209-
- `.multi_turn(N)` (NOT `.max_turns(N)`) controls tool-call loop depth
210-
- `ollama::Client::builder().api_key(rig::client::Nothing).base_url(URL).build()` — builder pattern avoids env var panic
211-
- Agent construction requires Tokio runtime (for tool registration) — test with `#[tokio::test]`
212-
213-
## Files Created/Modified
214-
215-
- `crates/mt-tauri/src/agent/mod.rs` — module root, parse_agent_response, parse_model_names, has_default_model, check_ollama, build_agent, agent_generate_playlist, agent_check_status
216-
- `crates/mt-tauri/src/agent/types.rs` — AgentResponse, TrackSummary, AgentContext, AgentError, ParsedPlaylist, AgentStatusResponse, OllamaStatus, PullProgress, OnboardingState, PullModelResult
217-
- `crates/mt-tauri/src/agent/prompt.rs` — SYSTEM_PROMPT, DEFAULT_MODEL, OLLAMA_BASE_URL, MAX_AGENT_TURNS
218-
- `crates/mt-tauri/src/agent/tools.rs` — 8 Tool implementations
219-
- `crates/mt-tauri/src/agent/setup.rs` — check_ollama_status, pull_model, get/set_onboarding_state, parse_pull_progress_line
220-
- `crates/mt-tauri/src/agent/evals.rs` — 13 eval tests with wiremock
221-
- `crates/mt-tauri/src/lastfm/client.rs` — 5 discovery methods (cfg-gated behind agent feature)
222-
- `crates/mt-tauri/src/lastfm/types.rs` — response types for discovery methods (cfg-gated behind agent feature)
223-
- `crates/mt-tauri/Cargo.toml` — rig-core + schemars deps, wiremock dev-dependency
224-
- `crates/mt-tauri/src/lib.rs` — agent module declaration + 6 wrapper Tauri commands (cfg pairs)
225-
226-
## Remaining
227-
228-
- Phase 7: Frontend — Genius sidebar category, prompt UI, onboarding wizard UI
229-
230-
## 2026-04-02: Agent Performance Fixes
231-
232-
### Issues Identified from Logs
233-
1. **Parallel tool execution not working**: `with_tool_concurrency(8)` was configured but model wasn't calling multiple tools per turn
234-
2. **Token generation too long**: Final LLM turn took 63 seconds generating ~57 IDs with multiple recounts
167+
Successfully migrated the Python reference implementation to Rust:
235168

236169
### Changes Made
237170

238-
**mod.rs - Agent builder:**
239-
- Added `.max_tokens(1024)` to cap response length (prevents endless recounting)
240-
241-
**prompt.rs - System prompt:**
242-
- Enhanced RULES section with explicit parallel tool calling instructions:
243-
- "Call MULTIPLE tools PER TURN in PARALLEL"
244-
- "When planning your strategy, call ALL independent tools at once"
245-
246-
### Why These Fixes Work
247-
248-
1. **max_tokens(1024)**: Limits the LLM to ~1024 tokens for the final response. The playlist format (name + 25 track IDs) needs only ~200-500 tokens. This prevents the model from generating excessive intermediate reasoning (listing 57 IDs, recounting, selecting, etc.) that was causing the 63-second response time.
249-
250-
2. **Explicit parallel instructions**: The previous prompt mentioned "Call multiple tools per turn" but wasn't emphatic enough. The new language uses ALL CAPS for key concepts and provides concrete examples ("get_similar_artists + search_library + get_track_tags together") to guide the model toward parallel tool calling.
251-
252-
### Test Results
253-
- All 757 tests pass with `cargo nextest run --workspace --features agent`
254-
- Both `cargo check --features agent` and `cargo check` compile cleanly
255-
256-
## 2026-04-02: Python Reference Implementation Complete
257-
258-
**Working Solution**: `scripts/agent.py` serves as the reference implementation for the Genius playlist creator.
259-
260-
### Key Features Implemented (Python Script)
261-
262-
1. **8 Agent Tools** - Full implementation matching Rust tool specifications:
263-
- `get_recently_played` - Recently played tracks from local library
264-
- `get_top_artists` - Top artists by play history
265-
- `search_library` - Text search across title/artist/album
266-
- `get_similar_tracks` - Last.fm similar tracks, cross-referenced with library
267-
- `get_similar_artists` - Last.fm similar artists with library sample tracks
268-
- `get_track_tags` - Last.fm mood/genre tags
269-
- `get_top_artists_by_tag` - Genre-based artist discovery
270-
- `get_top_tracks_by_country` - Regional track discovery
271-
272-
2. **Artist Variety Priority** - System prompt enforces 1 track per artist by default, only adding 2nd tracks when artist diversity is exhausted
273-
274-
3. **Shuffled Playlist Output** - `_shuffle_spread_artists()` uses greedy algorithm to spread same-artist tracks apart
275-
276-
4. **Configurable Track Counts** - Environment variables:
277-
- `AGENT_MIN_PLAYLIST_TRACKS` (default: 12)
278-
- `AGENT_MAX_PLAYLIST_TRACKS` (default: 25)
279-
280-
5. **Performance Tuning** - `num_predict: 2048` prevents response truncation
281-
282-
### Rust Implementation Notes
283-
284-
The Python script demonstrates the correct behavior for playlist generation:
285-
- Multi-turn tool calling with parallel execution
286-
- Last.fm cross-referencing with local library
287-
- Maximum artist variety (N tracks from N artists)
288-
- Post-generation shuffle for playback order
289-
290-
When implementing the Rust version, use `scripts/agent.py` as the behavioral reference for:
291-
- System prompt wording (artist variety rules)
292-
- Tool response format and hint messages
293-
- Playlist shuffling algorithm
294-
- Token limit settings (2048 via `max_tokens`)
295-
296-
### Usage
297-
298-
```bash
299-
# Default: 12-25 tracks with maximum artist variety
300-
./scripts/agent.py "make me a chill playlist"
301-
302-
# Custom range: 5-10 tracks
303-
AGENT_MIN_PLAYLIST_TRACKS=5 AGENT_MAX_PLAYLIST_TRACKS=10 ./scripts/agent.py "quick mix"
304-
305-
# Faster execution with fewer turns
306-
./scripts/agent.py --max-turns 2 "make me a chill playlist"
307-
```
308-
309-
Reference file: `scripts/agent.py` (1114 lines, fully functional)
171+
**prompt.rs** - Updated system prompt with enhanced artist variety rules:
172+
- Added "DEFAULT to 1 track per artist for MAXIMUM variety"
173+
- Added "PRIORITY: 20 tracks from 20 different artists > 20 tracks from 10 artists with 2 each"
174+
- Added "A playlist should feel like a JOURNEY through different artists, not an artist deep dive"
175+
- Added "When compiling: pick the BEST track from each artist, then move on"
176+
- Converted to raw string literal (`r#"..."#`) for cleaner syntax
177+
- Added CRITICAL section warning against common mistakes (keyword searches)
178+
179+
**mod.rs** - Added artist-spreading shuffle algorithm and duplicate name handling:
180+
- `shuffle_spread_artists()` function uses greedy approach to spread same-artist tracks apart
181+
- `generate_unique_playlist_name()` automatically appends (2), (3), etc. for duplicate names
182+
- Groups tracks by artist, shuffles each group locally, then greedily selects from the artist with most remaining tracks (excluding the last-played artist when possible)
183+
- Updated `agent_generate_playlist()` to:
184+
1. Fetch track details (id + artist) for parsed track IDs
185+
2. Shuffle using `shuffle_spread_artists()` before adding to playlist
186+
3. Handle duplicate playlist names by appending numbers
187+
4. Log validation count and shuffling info
188+
189+
### Key Features Ported from Python
190+
1. **Artist variety priority** - System prompt enforces 1 track per artist by default
191+
2. **Shuffled output** - Greedy algorithm spreads same-artist tracks apart in final playlist
192+
3. **Track validation** - Verifies all track IDs exist in library before shuffling
193+
4. **Unique playlist names** - Automatically appends (2), (3), etc. if name exists
194+
5. **Configurable limits** - Min/max tracks dynamically calculated from MAX_PLAYLIST_TRACKS
195+
196+
### Test Coverage
197+
- 5 unit tests for `shuffle_spread_artists()`:
198+
- Empty input returns empty
199+
- Spreads same-artist tracks apart (no adjacent duplicates)
200+
- Preserves all tracks in output
201+
- Single track works correctly
202+
- Unique artists handled properly
203+
- 2 unit tests for `generate_unique_playlist_name()`:
204+
- Returns base name when available
205+
- Appends number when name exists
206+
207+
Total: 764 tests pass (762 existing + 2 new)
208+
209+
2026-04-02: Added `PROMPT` override support to `scripts/agent.py` via python-decouple so prompt experiments can run without changing the default built-in system prompt. `_build_system_prompt()` now keeps the existing prompt for normal runs and uses the env-provided override when present, still interpolating `{min_tracks}` and `{max_tracks}`. Console output now reports whether the run used the default prompt or the override.
210+
211+
2026-04-02: Prompt experiment results in Python:
212+
- Mood request (`make me a chill playlist`): default prompt used 4 turns; override prompt used 2 turns with valid library-only output.
213+
- Artist-based request (`make me a playlist like Radiohead`): default prompt used 3 turns; override prompt used 2 turns, but prompt-only steering still leaked multiple seed-artist tracks.
214+
- Mixed-history request (`make me a chill playlist like what I listened to last Friday`): default prompt used 4 turns; override prompt used 2 turns and treated weak recent-history results as a weighting signal instead of spending extra turns matching them exactly.
215+
216+
2026-04-02: Conclusion from prompt experiments: tighter stop rules materially reduce turn count, but prompt-only business-rule enforcement remains unreliable for cases like seed-artist caps. The most promising direction is to keep LLM-driven discovery/tool routing while moving playlist compilation and policy enforcement into deterministic business logic that scores and filters candidates using empirical evidence (tool source overlap, local genre, Last.fm tags/similarity, last played date, play history, and explicit duplicate-artist caps).
310217
<!-- SECTION:NOTES:END -->
311218

312219
## Definition of Done

0 commit comments

Comments
 (0)