@@ -36,6 +36,44 @@ to the Rust backend (`crates/mt-tauri/src/agent/`).
3636 outcome is logged to a structured JSONL file for analysis.
3737- ** Hard cap** β ` parse_response() ` deduplicates and truncates track IDs to
3838 ` MAX_PLAYLIST_TRACKS ` regardless of model output.
39+ - ** Last-turn nudge** β On the final turn, a user message is injected telling
40+ the model to output the playlist immediately. Prevents turn exhaustion when
41+ ` repeat_penalty ` discourages the model from reusing its "compile now" pattern.
42+ - ** Creative naming** β System prompt instructs the model to use evocative
43+ synonyms for playlist names instead of parroting the user's request
44+ (e.g. "chill" -> "Velvet Haze").
45+
46+ ## Usage
47+
48+ ``` bash
49+ # Basic
50+ uv run scripts/agent.py " make me a chill playlist"
51+
52+ # With options
53+ uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 " shoegaze deep cuts"
54+
55+ # Extended thinking
56+ uv run scripts/agent.py --think --max-turns 8 " jazz from my library"
57+ ```
58+
59+ ## Configuration
60+
61+ All env vars are read from ` .env ` via ` python-decouple ` . CLI flags override
62+ env var defaults.
63+
64+ | Env Var | Default | CLI Flag |
65+ | ---------| ---------| ----------|
66+ | ` OLLAMA_MODEL ` | ` qwen3.5:9b ` | ` --model ` |
67+ | ` OLLAMA_HOST ` | ` http://localhost:11434 ` | ` --host ` |
68+ | ` AGENT_MAX_TURNS ` | ` 5 ` | ` --max-turns ` |
69+ | ` AGENT_TEMPERATURE ` | ` 0.3 ` | ` --temperature ` |
70+ | ` AGENT_THINK ` | ` false ` | ` --think ` |
71+ | ` AGENT_SEED ` | ` 0 ` | ` --seed ` |
72+ | ` AGENT_REPEAT_PENALTY ` | ` 1.1 ` | ` --repeat-penalty ` |
73+ | ` AGENT_MIN_PLAYLIST_TRACKS ` | ` 12 ` | β |
74+ | ` AGENT_MAX_PLAYLIST_TRACKS ` | ` 25 ` | β |
75+ | ` AGENT_LOG_FILE ` | ` /tmp/ollama_python_agent.jsonl ` | ` --log-file ` |
76+ | ` LASTFM_API_KEY ` | β | β |
3977
4078## Tools
4179
@@ -120,50 +158,78 @@ on error messages as navigation.
120158
121159Exhausted 5 turns. Produced a 2-track playlist of keyword matches.
122160
123- ** After** (Last.fm tools, strategy prompt, hints, temp=0.2, seed=42 ):
161+ ** After** (Last.fm tools, strategy prompt, hints, repeat penalty, temp=0.3 ):
124162
125163``` jsonl
126- {"event" :" session_start" ,"data" :{"temperature" :0.2 , "seed" : 42 ,"prompt" :" make me a chill playlist" }}
164+ {"event" :" session_start" ,"data" :{"temperature" :0.3 , "repeat_penalty" : 1.1 ,"prompt" :" make me a chill playlist" }}
127165{"event" :" tool_call" ,"data" :{"tool" :" get_top_artists_by_tag" ,"args" :{"tag" :" chillout" ,"limit" :50 }}}
128166{"event" :" tool_call" ,"data" :{"tool" :" get_top_artists_by_tag" ,"args" :{"tag" :" dream pop" ,"limit" :50 }}}
129- {"event" :" tool_call" ,"data" :{"tool" :" get_top_artists_by_tag" ,"args" :{"tag" :" shoegaze" ,"limit" :50 }}}
130- {"event" :" tool_result" ,"data" :{"tool" :" get_top_artists_by_tag" ,"count" :6 }}
131- {"event" :" tool_call" ,"data" :{"tool" :" get_similar_tracks" ,"args" :{"artist" :" Cigarettes After Sex" ,"track" :" K." }}}
132- {"event" :" tool_call" ,"data" :{"tool" :" get_similar_tracks" ,"args" :{"artist" :" Beach House" ,"track" :" Sparks" }}}
133- {"event" :" parse_success" ,"data" :{"playlist_name" :" Chill Vibes Collection" ,"track_ids" :[69727 ,70192 ,71486 ," ...21 more" ],"valid_count" :25 }}
134- {"event" :" session_end" ,"data" :{"reason" :" success" ,"turns_used" :4 }}
167+ {"event" :" tool_call" ,"data" :{"tool" :" get_top_artists_by_tag" ,"args" :{"tag" :" lo-fi" ,"limit" :50 }}}
168+ {"event" :" tool_call" ,"data" :{"tool" :" get_top_artists_by_tag" ,"args" :{"tag" :" ambient" ,"limit" :50 }}}
169+ {"event" :" parse_success" ,"data" :{"playlist_name" :" Velvet Haze" ,"track_ids" :[68967 ,69901 ," ...12 more" ],"valid_count" :14 }}
170+ {"event" :" eval_scores" ,"data" :{"concept" :2 ,"instruction" :2 ,"variety" :2 ,"harmonic_mean" :2.0 }}
171+ {"event" :" session_end" ,"data" :{"reason" :" success" ,"turns_used" :2 }}
135172```
136173
137- 25/25 valid tracks in 4 turns. Artists: Beach House, Cocteau Twins,
138- Cigarettes After Sex, Alvvays, girl in red, The Radio Dept., M83, Grimes.
174+ 14/14 valid tracks in 2 turns, 13 artists. Eval: 2/2 across all criteria.
139175
140176** Artist variety after shuffling** β The final playlist order spreads same-artist
141177tracks apart via greedy algorithm:
142178
143179```
144180SHUFFLED order (artists spread out):
145- [68876] The Radio Dept. - Four Months In The Shade
181+ [70060] Massive Attack - Angel
146182 [68658] Beach House - Sparks
147- [68791] Cocteau Twins - Tishbite
148- [68924] M83 - Karl
183+ [68669] Car Seat Headrest - Sunburned Shirts
184+ [68671] Cocteau Twins - Iceblink Luck
149185 [68709] Grimes - Symphonia IX (My Wait Is U)
150- [69848] Alvvays - Next Of Kin
151- ... (10 different artists, 1-2 tracks each, no same-artist adjacency)
186+ [68734] Alvvays - Dives
187+ ... (13 different artists, 1 track each)
188+
189+ Summary: 14 tracks, 13 artists, 2/5 turns
152190```
153191
154192## Determinism Controls
155193
156194| Lever | Default | Effect |
157195| -------| ---------| --------|
158- | ` AGENT_TEMPERATURE ` | 0.2 | Lower = more deterministic token sampling |
196+ | ` AGENT_TEMPERATURE ` | 0.3 | Lower = more deterministic token sampling |
159197| ` top_p ` | 0.9 | Nucleus sampling cutoff (hardcoded) |
160198| ` num_predict ` | 2048 | Maximum tokens to generate (prevents truncation) |
199+ | ` AGENT_REPEAT_PENALTY ` | 1.1 | Penalizes repeated tokens to reduce gibberish (CTRL-style) |
161200| ` AGENT_SEED ` | 0 (random) | Fixed seed for reproducible output |
162201| ` AGENT_MIN_PLAYLIST_TRACKS ` | 12 | Minimum tracks to include in playlist |
163202| ` AGENT_MAX_PLAYLIST_TRACKS ` | 25 | Hard cap on output track count |
164203| ` parse_response() ` | β | Deduplicates + truncates regardless of model output |
165204| ` _shuffle_spread_artists() ` | β | Greedy shuffle to spread same-artist tracks apart |
166205
206+ ## LLM-as-Judge Evaluation
207+
208+ After generating a playlist, the agent runs an automated evaluation pass using
209+ the same Ollama model as judge. Inspired by the AxBench metrics from the
210+ [ Eiffel Tower Llama] ( https://huggingface.co/spaces/dlouapre/eiffel-tower-llama )
211+ paper, three criteria are scored 0-2:
212+
213+ | Criterion | What it measures |
214+ | -----------| -----------------|
215+ | ** Concept match** | Does the playlist match the requested mood/genre/theme? |
216+ | ** Instruction following** | Valid playlist format, correct track count? |
217+ | ** Track variety** | Diverse artists vs. repetitive? |
218+
219+ A ** harmonic mean** of the three scores penalizes playlists that fail on any
220+ single dimension (e.g. on-theme but all from one artist scores poorly).
221+
222+ Scores are logged to JSONL as ` eval_scores ` events, enabling A/B comparison
223+ of prompt or parameter changes:
224+
225+ ``` bash
226+ # Compare harmonic means across runs
227+ jq ' select(.event == "eval_scores") | .data' /tmp/ollama_python_agent.jsonl
228+ ```
229+
230+ The evaluation uses ` temperature=0.0 ` for deterministic judging and a 128-token
231+ cap since only three scores are needed.
232+
167233## Applying to Rust Backend
168234
169235The script mirrors the Rust agent in ` crates/mt-tauri/src/agent/ ` :
@@ -183,37 +249,8 @@ Changes validated in the Python script should be ported to Rust:
1832492 . ** Actionable hints** β Add hint metadata to Rust tool ` Output ` types
1842503 . ** Default limits** β Increase ` get_top_artists_by_tag ` default from 10 to 50
1852514 . ** Hard cap** β Add dedup + truncation to ` parse_agent_response() `
186- 5 . ** Temperature/seed** β Pass through Ollama options in ` build_agent() `
252+ 5 . ** Temperature/seed/repeat_penalty ** β Pass through Ollama options in ` build_agent() `
1872536 . ** Token limit** β Set ` max_tokens: 2048 ` in ` build_agent() ` to prevent response truncation
1882547 . ** Track shuffling** β Port ` _shuffle_spread_artists() ` greedy algorithm to shuffle playlist order
189-
190- ## Usage
191-
192- ``` bash
193- # Basic
194- uv run scripts/agent.py " make me a chill playlist"
195-
196- # With options
197- uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 " shoegaze deep cuts"
198-
199- # Extended thinking
200- uv run scripts/agent.py --think --max-turns 8 " jazz from my library"
201- ```
202-
203- ## Configuration
204-
205- All env vars are read from ` .env ` via ` python-decouple ` . CLI flags override
206- env var defaults.
207-
208- | Env Var | Default | CLI Flag |
209- | ---------| ---------| ----------|
210- | ` OLLAMA_MODEL ` | ` qwen3.5:9b ` | ` --model ` |
211- | ` OLLAMA_HOST ` | ` http://localhost:11434 ` | ` --host ` |
212- | ` AGENT_MAX_TURNS ` | ` 5 ` | ` --max-turns ` |
213- | ` AGENT_TEMPERATURE ` | ` 0.2 ` | ` --temperature ` |
214- | ` AGENT_THINK ` | ` false ` | ` --think ` |
215- | ` AGENT_SEED ` | ` 0 ` | ` --seed ` |
216- | ` AGENT_MIN_PLAYLIST_TRACKS ` | ` 12 ` | β |
217- | ` AGENT_MAX_PLAYLIST_TRACKS ` | ` 25 ` | β |
218- | ` AGENT_LOG_FILE ` | ` /tmp/ollama_python_agent.jsonl ` | ` --log-file ` |
219- | ` LASTFM_API_KEY ` | β | β |
255+ 8 . ** Last-turn nudge** β Inject "output now" message on final turn to prevent exhaustion
256+ 9 . ** Creative naming** β Port synonym-based playlist naming instructions to ` prompt.rs `
0 commit comments