Skip to content

Commit 4ac5ca6

Browse files
feat(agent): add repeat penalty, LLM-as-judge eval, creative naming
- Add repeat_penalty=1.1 to Ollama options (CTRL-style token repetition penalty) with AGENT_REPEAT_PENALTY env var and --repeat-penalty CLI flag - Bump default temperature from 0.2 to 0.3 for more creative output - Add LLM-as-judge evaluation pass scoring concept match, instruction following, and track variety (0-2 each) with harmonic mean; uses /no_think and temperature=0.0 for deterministic judging - Add last-turn nudge injecting "output now" message to prevent turn exhaustion caused by repeat penalty discouraging reuse patterns - Add creative playlist naming instructions (synonyms over parroting) - Add summary line with track count, unique artists, and turns used - Update docs with new parameters, eval section, and current examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b9263a2 commit 4ac5ca6

File tree

2 files changed

+227
-50
lines changed

2 files changed

+227
-50
lines changed

β€Ždocs/agent.mdβ€Ž

Lines changed: 85 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,44 @@ to the Rust backend (`crates/mt-tauri/src/agent/`).
3636
outcome is logged to a structured JSONL file for analysis.
3737
- **Hard cap** β€” `parse_response()` deduplicates and truncates track IDs to
3838
`MAX_PLAYLIST_TRACKS` regardless of model output.
39+
- **Last-turn nudge** β€” On the final turn, a user message is injected telling
40+
the model to output the playlist immediately. Prevents turn exhaustion when
41+
`repeat_penalty` discourages the model from reusing its "compile now" pattern.
42+
- **Creative naming** β€” System prompt instructs the model to use evocative
43+
synonyms for playlist names instead of parroting the user's request
44+
(e.g. "chill" -> "Velvet Haze").
45+
46+
## Usage
47+
48+
```bash
49+
# Basic
50+
uv run scripts/agent.py "make me a chill playlist"
51+
52+
# With options
53+
uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 "shoegaze deep cuts"
54+
55+
# Extended thinking
56+
uv run scripts/agent.py --think --max-turns 8 "jazz from my library"
57+
```
58+
59+
## Configuration
60+
61+
All env vars are read from `.env` via `python-decouple`. CLI flags override
62+
env var defaults.
63+
64+
| Env Var | Default | CLI Flag |
65+
|---------|---------|----------|
66+
| `OLLAMA_MODEL` | `qwen3.5:9b` | `--model` |
67+
| `OLLAMA_HOST` | `http://localhost:11434` | `--host` |
68+
| `AGENT_MAX_TURNS` | `5` | `--max-turns` |
69+
| `AGENT_TEMPERATURE` | `0.3` | `--temperature` |
70+
| `AGENT_THINK` | `false` | `--think` |
71+
| `AGENT_SEED` | `0` | `--seed` |
72+
| `AGENT_REPEAT_PENALTY` | `1.1` | `--repeat-penalty` |
73+
| `AGENT_MIN_PLAYLIST_TRACKS` | `12` | β€” |
74+
| `AGENT_MAX_PLAYLIST_TRACKS` | `25` | β€” |
75+
| `AGENT_LOG_FILE` | `/tmp/ollama_python_agent.jsonl` | `--log-file` |
76+
| `LASTFM_API_KEY` | β€” | β€” |
3977

4078
## Tools
4179

@@ -120,50 +158,78 @@ on error messages as navigation.
120158

121159
Exhausted 5 turns. Produced a 2-track playlist of keyword matches.
122160

123-
**After** (Last.fm tools, strategy prompt, hints, temp=0.2, seed=42):
161+
**After** (Last.fm tools, strategy prompt, hints, repeat penalty, temp=0.3):
124162

125163
```jsonl
126-
{"event":"session_start","data":{"temperature":0.2,"seed":42,"prompt":"make me a chill playlist"}}
164+
{"event":"session_start","data":{"temperature":0.3,"repeat_penalty":1.1,"prompt":"make me a chill playlist"}}
127165
{"event":"tool_call","data":{"tool":"get_top_artists_by_tag","args":{"tag":"chillout","limit":50}}}
128166
{"event":"tool_call","data":{"tool":"get_top_artists_by_tag","args":{"tag":"dream pop","limit":50}}}
129-
{"event":"tool_call","data":{"tool":"get_top_artists_by_tag","args":{"tag":"shoegaze","limit":50}}}
130-
{"event":"tool_result","data":{"tool":"get_top_artists_by_tag","count":6}}
131-
{"event":"tool_call","data":{"tool":"get_similar_tracks","args":{"artist":"Cigarettes After Sex","track":"K."}}}
132-
{"event":"tool_call","data":{"tool":"get_similar_tracks","args":{"artist":"Beach House","track":"Sparks"}}}
133-
{"event":"parse_success","data":{"playlist_name":"Chill Vibes Collection","track_ids":[69727,70192,71486,"...21 more"],"valid_count":25}}
134-
{"event":"session_end","data":{"reason":"success","turns_used":4}}
167+
{"event":"tool_call","data":{"tool":"get_top_artists_by_tag","args":{"tag":"lo-fi","limit":50}}}
168+
{"event":"tool_call","data":{"tool":"get_top_artists_by_tag","args":{"tag":"ambient","limit":50}}}
169+
{"event":"parse_success","data":{"playlist_name":"Velvet Haze","track_ids":[68967,69901,"...12 more"],"valid_count":14}}
170+
{"event":"eval_scores","data":{"concept":2,"instruction":2,"variety":2,"harmonic_mean":2.0}}
171+
{"event":"session_end","data":{"reason":"success","turns_used":2}}
135172
```
136173

137-
25/25 valid tracks in 4 turns. Artists: Beach House, Cocteau Twins,
138-
Cigarettes After Sex, Alvvays, girl in red, The Radio Dept., M83, Grimes.
174+
14/14 valid tracks in 2 turns, 13 artists. Eval: 2/2 across all criteria.
139175

140176
**Artist variety after shuffling** β€” The final playlist order spreads same-artist
141177
tracks apart via greedy algorithm:
142178

143179
```
144180
SHUFFLED order (artists spread out):
145-
[68876] The Radio Dept. - Four Months In The Shade
181+
[70060] Massive Attack - Angel
146182
[68658] Beach House - Sparks
147-
[68791] Cocteau Twins - Tishbite
148-
[68924] M83 - Karl
183+
[68669] Car Seat Headrest - Sunburned Shirts
184+
[68671] Cocteau Twins - Iceblink Luck
149185
[68709] Grimes - Symphonia IX (My Wait Is U)
150-
[69848] Alvvays - Next Of Kin
151-
... (10 different artists, 1-2 tracks each, no same-artist adjacency)
186+
[68734] Alvvays - Dives
187+
... (13 different artists, 1 track each)
188+
189+
Summary: 14 tracks, 13 artists, 2/5 turns
152190
```
153191

154192
## Determinism Controls
155193

156194
| Lever | Default | Effect |
157195
|-------|---------|--------|
158-
| `AGENT_TEMPERATURE` | 0.2 | Lower = more deterministic token sampling |
196+
| `AGENT_TEMPERATURE` | 0.3 | Lower = more deterministic token sampling |
159197
| `top_p` | 0.9 | Nucleus sampling cutoff (hardcoded) |
160198
| `num_predict` | 2048 | Maximum tokens to generate (prevents truncation) |
199+
| `AGENT_REPEAT_PENALTY` | 1.1 | Penalizes repeated tokens to reduce gibberish (CTRL-style) |
161200
| `AGENT_SEED` | 0 (random) | Fixed seed for reproducible output |
162201
| `AGENT_MIN_PLAYLIST_TRACKS` | 12 | Minimum tracks to include in playlist |
163202
| `AGENT_MAX_PLAYLIST_TRACKS` | 25 | Hard cap on output track count |
164203
| `parse_response()` | β€” | Deduplicates + truncates regardless of model output |
165204
| `_shuffle_spread_artists()` | β€” | Greedy shuffle to spread same-artist tracks apart |
166205

206+
## LLM-as-Judge Evaluation
207+
208+
After generating a playlist, the agent runs an automated evaluation pass using
209+
the same Ollama model as judge. Inspired by the AxBench metrics from the
210+
[Eiffel Tower Llama](https://huggingface.co/spaces/dlouapre/eiffel-tower-llama)
211+
paper, three criteria are scored 0-2:
212+
213+
| Criterion | What it measures |
214+
|-----------|-----------------|
215+
| **Concept match** | Does the playlist match the requested mood/genre/theme? |
216+
| **Instruction following** | Valid playlist format, correct track count? |
217+
| **Track variety** | Diverse artists vs. repetitive? |
218+
219+
A **harmonic mean** of the three scores penalizes playlists that fail on any
220+
single dimension (e.g. on-theme but all from one artist scores poorly).
221+
222+
Scores are logged to JSONL as `eval_scores` events, enabling A/B comparison
223+
of prompt or parameter changes:
224+
225+
```bash
226+
# Compare harmonic means across runs
227+
jq 'select(.event == "eval_scores") | .data' /tmp/ollama_python_agent.jsonl
228+
```
229+
230+
The evaluation uses `temperature=0.0` for deterministic judging and a 128-token
231+
cap since only three scores are needed.
232+
167233
## Applying to Rust Backend
168234

169235
The script mirrors the Rust agent in `crates/mt-tauri/src/agent/`:
@@ -183,37 +249,8 @@ Changes validated in the Python script should be ported to Rust:
183249
2. **Actionable hints** β€” Add hint metadata to Rust tool `Output` types
184250
3. **Default limits** β€” Increase `get_top_artists_by_tag` default from 10 to 50
185251
4. **Hard cap** β€” Add dedup + truncation to `parse_agent_response()`
186-
5. **Temperature/seed** β€” Pass through Ollama options in `build_agent()`
252+
5. **Temperature/seed/repeat_penalty** β€” Pass through Ollama options in `build_agent()`
187253
6. **Token limit** β€” Set `max_tokens: 2048` in `build_agent()` to prevent response truncation
188254
7. **Track shuffling** β€” Port `_shuffle_spread_artists()` greedy algorithm to shuffle playlist order
189-
190-
## Usage
191-
192-
```bash
193-
# Basic
194-
uv run scripts/agent.py "make me a chill playlist"
195-
196-
# With options
197-
uv run scripts/agent.py --model qwen3.5:9b --seed 42 --temperature 0.1 "shoegaze deep cuts"
198-
199-
# Extended thinking
200-
uv run scripts/agent.py --think --max-turns 8 "jazz from my library"
201-
```
202-
203-
## Configuration
204-
205-
All env vars are read from `.env` via `python-decouple`. CLI flags override
206-
env var defaults.
207-
208-
| Env Var | Default | CLI Flag |
209-
|---------|---------|----------|
210-
| `OLLAMA_MODEL` | `qwen3.5:9b` | `--model` |
211-
| `OLLAMA_HOST` | `http://localhost:11434` | `--host` |
212-
| `AGENT_MAX_TURNS` | `5` | `--max-turns` |
213-
| `AGENT_TEMPERATURE` | `0.2` | `--temperature` |
214-
| `AGENT_THINK` | `false` | `--think` |
215-
| `AGENT_SEED` | `0` | `--seed` |
216-
| `AGENT_MIN_PLAYLIST_TRACKS` | `12` | β€” |
217-
| `AGENT_MAX_PLAYLIST_TRACKS` | `25` | β€” |
218-
| `AGENT_LOG_FILE` | `/tmp/ollama_python_agent.jsonl` | `--log-file` |
219-
| `LASTFM_API_KEY` | β€” | β€” |
255+
8. **Last-turn nudge** β€” Inject "output now" message on final turn to prevent exhaustion
256+
9. **Creative naming** β€” Port synonym-based playlist naming instructions to `prompt.rs`

0 commit comments

Comments
Β (0)