- Lyrics Workflow
- Clustering Workflow
- Pipeline steps
- What the feature vector means
- What stratified sampling means
- What the evolutionary search means (explore vs exploit)
- What a single iteration does
- What the fitness score means (the seven metrics)
- What cluster-to-playlist filtering means
- What batch orchestration means (concurrency, timeouts, recovery)
- What post-processing means (the finalization pipeline)
- Restarting the clustering
The lyrics pipeline turns a track into a multilingual text embedding (plus axis
scores), or falls back to an instrumental sentinel when no usable lyrics exist.
Lyrics are preferred from text sources (media server, external API); only if
those miss does the track go through Whisper-small ASR on the audio. The
embedding model (gte-multilingual-base) is language-agnostic, so there is no
translation step - language detection is used only for metadata and as a
quality gate.
| # | Step | Control made |
|---|---|---|
| 1 | musicnn instrumental check | If musicnn flagged the track as instrumental → skip everything, emit instrumental sentinel |
| 2 | Media-server lyrics fetch | Fetch by track_id; _sanitize_lyrics_text; if non-empty text → use it and skip STEPS 3-5 |
| 3 | External lyrics API | If enabled + artist/track present and media-server missed → fetch & _sanitize_lyrics_text; HIT skips STEPS 4-5; MISS falls back to ASR |
| 4 | Audio prep | Trim/load audio up to MAX_AUDIO_SECONDS (240s) for ASR |
| 4b | VAD (Silero ONNX) | _apply_vad: keep only voiced audio; too little voice → instrumental (unless musicnn flagged a vocalist) |
| 5 | Whisper-small ASR | Transcribe (300s timeout → empty); _sanitize_lyrics_text; record asr_lang, avg_logprob; _resolve_lang_and_quality → drop to instrumental if junk |
| 6 | langdetect (text lyrics only, whisper_raw_len==0) |
detect_langs → lang + confidence; _resolve_lang_and_quality; if no CJK and conf < 0.70 → drop (the text-path reliability gate - see _asr_should_drop) |
| 7 | ASR reliability gate | _asr_should_drop (low logprob / null lang) → drop. Language was already resolved in STEP 5 |
| 8 | Final text gate | _text_quality_reject on final text (with the resolved language) → drop to instrumental if junk |
| 9 | Embedding + axis scoring | If len ≥ MIN_CHARS_FOR_EMBEDDING (250) → embed (gte-multilingual) + score axes; else / empty embedding → instrumental sentinel |
STEPS 4-5 (audio + VAD + ASR) only run when no text lyrics were found in STEP 2 or 3. Each chapter below explains one function from the table.
VAD (Voice Activity Detection) runs the Silero ONNX model over the prepared audio to find the segments that actually contain a voice, before sending anything to ASR. Functionally:
- Detect speech - scans the clip and returns voiced timestamps using
LYRICS_VAD_THRESHOLD(0.2). If nothing is found, it retries once at a lower floor (LYRICS_VAD_RETRY_FLOOR, 0.15). - No speech at all - if even the retry finds nothing, it falls back to sending the full clip to ASR (rather than dropping the track outright).
- Too little voice - if voiced audio is below
VAD_VOICE_RECOGNITIONseconds, the track is treated as instrumental - unless musicnn already flagged a vocalist (vocal_prior), in which case the gate is bypassed and the full clip is sent to ASR anyway. - Enough voice - keeps only the concatenated voiced segments and passes them to Whisper, so ASR isn't fed long instrumental stretches.
This both improves transcription quality and filters instrumentals before the expensive ASR step.
Sanitizing is applied to every text source - media server, lyrics API, and the Whisper ASR transcript - stripping everything that is not actual sung content, so the embedding sees clean lyrics instead of formatting noise. Functionally it removes:
- Invisible / control characters - BOM, zero-width spaces, and ASCII control codes that carry no meaning but distort the text.
- Non-text symbols - emoji and the various pictographic / decorative Unicode blocks (arrows, boxes, dingbats, regional-indicator flags, etc.).
- Embedded markup -
<script>/<style>blocks and any other HTML-like tags, which appear when an API leaks a web page instead of plain lyrics. - LRC timing data - inline
[mm:ss.xx]timestamps and full LRC metadata lines ([ar:],[ti:],[al:],[length:],[offset:], …). - Structural section headers - standalone lines like Chorus, Verse 2, Bridge, Intro, Hook, Outro, etc., which describe structure rather than being lyrics.
- Excess blank lines - collapses runs of empty lines down to a single separator.
- Runaway length - truncates to a maximum word count (300 words) so a pathologically long blob can't dominate.
The result is trimmed plain text. If sanitizing leaves nothing, the source is treated as a miss.
This is the shared language + content control, called identically from both
STEP 5 (ASR) and STEP 6 (text). Given the text and a candidate language
(Whisper's asr_lang for ASR, langdetect's result for text), it does two things
in order:
- CJK-script override - if the text contains enough Hangul / kana / Han
characters (≥
LYRICS_CJK_SCRIPT_MIN_RATIO, 0.10 of letters), it forces the language toko/ja/zhregardless of the candidate. Script presence is a far more reliable CJK signal than either detector. - Content quality reject - runs
_text_quality_rejecton the resolved language and returns its verdict.
Because both paths call this one function, the CJK override and the content
checks are guaranteed to be identical no matter where the lyrics came from. It
deliberately does not include the reliability gate - that signal is
source-specific and lives outside this function (see
_asr_should_drop).
After text exists (whether from an API or from ASR), this gate decides whether the text is good enough to embed. It returns a reason string when the text should be dropped to the instrumental sentinel, or nothing when the text is accepted. The functional checks are:
- Too short - fewer than
MIN_CHARS_FOR_EMBEDDING(250) characters. Short fragments don't carry enough signal for a meaningful embedding. - Too repetitive - the zlib compression ratio of the text exceeds the
threshold (
LYRICS_TEXT_MAX_COMPRESSION_RATIO, default 15). A very high ratio means the text is mostly the same line repeated (ad-lib spam, "la la la" loops, ASR hallucination), so it is rejected while genuinely chorus-heavy songs still pass. - Script/language mismatch - when the resolved language is a non-Latin-script language (Korean, Japanese, Chinese, Arabic, Russian, Thai, Hindi, etc.) but the text is ≥90% Latin characters, the content is inconsistent with the claimed language (garbled/mojibake or wrong text) and is rejected.
If none of these fire, the text is kept and proceeds to embedding.
The reliability gate is a separate signal from the content checks in
_resolve_lang_and_quality. It exists once per source, in different steps, and
is the one control that is intentionally not symmetric - because a low
confidence score does not mean the same thing on each source:
- Text path (STEP 6) - the language-confidence gate. The inline
conf < LANG_CONFIDENCE_MIN(0.70) drop. langdetect only classifies text that already exists; it doesn't produce it. So low confidence does not prove the text is bad - it may simply be a language langdetect handles poorly or doesn't support. This makes it a weak signal: it catches mojibake/garbage, but it can also wrongly flag valid lyrics in an under-supported language. - ASR path (STEP 7) -
_asr_should_drop. Whisper generates the text from audio, so its low confidence directly means the produced transcript is wrong (a hallucination). This is a strong, trustworthy "bad content" signal. It drops the transcript when:- logprob
< -1.0(LYRICS_ASR_MIN_AVG_LOGPROB) - universal hallucination floor, every language; asr_langis null / unknown;- the transcript is non-English and logprob
< -0.85(LYRICS_ASR_NON_ENGLISH_MIN_LOGPROB) - a stricter floor than English faces.
- logprob
The text-path confidence gate sits after the CJK branch, so when CJK script is
detected it is skipped. _asr_should_drop has no CJK branch and runs
unconditionally. The reason is what a low score actually proves on each
source:
- API / music server: low langdetect confidence is unreliable - it may just mean the language is under-supported (Latin-script bias, code-mixed K-pop / J-pop), not that the text is junk. So dropping on it risks throwing away valid lyrics. The CJK bypass exists because CJK is the most common victim: the presence of Hangul / kana / Han proves the text is genuine CJK, so we ignore langdetect's untrustworthy low score. (Other under-supported languages with no script test can still be wrongly dropped - a known limitation of this gate.)
- ASR: low logprob is a trustworthy "this is wrong" signal, because Whisper produced the text - there is nothing to forgive. And CJK characters can't rescue it, since Whisper may have hallucinated them. So the gate runs regardless of script.
On both paths CJK still passes through the content checks
(_resolve_lang_and_quality); the only
thing CJK ever bypasses is the API-path confidence drop.
The extra -0.85 floor for non-English ASR is deliberate: it is the same
"is the transcription real?" idea, calibrated per language. Whisper-small is
less reliable on non-English audio, so a medium-confidence non-English transcript
is more likely to be a hallucination than an English one at the same score, and
the stricter floor demands higher confidence before trusting it. Trade-off: a
genuine non-English song scoring between -1.0 and -0.85 (which an English
song would survive) is dropped to instrumental. This only affects the ASR path.
To re-run lyrics analysis from scratch, drop the three lyrics tables. On the next analysis run they are recreated and every track is reprocessed through the pipeline above:
DROP TABLE IF EXISTS lyrics_embedding;
DROP TABLE IF EXISTS lyrics_index_data;
DROP TABLE IF EXISTS lyrics_axes_index_data;lyrics_embedding- per-track lyrics text, language, and embedding vector.lyrics_index_data- the semantic similarity index built from those embeddings.lyrics_axes_index_data- the axis-score index used for axis-based search.
Dropping these affects lyrics only; audio/musicnn analysis is untouched.
The clustering pipeline turns the analyzed library into a set of automatic playlists. It does not run one clustering pass - it runs an evolutionary search over clustering configurations: hundreds or thousands of independent iterations, each clustering a stratified sample of the library with slightly different parameters, each scored by a single weighted fitness number. The best-scoring iteration wins, and its clusters are post-processed into the playlists that are actually created.
There are three layers:
- Orchestrator (
run_clustering_task) - prepares the data, splits the requested number of runs into batch jobs, monitors them, then finalizes the single best result into playlists. - Batch worker (
run_clustering_batch_task) - an RQ job that runs a fixed number of iterations and reports back its best one. - Iteration (
_perform_single_clustering_iteration)- one clustering attempt: sample → scale → pick parameters → (PCA) → cluster → filter → score.
Input comes from the score table (per-track tempo, energy, mood_vector,
other_features, author) and, when embedding clustering is enabled, the
embedding table. Output is a set of media-server playlists plus the
playlist table.
| # | Step | What happens |
|---|---|---|
| 1 | Load lightweight data | Fetch item_id, author, mood_vector for every track with a non-empty mood_vector; abort if fewer tracks than the minimum cluster count |
| 2 | Build genre map + targets | Stratify: bucket tracks by predominant STRATIFIED_GENRES mood; compute target_songs_per_genre from a percentile of bucket sizes |
| 3 | Plan batches | Split num_clustering_runs into batches of ITERATIONS_PER_BATCH_JOB (20); recover any child tasks already in the DB |
| 4 | Run iterations (per batch) | Each iteration: re-sample the subset, pick parameters evolutionarily, cluster + score; keep the batch's best |
| 5 | Monitor & aggregate | Orchestrate batches up to MAX_CONCURRENT_BATCH_JOBS (10); fold each batch's best into the global best + the elite pool; timeout/staleness watchdogs prevent hangs |
| 6 | Post-process winner | On the global best: duplicate filter → min-size filter → Top-N diverse selection |
| 7 | Name + create | AI-name each surviving cluster, Fisher-Yates shuffle, chunk by MAX_SONGS_PER_CLUSTER, delete old _automatic playlists, create the new ones |
Every track is reduced to one numeric vector by
score_vector. The layout is fixed and every later step
indexes into it positionally:
[ tempo_norm, energy_norm, mood_0 … mood_n, other_0 … other_5 ]
index 0 index 1 index 2 … index 2+len(moods) …
- tempo / energy - normalized to 0–1 against
TEMPO_MIN/MAX_BPM(40–200) andENERGY_MIN/MAX(0.01–0.15), then clipped. - moods - one slot per active mood label (the top-N moods, controlled by the
top_n_moodsparameter), filled from the track'smood_vectorstring. - other features - the six
OTHER_FEATURE_LABELS(danceable,aggressive,happy,party,relaxed,sad).
This feature vector is always what names and scores a cluster. What gets
clustered is either this same vector (default) or the track's raw semantic
embedding when enable_clustering_embeddings is on - in that case the feature
vector is still used afterwards to label and score the resulting clusters.
A single iteration does not cluster the whole library - it clusters a representative subset, so that thousands of iterations stay fast and each sees a balanced cross-section.
- Genre buckets - each track is assigned a single predominant genre by
taking the highest-scoring label among
STRATIFIED_GENRESin itsmood_vector(everything else falls into__other__). - Per-genre target -
target_songs_per_genreis thestratified_sampling_target_percentilepercentile of the genre bucket sizes, floored atmin_songs_per_genre_for_stratification. This is what keeps a huge genre from swamping a small one. - Sampling -
_get_stratified_song_subsetdraws up to the target from each genre. - Perturbation between iterations - iteration 0 of a batch reuses the
previous subset; every later iteration churns it by
SAMPLING_PERCENTAGE_CHANGE_PER_RUN(0.2) - keep 80%, redraw 20%. This is the stochasticity that lets the search see different slices of the library over time.
The search has no gradient; it explores the parameter space and keeps what
works. Each iteration's parameters come from
_generate_evolutionary_parameters, which
chooses one of two modes:
- Explore (random) - generate a fresh random parameter set within the
configured ranges (PCA components, cluster count / DBSCAN
eps&min_samples/ GMM components / spectral clusters, depending on the method). - Exploit (mutate an elite) - take one of the best solutions found so far
and apply small random deltas (
MUTATION_INT_ABS_DELTA3,MUTATION_FLOAT_ABS_DELTA0.05).
The switch between them:
- Exploitation is off for the first
EXPLOITATION_START_FRACTION(0.2) of all runs - the search explores broadly before it has anything worth refining. - After that, each iteration exploits with probability
EXPLOITATION_PROBABILITY_CONFIG(0.7), otherwise still explores. - The elite pool is the top
TOP_N_ELITES(10) scoring parameter sets seen across all batches. The orchestrator passes the current elites into each new batch, so improvements propagate as the run progresses.
_perform_single_clustering_iteration is the
unit of work. In order:
- Fetch + vectorize - load full track data for the subset and build the feature vectors (and embeddings, if enabled). Tracks with missing/garbled data are dropped.
- Scale -
StandardScaleron whichever matrix will be clustered (embeddings or features). - Pick parameters - explore or exploit.
- PCA (optional) - if the chosen parameters enable it, reduce dimensionality before clustering; the actual component count is recorded.
- Cluster - fit the chosen model in
_apply_clustering_model: KMeans, DBSCAN, GMM (GMM_COVARIANCE_TYPE,reg_covar=1e-4), or Spectral (affinity='nearest_neighbors',SPECTRAL_N_NEIGHBORS). Degenerate configurations (e.g.k < 2, ork ≥ sample size) are rejected with afitness_scoreof-1.0. GPU models are used whenUSE_GPU_CLUSTERINGis on and the GPU module is available, with automatic CPU fallback. - Filter + score - turn clusters into candidate playlists and compute the fitness score.
The return value carries the fitness score, the named playlists, per-cluster centroids (both the feature-space details used for naming and the clustered-space vector used for Top-N diversity), and the parameters that produced them.
Each iteration is reduced to one number: a weighted sum of seven metrics, with weights supplied by the user. A metric is only computed when its weight is non-zero. The three structural metrics are all rescaled so higher is always better:
- silhouette -
(silhouette_score + 1) / 2, mapped to 0–1. - davies_bouldin -
1 / (1 + davies_bouldin_score); Davies-Bouldin is lower-is-better, so this inverts it. - calinski_harabasz -
1 - exp(-CH / 500), a saturating 0–1 squash.
These three need ≥ 2 clusters and fewer clusters than samples, or they stay 0.
The four content metrics describe how musically coherent the playlists are.
Each is computed as a raw sum, passed through log1p, then z-normalized
against precomputed corpus statistics (mean/sd) so the four are comparable
before weighting - and there are separate stats for embedding-based vs
feature-based clustering (LN_*_EMBEDING_STATS vs LN_*_STATS):
- mood_diversity - sums the predominant-mood score of each distinct playlist mood; rewards a set of playlists that between them span many moods.
- mood_purity - within each playlist, how strongly its songs actually carry
the playlist's top
TOP_K_MOODS_FOR_PURITY_CALCULATION(3) moods; rewards internally consistent playlists. - other_feature_diversity / other_feature_purity - the same two ideas
applied to the six "other features", gated by
OTHER_FEATURE_PREDOMINANCE_THRESHOLD_FOR_PURITY(0.3) so only features a cluster genuinely leans into count.
final_score = Σ weightₖ · metricₖ. Diversity and purity pull against each
other (more, narrower playlists vs fewer, broader ones), and the weights are how
the user tunes that trade-off.
Raw cluster membership is not used directly; each cluster is trimmed into a
candidate playlist inside
_format_and_score_iteration_result:
- Distance gate - every point's distance to its cluster center is normalized
to 0–1; members beyond
MAX_DISTANCE(0.5) are dropped, so loose outliers don't dilute a playlist. DBSCAN noise (label-1) is excluded outright. - Closest-first - surviving members are sorted by distance to the center, so the most representative tracks are kept first.
- Per-artist cap - at most
MAX_SONGS_PER_ARTIST(3) songs per artist (case-insensitive author key); set ≤ 0 to disable. Consistent with the path and voyager managers. - Per-cluster cap - at most
max_songs_per_clustersongs (0 = unlimited). - Naming -
_name_clusterinverts the centroid back to feature space and builds a name from the tempo band (Slow/Medium/Fast), the top moods, and any strongly-present other features (e.g.Happy_Party_Fast). When clustering on embeddings, the name is derived from the cluster's mean feature vector instead.
Iterations are expensive and run as RQ jobs, so the orchestrator manages them defensively - the overriding goal is that the task always finishes, even if individual batches die.
- Batching -
num_clustering_runsis split into batches ofITERATIONS_PER_BATCH_JOB(20). Up toMAX_CONCURRENT_BATCH_JOBS(10) run at once. - Aggregation -
_monitor_and_process_batchescollects each finished batch's best result, updates the global best, and feeds the elite pool (pruned toTOP_N_ELITES). - Per-batch timeout - a batch running longer than
CLUSTERING_BATCH_TIMEOUT_MINUTES(60) is declared failed, its runs are counted as done anyway (so the total can complete), and it's cleared from the active set. - Failure ceiling - once
CLUSTERING_MAX_FAILED_BATCHES(10) batches have failed, no new batches launch and the remaining runs are force-completed. - Staleness watchdog - if
runs_completeddoesn't advance forCLUSTERING_BATCH_TIMEOUT_MINUTES, the task force-completes with the best result found so far, rather than hanging near the end. - State recovery / idempotency - on restart the task reloads child tasks from the DB and resumes; a task already in a terminal state is skipped.
If no valid solution was found across every run, finalization raises rather than creating empty playlists.
The single winning result is cleaned up before any playlist is created
(tasks/clustering_postprocessing.py),
in order:
- Duplicate filtering
(
apply_duplicate_filtering_to_clustering_result)- within each playlist: sort by title (so near-identical titles are
adjacent), drop exact title/artist duplicates (normalizing away
(Remastered), [Explicit], - Radio Edit, etc.), then drop songs whose
embedding distance to a recent neighbor is below the duplicate threshold,
using the same metric and thresholds as the voyager manager
(
DUPLICATE_DISTANCE_CHECK_LOOKBACK, default lookback 1). Vectors are read straight from theembeddingtable; if none exist it falls back to title/artist matching only. The playlist is then shuffled.
- within each playlist: sort by title (so near-identical titles are
adjacent), drop exact title/artist duplicates (normalizing away
(Remastered), [Explicit], - Radio Edit, etc.), then drop songs whose
embedding distance to a recent neighbor is below the duplicate threshold,
using the same metric and thresholds as the voyager manager
(
- Minimum-size filter
(
apply_minimum_size_filter_to_clustering_result)- drop any playlist with fewer than
MIN_PLAYLIST_SIZE_FOR_TOP_N(20) songs.
- drop any playlist with fewer than
- Top-N diverse selection
(
select_top_n_diverse_playlists) - only if more playlists survive thantop_n_playlists. Greedy max-min selection in centroid space: start from the largest playlist, then repeatedly add the candidate maximizing a combined score of distance to the already-selected set (diversity) × log size (usefulness). This keeps a spread-out, non-redundant set rather than many near-identical playlists.
After post-processing, _name_and_prepare_playlists
AI-names the survivors (Ollama / OpenAI / Gemini / Mistral, falling back to the
generated name on error), a final Fisher-Yates shuffle randomizes order, and
playlists larger than MAX_SONGS_PER_CLUSTER are split into numbered chunks.
Existing _automatic playlists are deleted and the new ones created on the media
server and recorded in the playlist table.
Clustering is idempotent at the output level: every run begins by deleting
the existing _automatic playlists and ends by recreating them, so re-running
simply replaces the previous set - there are no clustering tables to drop. To
get a fresh result, just start a new clustering task (optionally with different
score weights, method, or run count).
Clustering only reads the analysis tables (score, embedding); it never
modifies them, so re-running clustering never requires re-analyzing audio or
lyrics.