Skip to content
This repository was archived by the owner on Apr 27, 2026. It is now read-only.

Normalize per-sample embeddings before averaging centroid#5

Open
ComputelessComputer wants to merge 1 commit intomainfrom
fix/l2-normalize-embedding-centroid
Open

Normalize per-sample embeddings before averaging centroid#5
ComputelessComputer wants to merge 1 commit intomainfrom
fix/l2-normalize-embedding-centroid

Conversation

@ComputelessComputer
Copy link
Copy Markdown
Collaborator

Summary

Speaker embeddings must be L2-normalized before averaging so samples with larger raw magnitudes (typically longer or louder clips) don't bias the centroid. The previous normalizedEmbeddingCentroid summed raw WeSpeaker outputs and only L2-normalized the result at the end.

What changed

src-tauri/swift-permissions/src/speech_bridge.swift — per-sample L2 normalization before summation inside normalizedEmbeddingCentroid. Zero-magnitude samples are skipped. The final L2-normalization of the summed vector is preserved.

Why it helps

Centroid embeddings drive speaker similarity comparisons (used today for cross-meeting speaker identification and, after #5, for constraining over-segmented diarization). Giving each sample equal weight — regardless of raw magnitude — matches the standard recipe for averaging speaker embeddings and reduces drift when a speaker has one long monologue plus several short contributions.

What's not in this PR

  • constrainDiarizedSegments embedding-based reassignment (separate PR).
  • Stratified sampling across segments in selectSpeakerEmbeddingSegments (H3 in the issue) — follow-up.

Testing notes

Swift-only change. bun run build for the frontend still passes. Please verify with the existing Swift test suite and, if available, the in-app speaker-suggestion flow with a known speaker to confirm match quality is the same or better.

Addresses #4.

Speaker embeddings must be L2-normalized before averaging so high-magnitude samples don't dominate the centroid. The old code summed raw WeSpeaker outputs and only normalized at the end, which biases the centroid toward louder or longer clips. Now each sample is L2-normalized before summation; the resulting mean is re-normalized as before.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant