Benchmark tooling for two emotion-audio annotation subsets and CLAP-style model evaluation against them.
- emolia-emo — 3-level ordinal rating (
not_present/weakly_present/strongly_present) per(file, queried_emotion, task_type). Used to score whether a clip expresses a queried emotion. Variables are self-explanatory emotion names. - emolia-dim — binary
yes/norating per(file, dimension, level, polarity)against a written rubric indataset/emolia-dim/variables.json. Used to score whether a clip matches a specific level of a perceptual dimension (e.g.TEMPlevel 5 = "fast").
annotations_raw/ # gitignored; real usernames live here
emolia-emo/{annotations.csv, users.csv}
emolia-dim/{annotations.csv, users.csv}
annotations/ # committed; usernames anonymized to user_0, user_1, …
emolia-emo/{annotations.csv, users.csv}
emolia-dim/{annotations.csv, users.csv}
dataset/
emolia-emo/data/<Emotion>_best/<stem>.{mp3,json}
emolia-dim/data/<DIM>/<level>/<polarity>/sample_NN.{mp3,json}
emolia-dim/variables.json # rubric for prompts
analysis_outputs/<subset>/
benchmark_labels.csv per_*_summary.csv summary.json incomplete_items.csv
analysis_outputs/report.md # combined paper-ready summary
benchmark_outputs/<subset>/
predictions.csv metrics_by_*.csv summary.json report.md
uv venv --python 3.13 # only needed once
uv run anonymize.py # then any other entry pointAll scripts use uv run.
annotations_raw/ is gitignored because usernames are still in there. Run
the anonymizer any time the raw CSVs are replaced:
uv run anonymize.pyThis rewrites annotations/<subset>/annotations.csv and
annotations/<subset>/users.csv with usernames replaced by user_0, user_1,
… (sorted username order; mapping saved to annotations_raw/<subset>/_anon_map.csv).
uv run analysis.pyFor each subset this writes to analysis_outputs/<subset>/:
benchmark_labels.csv— one row per item with majority-vote target (majority_present), per-rating vote counts,n_raters,all_agree_binary, andbenchmark_bucket(unanimous_*,majority_*,single_rater_*).per_*_summary.csv— slice tables (task type / emotion / dimension / polarity).incomplete_items.csv— items lacking 3 raters.summary.json— machine-readable summary includinghuman_upper_bound_binary(mean pairwise exact agreement) and rater-coverage histogram.
It also writes a single combined paper-ready report to
analysis_outputs/report.md. Numbers there are formatted for direct use in a
methods section: total annotations, annotators / demographics, kappa /
Fleiss kappa, per-task and per-emotion / per-dimension breakdowns.
Sham mode (deterministic fake similarity, no audio is read):
uv run benchmark.pyRemote endpoint mode:
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarityIn-process CLAP mode (loads a SentenceTransformer audio-text model on GPU and scores in-process — no HTTP round-trip):
uv run benchmark.py \
--clap-model /path/to/snapshot_or_hf_repo \
--clap-device cuda --clap-dtype bfloat16Published VoiceCLAP HF models have a --model shortcut that picks the right
in-process backend (SentenceTransformer for voiceclap-large, AutoModel for
voiceclap-small):
uv run benchmark.py --model voiceclap-small
uv run benchmark.py --model voiceclap-largeIn-process modes need torch, sentence-transformers, transformers, and
librosa, declared as the clap extra:
uv sync --extra clapCLAP-mode tuning flags (only active with --clap-model / --model):
--clap-device cuda|cpu(defaultcuda).--clap-dtype bfloat16|float16|float32(defaultbfloat16).--clap-audio-batch-size N(default 8).--clap-text-batch-size N(default 8).--clap-max-seconds S— truncate audio before encoding (default 30s).
Useful flags:
--subset emolia-emo|emolia-dim|both(defaultboth).--limit N— quick smoke test on first N rows.--threshold 0.0— predict positive if similarity ≥ threshold.--no-audio-send— with--endpoint, send only stem + text in the JSON payload (server reads files itself).- Filter flags (default keeps every row in
benchmark_labels.csv):--min-raters N— drop items with fewer than N human raters.--exclude-flagged— drop items annotators flagged as broken audio.--unanimous-only— keep only items where every rater agreed.
benchmark_labels.csv keeps every annotated item, including ones with only
one rater and ones an annotator flagged. The columns n_raters,
benchmark_bucket, and flagged describe each row so you can pick the
subset that matches your evaluation goal:
The most permissive view. Every annotated item counts, the label
majority_present is the majority of whatever votes are available (a single
rater's vote when only one human rated the item).
uv run benchmark.py
# or, equivalently in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")Use this when you want maximum data volume and don't mind that ~25% of emolia-dim labels come from a single rater.
The recommended evaluation subset for headline numbers in a paper. Every included item has at least two humans agreeing or disagreeing on the same clip, and broken-audio items are excluded.
uv run benchmark.py --min-raters 2 --exclude-flagged
# or, in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")
labels = labels[(labels["n_raters"] >= 2) & (~labels["flagged"])]This matches the "2 humans + Gemini preselection = 3-rater panel" that the human upper bound is computed on, so the model and the human ceiling are comparable.
If you want to see how a model performs only on the items annotators flagged, or audit them yourself:
# Just the flagged items, in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")
flagged = labels[labels["flagged"]]
# Or skim the raw flag reasons:
flags = pd.read_csv("annotations/emolia-dim/flags.csv")The 37 flagged items in emolia-dim are 0.2% of the corpus and the headline metrics move by < 0.001 with or without them, but it's the right thing to inspect when training a model that should robustly skip clips with no speech.
Add --unanimous-only to either of the modes above to keep only items
where every rater on the panel agreed. Smallest sample but highest
label confidence.
uv run benchmark.py --min-raters 2 --exclude-flagged --unanimous-onlyFor each subset, benchmark.py writes to benchmark_outputs/<subset>/:
predictions.csvmetrics_by_<task_type|polarity>.csvmetrics_by_benchmark_bucket.csvsummary.jsonreport.md
The report starts and ends with a score rubric:
| Band | Balanced accuracy | Notes |
|---|---|---|
| Bad | < 0.55 | At or below random; model isn't learning |
| Weak | 0.55 – 0.65 | Some signal, far from human |
| Medium | 0.65 – 0.75 | Useful but lossy; decent training target |
| Good | 0.75 – 0.85 | Strong CLAP-style performance |
| Excellent | ≥ 0.85 | Approaches human inter-rater agreement |
Headline metric: balanced accuracy on majority_present. The unanimous
subset (benchmark_bucket starting with unanimous_) is the cleanest target
to optimize against.
benchmark.py posts JSON to your endpoint:
{
"text": "Speech audio in which the speaker expresses or conveys Sadness.",
"audio_filename": "EN_B00025_S06526_W000000.mp3",
"audio_base64": "<base64 mp3 bytes>"
}The server should return one of {"similarity": …}, {"score": …}, or
{"logit": …}. The local stub server sham_clap_server.py implements this
contract for testing.
uv run sham_clap_server.py --port 8765
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity# Drop new raw CSVs into annotations_raw/<subset>/, then:
uv run anonymize.py
uv run analysis.py
uv run benchmark.py