emolia-bench

Benchmark tooling for two emotion-audio annotation subsets and CLAP-style model evaluation against them.

Subsets

emolia-emo — 3-level ordinal rating (not_present / weakly_present / strongly_present) per (file, queried_emotion, task_type). Used to score whether a clip expresses a queried emotion. Variables are self-explanatory emotion names.
emolia-dim — binary yes / no rating per (file, dimension, level, polarity) against a written rubric in dataset/emolia-dim/variables.json. Used to score whether a clip matches a specific level of a perceptual dimension (e.g. TEMP level 5 = "fast").

Repository layout

annotations_raw/          # gitignored; real usernames live here
  emolia-emo/{annotations.csv, users.csv}
  emolia-dim/{annotations.csv, users.csv}
annotations/              # committed; usernames anonymized to user_0, user_1, …
  emolia-emo/{annotations.csv, users.csv}
  emolia-dim/{annotations.csv, users.csv}
dataset/
  emolia-emo/data/<Emotion>_best/<stem>.{mp3,json}
  emolia-dim/data/<DIM>/<level>/<polarity>/sample_NN.{mp3,json}
  emolia-dim/variables.json   # rubric for prompts
analysis_outputs/<subset>/
  benchmark_labels.csv per_*_summary.csv summary.json incomplete_items.csv
analysis_outputs/report.md   # combined paper-ready summary
benchmark_outputs/<subset>/
  predictions.csv metrics_by_*.csv summary.json report.md

Environment

uv venv --python 3.13       # only needed once
uv run anonymize.py         # then any other entry point

All scripts use uv run.

Pipeline

1. Refresh anonymized annotations

annotations_raw/ is gitignored because usernames are still in there. Run the anonymizer any time the raw CSVs are replaced:

uv run anonymize.py

This rewrites annotations/<subset>/annotations.csv and annotations/<subset>/users.csv with usernames replaced by user_0, user_1, … (sorted username order; mapping saved to annotations_raw/<subset>/_anon_map.csv).

2. Build benchmark labels and the agreement summary

uv run analysis.py

For each subset this writes to analysis_outputs/<subset>/:

benchmark_labels.csv — one row per item with majority-vote target (majority_present), per-rating vote counts, n_raters, all_agree_binary, and benchmark_bucket (unanimous_*, majority_*, single_rater_*).
per_*_summary.csv — slice tables (task type / emotion / dimension / polarity).
incomplete_items.csv — items lacking 3 raters.
summary.json — machine-readable summary including human_upper_bound_binary (mean pairwise exact agreement) and rater-coverage histogram.

It also writes a single combined paper-ready report to analysis_outputs/report.md. Numbers there are formatted for direct use in a methods section: total annotations, annotators / demographics, kappa / Fleiss kappa, per-task and per-emotion / per-dimension breakdowns.

3. Run the model benchmark

Sham mode (deterministic fake similarity, no audio is read):

uv run benchmark.py

Remote endpoint mode:

uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity

In-process CLAP mode (loads a SentenceTransformer audio-text model on GPU and scores in-process — no HTTP round-trip):

uv run benchmark.py \
  --clap-model /path/to/snapshot_or_hf_repo \
  --clap-device cuda --clap-dtype bfloat16

Published VoiceCLAP HF models have a --model shortcut that picks the right in-process backend (SentenceTransformer for voiceclap-large, AutoModel for voiceclap-small):

uv run benchmark.py --model voiceclap-small
uv run benchmark.py --model voiceclap-large

In-process modes need torch, sentence-transformers, transformers, and librosa, declared as the clap extra:

uv sync --extra clap

CLAP-mode tuning flags (only active with --clap-model / --model):

--clap-device cuda|cpu (default cuda).
--clap-dtype bfloat16|float16|float32 (default bfloat16).
--clap-audio-batch-size N (default 8).
--clap-text-batch-size N (default 8).
--clap-max-seconds S — truncate audio before encoding (default 30s).

Useful flags:

--subset emolia-emo|emolia-dim|both (default both).
--limit N — quick smoke test on first N rows.
--threshold 0.0 — predict positive if similarity ≥ threshold.
--no-audio-send — with --endpoint, send only stem + text in the JSON payload (server reads files itself).
Filter flags (default keeps every row in benchmark_labels.csv):
- --min-raters N — drop items with fewer than N human raters.
- --exclude-flagged — drop items annotators flagged as broken audio.
- --unanimous-only — keep only items where every rater agreed.

The three standard use modes

benchmark_labels.csv keeps every annotated item, including ones with only one rater and ones an annotator flagged. The columns n_raters, benchmark_bucket, and flagged describe each row so you can pick the subset that matches your evaluation goal:

1. All samples (default — including single-rater items)

The most permissive view. Every annotated item counts, the label majority_present is the majority of whatever votes are available (a single rater's vote when only one human rated the item).

uv run benchmark.py
# or, equivalently in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")

Use this when you want maximum data volume and don't mind that ~25% of emolia-dim labels come from a single rater.

2. Stronger-evidence cases (≥2 raters, no flagged audio)

The recommended evaluation subset for headline numbers in a paper. Every included item has at least two humans agreeing or disagreeing on the same clip, and broken-audio items are excluded.

uv run benchmark.py --min-raters 2 --exclude-flagged
# or, in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")
labels = labels[(labels["n_raters"] >= 2) & (~labels["flagged"])]

This matches the "2 humans + Gemini preselection = 3-rater panel" that the human upper bound is computed on, so the model and the human ceiling are comparable.

3. Auditing the broken-audio items (flagged-only or flag-aware)

If you want to see how a model performs only on the items annotators flagged, or audit them yourself:

# Just the flagged items, in pandas:
labels = pd.read_csv("analysis_outputs/emolia-dim/benchmark_labels.csv")
flagged = labels[labels["flagged"]]
# Or skim the raw flag reasons:
flags = pd.read_csv("annotations/emolia-dim/flags.csv")

The 37 flagged items in emolia-dim are 0.2% of the corpus and the headline metrics move by < 0.001 with or without them, but it's the right thing to inspect when training a model that should robustly skip clips with no speech.

Bonus: unanimous-only (the cleanest possible labels)

Add --unanimous-only to either of the modes above to keep only items where every rater on the panel agreed. Smallest sample but highest label confidence.

uv run benchmark.py --min-raters 2 --exclude-flagged --unanimous-only

For each subset, benchmark.py writes to benchmark_outputs/<subset>/:

predictions.csv
metrics_by_<task_type|polarity>.csv
metrics_by_benchmark_bucket.csv
summary.json
report.md

The report starts and ends with a score rubric:

Band	Balanced accuracy	Notes
Bad	< 0.55	At or below random; model isn't learning
Weak	0.55 – 0.65	Some signal, far from human
Medium	0.65 – 0.75	Useful but lossy; decent training target
Good	0.75 – 0.85	Strong CLAP-style performance
Excellent	≥ 0.85	Approaches human inter-rater agreement

Headline metric: balanced accuracy on majority_present. The unanimous subset (benchmark_bucket starting with unanimous_) is the cleanest target to optimize against.

Endpoint contract

benchmark.py posts JSON to your endpoint:

{
  "text": "Speech audio in which the speaker expresses or conveys Sadness.",
  "audio_filename": "EN_B00025_S06526_W000000.mp3",
  "audio_base64": "<base64 mp3 bytes>"
}

The server should return one of {"similarity": …}, {"score": …}, or {"logit": …}. The local stub server sham_clap_server.py implements this contract for testing.

uv run sham_clap_server.py --port 8765
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity

End-to-end refresh after new annotations

# Drop new raw CSVs into annotations_raw/<subset>/, then:
uv run anonymize.py
uv run analysis.py
uv run benchmark.py

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
analysis_outputs		analysis_outputs
annotations		annotations
dataset		dataset
docs		docs
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
anonymize.py		anonymize.py
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml
sham_clap_server.py		sham_clap_server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emolia-bench

Subsets

Repository layout

Environment

Pipeline

1. Refresh anonymized annotations

2. Build benchmark labels and the agreement summary

3. Run the model benchmark

The three standard use modes

1. All samples (default — including single-rater items)

2. Stronger-evidence cases (≥2 raters, no flagged audio)

3. Auditing the broken-audio items (flagged-only or flag-aware)

Bonus: unanimous-only (the cleanest possible labels)

Endpoint contract

End-to-end refresh after new annotations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

emolia-bench

Subsets

Repository layout

Environment

Pipeline

1. Refresh anonymized annotations

2. Build benchmark labels and the agreement summary

3. Run the model benchmark

The three standard use modes

1. All samples (default — including single-rater items)

2. Stronger-evidence cases (≥2 raters, no flagged audio)

3. Auditing the broken-audio items (flagged-only or flag-aware)

Bonus: unanimous-only (the cleanest possible labels)

Endpoint contract

End-to-end refresh after new annotations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages