EEA Unit Tests

Unit tests for the CoEval Experiment Evaluation Analyzer (EEA).

Running the Tests

python -m pytest Tests/analyzer/

Add -v for verbose output or -x to stop on the first failure.

Tests for loading Experiment Storage Set (EES) data from disk.

EES directory loading: verifying correct file discovery and parsing
Status detection: identifying complete, partial, and failed runs
Model classification: correctly labeling models as teacher, student, or judge based on config and file layout

Tests for evaluation metrics and scoring.

SPA (Simple Percent Agreement): pairwise judge agreement calculation
WPA (Weighted Percent Agreement): distance-weighted variant of SPA
Cohen's kappa: inter-rater reliability with expected-agreement correction
Teacher scoring formulas: aggregation of teacher-assigned scores across items
Robust filter: handling of empty responses, malformed JSON, out-of-range scores, and other edge cases

Smoke tests for report generation.