Name	Name	Last commit message	Last commit date
parent directory ..
reports	reports
README.md	README.md
__init__.py	__init__.py
calibration.py	calibration.py
loader.py	loader.py
main.py	main.py
metrics.py	metrics.py
paper_tables.py	paper_tables.py

analysis/ — Evaluation Experiment Analyzer (EEA)

This package implements the EEA pipeline (coeval analyze): it reads a completed (or in-progress) Experiment Storage Set (EES) produced by coeval run and generates reports — Excel workbooks, interactive HTML pages, and filtered JSONL/Parquet exports.

Package Contents

analysis/
├── loader.py           ← EES data loader: reads all 5 phase folders into EESDataModel
├── metrics.py          ← Computation engine: SPA/WPA/kappa agreement, teacher/student scores,
│                          robust filter (J*, T*, consistency threshold)
├── main.py             ← EEA dispatch: load once, route subcommand to report generator
│
├── reports/            ← One module per report type
│   ├── html_base.py    ← Shared HTML scaffold (Plotly, CSS, navigation)
│   ├── excel.py        ← complete-report (Excel workbook)
│   ├── coverage.py     ← coverage-summary (HTML)
│   ├── score_dist.py   ← score-distribution (HTML)
│   ├── teacher_report.py   ← teacher-report (HTML)
│   ├── judge_report.py     ← judge-report (HTML)
│   ├── student_report.py   ← student-report (HTML)
│   ├── interaction.py      ← interaction-matrix (HTML)
│   ├── consistency.py      ← judge-consistency (HTML)
│   ├── robust.py           ← robust-summary (HTML)
│   └── export_benchmark.py ← export-benchmark (JSONL / Parquet)
│
├── tests/              ← Unit tests (run with: python -m pytest analysis/tests/)
├── samples/            ← Sample EES fixtures for offline testing
└── docs/               ← Analysis user manual + COEVAL-SPEC-002

Available Reports

Subcommand	Output	What it shows
`complete-report`	Excel	All raw scores, attributes, model names — full audit trail
`coverage-summary`	HTML	Phase coverage per task/model, invalid record breakdown
`score-distribution`	HTML	Score histograms by rubric aspect, model, and attribute
`teacher-report`	HTML	Teacher differentiation scores — how varied each teacher's prompts are
`judge-report`	HTML	Judge agreement (SPA/WPA/kappa) and reliability across models
`student-report`	HTML	Student performance ranking across tasks and aspects
`interaction-matrix`	HTML	Heatmap: student scores cross-tabulated by teacher and student model
`judge-consistency`	HTML	Within-judge consistency: same prompt scored twice
`robust-summary`	HTML	Student ranking after robust filter (top-half judges + consistency threshold)
`export-benchmark`	JSONL/Parquet	High-quality datapoints that passed all robust filters
`all`	all above	All HTML reports + Excel + robust summary in one call

Key Classes

Class	Module	Role
`EESDataModel`	`loader.py`	In-memory representation of one EES (phases 1–5 data, validity flags)
`RobustFilterResult`	`metrics.py`	Output of the robust filter: J, T, consistency-passing datapoints
`run_analyze`	`main.py`	Top-level dispatch: load EES, call the right report function

Robust Filtering

The robust filter selects a subset of high-confidence datapoints for the benchmark export:

Judge selection (J*) — rank judges by agreement metric (SPA / WPA / kappa), keep top-half or all
Teacher selection (T*) — rank teachers by differentiation score, keep best
Consistency filter — keep only datapoints where judges agree above threshold theta

coeval analyze export-benchmark \
    --run eval_runs/my-experiment \
    --out my-benchmark.jsonl \
    --judge-selection top_half \      # or: all
    --agreement-metric spa \          # or: wpa, kappa
    --agreement-threshold 0.8         # fraction of judges that must agree

Quick Commands

# Generate all reports at once
coeval analyze all \
    --run eval_runs/my-experiment \
    --out eval_runs/my-experiment-reports/

# Single reports
coeval analyze student-report  --run eval_runs/my-experiment --out report.html
coeval analyze complete-report --run eval_runs/my-experiment --out report.xlsx

# Export benchmark dataset
coeval analyze export-benchmark \
    --run eval_runs/my-experiment \
    --out my-benchmark.jsonl

# Analyze an in-progress experiment (skip completeness warning)
coeval analyze student-report \
    --run eval_runs/my-experiment \
    --out report.html \
    --partial-ok

# Run tests
python -m pytest analysis/tests/ -v

Documentation

Document	Description
`docs/running_analysis.md`	User manual — all subcommands, filtering options, output formats, FAQ
`docs/spec_phase2_claude.md`	COEVAL-SPEC-002 — formal EEA specification
`samples/`	Sample EES folders for offline testing and report preview
`../../docs/developer_guide.md`	Full developer guide covering both EER and EEA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

analysis/ — Evaluation Experiment Analyzer (EEA)

Package Contents

Available Reports

Key Classes

Robust Filtering

Quick Commands

Documentation

FilesExpand file tree

analyzer

Directory actions

More options

Directory actions

More options

Latest commit

History

analyzer

Folders and files

parent directory

README.md

analysis/ — Evaluation Experiment Analyzer (EEA)

Package Contents

Available Reports

Key Classes

Robust Filtering

Quick Commands

Documentation