This page covers the scripts in analysis/ for summarizing, comparing, and auditing experiment results after evaluation runs. For the output directory structure, HDF5 layout, and episode result fields, see Data Storage and Output.
The primary script for reading and summarizing experiment results from episode_results.jsonl (or legacy .json). It supports multiple summarization modes, filtering, CSV export, and multi-folder aggregation.
python analysis/read_results.py <folder> [<folder> ...]<folder> can be:
- A folder name relative to the default output directory (e.g.,
2025-09-02_13-15-34) - An absolute path (e.g.,
/data/experiments/my_run) - A glob pattern (e.g.,
pi0_*), which prompts for confirmation before proceeding
Multiple folders can be passed to aggregate results across runs.
By default, the script prints a per-task summary table with success rate, score, and trajectory metrics. Additional modes provide different views of the same data:
| Flag | Description |
|---|---|
| (default) | Per-task table with success/failure counts, percentages, scores, and trajectory metrics |
--by-attributes |
Groups tasks by benchmark categories (visual, relational, procedural) with attribute breakdowns |
--by-difficulty |
Summarizes results grouped by difficulty label (simple, moderate, complex) |
--by-scene |
Aggregates results by scene instead of by task |
--by-wrong-objects |
Per-task breakdown of wrong object grasps: success count, fail count, and which objects were grabbed |
--by-instruction-type |
Pivot table comparing success rates across instruction types (default, vague, specific, etc.) |
--show-episodes |
Appends a detailed per-episode table after the summary |
| Flag | Description |
|---|---|
--task TASK [TASK ...] |
Show only the specified task name(s) |
--filter-pattern PATTERN |
Glob-style pattern to filter results (e.g., pick_*, *cube*) |
--filter-field FIELD |
Field to apply the filter on. Default: env_name. Other options: task_name, scene, attributes |
| Flag | Description |
|---|---|
--csv |
Print results in CSV format (tab-separated) for copy-pasting into spreadsheets |
--csv-compact |
CSV with stddev in the same column as the value, e.g., -9.14 (± 4.72) (implies --csv) |
--output-csv FILE |
Write CSV output to a file instead of stdout. If the path is relative, it is placed inside the first data folder (implies --csv) |
| Flag | Description |
|---|---|
--verbose |
Show stddev columns, wrong object details, and episode IDs |
--no-metrics |
Hide trajectory metrics columns (EE SPARC, Path Length, Speed) |
--timing |
Show wall-clock timing columns: average iteration speed (it/s) and wall time per episode in minutes (Walltime(m)) |
--exclude-containers |
Exclude container objects (bin, crate, box, etc.) from wrong-object-grabbed counts |
# Basic summary for a single run
python analysis/read_results.py 2025-09-02_13-15-34
# Verbose summary with all details
python analysis/read_results.py 2025-09-02_13-15-34 --verbose
# Aggregate results across multiple runs
python analysis/read_results.py pi0_run1 pi0_run2 pi0_run3
# Aggregate with glob pattern
python analysis/read_results.py "pi0_*"
# Filter to specific tasks
python analysis/read_results.py 2025-09-02_13-15-34 --task RubiksCubeTask BananaInBowlTask
# Filter by env_name pattern
python analysis/read_results.py 2025-09-02_13-15-34 --filter-pattern "*cube*"
# Group results by benchmark category
python analysis/read_results.py 2025-09-02_13-15-34 --by-attributes
# Compare instruction types
python analysis/read_results.py 2025-09-02_13-15-34 --by-instruction-type
# Export to CSV file
python analysis/read_results.py 2025-09-02_13-15-34 --output-csv summary.csv
# Compact CSV for spreadsheets (stddev in same column)
python analysis/read_results.py 2025-09-02_13-15-34 --csv-compact
# Summary without trajectory metrics
python analysis/read_results.py 2025-09-02_13-15-34 --no-metrics
# Wrong object analysis, excluding containers
python analysis/read_results.py 2025-09-02_13-15-34 --by-wrong-objects --exclude-containersThe default output includes trajectory metrics columns (EE SPARC, Path Length, Speed):
---------------------------------------------- EXPERIMENT SUMMARY ----------------------------------------------
Task Name Success % Score(total) Score(fail) Time(s) EE SPARC PathLen(m) Speed(cm/s)
----------------------------------------------------------------------------------------------------------------
TOTAL (2 tasks) 6/20 30.0% 0.400 0.143 65.59 -12.86 7.33 2.9
----------------------------------------------------------------------------------------------------------------
AnimalsInBinTask 0/10 0.0% 0.000 0.000 - -7.49 2.02 2.2
AppleAndYogurtInBowlTask 6/10 60.0% 0.800 0.500 65.59 -18.23 12.63 3.5
----------------------------------------------------------------------------------------------------------------
Score columns:
Score(total): mean per-episode score across all episodes (successes contribute 1.0; failures contribute their fractional subtask progress in[0, 1)).Score(fail): mean per-episode score over failed episodes only — "how close did the failures get."
Score(total) = success_rate + (1 − success_rate) · Score(fail).
EE SPARC is the spectral arc length (smoothness) metric; more negative = less smooth. Stationary trajectories return NaN and are excluded from the average. Use --no-metrics to hide the trajectory metrics columns.
Validates that episode results are consistent with run_*.hdf5 files — checks that every episode entry has a matching demo in the HDF5, and reports missing or corrupt data.
Usage:
python analysis/check_results.py <folder> [<folder> ...] [--verbose] [--diagnose]Arguments:
| Flag | Description | Default |
|---|---|---|
folder (positional) |
Folder(s) or absolute path(s) containing results | (required) |
--verbose |
Print status for every episode, not only errors | False |
--diagnose |
Extra HDF5 diagnostics (available demos, numbering gaps, etc.) | False |
Example:
# Quick sanity check
python analysis/check_results.py 2025-09-02_13-15-34
# Full diagnostics
python analysis/check_results.py 2025-09-02_13-15-34 --verbose --diagnoseCompile and merge experiment results. Supports two modes:
Reads episode_results.jsonl (or legacy .json) from one or more folders and writes a single output file.
python analysis/compile_results.py "pi05_batch*" -o results.jsonl
python analysis/compile_results.py "pi05_batch*" -o results.json # JSON array format
python analysis/compile_results.py "pi05_batch*" -o results # defaults to .jsonlMoves task subdirectories and merges results into a single output folder. Aborts if any task folder appears in multiple sources (conflict). Source folders are removed after merge by default.
python analysis/compile_results.py "pi05_batch*" --merge output_folder
python analysis/compile_results.py "pi05_batch*" --merge output_folder --keep # preserve sourcesArguments:
| Flag | Description | Default |
|---|---|---|
folders (positional) |
Folders to compile/merge (glob patterns supported) | (required) |
-o / --output |
Output file path (compile mode). Extension determines format. | — |
--merge |
Output folder path (merge mode). Moves task folders + merges results. | — |
--keep |
Keep source folders after merge | False (remove) |
-y / --yes |
Skip confirmation when globs expand to many folders | False |
--task FILTER |
Filter episodes (e.g., wrong object) |
None |
Examples:
# Compile batch results into one file
python analysis/compile_results.py run_1 run_2 run_3 -o combined.jsonl
# Merge batch folders into one folder
python analysis/compile_results.py "pi05_batch*" --merge pi05_mergedExtracts initial camera and object poses from HDF5 files and writes episode_initial_poses.json. Useful for analyzing pose distributions or debugging scene initialization.
Usage:
python analysis/extract_initial_poses.py <folder> [<folder> ...]Arguments:
| Flag | Description | Default |
|---|---|---|
folder (positional) |
Folder(s) or absolute path(s) containing results | (required) |
--overwrite |
Recompute even if episode_initial_poses.json exists |
False |
--csv |
CSV-style output | False |
--summary |
Summary table (counts) instead of per-episode detail | False |
--all |
Include all pose columns (all cameras/objects) | False |
--compact |
Compact poses (xyz only, no orientation) | False |
--output-file FILE |
Write CSV to this path instead of stdout | None |
Example:
# Extract poses and print summary
python analysis/extract_initial_poses.py 2025-09-02_13-15-34 --summary
# Export all poses as CSV
python analysis/extract_initial_poses.py 2025-09-02_13-15-34 --csv --all --output-file poses.csvReads and displays subtask completion status directly from an HDF5 data file. Extracts timing, status codes, completion flags, and scores for each subtask step during episode execution.
Usage:
python scripts/read_subtask_status_from_hdf5.py <hdf5_file> [-e EPISODE]Arguments:
| Flag | Description | Default |
|---|---|---|
file (positional) |
Path to the HDF5 data file | (required) |
-e / --episode |
Episode index (e.g., 0 for demo_0). If omitted, shows all episodes |
None |
Example:
# Display all episodes
python scripts/read_subtask_status_from_hdf5.py output/2025-09-02_13-15-34/RubiksCubeTask/run_0.hdf5
# Display specific episode
python scripts/read_subtask_status_from_hdf5.py output/2025-09-02_13-15-34/RubiksCubeTask/run_0.hdf5 -e 0- Data Storage and Output — Output directory structure, HDF5 layout, and episode result fields