This guide covers the usage of the visualization scripts for HOSER evaluation results.
Three visualization scripts are available at the project root, all designed to work with evaluation directories created by setup_evaluation.py:
visualize_trajectories.py- Interactive trajectory visualizations with scenario-aware samplingcreate_distribution_plots.py- Statistical distribution comparisonscreate_analysis_figures.py- Publication-quality analysis figures including scenario breakdowns
All scripts automatically:
- Detect the dataset from
evaluation.yamlin the eval directory - Create output directories within the eval directory (
figures/) - Support both Beijing and Porto datasets
- Load scenario analysis results if available
- Visualize real vs. generated trajectories
- Multiple sampling strategies (random, length-based, representative, scenario)
- Scenario-aware sampling (uses scenario analysis results if available)
- Separate and overlaid comparison modes
- Cross-model comparisons for same OD pairs
- NEW: Scenario-based cross-model comparisons - Compare all models across different scenarios
- Road network overlay
# Visualize trajectories from an evaluation directory
uv run python visualize_trajectories.py --eval-dir hoser-distill-optuna-6
# Use scenario-based sampling (if scenario analysis was run)
uv run python visualize_trajectories.py --eval-dir hoser-distill-optuna-6 --sample_strategy scenario
# Generate cross-model comparisons
uv run python visualize_trajectories.py --eval-dir hoser-distill-optuna-6 --cross_model
# Only overlaid plots (no separate)
uv run python visualize_trajectories.py --eval-dir hoser-distill-optuna-6 --no_separatelength_based(default): Samples short, medium, and long trajectoriesrandom: Random sampling from all trajectoriesrepresentative: Median-length trajectory (most representative)scenario: Samples from top N scenarios (requires scenario analysis results)
--eval-dir EVAL_DIR # Required: evaluation directory
--dataset DATASET # Optional: override auto-detected dataset
--sample_strategy STRATEGY # Sampling strategy (default: length_based)
--samples_per_type N # Number of samples per type (default: 1)
--max_scenarios N # Max scenarios to plot (default: 5, for scenario strategy)
--random_seed SEED # Random seed (default: 42)
--no_separate # Skip separate plots
--no_overlaid # Skip overlaid plots
--cross_model # Generate cross-model comparisons
--no_real # Exclude real trajectories from cross-model
--dpi DPI # Output resolution (default: 300)- Saved to:
{eval_dir}/figures/trajectories/ - Formats: PNG images
- Distance distribution comparisons
- Duration distribution comparisons
- Statistical overlays (KDE, histograms)
- Real vs. generated comparisons across all models
# Generate distribution plots
uv run python create_distribution_plots.py --eval-dir hoser-distill-optuna-6
# Specify dataset explicitly
uv run python create_distribution_plots.py --eval-dir hoser-distill-optuna-6 --dataset Beijing
# Enable verbose logging
uv run python create_distribution_plots.py --eval-dir hoser-distill-optuna-6 --verbose--eval-dir EVAL_DIR # Required: evaluation directory
--dataset DATASET # Optional: override auto-detected dataset
--verbose # Enable verbose logging- Saved to:
{eval_dir}/figures/distributions/ - Formats: PNG images
- Plots: Distance distributions, duration distributions, KDE overlays
The scenario cross-model mode filters trajectories by scenario, then uses the same OD-matching logic as regular cross-model comparisons. This ensures fair comparisons where all models are evaluated on the same routes within each scenario context (peak/off-peak, weekday/weekend, city center, etc.).
- OD-matched comparisons within scenarios: Only compares trajectories with matching origin-destination pairs
- Multiple plots per scenario: Generates one plot for each common OD pair found within the scenario
- Fair route comparisons: All models navigate the same route in the same scenario conditions
- No arbitrary filtering: Processes all scenarios regardless of trajectory count
- Reuses proven logic: Same plot style and OD-matching as regular cross-model mode
# Generate scenario-based cross-model comparisons (OD-matched within scenarios)
uv run python visualize_trajectories.py \
--eval-dir hoser-distill-optuna-6 \
--scenario_cross_model
# Combine with other modes (skip overlaid/separate, only do scenario cross-model)
uv run python visualize_trajectories.py \
--eval-dir hoser-distill-optuna-6 \
--scenario_cross_model \
--no_separate --no_overlaid--scenario_cross_model # Enable scenario-based cross-model mode with OD matchingPer-scenario OD comparison plots organized by scenario:
scenario_cross_model/train/off_peak/train_od_comparison_1_origin997_dest26798.{pdf,png}scenario_cross_model/train/off_peak/train_od_comparison_2_origin17308_dest16404.{pdf,png}scenario_cross_model/train/weekday/train_od_comparison_1_origin997_dest26798.{pdf,png}scenario_cross_model/train/suburban/train_od_comparison_1_origin1154_dest36175.{pdf,png}- ... (multiple plots per scenario, one for each matching OD pair)
Directory structure:
scenario_cross_model/
├── train/
│ ├── off_peak/ # 3 OD comparisons
│ ├── weekday/ # 1 OD comparison
│ ├── suburban/ # 3 OD comparisons
│ └── ...
└── test/
├── off_peak/ # 6 OD comparisons
├── weekday/ # 6 OD comparisons
└── ...
Each plot shows all models (vanilla, distilled, distilled_seed44, real) for the same OD pair within the specific scenario context.
- Requires scenario analysis to be run first (using
tools/analyze_scenarios.py) - Or run
python_pipeline.py --run-scenariosto generate scenarios automatically - The scenario analysis must generate
trajectory_scenarios.jsonmapping file
Titles include scenario context to make it clear what conditions are being compared:
"Off Peak Scenario - TRAIN OD: All Models - Origin 997 → Destination 26798"
"Weekday Scenario - TEST OD: All Models - Origin 832 → Destination 17361"
"City Center Scenario - TRAIN OD: All Models - Origin 17496 → Destination 33590"
New Feature: For OD pairs that appear in 2+ scenarios, a concatenated multi-panel plot is automatically generated showing the same route under different conditions side-by-side.
Example output:
scenario_cross_model/train/multi_scenario/train_origin997_dest26798_multi_scenario.{pdf,png}- Shows Origin 997 → Dest 26798 across 3 scenarios: Off Peak | Suburban | Weekday
- Each panel shows all models (vanilla, distilled, distilled_seed44, real) for that scenario
- Perfect for analyzing subtle differences in the same route under varying conditions
Benefits:
- Side-by-side comparison of same OD pair across scenarios
- Identify scenario-specific model behavior (e.g., does the model perform worse during peak hours?)
- Publication-ready multi-panel figures showing condition variations
Output structure:
scenario_cross_model/
├── train/
│ ├── multi_scenario/ # NEW: Multi-scenario concatenated plots
│ │ ├── train_origin997_dest26798_multi_scenario.{pdf,png} # 3 scenarios
│ │ └── train_origin17308_dest16404_multi_scenario.{pdf,png} # 2 scenarios
│ ├── off_peak/ # Individual scenario plots
│ └── ...
└── test/
└── multi_scenario/ # NEW: Multi-scenario concatenated plots
├── test_origin31410_dest374_multi_scenario.{pdf,png} # 3 scenarios
└── ...
NEW: YAML-driven plotting system with configurable metric filtering and individual plot selection.
- 13 publication-quality analysis plots organized by type
- YAML-based configuration for extensibility
- Individual plot selection for fast iteration
- Dataset-specific metric filtering
- Plot groups for convenience
- Self-documenting with
--list-plots
# List all available plots and groups
uv run python create_scenario_plots.py --list-plots
# Generate all plots (default)
uv run python create_scenario_plots.py --eval-dir hoser-distill-optuna-6
# Generate only heatmap visualizations (fast)
uv run python create_scenario_plots.py --eval-dir hoser-distill-optuna-6 --plots heatmaps_only
# Generate a single plot (development/iteration)
uv run python create_scenario_plots.py --eval-dir hoser-distill-optuna-6 \
--plots application.improvement_heatmaps
# Generate multiple specific plots
uv run python create_scenario_plots.py --eval-dir hoser-distill-optuna-6 \
--plots metrics.scenario_heatmap,application.radar_chartsMetrics Plots (metrics.*)
metrics.scenario_heatmap- Scenario × Metric performance heatmapmetrics.model_comparison- Cross-model bar chart comparisonmetrics.aggregate_comparison- Aggregate metrics across modelsmetrics.metric_distributions- Distribution plots for key metrics
Temporal/Spatial Plots (temporal_spatial.*)
temporal_spatial.temporal_patterns- Time-based performance patternstemporal_spatial.spatial_patterns- Spatial coverage and performance
Robustness Plots (robustness.*)
robustness.scenario_robustness- Performance stability across scenariosrobustness.metric_sensitivity- Metric sensitivity analysis
Analysis Plots (analysis.*)
analysis.difficulty_ranking- Scenario difficulty rankinganalysis.metric_sensitivity_by_scenario- Most sensitive metrics per scenarioanalysis.duration_ceiling- Duration prediction ceiling effect
Application Plots (application.*)
application.improvement_heatmaps- Distance heatmaps showing raw metric values (lower=better, individual + grid)application.radar_charts- Application use case radar charts
Pre-defined groups for convenience:
all- All plots (default)core- Quick overview (scenario_heatmap + distance heatmaps)heatmaps_only- Only heatmap visualizationsfull_analysis- Complete analysis suiteapplication- Application-focused plots (distance heatmaps + radars)metrics,analysis,robustness,temporal- Type-specific groups
Metric Filtering (eval_dir/config/scenarios_*.yaml):
plotting:
metrics:
exclude_patterns:
- "*_real_*" # Exclude metrics mirroring input data
exclude:
- "total_generated_od_pairs"
- "matched_od_pairs"Plot-Specific Settings (config/scenario_plots.yaml):
plot_types:
application:
plots:
improvement_heatmaps:
config:
colormap: "RdYlGn_r" # Reversed Red-Yellow-Green (green=good, red=bad)
percentile_range: [5, 95]The distance heatmaps visualize how close each model's outputs are to real data across different scenarios and metrics.
1. Raw Distance Values
- Shows actual metric values (JSD, Hausdorff, DTW, EDR)
- Lower values = closer to real data (better performance)
- All metrics measure distance from reality where 0 = perfect match
2. Per-Metric Color Normalization
- Each metric column has its own color scale
- Green = best value for that metric (across all scenarios/models)
- Red = worst value for that metric
- Makes it easy to see relative performance within each metric
- Example: JSD of 0.001 might be green, Hausdorff of 5.0 might also be green (both are best in their respective columns)
3. Two Visualization Modes
Individual Heatmaps (plot_improvement_heatmaps_individual):
- One heatmap per distilled model
- Rows = scenarios (e.g., peak, off_peak, city_center)
- Columns = metrics (e.g., JSD, Hausdorff, DTW, EDR)
- Annotations show actual metric values
- Filename:
distance_heatmap_{model_name}.pdf
Grid Heatmap (plot_improvement_heatmap_grid):
- All models shown side-by-side in a grid
- Enables easy comparison between vanilla and distilled models
- Same per-metric normalization applied consistently across all models
- Filename:
distance_heatmap_grid.pdf
4. Configurable Colormap
- Default:
RdYlGn_r(reversed Red-Yellow-Green) - Can be changed via
config/scenario_plots.yaml - Any matplotlib colormap name is valid (e.g.,
Spectral_r,viridis,coolwarm)
- Green cells: Model performs well on this metric/scenario combination
- Yellow cells: Model has average performance
- Red cells: Model performs poorly on this metric/scenario combination
- Numbers: Actual distance values for precise comparison
- Colors: Relative performance for visual pattern recognition
Distance heatmap showing Distill model:
JSD Hausdorff DTW EDR
peak 0.012 4.5 3.2 0.15 (green/yellow/yellow/green)
off_peak 0.008 3.8 2.9 0.12 (green/green/green/green)
city_center 0.015 5.2 4.1 0.18 (yellow/red/red/yellow)
In this example:
- JSD values range 0.008-0.015 (all green/yellow = good)
- Hausdorff values range 3.8-5.2 (green to red = varies)
- Model performs best during off_peak hours
- City center scenarios are most challenging
--eval-dir EVAL_DIR # Required: evaluation directory
--plots SELECTION # Plot IDs, groups, or comma-separated (default: all)
--dpi DPI # Output resolution (default: 300)
--list-plots # List all available plots and groups, then exit- Saved to:
{eval_dir}/figures/scenarios/ - Formats: PDF (publication-quality) and PNG
- All 13 plots with consistent styling
- Fast iteration - Generate only the plot you're working on
- No code changes - Add/remove plots via YAML configuration
- Dataset-specific - Different metric filtering per dataset
- Self-documenting - Use
--list-plotsto see all options - Extensible - Easy to add new plots without touching main code
# Development: iterate on single plot
uv run python create_scenario_plots.py --eval-dir DIR \
--plots application.improvement_heatmaps
# Publication: high-DPI core visualizations
uv run python create_scenario_plots.py --eval-dir DIR \
--plots core --dpi 600
# Analysis: generate all heatmaps
uv run python create_scenario_plots.py --eval-dir DIR \
--plots heatmaps_only
# Custom: specific plots you need
uv run python create_scenario_plots.py --eval-dir DIR \
--plots metrics.scenario_heatmap,application.radar_charts,analysis.difficulty_rankingSee scenario_plots/README.md for detailed documentation on adding new plots and configuration options.
- Publication-quality PDF and PNG figures
- Metric comparisons across models
- Performance radar charts
- Scenario-based visualizations (if scenario analysis was run)
# Generate all analysis figures
uv run python create_analysis_figures.py --eval-dir hoser-distill-optuna-6
# Specify dataset explicitly
uv run python create_analysis_figures.py --eval-dir hoser-distill-optuna-6 --dataset Porto--eval-dir EVAL_DIR # Required: evaluation directory
--dataset DATASET # Optional: override auto-detected dataset
--skip-performance-plots # Optional: disable generation-performance figures- Saved to:
{eval_dir}/figures/analysis/ - Formats: PDF (publication-quality) and PNG
- Figures:
- Metric comparisons
- Performance radar charts
- Model comparisons
When you only need throughput/latency visuals (no trajectory metrics), run the
dedicated tool. This script reuses cached _perf.json data and does not rerun
any pipeline phase.
# Core plots (efficiency, latency, latency percentiles, speed ranking, heatmap, slope, correlations)
uv run python tools/perf_analysis_plots.py \
--eval-dir hoser-distill-beijing \
--plots efficiency,latency,latency_percentiles,speed_ranking,heatmap
# Use all plots with default output directory (figures/performance)
uv run python tools/perf_analysis_plots.py --eval-dir hoser-distill-beijingKey CLI flags:
--plots– comma list (efficiency, latency, latency_percentiles, speed_ranking, heatmap, slope, length_speed, length_accuracy) orall.latency_percentilesvisualizes the cached per-trajectory median and 95th percentile latencies for each model so tail slowdowns can be inspected without altering the throughput ranking.speed_rankingnow adds leftward error bars computed from the latency percentiles (total_time_medianvstotal_time_p95) so tail slowdowns are visible without dwarfing faster trajectories.--style– Matplotlib style sheet for quick aesthetic tweaks.--include-cross/--include-abnormal– opt-in for those run types.
Performance plots now run by default when you call create_analysis_figures.py.
Use --skip-performance-plots if you only need the trajectory-quality figures.
If scenario analysis has been run (using tools/analyze_scenarios.py), the ScenarioVisualizer class provides additional plots:
# Scenario analysis creates its own plots automatically
uv run python tools/analyze_scenarios.py --eval-dir hoser-distill-optuna-6 --config config/scenarios_beijing.yaml
# Or trigger from python_pipeline.py with --run-scenarios flagScenario plots include:
- Scenario distribution bar charts
- Metric comparison heatmaps
- Hierarchical breakdown pie charts
Output location: {eval_dir}/scenarios/{od_source}/{model}/
All visualization scripts auto-detect and enhance their output if scenario analysis results are available:
-
Trajectory Visualization:
- Enables
--sample_strategy scenariooption - Samples trajectories from top N scenarios
- Adds scenario labels to plots
- Enables
-
Distribution Plots:
- Can be filtered by scenario (future enhancement)
-
Analysis Figures:
- Includes scenario-specific metric breakdowns
- Uses
ScenarioVisualizerclass for scenario plots
# 1. Setup evaluation directory
uv run python setup_evaluation.py --dataset Beijing --models distilled vanilla
# 2. Run evaluation pipeline
cd eval_Beijing_20251024_123456
uv run python ../python_pipeline.py --run-scenarios
# 3. Generate visualizations (from project root)
cd ..
uv run python visualize_trajectories.py --eval-dir eval_Beijing_20251024_123456 --sample_strategy scenario
uv run python create_distribution_plots.py --eval-dir eval_Beijing_20251024_123456
uv run python create_scenario_plots.py --eval-dir eval_Beijing_20251024_123456
uv run python create_analysis_figures.py --eval-dir eval_Beijing_20251024_123456
# 4. All figures are now in eval_Beijing_20251024_123456/figures/After running all visualization scripts:
eval_directory/
├── figures/
│ ├── trajectories/ # From visualize_trajectories.py
│ │ ├── separate/ # Individual model plots
│ │ ├── overlaid/ # Combined plots per model
│ │ ├── cross_model/ # Same OD pair, different models
│ │ └── scenario_cross_model/ # Scenario-based comparisons
│ │ ├── train/
│ │ │ ├── off_peak_comparison.{pdf,png}
│ │ │ ├── weekday_comparison.{pdf,png}
│ │ │ ├── city_center_comparison.{pdf,png}
│ │ │ └── all_scenarios_grid.{pdf,png}
│ │ └── test/
│ │ └── ... (same structure)
│ ├── distributions/ # From create_distribution_plots.py
│ │ ├── distance_distribution.png
│ │ └── duration_distribution.png
│ ├── scenarios/ # NEW: From create_scenario_plots.py (13 plots)
│ │ ├── improvement_heatmap_*.{pdf,png}
│ │ ├── improvement_heatmap_grid.{pdf,png}
│ │ ├── scenario_metrics_heatmap.{pdf,png}
│ │ ├── application_use_case_radar.{pdf,png}
│ │ ├── scenario_difficulty_ranking.{pdf,png}
│ │ └── ... (all 13 scenario analysis plots)
│ ├── analysis/ # From create_analysis_figures.py
│ │ ├── metric_comparison.pdf
│ │ ├── performance_radar.pdf
│ │ └── ...
│ └── old_scenarios/ # From ScenarioVisualizer (legacy, if scenarios run)
│ ├── train/
│ │ └── vanilla/
│ │ ├── scenario_distribution.png
│ │ ├── metric_comparison.png
│ │ └── hierarchical_*.png
│ └── test/
└── scenarios/ # Raw scenario analysis results
├── train/
│ └── vanilla/
│ ├── scenario_analysis.json
│ └── trajectory_scenarios.json
└── test/
- Run scenario analysis first to enable scenario-aware sampling in trajectory visualization
- Use
--verbosefor debugging if plots aren't generating as expected - Check dataset detection - scripts auto-detect from
evaluation.yaml, but you can override with--dataset - All scripts work from project root - they find data and save outputs relative to the eval directory
- PDF outputs are publication-quality from
create_analysis_figures.py
- Ensure trajectory generation completed successfully
- Check that
gene/{dataset}/exists in the eval directory - Verify model names match expected patterns (vanilla, distilled, etc.)
- Scenario analysis must be run first:
tools/analyze_scenarios.py - Or enable in pipeline:
python_pipeline.py --run-scenarios - Check that
scenarios/directory exists withscenario_analysis.jsonfiles
- Verify
config/evaluation.yamlexists in eval directory - Use
--datasetflag to override:--dataset Beijingor--dataset Porto
- Ensure data directory exists:
data/Beijing/ordata/porto_hoser/ - Check that
roadmap.geofile is present - Data paths are relative to project root, not eval directory
# Process all eval directories with full visualization suite
for eval_dir in eval_*/; do
echo "Processing $eval_dir..."
uv run python visualize_trajectories.py --eval-dir "$eval_dir"
uv run python create_distribution_plots.py --eval-dir "$eval_dir"
uv run python create_scenario_plots.py --eval-dir "$eval_dir"
uv run python create_analysis_figures.py --eval-dir "$eval_dir"
done
# Quick batch processing with only core plots
for eval_dir in eval_*/; do
echo "Processing $eval_dir (core plots only)..."
uv run python create_scenario_plots.py --eval-dir "$eval_dir" --plots core
done# In visualize_trajectories.py, the ScenarioSampler can be extended
# to implement custom sampling logic based on scenario tags- Setup Guide:
docs/SETUP_EVALUATION_GUIDE.md - Scenario Analysis:
docs/SCENARIO_ANALYSIS_GUIDE.md - Evaluation Pipeline:
docs/EVALUATION_PIPELINE_GUIDE.md