This document catalogs the public datasets and benchmarking protocols integrated into the LayerForge-X framework. It delineates the role of each benchmark, its current implementation status, and the roadmap for future extensions.
Use the orchestration helper when public datasets are mounted locally:
python scripts/run_public_benchmarks_if_present.py \
--data-root data \
--output-root results/public_benchmarks \
--preset world_best \
--max-images 512It checks for the expected COCO Panoptic, ADE20K, and DIODE directories before
launching any benchmark. Missing datasets are reported as skipped; present
datasets are executed through the existing layerforge benchmark-* CLIs. The
runner writes public_benchmark_run_report.json and
public_benchmark_run_report.md.
- Role: Visible semantic and instance-level grouping benchmark.
- Reference: COCO Dataset
- Protocol: Evaluates coarse-group Intersection-over-Union (mIoU) for "thing" (instance) and "stuff" (region) categories.
- Implementation: Automated download and evaluation harness provided in
scripts/download_coco_panoptic_val.pyandlayerforge benchmark-coco-panoptic. - Measured Performance (Mask2Former):
- Images Evaluated: 512
- Group mIoU:
0.5660 - Thing mIoU:
0.5842 - Stuff mIoU:
0.5479
- Summary Artifact:
report_artifacts/metrics_snapshots/coco_panoptic_group_benchmark_summary.json
- Role: High-density scene parsing and background/stuff categorization.
- Reference: ADE20K Dataset
- Protocol: Employs the same coarse-group mIoU protocol as the COCO benchmark, providing a more challenging environment for dense scene understanding.
- Implementation: Automated harness via
scripts/download_ade20k.pyandlayerforge benchmark-ade20k. - Measured Performance (Mask2Former):
- Images Evaluated: 512
- Group mIoU:
0.6015 - Thing mIoU:
0.5579 - Stuff mIoU:
0.6451 - Mean Image mIoU:
0.5569
- Summary Artifact:
report_artifacts/metrics_snapshots/ade20k_group_benchmark_summary.json
- Role: Diverse indoor and outdoor monocular depth benchmark.
- Reference: DIODE Dataset
- Protocol: Validates monocular geometry models (e.g., Depth Pro) against ground-truth RGB-D data. Supports both raw metric evaluation and scale-aligned comparative studies.
- Implementation: Automated harness via
scripts/download_diode_val.pyandlayerforge benchmark-diode. - Measured Performance (Depth Pro):
- Raw Metric: AbsRel
0.5230, Delta10.4057, SILog26.8766 - Scale-Aligned: AbsRel
0.3629, RMSE6.1891, Delta10.6452
- Raw Metric: AbsRel
- Summary Artifacts:
report_artifacts/metrics_snapshots/diode_depth_benchmark_summary.json,diode_depth_scale_benchmark_summary.json
The following datasets are identified as high-priority targets for future integration to further validate specific DALG attributes:
- Objective: Establish a robust indoor pairwise-order and depth-accuracy benchmark.
- Rationale: NYU Depth V2 provides the highest-quality ground truth for indoor scene geometry and relative object ordering.
- Objective: Quantitative validation of amodal segmentation and occlusion reasoning.
- Rationale: These datasets provide ground-truth amodal masks for street scenes and general object categories, facilitating a rigorous assessment of hidden-region estimation.
- Objective: Standardized evaluation of albedo and shading decomposition.
- Rationale: While intrinsic decomposition is currently a stretch goal, integration with IIW/WHDR metrics will allow for direct comparison against specialized intrinsic decomposition architectures.
The integration of these diverse public benchmarks ensures that LayerForge-X is not evaluated in isolation. By measuring performance across visibility (COCO/ADE20K), geometry (DIODE), and synthetic structure (LayerBench), the framework establishes a verifiable and transparent performance baseline for the task of layered scene decomposition.