Skip to content

Migrate dataset distribution to HuggingFace-native #2

@mohitgargai

Description

@mohitgargai

Context

Today the dataset is distributed two different ways and they have drifted:

  • GCS zipgs://lica-ml/gdb-dataset.zip (public URL: https://storage.googleapis.com/lica-ml/gdb-dataset.zip), pointed at by scripts/download_data.py. Contains the canonical on-disk layout that both the upstream run_benchmarks.py and the Harbor adapter load from:
    gdb-dataset/
    ├── benchmarks/      # ~3.4 GB per-benchmark JSON definitions
    └── lica-data/       # ~1.0 GB (metadata.csv, layouts, images, annotations)
    
  • HuggingFacelica-world/GDB, currently ~62 parquet files across 39 benchmark configs. Only has per-benchmark sample rows, no lica-data/.

Recent incidents that motivate this cleanup:

  1. template-{4,5}.json shipped context_layouts / designated_layout / ground_truth as JSON-encoded strings (instead of dicts). Caused a template-5 evaluator crash and doubly-escaped prompts for template-4. Fix landed locally + on HF, but the GCS zip and the harbor-datasets tasks built from it both drifted out of sync until a manual refresh (harbor PR #1433 + companion harbor-datasets PR #196).
  2. A stalled gsutil upload in the same session left gs://lica-ml/gdb-dataset.zip as 404 for several hours, silently breaking scripts/download_data.py.

Proposal

Move the dataset to a HF-native layout and make lica-world/GDB the single source of truth.

Scope

  • Decide on the HF representation for lica-data/:
    • Option A: Embed image/layout bytes in the parquet rows (datasets.Image() / datasets.Value('binary')). Clean but grows the parquet size.
    • Option B: Publish lica-data/ as a separate HF dataset (lica-world/GDB-assets) using the imagefolder builder or raw LFS files, and have datasets.load_dataset('lica-world/GDB', <benchmark>) reference asset paths.
    • Option C: Keep lica-data/ on GCS as a pure CDN and only migrate benchmarks/ to HF. Simplest — but doesn't get us off of GCS.
  • Rewrite scripts/download_data.py:
    • Default source: HF (via huggingface_hub.snapshot_download or datasets.load_dataset(...).save_to_disk).
    • Reconstruct the on-disk gdb-dataset/ layout so run_benchmarks.py and the Harbor adapter keep working unchanged.
    • Keep --from-zip as the local-archive escape hatch; drop the GCS URL as the default.
  • Update BaseBenchmark.load_data(...) paths and/or the per-task load_data implementations if we go Option A (reading from HF rows instead of disk).
  • Add a CI job on lica-world/GDB that publishes tagged releases to HF on every main merge (prevents future GCS/HF drift).
  • Deprecate the GCS zip: leave it up with a README note pointing at HF, or redirect via a CDN alias.

Non-goals (explicitly out of scope)

  • Moving the media URLs embedded inside layouts (https://storage.googleapis.com/lica-video/<uuid>.png) off GCS. These are public, stable, and referenced across hundreds of assets; that's a separate cleanup.

Success criteria

  • pip install lica-gdb && python -c 'from datasets import load_dataset; load_dataset(\"lica-world/GDB\", \"template-5\")' works standalone.
  • python scripts/download_data.py fetches from HF by default and produces an identical on-disk layout to the current zip extract.
  • Removing gs://lica-ml/gdb-dataset.zip does not break any supported workflow.

Why now

The Harbor adapter PR (harbor-framework/harbor#1433) is landing, and it will become a second public consumer of this dataset. This is the right time to converge on one distribution path before more downstream users build on top of the GCS URL.


Filed as a follow-up from the Scenario-2 parity work on the harbor-adapter branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions