Skip to content

Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230

Open
pei-li-hedgehog wants to merge 9 commits into
NVIDIA:developfrom
pei-li-hedgehog:pli-expose-enable-sample-tables
Open

Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230
pei-li-hedgehog wants to merge 9 commits into
NVIDIA:developfrom
pei-li-hedgehog:pli-expose-enable-sample-tables

Conversation

@pei-li-hedgehog
Copy link
Copy Markdown

@pei-li-hedgehog pei-li-hedgehog commented May 19, 2026

Closes #231.

Summary

SqliteIndexWriter already supports enable_sample_tables=False to skip creating the samples and sample_parts tables, but that knob isn't plumbed through BaseWebdatasetFactory.prepare_dataset() or the energon prepare CLI. On very large datasets (100M+ samples) the SQLite inserts and the post-insert btree builds dominate preparation time. For datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), the samples table isn't queried at training time — checkpoint/resume uses SliceState integer offsets resolved via .info.json + .tar.idx.

Concretely, on a 244M-row / 4000-shard text webdataset we hit a 13-hour K8s pod timeout in the indexing phase. We need a way to skip the SQLite cost for monolithic pretraining datasets while keeping everything the integer-indexed loader needs (.tar.idx, .info.json, split config).

Changes

  • BaseWebdatasetFactory.prepare_dataset() accepts enable_sample_tables=True (default — backwards compatible), passes through to SqliteIndexWriterAggregator.
  • energon prepare --no-sample-tables exposes the same option. Mutually exclusive with --tar-index-only (which operates on an already-prepared dataset, different semantics).
  • append_samples / append_parts become no-ops when the writer was constructed with enable_sample_tables=False, so producers can keep emitting rows unmodified.
  • SqliteIndexWriter always drops stale samples / sample_parts when reset_tables=True, regardless of whether the new run intends to repopulate them. Avoids stale tables when re-preparing with --no-sample-tables after a previous full prepare.

Out of scope

Sample-key lookups (WebdatasetFileStore / SqliteITarEntryReader), polylithic joins (which INNER JOIN ... ON samples.sample_key = ... at metadataset-prepare time), and aux-data access on CrudeSample datasets all rely on the SQLite sample tables and will not work for datasets prepared with --no-sample-tables. That's the intended trade-off — documented in the new docstring/help text. Failures are loud (clear SQLite no such table errors), not silent.

A larger follow-up would also skip emitting IndexSample / IndexSamplePart items from _preprocess_tar (saving IPC overhead in addition to SQLite cost). Out of scope here.

Testing

All four existing test_prepare_dataset* tests pass (backwards compatibility).

Two new tests:

  • test_prepare_dataset_no_sample_tables — covers the prepare path:

    • --no-sample-tables prepare succeeds; .info.json and per-tar .tar.idx are produced.
    • index.sqlite exists but does not contain the samples / sample_parts tables.
    • --no-sample-tables + --tar-index-only is rejected with the documented UsageError.
    • Constructing a SqliteIndexReader against the prepared dataset and calling sample-key methods (get_sample_pointer_by_key, get_sample_count, get_sample_part) raises MissingSamplesTableError with a descriptive message — not the raw sqlite3.OperationalError: no such table: samples.
  • test_prepare_dataset_no_sample_tables_save_restore — covers the training-time checkpoint/resume path on a --no-sample-tables dataset:

    • Reference: uninterrupted iteration of 20 samples.
    • Save state mid-stream after 10 samples via save_state_rank(); continue iterating to capture the next 10 (post_save).
    • Build a fresh loader, restore_state_rank(state), iterate 10 samples (post_restore).
    • Asserts first_half + post_save == reference and post_restore == post_save.
    • This empirically verifies that ShardInfosITarReader + SliceState (the integer-indexed loader/state path) work end-to-end on a dataset prepared without the SQLite samples tables.

ruff check and ruff format --check clean on all touched files.

…les CLI flag

The SqliteIndexWriter already supports enable_sample_tables=False to skip
the samples and sample_parts tables, but the flag was not plumbed through
BaseWebdatasetFactory.prepare_dataset() or the `energon prepare` CLI. On
very large datasets (100M+ samples) the SQLite inserts and post-insert
btree builds dominate preparation runtime; for datasets consumed purely
by the integer-indexed loader (ShardInfosITarReader), the samples table
is not needed at training time.

Changes:
- prepare_dataset() accepts enable_sample_tables=True (default), passes
  through to SqliteIndexWriterAggregator.
- `energon prepare --no-sample-tables` flag exposes the same option.
  Cannot be combined with --tar-index-only (which operates on an
  already-prepared dataset).
- append_samples/append_parts become no-ops when the writer was constructed
  with enable_sample_tables=False, so producers can keep emitting rows
  without changes.
- SqliteIndexWriter always drops stale samples/sample_parts on
  reset_tables=True, regardless of whether the new run intends to
  repopulate them. This avoids leaving stale tables when re-preparing
  a dataset with --no-sample-tables.
- Adds a unit test covering the CLI flag, table-absence, and the
  --no-sample-tables vs --tar-index-only conflict.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
Adds user-facing docs for the new option to skip the SQLite samples
and sample_parts tables during prepare:

- docs/source/basic/data_prep.md: subsection under
  "index.sqlite and index.uuid" explaining when and why to use
  `energon prepare --no-sample-tables`, including the trade-off
  (polylithic joins, FileStore access, and `energon mount` won't work).
- docs/source/advanced/data_prep_api.md: section showing the equivalent
  `BaseWebdatasetFactory.prepare_dataset(..., enable_sample_tables=False)`
  programmatic usage with the same trade-off note.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pei-li-hedgehog , thank you for this feature! Basically looks good to me, I'd just reduce the docs a bit, to be more on the spot.

Comment thread docs/source/basic/data_prep.md
Comment thread docs/source/advanced/data_prep_api.md
Comment thread docs/source/advanced/data_prep_api.md Outdated
@voegtlel
Copy link
Copy Markdown
Collaborator

One more thing: We may need to additionally add a proper error message when loading such a dataset as aux fails, i.e. if the sqlite doesn't contain the samples table, but we try to load it, explaining that the dataset was created without an index, and that it needs to be prepared accordingly.

Drop the `:class: warning` admonition (no other doc in the repo uses it
that way) and trim both new subsections to focus on the single consequence
that matters to users: a dataset prepared with --no-sample-tables /
enable_sample_tables=False cannot be used as an auxiliary dataset.

Matches the length and style of neighboring subsections.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
…e-tables datasets

Without this, code paths that load a dataset prepared with
--no-sample-tables as auxiliary data (or any other use of the SQLite
sample-key index) would fail with the unhelpful raw
`sqlite3.OperationalError: no such table: samples`.

This adds a new MissingSamplesTableError that names the cause and
points to the fix (re-prepare without --no-sample-tables). Every
SqliteIndexReader method that queries the samples / sample_parts
tables runs a one-time check on first use and raises the descriptive
error if the table is absent. get_media_metadata is unaffected since
the media_metadata table is independent.

Extends the test to verify the error type and message for
get_sample_pointer_by_key, get_sample_count, and get_sample_part.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
@pei-li-hedgehog
Copy link
Copy Markdown
Author

pei-li-hedgehog commented May 19, 2026

Thanks for the review @voegtlel! All four points addressed.

Good that you mentioned about adding a proper error message when loading such a dataset as aux fails. I've made the following change:

  • New MissingSamplesTableError in indexing.py. Message names the cause (--no-sample-tables / enable_sample_tables=False) and points to the fix (re-prepare without the flag).
  • SqliteIndexReader now runs a one-time check on first use of any samples-touching method. get_media_metadata is unaffected — the media_metadata table is independent.
  • Test extended to cover the error type + message for three of the methods.

Empirically verifies that the integer-indexed loader's checkpoint /
resume path works on a dataset prepared with --no-sample-tables.
ShardInfosITarReader and SliceState never touch the SQLite samples
tables, so the load-bearing claim of the flag is that training-time
save/restore still produces the same sample sequence. This test
exercises the round-trip:

  1. Reference: an uninterrupted iteration of 20 samples.
  2. Capture state mid-stream (after 10 samples) via save_state_rank().
  3. Continue iterating to capture the next 10 samples (post_save).
  4. Build a fresh loader, restore_state_rank(state), iterate 10 samples
     (post_restore).
  5. Assert first_half + post_save == reference (no divergence from
     the reference run) and post_restore == post_save (resumed
     iteration matches continued iteration).

Re-prepares the test fixture as CaptioningSample + --no-sample-tables
so get_train_dataset returns decodable samples.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
Comment thread docs/source/basic/data_prep.md Outdated
Comment thread src/megatron/energon/flavors/webdataset/indexing.py Outdated
@pei-li-hedgehog pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 4c7ab4a to f205b90 Compare May 19, 2026 19:08
…ntryReader

- data_prep.md: replace "consumed sequentially" with the precise constraint
  ("not used as auxiliary or mounted").
- SqliteIndexReader: expose has_sample_tables as a constructor-time attribute
  (mirrors db_has_sample_parts); drop the per-method _check_samples_table guard.
- SqliteITarEntryReader: raise MissingSamplesTableError at __init__ when the
  samples table is missing — fail fast at the boundary that actually requires it.
- Test updated to assert at SqliteITarEntryReader construction.

Signed-off-by: Pei Li <pei.li@kaiko.ai>
@pei-li-hedgehog pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from f205b90 to e35c68d Compare May 19, 2026 20:30
@pei-li-hedgehog pei-li-hedgehog requested a review from voegtlel May 19, 2026 20:34
Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just this very small edit

Comment thread src/megatron/energon/flavors/webdataset/itar_reader.py Outdated
Comment thread src/megatron/energon/flavors/webdataset/itar_reader.py Outdated
pei-li-hedgehog and others added 2 commits May 20, 2026 09:58
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com>
Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com>
Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
@pei-li-hedgehog pei-li-hedgehog requested a review from voegtlel May 20, 2026 08:21
@pei-li-hedgehog pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 4bcce3b to 2678503 Compare May 20, 2026 08:24
Signed-off-by: Pei Li <pei.li@kaiko.ai>
@pei-li-hedgehog pei-li-hedgehog force-pushed the pli-expose-enable-sample-tables branch from 2678503 to 0914862 Compare May 21, 2026 08:06
@pei-li-hedgehog
Copy link
Copy Markdown
Author

Suggestions applied. Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets)

2 participants