Expose enable_sample_tables in prepare_dataset and as --no-sample-tables CLI flag#230
Open
pei-li-hedgehog wants to merge 9 commits into
Open
Conversation
…les CLI flag The SqliteIndexWriter already supports enable_sample_tables=False to skip the samples and sample_parts tables, but the flag was not plumbed through BaseWebdatasetFactory.prepare_dataset() or the `energon prepare` CLI. On very large datasets (100M+ samples) the SQLite inserts and post-insert btree builds dominate preparation runtime; for datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), the samples table is not needed at training time. Changes: - prepare_dataset() accepts enable_sample_tables=True (default), passes through to SqliteIndexWriterAggregator. - `energon prepare --no-sample-tables` flag exposes the same option. Cannot be combined with --tar-index-only (which operates on an already-prepared dataset). - append_samples/append_parts become no-ops when the writer was constructed with enable_sample_tables=False, so producers can keep emitting rows without changes. - SqliteIndexWriter always drops stale samples/sample_parts on reset_tables=True, regardless of whether the new run intends to repopulate them. This avoids leaving stale tables when re-preparing a dataset with --no-sample-tables. - Adds a unit test covering the CLI flag, table-absence, and the --no-sample-tables vs --tar-index-only conflict. Signed-off-by: Pei Li <pei.li@kaiko.ai>
Adds user-facing docs for the new option to skip the SQLite samples and sample_parts tables during prepare: - docs/source/basic/data_prep.md: subsection under "index.sqlite and index.uuid" explaining when and why to use `energon prepare --no-sample-tables`, including the trade-off (polylithic joins, FileStore access, and `energon mount` won't work). - docs/source/advanced/data_prep_api.md: section showing the equivalent `BaseWebdatasetFactory.prepare_dataset(..., enable_sample_tables=False)` programmatic usage with the same trade-off note. Signed-off-by: Pei Li <pei.li@kaiko.ai>
voegtlel
requested changes
May 19, 2026
Collaborator
voegtlel
left a comment
There was a problem hiding this comment.
Hi @pei-li-hedgehog , thank you for this feature! Basically looks good to me, I'd just reduce the docs a bit, to be more on the spot.
Collaborator
|
One more thing: We may need to additionally add a proper error message when loading such a dataset as aux fails, i.e. if the sqlite doesn't contain the |
Drop the `:class: warning` admonition (no other doc in the repo uses it that way) and trim both new subsections to focus on the single consequence that matters to users: a dataset prepared with --no-sample-tables / enable_sample_tables=False cannot be used as an auxiliary dataset. Matches the length and style of neighboring subsections. Signed-off-by: Pei Li <pei.li@kaiko.ai>
…e-tables datasets Without this, code paths that load a dataset prepared with --no-sample-tables as auxiliary data (or any other use of the SQLite sample-key index) would fail with the unhelpful raw `sqlite3.OperationalError: no such table: samples`. This adds a new MissingSamplesTableError that names the cause and points to the fix (re-prepare without --no-sample-tables). Every SqliteIndexReader method that queries the samples / sample_parts tables runs a one-time check on first use and raises the descriptive error if the table is absent. get_media_metadata is unaffected since the media_metadata table is independent. Extends the test to verify the error type and message for get_sample_pointer_by_key, get_sample_count, and get_sample_part. Signed-off-by: Pei Li <pei.li@kaiko.ai>
Author
|
Thanks for the review @voegtlel! All four points addressed. Good that you mentioned about
|
Empirically verifies that the integer-indexed loader's checkpoint /
resume path works on a dataset prepared with --no-sample-tables.
ShardInfosITarReader and SliceState never touch the SQLite samples
tables, so the load-bearing claim of the flag is that training-time
save/restore still produces the same sample sequence. This test
exercises the round-trip:
1. Reference: an uninterrupted iteration of 20 samples.
2. Capture state mid-stream (after 10 samples) via save_state_rank().
3. Continue iterating to capture the next 10 samples (post_save).
4. Build a fresh loader, restore_state_rank(state), iterate 10 samples
(post_restore).
5. Assert first_half + post_save == reference (no divergence from
the reference run) and post_restore == post_save (resumed
iteration matches continued iteration).
Re-prepares the test fixture as CaptioningSample + --no-sample-tables
so get_train_dataset returns decodable samples.
Signed-off-by: Pei Li <pei.li@kaiko.ai>
voegtlel
requested changes
May 19, 2026
4c7ab4a to
f205b90
Compare
…ntryReader
- data_prep.md: replace "consumed sequentially" with the precise constraint
("not used as auxiliary or mounted").
- SqliteIndexReader: expose has_sample_tables as a constructor-time attribute
(mirrors db_has_sample_parts); drop the per-method _check_samples_table guard.
- SqliteITarEntryReader: raise MissingSamplesTableError at __init__ when the
samples table is missing — fail fast at the boundary that actually requires it.
- Test updated to assert at SqliteITarEntryReader construction.
Signed-off-by: Pei Li <pei.li@kaiko.ai>
f205b90 to
e35c68d
Compare
voegtlel
requested changes
May 20, 2026
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
Co-authored-by: Lukas Voegtle <5764745+voegtlel@users.noreply.github.com> Signed-off-by: pei-li-hedgehog <pei.li@kaiko.ai>
4bcce3b to
2678503
Compare
Signed-off-by: Pei Li <pei.li@kaiko.ai>
2678503 to
0914862
Compare
Author
|
Suggestions applied. Thx! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #231.
Summary
SqliteIndexWriteralready supportsenable_sample_tables=Falseto skip creating thesamplesandsample_partstables, but that knob isn't plumbed throughBaseWebdatasetFactory.prepare_dataset()or theenergon prepareCLI. On very large datasets (100M+ samples) the SQLite inserts and the post-insert btree builds dominate preparation time. For datasets consumed purely by the integer-indexed loader (ShardInfosITarReader), thesamplestable isn't queried at training time — checkpoint/resume usesSliceStateinteger offsets resolved via.info.json+.tar.idx.Concretely, on a 244M-row / 4000-shard text webdataset we hit a 13-hour K8s pod timeout in the indexing phase. We need a way to skip the SQLite cost for monolithic pretraining datasets while keeping everything the integer-indexed loader needs (
.tar.idx,.info.json, split config).Changes
BaseWebdatasetFactory.prepare_dataset()acceptsenable_sample_tables=True(default — backwards compatible), passes through toSqliteIndexWriterAggregator.energon prepare --no-sample-tablesexposes the same option. Mutually exclusive with--tar-index-only(which operates on an already-prepared dataset, different semantics).append_samples/append_partsbecome no-ops when the writer was constructed withenable_sample_tables=False, so producers can keep emitting rows unmodified.SqliteIndexWriteralways drops stalesamples/sample_partswhenreset_tables=True, regardless of whether the new run intends to repopulate them. Avoids stale tables when re-preparing with--no-sample-tablesafter a previous full prepare.Out of scope
Sample-key lookups (
WebdatasetFileStore/SqliteITarEntryReader), polylithic joins (whichINNER JOIN ... ON samples.sample_key = ...at metadataset-prepare time), and aux-data access onCrudeSampledatasets all rely on the SQLite sample tables and will not work for datasets prepared with--no-sample-tables. That's the intended trade-off — documented in the new docstring/help text. Failures are loud (clear SQLiteno such tableerrors), not silent.A larger follow-up would also skip emitting
IndexSample/IndexSamplePartitems from_preprocess_tar(saving IPC overhead in addition to SQLite cost). Out of scope here.Testing
All four existing
test_prepare_dataset*tests pass (backwards compatibility).Two new tests:
test_prepare_dataset_no_sample_tables— covers the prepare path:--no-sample-tablesprepare succeeds;.info.jsonand per-tar.tar.idxare produced.index.sqliteexists but does not contain thesamples/sample_partstables.--no-sample-tables+--tar-index-onlyis rejected with the documentedUsageError.SqliteIndexReaderagainst the prepared dataset and calling sample-key methods (get_sample_pointer_by_key,get_sample_count,get_sample_part) raisesMissingSamplesTableErrorwith a descriptive message — not the rawsqlite3.OperationalError: no such table: samples.test_prepare_dataset_no_sample_tables_save_restore— covers the training-time checkpoint/resume path on a--no-sample-tablesdataset:save_state_rank(); continue iterating to capture the next 10 (post_save).restore_state_rank(state), iterate 10 samples (post_restore).first_half + post_save == referenceandpost_restore == post_save.ShardInfosITarReader+SliceState(the integer-indexed loader/state path) work end-to-end on a dataset prepared without the SQLite samples tables.ruff checkandruff format --checkclean on all touched files.