Example: out-of-core minibatch variational inference with DataLoader and Trainer#888
Draft
YichengYang-Ethan wants to merge 17 commits into
Draft
Example: out-of-core minibatch variational inference with DataLoader and Trainer#888YichengYang-Ethan wants to merge 17 commits into
YichengYang-Ethan wants to merge 17 commits into
Conversation
Minibatch ADVI fed from Parquet shards on disk (flat memory), shown equivalent to in-RAM pm.Minibatch, with ELBO / posterior / memory figures and the shuffling caveat. Paired .ipynb + .myst.md.
Update the out-of-core variational inference example to the reworked streaming API: a DataLoader over parquet_source, a pm.Data placeholder, and the callback-free Trainer (replacing the removed StreamingDataset and fit_callback). Replace the non-reproducible private-data memory note with measured peak-RSS figures on the public Criteo benchmark. The notebook executes end-to-end; outputs and figures regenerated.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Set layout=tight at figure creation instead of calling tight_layout on a constrained-layout figure, which emitted a warning into the committed outputs. Declare pyarrow via extra_dependencies and the standard install note. State that the Criteo numbers come from a run outside the notebook, fix the out-of-memory extrapolation to follow from the stated measurements, and correct two sentences that overstated what stays in memory.
12f92fc to
5d13f6e
Compare
- pass random_seed to the in-RAM pm.fit and to both posterior sample calls so the comparison is reproducible - the streaming memory line now includes the shuffle buffer and the current source chunk, matching the surrounding prose - say 'no callbacks to write' (the Trainer uses pm.fit's hook internally) and 'can bias' for the bounded-buffer caveat - note that the dropped end-of-pass remainder is re-drawn each epoch under shuffle=True
The largest streaming vs in-RAM posterior gap was 0.12 on a feature coefficient, not ~0.1 on the intercept (the intercept gap was 0.01), and the comparison ran on a 1M-row slice; the OOM extrapolation from the fitted slope is ~240M rows, not 250M.
The artifact shows two near-zero coefficients flipping sign between the two stochastic fits, so 'agreed coefficient for coefficient' overstated it; state the max gap against the coefficient scale and disclose the sign flips. parquet_source reads row groups, not whole shards.
Plainer section headers (Write the dataset to disk; Compare with in-RAM pm.Minibatch), walk-through connectors, and a tighter stream-and-fit lead-in that no longer repeats the intro. Markdown only; outputs unchanged.
On the 1M-row slice two near-zero slopes flip sign and the streaming estimates sit ~5 sd from zero, so 'two near-zero coefficients differed in sign' understated it; say plainly that the weak slopes disagree. The memory paragraph now notes the figure ignores the np.concatenate copy, and the comparison paragraph states the fits share everything but the minibatch source.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft for mentor review (GSoC 2026, Streaming Variational Inference). Pairs with pymc-devs/pymc#8325. Not for merge yet.
A how-to notebook for out-of-core minibatch variational inference with the new
pymc.variational.streamingAPI: aDataLoaderoverparquet_source, apm.Dataplaceholder, and the callback-freeTrainer.It writes a synthetic logistic dataset to Parquet shards, deletes the combined in-memory table (
Xandystay only for the in-RAM baseline comparison), streams minibatches off disk with ADVI, and shows that the streaming posterior agrees with an in-RAMpm.Minibatchfit on the toy problem while peak memory stays flat in N. Both fits and both posterior draws are seeded, so the comparison is reproducible.The memory-at-scale numbers come from a separate run of the same model on the public Criteo 1TB benchmark. The notebook itself runs end to end in seconds, and the outputs and figures are committed.
Open question: where the example data should live for a fully self-contained run (a download script vs HuggingFace hosting). Happy to raise this in Discord.