Skip to content

Example: out-of-core minibatch variational inference with DataLoader and Trainer#888

Draft
YichengYang-Ethan wants to merge 17 commits into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset-example
Draft

Example: out-of-core minibatch variational inference with DataLoader and Trainer#888
YichengYang-Ethan wants to merge 17 commits into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset-example

Conversation

@YichengYang-Ethan

@YichengYang-Ethan YichengYang-Ethan commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Draft for mentor review (GSoC 2026, Streaming Variational Inference). Pairs with pymc-devs/pymc#8325. Not for merge yet.

A how-to notebook for out-of-core minibatch variational inference with the new pymc.variational.streaming API: a DataLoader over parquet_source, a pm.Data placeholder, and the callback-free Trainer.

It writes a synthetic logistic dataset to Parquet shards, deletes the combined in-memory table (X and y stay only for the in-RAM baseline comparison), streams minibatches off disk with ADVI, and shows that the streaming posterior agrees with an in-RAM pm.Minibatch fit on the toy problem while peak memory stays flat in N. Both fits and both posterior draws are seeded, so the comparison is reproducible.

The memory-at-scale numbers come from a separate run of the same model on the public Criteo 1TB benchmark. The notebook itself runs end to end in seconds, and the outputs and figures are committed.

Open question: where the example data should live for a fully self-contained run (a download script vs HuggingFace hosting). Happy to raise this in Discord.

Minibatch ADVI fed from Parquet shards on disk (flat memory), shown
equivalent to in-RAM pm.Minibatch, with ELBO / posterior / memory
figures and the shuffling caveat. Paired .ipynb + .myst.md.
Update the out-of-core variational inference example to the reworked
streaming API: a DataLoader over parquet_source, a pm.Data placeholder, and
the callback-free Trainer (replacing the removed StreamingDataset and
fit_callback). Replace the non-reproducible private-data memory note with
measured peak-RSS figures on the public Criteo benchmark. The notebook
executes end-to-end; outputs and figures regenerated.
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Set layout=tight at figure creation instead of calling tight_layout on
a constrained-layout figure, which emitted a warning into the committed
outputs. Declare pyarrow via extra_dependencies and the standard
install note. State that the Criteo numbers come from a run outside the
notebook, fix the out-of-memory extrapolation to follow from the stated
measurements, and correct two sentences that overstated what stays in
memory.
@YichengYang-Ethan YichengYang-Ethan force-pushed the streaming-dataset-example branch from 12f92fc to 5d13f6e Compare June 11, 2026 05:03
- pass random_seed to the in-RAM pm.fit and to both posterior sample
  calls so the comparison is reproducible
- the streaming memory line now includes the shuffle buffer and the
  current source chunk, matching the surrounding prose
- say 'no callbacks to write' (the Trainer uses pm.fit's hook
  internally) and 'can bias' for the bounded-buffer caveat
- note that the dropped end-of-pass remainder is re-drawn each epoch
  under shuffle=True
The largest streaming vs in-RAM posterior gap was 0.12 on a feature
coefficient, not ~0.1 on the intercept (the intercept gap was 0.01),
and the comparison ran on a 1M-row slice; the OOM extrapolation from
the fitted slope is ~240M rows, not 250M.
The artifact shows two near-zero coefficients flipping sign between the
two stochastic fits, so 'agreed coefficient for coefficient' overstated
it; state the max gap against the coefficient scale and disclose the
sign flips. parquet_source reads row groups, not whole shards.
Plainer section headers (Write the dataset to disk; Compare with in-RAM
pm.Minibatch), walk-through connectors, and a tighter stream-and-fit
lead-in that no longer repeats the intro. Markdown only; outputs
unchanged.
On the 1M-row slice two near-zero slopes flip sign and the streaming
estimates sit ~5 sd from zero, so 'two near-zero coefficients differed
in sign' understated it; say plainly that the weak slopes disagree. The
memory paragraph now notes the figure ignores the np.concatenate copy,
and the comparison paragraph states the fits share everything but the
minibatch source.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant