Example: out-of-core minibatch variational inference with DataLoader and Trainer by YichengYang-Ethan · Pull Request #888 · pymc-devs/pymc-examples

YichengYang-Ethan · 2026-06-09T03:43:55Z

Draft for mentor review (GSoC 2026, Streaming Variational Inference). Pairs with pymc-devs/pymc#8325. Not for merge yet.

A how-to notebook for out-of-core minibatch variational inference with the new pymc.variational.streaming API: a DataLoader over parquet_source, a pm.Data placeholder, and the callback-free Trainer.

It writes a synthetic logistic dataset to Parquet shards, deletes the combined in-memory table (X and y stay only for the in-RAM baseline comparison), streams minibatches off disk with ADVI, and shows that the streaming posterior agrees with an in-RAM pm.Minibatch fit on the toy problem while peak memory stays flat in N. Both fits and both posterior draws are seeded, so the comparison is reproducible.

The memory-at-scale numbers come from a separate run of the same model on the public Criteo 1TB benchmark. The notebook itself runs end to end in seconds, and the outputs and figures are committed.

Open question: where the example data should live for a fully self-contained run (a download script vs HuggingFace hosting). Happy to raise this in Discord.

Minibatch ADVI fed from Parquet shards on disk (flat memory), shown equivalent to in-RAM pm.Minibatch, with ELBO / posterior / memory figures and the shuffling caveat. Paired .ipynb + .myst.md.

Update the out-of-core variational inference example to the reworked streaming API: a DataLoader over parquet_source, a pm.Data placeholder, and the callback-free Trainer (replacing the removed StreamingDataset and fit_callback). Replace the non-reproducible private-data memory note with measured peak-RSS figures on the public Criteo benchmark. The notebook executes end-to-end; outputs and figures regenerated.

review-notebook-app · 2026-06-09T03:44:00Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Set layout=tight at figure creation instead of calling tight_layout on a constrained-layout figure, which emitted a warning into the committed outputs. Declare pyarrow via extra_dependencies and the standard install note. State that the Criteo numbers come from a run outside the notebook, fix the out-of-memory extrapolation to follow from the stated measurements, and correct two sentences that overstated what stays in memory.

- pass random_seed to the in-RAM pm.fit and to both posterior sample calls so the comparison is reproducible - the streaming memory line now includes the shuffle buffer and the current source chunk, matching the surrounding prose - say 'no callbacks to write' (the Trainer uses pm.fit's hook internally) and 'can bias' for the bounded-buffer caveat - note that the dropped end-of-pass remainder is re-drawn each epoch under shuffle=True

The largest streaming vs in-RAM posterior gap was 0.12 on a feature coefficient, not ~0.1 on the intercept (the intercept gap was 0.01), and the comparison ran on a 1M-row slice; the OOM extrapolation from the fitted slope is ~240M rows, not 250M.

The artifact shows two near-zero coefficients flipping sign between the two stochastic fits, so 'agreed coefficient for coefficient' overstated it; state the max gap against the coefficient scale and disclose the sign flips. parquet_source reads row groups, not whole shards.

Plainer section headers (Write the dataset to disk; Compare with in-RAM pm.Minibatch), walk-through connectors, and a tighter stream-and-fit lead-in that no longer repeats the intro. Markdown only; outputs unchanged.

On the 1M-row slice two near-zero slopes flip sign and the streaming estimates sit ~5 sd from zero, so 'two near-zero coefficients differed in sign' understated it; say plainly that the weak slopes disagree. The memory paragraph now notes the figure ignores the np.concatenate copy, and the comparison paragraph states the fits share everything but the minibatch source.

YichengYang-Ethan added 2 commits June 5, 2026 11:05

Add out-of-core StreamingDataset variational inference example

ef86f63

Minibatch ADVI fed from Parquet shards on disk (flat memory), shown equivalent to in-RAM pm.Minibatch, with ELBO / posterior / memory figures and the shuffling caveat. Paired .ipynb + .myst.md.

YichengYang-Ethan mentioned this pull request Jun 9, 2026

Streaming variational inference: out-of-core DataLoader for minibatch ADVI pymc-devs/pymc#8325

Draft

YichengYang-Ethan added 10 commits June 8, 2026 22:49

Apply pre-commit formatting (black line-length 100, jupytext sync)

aa8be27

Clean up memory figure layout (title/legend/annotation overlap)

505382a

Tighten example prose

26ea088

Trim code comments and prose emphasis

9d49f19

Use the bibliography directive and standard watermark section

2e79e44

Drop the redundant ADVI tag

52d15cd

State the memory behavior precisely

015cda3

Attribute the agreement numbers to the right comparison

e413b2e

State the loader memory behavior precisely in the intro

2d9771e

YichengYang-Ethan force-pushed the streaming-dataset-example branch from 12f92fc to 5d13f6e Compare June 11, 2026 05:03

YichengYang-Ethan added 5 commits June 11, 2026 10:50

Match the example's prose to the pymc-examples house style

712ed93

Plainer section headers (Write the dataset to disk; Compare with in-RAM pm.Minibatch), walk-through connectors, and a tighter stream-and-fit lead-in that no longer repeats the intro. Markdown only; outputs unchanged.

YichengYang-Ethan mentioned this pull request Jun 23, 2026

Streaming variational inference: out-of-core DataLoader for minibatch ADVI pymc-devs/pymc-extras#698

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example: out-of-core minibatch variational inference with DataLoader and Trainer#888

Example: out-of-core minibatch variational inference with DataLoader and Trainer#888
YichengYang-Ethan wants to merge 17 commits into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset-example

YichengYang-Ethan commented Jun 9, 2026 •

edited

Loading

Uh oh!

review-notebook-app Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YichengYang-Ethan commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YichengYang-Ethan commented Jun 9, 2026 •

edited

Loading