Streaming variational inference: out-of-core DataLoader for minibatch ADVI by YichengYang-Ethan · Pull Request #698 · pymc-devs/pymc-extras

YichengYang-Ethan · 2026-06-23T01:53:25Z

Adds a streaming data layer for minibatch variational inference on data that doesn't fit in memory. Ported from pymc-devs/pymc#8325 — moving it here per the discussion on that PR (@ricardoV94's call to start in extras).

pm.Minibatch indexes an in-memory array, so peak memory is O(N). This streams minibatches from an out-of-core source into a pm.Data placeholder instead, so peak memory is set by the batch, the source chunk, and the optional shuffle buffer — independent of N.

The API mirrors torch.utils.data:

IterableDataset — a re-iterable, out-of-core source of rows (e.g. parquet_source over a directory of shards).
DataLoader — fixed-size, optionally shuffled minibatches; sized, with len(loader) == N for total_size.
shuffle_buffer — a bounded shuffle over the stream.

The unbiased-gradient rescaling reuses the existing create_minibatch_rv (the same N / batch_size as pm.Minibatch), via total_size=len(loader).

Notes:

Lives at pymc_extras/variational/streaming.py — happy to move it.
pyarrow is an optional dependency, imported lazily only for the Parquet source.
Tests in tests/variational/. End-to-end example: Example: out-of-core minibatch variational inference with DataLoader and Trainer pymc-examples#888.
The Trainer (Streaming variational inference: Trainer for minibatch ADVI pymc#8333) will move over as a follow-up.

Ports the streaming data layer (IterableDataset, parquet_source, DataLoader, shuffle_buffer) from pymc-devs/pymc#8325. Self-contained numpy/pyarrow data layer with no pymc-internal coupling; public names mirror torch.utils.data. Tests moved alongside.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming variational inference: out-of-core DataLoader for minibatch ADVI#698

Streaming variational inference: out-of-core DataLoader for minibatch ADVI#698
YichengYang-Ethan wants to merge 1 commit into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset

YichengYang-Ethan commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YichengYang-Ethan commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant