Skip to content

Streaming variational inference: out-of-core DataLoader for minibatch ADVI#698

Draft
YichengYang-Ethan wants to merge 1 commit into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset
Draft

Streaming variational inference: out-of-core DataLoader for minibatch ADVI#698
YichengYang-Ethan wants to merge 1 commit into
pymc-devs:mainfrom
YichengYang-Ethan:streaming-dataset

Conversation

@YichengYang-Ethan

Copy link
Copy Markdown

Adds a streaming data layer for minibatch variational inference on data that doesn't fit in memory. Ported from pymc-devs/pymc#8325 — moving it here per the discussion on that PR (@ricardoV94's call to start in extras).

pm.Minibatch indexes an in-memory array, so peak memory is O(N). This streams minibatches from an out-of-core source into a pm.Data placeholder instead, so peak memory is set by the batch, the source chunk, and the optional shuffle buffer — independent of N.

The API mirrors torch.utils.data:

  • IterableDataset — a re-iterable, out-of-core source of rows (e.g. parquet_source over a directory of shards).
  • DataLoader — fixed-size, optionally shuffled minibatches; sized, with len(loader) == N for total_size.
  • shuffle_buffer — a bounded shuffle over the stream.

The unbiased-gradient rescaling reuses the existing create_minibatch_rv (the same N / batch_size as pm.Minibatch), via total_size=len(loader).

Notes:

Ports the streaming data layer (IterableDataset, parquet_source, DataLoader, shuffle_buffer) from pymc-devs/pymc#8325. Self-contained numpy/pyarrow data layer with no pymc-internal coupling; public names mirror torch.utils.data. Tests moved alongside.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant