Skip to content

Change sample_posterior_predictive API wrt to volatility#8209

Merged
ricardoV94 merged 2 commits intopymc-devs:v6from
ricardoV94:sample_posterior_predictive
Apr 27, 2026
Merged

Change sample_posterior_predictive API wrt to volatility#8209
ricardoV94 merged 2 commits intopymc-devs:v6from
ricardoV94:sample_posterior_predictive

Conversation

@ricardoV94
Copy link
Copy Markdown
Member

@ricardoV94 ricardoV94 commented Mar 21, 2026

Keep in mind this is like our second most important function in PyMC, and it was easily a footgun whenever var_names was involved or there were unobserved RVs that depended on modified data/coordinates (see #7069).

After this RV variables in the trace are never resampled by default, even if it may mean the graph is invalid (shape or semantically). A warning is issued, directing users to explicitly add variables "deemed volatile" into sample_vars (to resample them) or freeze_vars (to confirm this was intended).

Regular uses that don't touch var_names should stay the same as before. After these I think var_names becomes very intuitive (and less important as well)

Main knobs

  • sample_vars — variables to force sampling. RVs present in the trace will never be resampled unless they show up here. Deterministics are still recomputed automatically based on volatility, but can be included here in case the volatility analysis wouldn't trigger them (there's a pm.do use case in the tests/docstrings). Variables that don't exist in the trace (including observeds) are always sampled, but can be added here for completeness.
  • freeze_vars — variables to explicitly reuse from the trace. Should also be used for RVs in the trace that seem volatile but are intended to not be resampled. Also works for deterministics — e.g. the HSGP-style case of a training-time mean-centering that shouldn't be rederived from prediction data.
  • ImplicitFreezeWarning in pymc.exceptions — emitted when a trace RV is deemed volatile but not explicitly in sample_vars or freeze_vars. This is the knob to alert users they need to make a decision, reducing silent "footguns".
  • var_names is strictly output-gating now. It just determines what variables appear in the returned trace.

Other changes

Closes #7069
Closes #6876
Closes #6614

TODO:

  • clean code
  • clean tests

@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 8a6137a to 9999e3d Compare March 23, 2026 15:56
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 97.31544% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (v6@733088a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pymc/sampling/forward.py 97.27% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##             v6    #8209   +/-   ##
=====================================
  Coverage      ?   91.76%           
=====================================
  Files         ?      124           
  Lines         ?    20131           
  Branches      ?        0           
=====================================
  Hits          ?    18473           
  Misses        ?     1658           
  Partials      ?        0           
Files with missing lines Coverage Δ
pymc/exceptions.py 100.00% <100.00%> (ø)
pymc/sampling/forward.py 96.77% <97.27%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ricardoV94 ricardoV94 added major Include in major changes release notes section samplers labels Mar 23, 2026
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 9999e3d to 4f7810c Compare March 23, 2026 17:42
@ricardoV94 ricardoV94 changed the title Finer grained sample_posterior_predictive Change sample_posterior_predictive API wrt to volatility Mar 23, 2026
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 4f7810c to 6025611 Compare April 3, 2026 11:17
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 6025611 to c3eb5fd Compare April 3, 2026 15:46
@ricardoV94 ricardoV94 added this to the v6 milestone Apr 8, 2026
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch 4 times, most recently from c56451f to c997f8e Compare April 23, 2026 14:38
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch 2 times, most recently from 32f1ce6 to 7fdaffd Compare April 23, 2026 15:57
Comment thread pymc/sampling/forward.py Outdated
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 7fdaffd to 5d85bee Compare April 23, 2026 22:50
@ricardoV94 ricardoV94 marked this pull request as ready for review April 23, 2026 23:14
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 5d85bee to a6c7134 Compare April 24, 2026 00:03
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch 2 times, most recently from cbf271e to 73f2f5e Compare April 24, 2026 00:13
@ricardoV94 ricardoV94 requested review from lucianopaz and zaxtax April 24, 2026 00:31
@ricardoV94 ricardoV94 marked this pull request as draft April 24, 2026 00:32
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch 3 times, most recently from a35c162 to c5fd97f Compare April 24, 2026 12:00
@ricardoV94 ricardoV94 marked this pull request as ready for review April 24, 2026 12:05
Comment thread pymc/sampling/forward.py
@ricardoV94
Copy link
Copy Markdown
Member Author

There is one test failing due to pymc-devs/pytensor#2051

Verified that the patch fixes it.

Our step samplers don't eval before multiprocess splitting so numba compiles in each process... wasteful and shows the bug here.

We'll just wait but we may want to tweak the steps to eval once after compile.

Regular nuts pipeline does eval before split (in init nuts jitter)

@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from f5d5eb3 to 83975e3 Compare April 24, 2026 21:31
Copy link
Copy Markdown
Member

@jessegrabowski jessegrabowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: the process of walking the fgraph and classifying nodes as volatile or not is exactly the function of a Feature, is there a reason why you didn't use one here?

Copy link
Copy Markdown
Member

@jessegrabowski jessegrabowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

With this refactor I think we should really think about renaming this function to something like conditional_resample then implement sample_posterior_predictive as a special case that only generates the posterior predictive. My logic is that the functionality we now offer goes well beyond just sampling the posterior predictive distribution, which is a well-defined object in the literature.

Comment thread pymc/sampling/forward.py
Comment thread pymc/sampling/forward.py
Comment thread pymc/sampling/forward.py Outdated
# Default copies old `x**2` values from the trace.
pm.sample_posterior_predictive(idata, var_names=["det"], **kwargs)
# Force recomputation using the new `x**3` graph.
pm.sample_posterior_predictive(idata, var_names=["det"], sample_vars=["det"], **kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use sample_vars without var_names in this example?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is easier to follow, I'll add a note saying var_names wasn't needed as it would defaulted to the same

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I thought this was on the section on volatility. Agree. Also removed the first example. The nudge being users shouldn't think about var_names (or at least not first). but sample_vars.

Comment thread pymc/sampling/forward.py
Comment thread pymc/sampling/forward.py Outdated
Note: The samples of ``y`` are equivalent to its prior, since it does not depend on any other variables.
In contrast, the samples of ``z`` and ``det`` depend on the new samples of ``y`` and the posterior samples of
``x`` found in the trace.
Note: The samples of y are equivalent to its prior, since it does not depend on any other variables.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want this to be in a .. warning:: directive

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, but it's weird with the In contrast part after ... ?

Comment thread pymc/sampling/forward.py
Comment thread tests/model/transform/test_conditioning.py
@ricardoV94
Copy link
Copy Markdown
Member Author

ricardoV94 commented Apr 27, 2026

General comment: the process of walking the fgraph and classifying nodes as volatile or not is exactly the function of a Feature, is there a reason why you didn't use one here?

A Feature is specifically about adding/keep meta information while a graph is being rewritten, we are not doing any graph rewrite here, just a snapshot analysis. I can't think of any angle where it would help

@ricardoV94
Copy link
Copy Markdown
Member Author

ricardoV94 commented Apr 27, 2026

With this refactor I think we should really think about renaming this function to something like conditional_resample then implement sample_posterior_predictive as a special case that only generates the posterior predictive. My logic is that the functionality we now offer goes well beyond just sampling the posterior predictive distribution, which is a well-defined object in the literature.

This was discussed in the original issue (and I think more internally) and rejected, Back then I was advocating for it, although I didn't have as good a name as yours (except the re- in resample, as we are not always "re-sampling").

But now I don't agree so much. What is the strict criteria for distinguishing whether this is a sample_posterior_predictive vs not? The freeze_vars is not imo. Missing vars is not (think of extra group in hierarchical models). Discarding trace variables? Again depends, maybe you are discarding because you are sampling new coordinates only, and you are just telling the function that.

So... I'm inclined to think there is no obvious set of arguments/restrictions that would constrict this function to only do "sample posterior predictive". A bit like "sample" may very well be used to get prior / prior_predictive draws (or compute deterministics), the meaning of the operation is not something you can pin-point from the computation being run.

If you agree with this, the argument becomes why we should have function with another name that has exactly the same signature/behavior? Since for most users the function does what they want, let the power users make their own alias if they want, and keep official naming simple

@jessegrabowski
Copy link
Copy Markdown
Member

This was discussed in the original issue (and I think more internally) and rejected, Back then I was advocating for it, although I didn't have as good a name as yours (except the re- in resample, as we are not always "re-sampling").

I don't agree that the ppd is hard to pointpoint though. It's defined by the fact that we're sampling the observed distribution, conditioned on the posterior samples -- it's the base case. The advantage would be to hide complexity from the median user. That said, it is likely sufficient that pm.sample_posterior_predictive(idata) does the right thing.

If this was already discussed let's move on instead of re-litigating it ad nauseam. I accept the judgement of the collective.

@ricardoV94
Copy link
Copy Markdown
Member Author

ricardoV94 commented Apr 27, 2026

It's defined by the fact that we're sampling the observed distribution, conditioned on the posterior samples -- it's the base case.

As long as you use even one sample from the posterior it fits the mold. I don't think it must use all samples from the posterior (trivial case: dead path) or can't introduce new variables (e.g., marginalized variables that are being recovered, you are using a stochastic approximation to the posterior, conjugate posterior with closed-form solution)?

Also maybe you didn't mark your variable as observed from PyMC's standpoint (but you know this is your observation). I don't think the meaning can be decided in general.

@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch 2 times, most recently from cd189b9 to 0925446 Compare April 27, 2026 14:59
@ricardoV94 ricardoV94 force-pushed the sample_posterior_predictive branch from 0925446 to d96c1da Compare April 27, 2026 16:11
@ricardoV94 ricardoV94 merged commit 39158be into pymc-devs:v6 Apr 27, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants