Soften SC sensitivity narrative; flag #876 reproducibility gap

drbenvincent · drbenvincent · commit d15eb3ff3e70 · 2026-04-30T15:00:39.000+01:00
The PlaceboInTime verdict on this dataset is borderline rather than a clean fail, and small MCMC sampling noise (issue #876) can flip it either side of the 0.95 threshold between runs. Update the docs to treat it as such instead of asserting a specific verdict: - Rewrite "Interpreting the result on this dataset" to describe the outcome as borderline and explicitly point at the reproducibility note, instead of asserting NOT SUPPORTED. - Drop the over-specific sd numbers (sd≈30, sd≈9) so the narrative stays consistent if MCMC noise shifts them slightly. - Expand the :::note::: callout into "Known limitations being tracked" with two bullets: #875 (placebo windows that land too early) and the newly-filed #876 (unseeded pm.sample_posterior_predictive in the hierarchical-null fit). No code or executed-output changes; the rendered cell still prints P(actual outside null) = 0.923 and NOT SUPPORTED, which the new text treats as one borderline draw rather than the headline verdict. Refs #789, #875, #876 Made-with: Cursor
diff --git a/docs/source/notebooks/sc_pymc.ipynb b/docs/source/notebooks/sc_pymc.ipynb
@@ -1196,12 +1196,15 @@
     "3. **Inspect the printed `CheckResult.text`** for the verdict and per-fold summaries; the full hierarchical-null draws are available in `metadata[\"null_samples\"]`.\n",
     "4. **Open the HTML report** (iframe) for a consolidated view alongside the main estimates.\n",
     "\n",
-    "**Interpreting the result on this dataset.** On the example dataset above, the check reports `NOT SUPPORTED` with `P(actual outside null) ≈ 0.92` — *just* below the 0.95 default threshold. The actual cumulative impact (≈ −38) is clearly large, but the hierarchical null inferred from the placebo folds is wide enough to swallow it. Drilling into the per-fold summaries explains why: fold 1's pseudo-treatment time lands at index 12, leaving only 12 pre-treatment observations to fit a synthetic control on 7 donors. That model is poorly identified, so its posterior cumulative impact has `sd ≈ 30`, which inflates `tau` in the hierarchical status-quo model and broadens the null distribution. Fold 2, with a 41-row pre-period, behaves much better (`sd ≈ 9`).\n",
+    "**Interpreting the result on this dataset.** On the example dataset above the verdict sits *right* on the 0.95 threshold — the printed `P(actual outside null)` lands near 0.92, and small MCMC sampling noise can push it either side of the cutoff between runs (see the reproducibility note below). Read the verdict as borderline rather than as a clean pass or fail. The actual cumulative impact (≈ −38) is clearly large, but the hierarchical null inferred from the placebo folds is wide enough that the check cannot discriminate sharply on this data. Drilling into the per-fold summaries explains why: fold 1's pseudo-treatment time lands at index 12, leaving only 12 pre-treatment observations to fit a synthetic control on 7 donors. That model is poorly identified, so its posterior cumulative impact carries a large standard deviation, which inflates `tau` in the hierarchical status-quo model and broadens the null distribution. Fold 2, with a 41-row pre-period, behaves much better.\n",
     "\n",
-    "So this is *not* a clean rejection of the headline causal estimate; it is the placebo-in-time check telling you, correctly, that with this particular dataset and `n_folds=2` the placebo windows do not provide a tight enough null to discriminate. In general, a **pass** means the actual cumulative impact is unlikely under the status-quo null inferred from placebo folds — consistent with a real treatment effect. A **fail** means the placebo folds produced effects of similar magnitude, so the real effect is hard to distinguish from background variability. Like any single diagnostic, this is *necessary but not sufficient*: passing does not prove identification, and failing does not prove the absence of an effect {cite:p}`reichardt2019quasi`.\n",
+    "Treat this as an illustrative borderline case rather than a substantive verdict on the headline causal estimate — with only `n_folds=2` on a short series, this dataset does not give the placebo-in-time check enough independent evidence to call it either way. In general, a **pass** means the actual cumulative impact is unlikely under the status-quo null inferred from placebo folds — consistent with a real treatment effect. A **fail** means the placebo folds produced effects of similar magnitude, so the real effect is hard to distinguish from background variability. Like any single diagnostic, this is *necessary but not sufficient*: passing does not prove identification, and failing does not prove the absence of an effect {cite:p}`reichardt2019quasi`.\n",
     "\n",
     ":::{note}\n",
-    "**Known limitation: placebo windows that land too early.** `PlaceboInTime` derives `intervention_length` from `data.index.max() - treatment_time` and shifts the placebo treatment time backward by multiples of that length. When the resulting earliest fold has a short pre-period (as fold 1 does here), the synthetic control fit on that fold is noisy and can dominate the hierarchical null. This is being tracked in [issue #875](https://github.com/pymc-labs/CausalPy/issues/875), which proposes letting users pass an explicit `intervention_length` and adding a configurable minimum pre-period per fold so weak folds can be skipped. As an interim workaround you can pass an `experiment_factory` to `PlaceboInTime` that fits placebo folds on a smaller donor pool or a shorter intervention window.\n",
+    "**Known limitations being tracked.** Two upstream issues affect how this check behaves on the example data above. Until both are resolved, treat borderline `P(actual outside null)` values on this dataset as exactly that — borderline — rather than as evidence for or against the headline estimate.\n",
+    "\n",
+    "- **Placebo windows that land too early ([#875](https://github.com/pymc-labs/CausalPy/issues/875)):** `PlaceboInTime` derives `intervention_length` from `data.index.max() - treatment_time` and shifts the placebo treatment time backward by multiples of that length. When the resulting earliest fold has a short pre-period (as fold 1 does here), the synthetic control fit on that fold is noisy and can dominate the hierarchical null. The proposed fix lets users pass an explicit `intervention_length` and adds a configurable minimum pre-period per fold so weak folds can be skipped. As an interim workaround you can pass an `experiment_factory` to `PlaceboInTime` that fits placebo folds on a smaller donor pool or a shorter intervention window.\n",
+    "- **Reproducibility gap in the internal posterior predictive ([#876](https://github.com/pymc-labs/CausalPy/issues/876)):** even with `random_seed` plumbed through both `sample_kwargs` and the constructor, the internal `pm.sample_posterior_predictive` call for `theta_new` is not currently seeded, so the printed `P(actual outside null)` can drift by ~0.01–0.02 between runs and may straddle the 0.95 threshold. The proposed fix unifies the seed surface so the constructor's `random_seed` deterministically governs every stochastic stage of the check.\n",
     ":::\n",
     "\n",
     "#### If this check fails\n",