Add outcome falsification, random placebo folds, and model clone support by cetagostini · Pull Request #826 · pymc-labs/CausalPy

cetagostini · 2026-04-06T07:39:21Z

Summary

Fix interrogate coverage: Exclude .marimo/ and .scratch/ directories from docstring coverage checks (pre-existing failure at 82.8%)
Add _clone methods to BayesianBasisExpansionTimeSeries and StateSpaceTimeSeries so sensitivity checks can create fresh, unfitted copies
Add random fold selection to PlaceboInTime with min_training_pct, min_gap, and exclude_periods constraints; fix assurance simulation to correctly model alternative hypothesis as theta_new + expected_effect
Add OutcomeFalsification sensitivity check that re-fits experiments with alternative outcome formulas and reports effect sizes with HDI intervals (informational, no pass/fail)
Add article notebook (its_placebo_in_time_analysis.ipynb) documenting placebo-in-time and outcome-falsification methodology with worked examples

Contributes towards #914 (Bayesian assurance / operating-characteristics thinking for design assessment) but does not close it — #914 tracks a separate methodology review of iid Gaussian noise injection in SyntheticControl.power_analysis().

Test plan

Unit tests for OutcomeFalsification construction, validation, and repr
Integration tests for OutcomeFalsification.run() with single/multiple formulas and failed formulas
Unit tests for PlaceboInTime random selection construction, validation, and geometry
Integration tests for PlaceboInTime with random selection mode
Verify interrogate passes on CI (was failing pre-existing at 82.8%)
Verify all pre-commit hooks pass

🤖 Generated with Claude Code

These directories contain exploratory/demo scripts that are not part of the main package and should not count toward docstring coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…TimeSeries Sensitivity checks like PlaceboInTime and OutcomeFalsification need to create fresh, unfitted copies of models. These _clone methods preserve all configuration (components, sample_kwargs, mode) while resetting fitted state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Time Adds a "random" selection_method that randomly samples eligible placebo windows from the pre-intervention period, with constraints on minimum training fraction, minimum gap between folds, and optional period exclusion. Also fixes the assurance simulation to correctly model the alternative hypothesis as null baseline noise + expected treatment effect (theta_new + expected_effect), matching the paper's formulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduces outcome falsification, which re-fits the experiment with alternative outcome formulas and reports their estimated effect sizes with HDI intervals. This is an informational check (no pass/fail) that lets researchers assess whether the pattern of effects across outcomes is consistent with their causal story. Inspired by the "causal detective" approach in Gallea (2026). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a developer notebook documenting the placebo-in-time and outcome falsification sensitivity check methodology with worked examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

review-notebook-app · 2026-04-06T07:39:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2026-04-06T07:45:16Z

Codecov Report

❌ Patch coverage is 96.81621% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.12%. Comparing base (6fe2ec3) to head (ab50a40).

Files with missing lines	Patch %	Lines
causalpy/tests/test_timeseries_model_coverage.py	80.00%	9 Missing ⚠️
causalpy/checks/outcome_falsification.py	92.47%	3 Missing and 4 partials ⚠️
causalpy/checks/placebo_in_time.py	95.34%	2 Missing and 2 partials ⚠️
causalpy/pymc_models.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #826      +/-   ##
==========================================
+ Coverage   95.03%   95.12%   +0.08%     
==========================================
  Files          90       92       +2     
  Lines       14117    14803     +686     
  Branches      851      890      +39     
==========================================
+ Hits        13416    14081     +665     
- Misses        490      505      +15     
- Partials      211      217       +6

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

read-the-docs-community · 2026-04-06T07:45:31Z

Documentation build overview

📚 causalpy | 🛠️ Build #33072823 | 📁 Comparing ab50a40 against latest (78ec91d)

🔍 Preview build

260 files changed · + 65 added · ± 194 modified · - 1 deleted

+ Added

± Modified

- Deleted

api/generated/causalpy.experiments.base.BaseExperiment.plot.html

drbenvincent · 2026-04-14T09:05:27Z

Bit of a mega payload here. There are a bunch of definitely different themes going on. In an ideal world, future PR's would be broken up into slightly more discrete PR's. Partly because it could be easier to revert any changes, partly because it's easier to review smaller PR's. But if there are genuine interactions/dependencies then cool.

Review incoming...

drbenvincent

Thanks for the PR! The sensitivity-checking work here is solid — the OutcomeFalsification check and the random fold selection for PlaceboInTime are both well-motivated features, and the article notebook is pedagogically excellent.

I have comments grouped by severity below.

Must-fix

1. `_clone()` methods should forward user priors (`pymc_models.py`)

Both new _clone() methods (BayesianBasisExpansionTimeSeries and StateSpaceTimeSeries) omit priors=self._user_priors, unlike the base PyMCModel._clone() which passes it through. If a user provided custom priors to the original model, the clone silently drops them, meaning the cloned model may behave differently from the original. Both methods should forward self._user_priors for consistency with the base class pattern.

2. `selection_method` should use `Literal` (`placebo_in_time.py`)

The project's AGENTS.md type-checking rules specify: "use Literal for constrained string parameters." Literal is already used across 15+ source files in the codebase. selection_method: str should be selection_method: Literal["sequential", "random"].

Should-fix

3. `min_gap` semantics are confusing (`placebo_in_time.py`)

The docstring says min_gap is the "minimum index distance between any two selected candidate positions in the sorted candidate list." But test_random_fold_respects_min_gap asserts that the actual treatment times differ by >= min_gap, not the candidate-list indices. These are different quantities — a gap of 5 in candidate-list positions doesn't necessarily mean 5 units of actual time between folds (candidates may not be contiguous).

Either:

Clarify the docstring to match the actual intent, or
Change the constraint to operate on actual time distance between selected folds, which is what users likely care about.

4. Assurance formula fix needs a targeted test (`placebo_in_time.py`)

The change to the assurance simulation (from true_effect = expected_effect to true_effect = theta_new + expected_effect) is a behavioral change to existing functionality. The commit message explains the rationale well, but there's no test that specifically validates the new formula is correct (or demonstrates the old one was wrong). A test that checks assurance power changes in the expected direction under a known scenario would give confidence in this fix.

5. Bare `except Exception` is too broad (`outcome_falsification.py`)

In the run() loop, except Exception swallows everything. This is convenient for robustness, but it can hide real bugs during development (e.g., an AttributeError from a code change in the experiment class). Consider catching a more targeted set of exceptions (e.g., (ValueError, PatsyError, RuntimeError)) and letting unexpected errors propagate. The exc_info=True logging is good, but users who don't check logs will never know something unexpected happened.

6. `FalsificationResult` stores the full fitted experiment (`outcome_falsification.py`)

Each FalsificationResult holds a reference to the complete fitted BaseExperiment (including its InferenceData). For a check with many falsification formulas, this could consume significant memory. Consider whether storing only the summary statistics (already captured in effect_mean, hdi_lower, hdi_upper) would suffice for most use cases, and document that the full experiments are retained for users who want to inspect posteriors.

Suggestions / nits

7. External data dependency in notebook

The notebook fetches data at runtime from https://nyc3.digitaloceanspaces.com/owid-public/data/.... If these URLs change or go down, the notebook breaks. Consider bundling the dataset as a CausalPy example dataset, or at minimum noting the dependency in a comment.

8. "Co2" should be "CO₂" in prose

Throughout the notebook markdown, "Co2" is used where "CO₂" (or at least "CO2") would be the standard scientific notation. Variable/column names like coal_co2 are fine, but the prose should use the conventional form.

9. Test setup duplication

Several integration tests in test_outcome_falsification.py repeat the exact same setup block (create data, create experiment, create context, populate experiment_config). A shared pytest fixture would reduce ~60 lines of duplication and make the tests easier to maintain.

10. Greedy fold selection can fail unnecessarily (`placebo_in_time.py`)

The random fold selection algorithm picks one fold at a time without backtracking. This means a bad early pick can make it impossible to select remaining folds even when valid selections exist. For small candidate pools with large min_gap, this could lead to unnecessary ValueErrors. A note in the docstring would help, or the algorithm could retry with shuffled order.

Things I liked

The "causal detective" pedagogical framing in the notebook is genuinely excellent.
The informational-only design (passed=None) for OutcomeFalsification is the right call — this check should inform, not gate.
The test coverage is thorough, with both unit tests (no sampling) and integration tests for both new features.
The interrogate config fix is a nice drive-by cleanup.

Must-fix: - pymc_models: forward user priors through BayesianBasisExpansionTimeSeries and StateSpaceTimeSeries _clone() (add priors kwarg to both __init__s) - placebo_in_time: tighten selection_method to Literal["sequential", "random"] Should-fix: - placebo_in_time: min_gap now measures observation-count distance between selected folds (tracked via each candidate's position in the sorted pre-period index), matching the test's intent - test_placebo_in_time: add targeted unit tests pinning the corrected assurance formula (true_effect = theta_new + expected_effect) - outcome_falsification: narrow the broad except to (PatsyError, FormulaException, DataException, ValueError, KeyError, RuntimeError, LinAlgError) and warnings.warn on skip so silent failures surface - outcome_falsification: add store_experiments flag so callers can drop fitted experiments and keep only the summary stats Nits: - notebook: add comment flagging the OWID URL runtime dependency - notebook: CO2 -> CO2 (with subscript) in markdown prose - test_outcome_falsification: extract shared its_context fixture - placebo_in_time: document greedy-selection failure mode in docstring and error message Tests: - add regression tests that _clone() preserves user_priors on both time-series models Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n_placebos

drbenvincent

Review by claude-opus-4-7-xhigh. Code-only review — Ben will manually review the new its_place_in_time_analysis.ipynb docs page.

Thanks for the thorough turnaround on the prior round — the two new assurance-formula tests in particular are a nice way to pin the corrected math, and the _clone() priors regression tests directly address what was flagged. CI is green, patch coverage is 96.65%, and every item from the previous review is accounted for in 3ab9fd6.

A few remaining concerns and nits below, none of which I think are blockers.

Should-fix

1. `min_gap=1` is silently a no-op (`placebo_in_time.py`)

With sampling without replacement, abs(pos_i - pos_j) >= 1 is trivially true for any two distinct candidates (positions are unique integers). So the documented default "minimum 1 observation gap" doesn't actually constrain anything beyond "don't pick the same candidate twice". Either:

document this honestly (e.g., "default 1 imposes no spacing constraint"), or
pick a more useful default such as intervention_length, which would nudge users toward non-overlapping placebo windows by default.

2. Random placebo folds can overlap each other (`placebo_in_time.py`)

The candidate filter pseudo_end > treatment_time prevents a fold from overlapping the real intervention, but nothing prevents two random folds whose pseudo-windows overlap each other (e.g., positions 40 and 45 with intervention_length=10). Overlapping folds share observations, which violates the exchangeability assumption implicit in the hierarchical status-quo model — each fold_mean is treated as an independent draw from a common mu_status_quo. A min_gap = intervention_length default in random mode would solve both this and (1); at minimum, a docstring warning would help.

3. `store_experiments=True` is still the default (`outcome_falsification.py`)

Adding the flag addresses the memory concern as an opt-out, but for the common case (a handful of falsification formulas, each retaining a full BaseExperiment with InferenceData) the footprint can run into hundreds of MB on PiecewiseITS or large samples. The summary stats are what most users actually need; consider flipping the default to False and documenting True as the inspect-the-posteriors opt-in.

4. `_draw_expected_effect_samples` ignores `n` for numpy arrays (`placebo_in_time.py`)

def _draw_expected_effect_samples(self, n: int) -> np.ndarray:
    """Draw samples from the expected-effect prior."""
    prior = self.expected_effect_prior
    if prior is None:
        raise ValueError("expected_effect_prior is not set.")
    if isinstance(prior, np.ndarray):
        return prior

When the user passes a numpy array, it is returned verbatim regardless of the requested n. _compute_assurance then wraps with i % len(expected_samples), so it doesn't crash, but a 10-element user-supplied array will cycle 10-to-1 against a 4000-sample theta_new without any signal to the user. Either tile/subsample to n, or document the cycling behaviour in the docstring (and ideally warn when len(prior) < n).

5. Greedy failure path is stochastic and unfriendly to reproducibility

The "retry with a different random_seed" suggestion in the error message defeats the purpose of passing a seed. A bounded internal retry loop (e.g., up to N reshuffles using deterministic sub-seeds derived from random_seed) would eliminate the failure for almost all realistic configurations while preserving reproducibility. Not a must-fix, but cheap and user-friendly.

Nits

6. `OutcomeFalsification.repr` always shows `store_experiments`

Even at default. PlaceboInTime.__repr__ uses the nicer "only show non-default" pattern — worth mirroring here.

7. `np.linalg.LinAlgError` in the caught-exceptions tuple is dead defense

It's not actually raised by the PyMC sampling path for these models. Harmless, just noting.

8. Random-mode pre-period is effectively halved when `treatment_end_time` is unset

_compute_intervention_length falls back to data.index.max() - treatment_time, so the candidate filter pseudo_end > treatment_time requires pos_val + post_length < treatment_time. For a typical 75/25 split that leaves ~50% of the pre-period as eligible. Worth a one-line note in the random-mode docstring so users aren't surprised.

9. `test_run_handles_failed_formula` uses a syntactically invalid formula

The comment correctly attributes this to a Python 3.13 / patsy traceback interaction, but the more common real-world failure mode is a missing-column formula. A short link to the upstream bug (or a TODO to swap back when fixed) would be useful long term.

Process note (for future PRs, not this one)

Echoing the earlier "mega payload" point: five logically independent themes (interrogate config, time-series _clone(), random placebo folds, new OutcomeFalsification check, PanelRegression plot bug) in one PR makes review slower and revert harder. Worth keeping in mind for the next round rather than reshuffling this one.

drbenvincent

Follow-up by claude-opus-4-7-xhigh.

Must-fix: notebook is not wired into the docs

docs/source/notebooks/its_place_in_time_analysis.ipynb exists in the repo but is not referenced from any toctree. Specifically:

It is not in docs/source/notebooks/index.md (no toctree entry under any section).
It is not linked from docs/source/notebooks/sensitivity_checks.md, even though that page maintains a "Where examples already exist" list and a dedicated PlaceboInTime description that would be the natural pointers.
No :glob: directive picks it up automatically.

The Read the Docs preview happens to render the page because Sphinx builds every .ipynb it finds, but the page will be an orphan: not reachable from the sidebar or any index, and Sphinx will emit a "document isn't included in any toctree" warning at build time.

Question on placement

Where do you want this notebook to live in the navigation? My hunch is ITS-specific — the content is built around an ITS analysis of the UK 2008 climate intervention and PlaceboInTime is one of the headline diagnostics for ITS. In that case, it belongs in the existing Interrupted Time Series toctree in docs/source/notebooks/index.md:

:::{toctree}
:caption: Interrupted Time Series
:maxdepth: 1

its_skl.ipynb
its_pymc.ipynb
its_post_intervention_analysis.ipynb
its_covid.ipynb
its_lift_test.ipynb
its_place_in_time_analysis.ipynb
:::

Alternatively, since PlaceboInTime is a sensitivity check and the prose page sensitivity_checks.md already has a section for it, it could go in the Sensitivity Checks toctree:

:::{toctree}
:caption: Sensitivity Checks
:maxdepth: 1

sensitivity_checks.md
its_place_in_time_analysis.ipynb
:::

Could you confirm which section you want it under? Either way, please also add a discoverability link from sensitivity_checks.md — at minimum extending the existing line:

- `PlaceboInTime`: {doc}`pipeline_workflow`, {doc}`report_demo`, {doc}`its_place_in_time_analysis`

so users land on the worked example from the sensitivity-checks prose entry point.

drbenvincent · 2026-04-23T19:43:40Z

Ben thinks ITS is the most appropriate section

drbenvincent

Comments on the docs

For code cells focussing on plotting, add the hide-input cell tag. That will collapse the code so that it's hidden, but viewable if people want.

See what you can do about the sampler warnings

For cells with lots of output, add the hide-output tag. That should also be collapsable. It's fine if the output is the important point, but where you get lots of lines of sampler output and it's not a core part of the story, then just hide/collapse that stuff

More from me on the notebook soon - trying to fit this review in in the evening, but sleepiness is winning.

drbenvincent · 2026-04-23T19:57:22Z

+            "Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
+            "Chain 2 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
+            "Chain 3 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
+            "Sampling: [beta, delta, fourier_beta


Suggestion: promote plot_placebo_calibration into the codebase

claude-opus-4-7-xhigh here. Ben will manually review the docs page itself; this is a code-shape suggestion based on what the notebook defines.

The notebook defines a ~130-line helper plot_placebo_calibration(pit_check, original_result, title=...) and then calls it three times on different datasets. That repeated reuse is a strong signal it's general, not example-specific.

Why it belongs in the library

Looking at what the function actually consumes, it's entirely keyed on the PlaceboInTime CheckResult contract:

pit_check.metadata["fold_results"] (list of PlaceboFoldResult)

pit_check.metadata["null_samples"]

pit_check.metadata["actual_cumulative_mean"]

pit_check.metadata["p_effect_outside_null"]

original_result.post_impact — present on every applicable_methods member of PlaceboInTime (ITS and SyntheticControl)

It already handles both datetime and integer pseudo_treatment_time formatting, so it's general for every experiment type PlaceboInTime supports, not ITS-specific.

There is also already a documented home for this in CheckResult:

# causalpy/checks/base.py figures : list Optional matplotlib figures produced by the check. ... figures: list[Any] = field(default_factory=list)

…but no check currently populates it, so GenerateReport never has plots to surface. Promoting this helper would be the first concrete user of that field.

Suggested shape

Two clean options, not mutually exclusive:

Method on the check class — PlaceboInTime.plot_calibration(check_result, experiment) -> Figure. Mirrors how BaseExperiment exposes its plotting methods. Lives next to the metadata schema it depends on, in causalpy/checks/placebo_in_time.py.

Auto-populate CheckResult.figures — call the plotter inside PlaceboInTime.run() (gated behind something like make_figures: bool = True) and append the Figure to the result. This makes GenerateReport strictly more useful for free.

I'd lean toward (1) as the building block and (2) as the default behaviour.

Cleanups before promoting

The original_result arg is only used to extract post_impact. Since run() already has experiment in scope, you could compute actual_samples once during run() and stash it in metadata (e.g. metadata["actual_cumulative_samples"]), then the plotter takes just the CheckResult — no second positional arg, no easy way to mismatch them.

Hard-coded FOLD_COLORS (5 colors) silently wraps past 5 folds; promote to a module-level _DEFAULT_FOLD_COLORS or use matplotlib's color cycle.

The print(...) fallback for "not enough folds completed" should be warnings.warn (or return an empty figure with an annotation) — print is noisy inside library code.

Drop the trailing plt.show(); return the Figure and let the caller / notebook display it.

Add figsize and axes kwargs so it's composable and unit-testable.

Scope

Happy for this to be a follow-up PR rather than expanding this one further — the notebook can keep its inline helper for now, and a follow-up can do the move + cleanups + a test that just asserts run(..., make_figures=True) produces one Figure of the expected shape. Flag if you'd rather roll it in here.

Ben's review from 2026-04-23: PlaceboInTime (checks/placebo_in_time.py): - Add allow_overlap parameter (default False) that enforces non-overlap of pseudo-intervention windows, so random folds no longer violate the hierarchical model's exchangeability assumption by default. - Replace the "retry with a different random_seed" error path with a bounded greedy-selection retry loop (MAX_RANDOM_SELECTION_RETRIES=16) using deterministic sub-seeds derived from random_seed; failure message now names the knobs to relax (allow_overlap, min_gap, n_folds). - Warn when a pre-drawn numpy expected_effect_prior is shorter than the number of replications requested by the assurance simulation, documenting the cycling behaviour in _draw_expected_effect_samples. - Surface allow_overlap=True in __repr__ following the same "non-default only" pattern as selection_method. - Document how intervention_length falls back to data.index.max() - treatment_time when treatment_end_time is unset, shrinking the random-mode eligible window. OutcomeFalsification (checks/outcome_falsification.py): - Warn at run() when storing >= 3 fitted experiments, explaining that store_experiments=False keeps only summary statistics. - Rewrite __repr__ to hide default alpha and store_experiments flags. - Drop dead np.linalg.LinAlgError from the caught-exception tuple and the now-unused numpy import. Docs: - Wire its_place_in_time_analysis.ipynb into the ITS toctree so it stops being an orphan page. - Cross-link the notebook from sensitivity_checks.md under the "Where examples already exist" list. - Tag plot-only cells with hide-input and sampler-heavy cells with hide-output so the rendered page collapses non-essential chunks. Tests: - Pin allow_overlap default, the non-overlap invariant, the _windows_overlap helper for numeric and datetime indices, the allow_overlap opt-out, the bounded-retry reproducibility and exhaustion paths, and the expected-effect-prior cycling warning. - Pin the new OutcomeFalsification __repr__ pattern, the store_experiments memory warning at run() time, and its opt-out and below-threshold paths. - Expand the upstream-bug TODO in test_run_handles_failed_formula. Made-with: Cursor

drbenvincent · 2026-04-24T20:54:53Z

Hi @cetagostini — heads up: this branch contains two commits that touch causalpy/experiments/panel_regression.py and overlap with the fix that just landed on main via #853 (which closes #852):

8c91967d — "Fix mypy union-attr errors in PanelRegression"
f50caff5 — "Fix PanelRegression.plot_unit_effects with duplicate coeffs labels"

The merged solution in #853 takes a slightly different (and reviewed/CI-validated) approach for both:

Uses # type: ignore[union-attr] only, rather than [attr-defined,union-attr] — only union-attr is the real error today, so the extra code is unnecessary.
Keeps .sel(coeffs=label) and adds .squeeze("treated_units", drop=True).item() to match the existing idiom at lines 852–862 of the same file. (Patsy column names are unique by construction, so the "duplicate coeffs labels" diagnosis didn't apply to PanelRegression. If you have a real reproducer for that scenario, it's worth filing as a separate issue.)

Suggested next step:

Drop 8c91967d and f50caff5 from this branch (e.g. git rebase -i main and remove those two commits, or git reset + cherry-pick everything else).
Merge/rebase main to pick up the canonical fix from Fix mypy type: ignore codes in panel_regression.py #853.
This should also help unblock the current CONFLICTING merge state.

Happy to help walk through the rebase if useful.

Ben's review from 2026-04-23: PlaceboInTime (checks/placebo_in_time.py): - Add allow_overlap parameter (default False) that enforces non-overlap of pseudo-intervention windows, so random folds no longer violate the hierarchical model's exchangeability assumption by default. - Replace the "retry with a different random_seed" error path with a bounded greedy-selection retry loop (MAX_RANDOM_SELECTION_RETRIES=16) using deterministic sub-seeds derived from random_seed; failure message now names the knobs to relax (allow_overlap, min_gap, n_folds). - Warn when a pre-drawn numpy expected_effect_prior is shorter than the number of replications requested by the assurance simulation, documenting the cycling behaviour in _draw_expected_effect_samples. - Surface allow_overlap=True in __repr__ following the same "non-default only" pattern as selection_method. - Document how intervention_length falls back to data.index.max() - treatment_time when treatment_end_time is unset, shrinking the random-mode eligible window. OutcomeFalsification (checks/outcome_falsification.py): - Warn at run() when storing >= 3 fitted experiments, explaining that store_experiments=False keeps only summary statistics. - Rewrite __repr__ to hide default alpha and store_experiments flags. - Drop dead np.linalg.LinAlgError from the caught-exception tuple and the now-unused numpy import. Docs: - Wire its_place_in_time_analysis.ipynb into the ITS toctree so it stops being an orphan page. - Cross-link the notebook from sensitivity_checks.md under the "Where examples already exist" list. - Tag plot-only cells with hide-input and sampler-heavy cells with hide-output so the rendered page collapses non-essential chunks. Tests: - Pin allow_overlap default, the non-overlap invariant, the _windows_overlap helper for numeric and datetime indices, the allow_overlap opt-out, the bounded-retry reproducibility and exhaustion paths, and the expected-effect-prior cycling warning. - Pin the new OutcomeFalsification __repr__ pattern, the store_experiments memory warning at run() time, and its opt-out and below-threshold paths. - Expand the upstream-bug TODO in test_run_handles_failed_formula. Made-with: Cursor

…n_placebos

cetagostini · 2026-04-24T21:23:12Z

@drbenvincent all good, I did already! The feedback was implemented already as well.

cetagostini · 2026-05-07T05:14:28Z

@drbenvincent quick ping here. All should be good now!

cetagostini · 2026-05-21T10:34:34Z

Maybe @juanitorduz can take a look as well this should solve #914 and similars.

…n_placebos

drbenvincent · 2026-05-25T10:33:30Z

NOTE TO SELF: I think this just needs a quick read over the rendered docs from me and @juanitorduz maybe

cetagostini · 2026-06-03T07:46:46Z

@drbenvincent @juanitorduz small nudge again! jiji

juanitorduz

small initial comments

juanitorduz · 2026-06-04T07:03:10Z

+    # ------------------------------------------------------------------
+    # Validation
+    # ------------------------------------------------------------------


remove these comments :)

juanitorduz · 2026-06-04T07:03:36Z

+    # ------------------------------------------------------------------
+    # Factory helper
+    # ------------------------------------------------------------------


juanitorduz · 2026-06-04T07:03:56Z

+    # ------------------------------------------------------------------
+    # Effect extraction
+    # ------------------------------------------------------------------


same here (and all the following ones)

daimon-pymclabs · 2026-06-04T09:16:20Z

Detailed Review — PR #826

Reviewer: @juanitorduz (via Daimon)

This PR adds outcome falsification, random placebo fold selection, model _clone() support, and an article notebook. It has already gone through two rounds of review with Ben, which addressed many issues. Below is a fresh pass focusing on correctness, modularity, maintainability, type hints, docstrings, and tests.

1. Correctness

Assurance formula fix (placebo_in_time.py) — good catch, correctly fixed. The old code set true_effect = expected_effect, ignoring the theta_new null baseline noise. The fix true_effect = null_component + treatment_component correctly models the alternative hypothesis as the sum of both components, matching the paper's formulation. The two new unit tests (test_assurance_formula_zero_expected_matches_null and test_assurance_formula_large_expected_effect_dominates_baseline) pin the corrected behavior well.

_windows_overlap boundary semantics are correct. later < earlier + intervention_length gives half-open [start, start+L) windows, so back-to-back windows return False (non-overlapping). The datetime and numeric tests confirm this.

_extract_effect_stats assumes table.iloc[0] always has the right row. This is fine for the current effect_summary API, but if that method ever returns multi-row tables (e.g. multi-period effects), this will silently take the wrong row. Consider adding an assertion or at least a comment noting the assumption.

Minor: _draw_expected_effect_samples cycling behavior. When prior is an ndarray shorter than n, the method now warns but still silently cycles via prior[i % len(prior)]. The warning is good. However, the cycling logic itself lives in _compute_assurance, not in _draw_expected_effect_samples — the method just returns prior as-is and the cycling happens at index time. This is fine but the warning message could be confusing since the cycling isn't actually happening in the method that warns.

2. Code Modularity & Design

OutcomeFalsification is well-structured. Clean separation between validation, experiment building, effect extraction, and result assembly. The _build_experiment / _extract_effect_stats static methods make the class easy to test and extend.

_compute_random_fold_treatment_times is long (~100 lines) but logical. The candidate filtering → greedy selection → retry loop decomposition is clean. The _try_greedy_selection helper handles the constraint-satisfaction logic separately, which is good.

clone_model in base.py is a nice pattern — duck-typing _clone() with a deepcopy fallback keeps the model classes decoupled from the check infrastructure.

Suggestion: consider extracting _compute_random_fold_treatment_times's candidate-building loop. The date formatting + exclusion + min_training + pseudo_end filtering could be a _build_candidate_pool method. This would make _compute_random_fold_treatment_times easier to read and test the candidate logic in isolation.

3. Maintainability

Exception handling in OutcomeFalsification.run() is well-scoped. The caught tuple (PatsyError, FormulaException, DataException, ValueError, KeyError, RuntimeError) is specific enough to avoid swallowing unexpected errors while still being resilient to the realistic failure modes. The warnings.warn on skip is a good choice — silent failures would be much harder to debug.

MAX_RANDOM_SELECTION_RETRIES = 16 is a module-level constant, which is good. The error message when retries are exhausted is actionable — it names the knobs to relax (allow_overlap, min_gap, n_folds).

_STORE_EXPERIMENTS_WARN_THRESHOLD = 3 is reasonable. Three fitted PyMC experiments with full InferenceData can easily reach hundreds of MB.

Minor concern: pool.remove(pick) in _try_greedy_selection is O(n). Not a real problem since n_folds and candidate pools are small in practice, but worth noting.

4. Type Hints

Good overall. Key functions have return type annotations and parameter types. Specific notes:

selection_method: Literal["sequential", "random"] — correctly tightened from str per review feedback. ✅
_compute_random_fold_treatment_times returns list[Any] — the Any is unavoidable since treatment times can be numeric or datetime. Consider list[Any] → a type alias like TreatmentTime = Union[int, float, pd.Timestamp] to make intent clearer, but this is a nit.
FalsificationResult.experiment: BaseExperiment | None = None — correct, None when store_experiments=False. ✅
_clone() -> "PyMCModel" — forward reference string is fine. ✅
_windows_overlap(idx_a: Any, idx_b: Any, intervention_length: Any) -> bool — all Any is a bit loose. Same as above — a type alias would help readability but isn't blocking.

5. Docstrings

Module-level docstrings are excellent. outcome_falsification.py has a clear description of purpose, references Gallea (2026), and lists supported experiment types.

FalsificationResult dataclass has an Attributes section. ✅

OutcomeFalsification class docstring documents all parameters. ✅

_clone() methods have a one-liner docstring. Appropriate for internal methods. ✅

_try_greedy_selection and _windows_overlap lack docstrings. These are internal methods, so this is borderline. _windows_overlap in particular would benefit from a one-line docstring clarifying the half-open interval semantics (e.g., "True iff [a, a+L) and [b, b+L) share at least one point").

_compute_random_fold_treatment_times has no docstring. Given its complexity (~100 lines, multiple parameters, retry logic), a Parameters/Returns/Raises docstring would help future contributors understand the contract without reading the full body.

6. Tests

Coverage is thorough and well-organized. The test suite covers:

Construction/validation (no sampling — fast) ✅
Geometry/constraints (no sampling — fast) ✅
Integration (PyMC sampling — slow but necessary) ✅
Edge cases (failed formulas, infeasible constraints, cycling warnings) ✅

Specific strengths:

test_run_handles_failed_formula documents the Python 3.13/patsy interaction bug in a TODO — good practice.
The its_context fixture in test_outcome_falsification.py avoids repeated setup.
test_greedy_retry_raises_when_infeasible and test_greedy_retry_error_suggests_relaxing_constraints pin the error UX.
TestTimeSeriesModelClonePreservesPriors is a regression test for the priors forwarding fix — exactly the right kind of test.

One gap: no test for OutcomeFalsification with DifferenceInDifferences or PiecewiseITS. All integration tests use InterruptedTimeSeries. A parametrized test across the three applicable_methods would increase confidence, though this is arguably low-risk since the formula-swapping logic is method-agnostic.

Another gap: no negative test for _clone() verifying that the clone is truly unfitted. The tests verify _user_priors is preserved, but don't assert that the cloned model has no idata / fitted state. This would guard against future regressions where _clone() accidentally copies fitted state.

7. Notebook

I didn't do a line-by-line review of the notebook, but the commit history shows it's been through extensive iteration with Ben (heading structure, plot sizing, retina rendering, prose clarity, codespell/ruff fixes). The methodology documentation (fossil-fuel substitution case study, calibration plot interpretation, four-question structure) looks well-scaffolded.

Summary

This is a solid PR that adds meaningful sensitivity-analysis capabilities to CausalPy. The assurance formula bug fix is a real correctness improvement. The code is well-structured, the exception handling is thoughtful, and the test coverage is comprehensive. The main suggestions are:

Add a docstring to _compute_random_fold_treatment_times (given its complexity)
Add a one-liner docstring to _windows_overlap clarifying half-open interval semantics
Consider a negative _clone() test asserting no fitted state is carried over
Consider a parametrized OutcomeFalsification integration test across DiD/PiecewiseITS (low priority)
The _extract_effect_stats iloc[0] assumption is fragile if effect_summary ever changes — a defensive assertion would be cheap insurance

None of these are blockers. The PR is in good shape.

juanitorduz · 2026-06-04T12:35:55Z

@cetagostini I think these are valid comments :) I suggest addressing them

cetagostini · 2026-06-04T12:49:10Z

@cetagostini I think these are valid comments :) I suggest addressing them

Yes, I can do in a few hours. The 5 points by Daimon?

juanitorduz · 2026-06-04T13:16:14Z

yes, if possible :)

daimon-pymclabs · 2026-06-04T16:24:41Z

Re-review — PR #826 (round 2)

Reviewer: @juanitorduz (via Daimon)

Carlos addressed all five suggestions from the first review. Here's a point-by-point check plus notes on the additional changes.

Addressed items

_extract_effect_stats defensive assertion — Added validation for empty table and missing required columns (mean, hdi_lower, hdi_upper) with clear RuntimeError messages. The docstring now documents the Raises contract. ✅
_windows_overlap docstring — Rewritten to clearly describe the half-open [idx, idx + L) semantics, the overlap condition abs(a - b) < L, and the back-to-back non-overlap convention. Also explains why idx + intervention_length arithmetic is used instead of direct Timedelta. ✅
_compute_random_fold_treatment_times docstring — Expanded significantly into a two-stage algorithm description (candidate pool + greedy selection with retry). Documents the three candidate eligibility constraints, the retry mechanism with deterministic sub-seeds, and the error path. ✅
Negative _clone() tests — New TestTimeSeriesModelCloneIsUnfitted class with four tests:
- LinearRegression, BayesianBasisExpansionTimeSeries, StateSpaceTimeSeries each assert cloned.idata is None and cloned is not original
- test_clone_after_fit_does_not_inherit_idata actually fits a LinearRegression with real sampling, then clones and verifies the clone has no idata while the original is untouched. This is the strongest possible test for the invariant. ✅
Cross-method parametrized tests — test_run_across_applicable_methods covers DiD and PiecewiseITS with proper data generators, context builders, and assertions on the full CheckResult shape. ITS was already covered by existing tests. ✅

Additional changes (not from my suggestions)

Removed section-separator comments (# ------- Validation ------- etc.) from placebo_in_time.py and outcome_falsification.py. Cleaner — the class is readable without them. ✅

Enriched PlaceboInTime.run() metadata with rope_half_width, threshold, expected_effect_prior, and fold_sds. Two new tests pin this:

test_run_metadata_carries_design_configuration — verifies all four fields round-trip correctly, including identity check on the prior array and shape/positivity check on fold_sds.
test_run_metadata_carries_defaults_when_unconfigured — verifies None/default values when ROPE/prior are not configured.

This is a good addition — downstream consumers (report generation, plotting) can now access the design configuration without needing to inspect the check object. ✅

type: ignore[arg-type] annotations on round_num calls in interrupted_time_series.py, piecewise_its.py, regression_discontinuity.py, and synthetic_control.py. These suppress a mypy complaint about the round_to parameter. Fine as a pragmatic fix. ✅

Verdict

All review feedback is addressed. The new tests are thorough and well-motivated. No new issues found. This PR is ready to merge from a code-quality perspective.

cetagostini · 2026-06-04T16:33:28Z

Everything address @juanitorduz

Rename its_place_in_time_analysis.ipynb to its_placebo_in_time_analysis.ipynb and update notebook index and sensitivity_checks cross-links. Merge latest main to stay current with upstream. Co-authored-by: Cursor <cursoragent@cursor.com>

drbenvincent · 2026-06-10T10:04:27Z

@cetagostini sorry about the duration on this one. I had a few minor requests, but I just implemented them so I'm not standing in your way any more. Will merge once remote CI is green

cetagostini and others added 5 commits April 6, 2026 10:37

Fix interrogate coverage by excluding .marimo and .scratch directories

312357b

These directories contain exploratory/demo scripts that are not part of the main package and should not count toward docstring coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add sensitivity checks article notebook

d092a15

Adds a developer notebook documenting the placebo-in-time and outcome falsification sensitivity check methodology with worked examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cetagostini requested a review from drbenvincent April 6, 2026 08:10

cetagostini added 2 commits April 6, 2026 11:12

Update

ab02595

Preparing PR

aa74ac3

cetagostini requested a review from NathanielF April 11, 2026 19:02

cetagostini mentioned this pull request Apr 11, 2026

Centralize AI config and reorganize skills #837

Closed

5 tasks

drbenvincent requested changes Apr 14, 2026

View reviewed changes

cetagostini requested a review from drbenvincent April 16, 2026 19:19

Merge remote-tracking branch 'origin/main' into cetagostini/working_o…

8d954f8

…n_placebos

cetagostini mentioned this pull request Apr 16, 2026

Comparison notebook: Why flexibility matters? #703

Open

drbenvincent reviewed Apr 23, 2026

View reviewed changes

drbenvincent requested changes Apr 23, 2026

View reviewed changes

drbenvincent reviewed Apr 23, 2026

View reviewed changes

cetagostini added 2 commits April 25, 2026 00:18

Merge remote-tracking branch 'origin/main' into cetagostini/working_o…

76b477c

…n_placebos

cetagostini force-pushed the cetagostini/working_on_placebos branch from 8342c7d to 76b477c Compare April 24, 2026 21:22

daimon-pymclabs mentioned this pull request May 12, 2026

Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control #819

Draft

5 tasks

Merge branch 'main' into cetagostini/working_on_placebos

0a14b05

Merge remote-tracking branch 'origin/main' into cetagostini/working_o…

f831084

…n_placebos

Merge branch 'main' into cetagostini/working_on_placebos

68d230e

juanitorduz reviewed Jun 4, 2026

View reviewed changes

cetagostini added 4 commits June 5, 2026 01:05

changes

03ccecc

Update outcome_falsification.py

73687c7

Update placebo_in_time.py

e83c06f

precommit

56f1586

3.14

272510a

Merge branch 'main' into cetagostini/working_on_placebos

3fd7da5

cetagostini requested a review from drbenvincent June 7, 2026 15:22

juanitorduz approved these changes Jun 8, 2026

View reviewed changes

drbenvincent and others added 2 commits June 10, 2026 11:01

Merge branch 'main' into cetagostini/working_on_placebos

cdd1afd

Rename ITS placebo notebook and merge main

ab50a40

Rename its_place_in_time_analysis.ipynb to its_placebo_in_time_analysis.ipynb and update notebook index and sensitivity_checks cross-links. Merge latest main to stay current with upstream. Co-authored-by: Cursor <cursoragent@cursor.com>

drbenvincent approved these changes Jun 10, 2026

View reviewed changes

drbenvincent merged commit 261a84e into main Jun 10, 2026
19 checks passed

drbenvincent deleted the cetagostini/working_on_placebos branch June 10, 2026 10:14

anevolbap mentioned this pull request Jun 11, 2026

Feature parity with Google's CausalImpact #758

Open

24 tasks

Uh oh!

Conversation

cetagostini commented Apr 6, 2026 • edited by drbenvincent Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

review-notebook-app Bot commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

read-the-docs-community Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

drbenvincent commented Apr 14, 2026

Uh oh!

drbenvincent left a comment

Choose a reason for hiding this comment

Must-fix

1. _clone() methods should forward user priors (pymc_models.py)

2. selection_method should use Literal (placebo_in_time.py)

Should-fix

3. min_gap semantics are confusing (placebo_in_time.py)

4. Assurance formula fix needs a targeted test (placebo_in_time.py)

5. Bare except Exception is too broad (outcome_falsification.py)

6. FalsificationResult stores the full fitted experiment (outcome_falsification.py)

Suggestions / nits

7. External data dependency in notebook

8. "Co2" should be "CO₂" in prose

9. Test setup duplication

10. Greedy fold selection can fail unnecessarily (placebo_in_time.py)

Things I liked

Uh oh!

drbenvincent left a comment

Choose a reason for hiding this comment

Should-fix

1. min_gap=1 is silently a no-op (placebo_in_time.py)

2. Random placebo folds can overlap each other (placebo_in_time.py)

3. store_experiments=True is still the default (outcome_falsification.py)

4. _draw_expected_effect_samples ignores n for numpy arrays (placebo_in_time.py)

5. Greedy failure path is stochastic and unfriendly to reproducibility

Nits

6. OutcomeFalsification.__repr__ always shows store_experiments

7. np.linalg.LinAlgError in the caught-exceptions tuple is dead defense

8. Random-mode pre-period is effectively halved when treatment_end_time is unset

9. test_run_handles_failed_formula uses a syntactically invalid formula

Process note (for future PRs, not this one)

Uh oh!

drbenvincent left a comment

Choose a reason for hiding this comment

Must-fix: notebook is not wired into the docs

Question on placement

Uh oh!

drbenvincent commented Apr 23, 2026

Uh oh!

drbenvincent left a comment

Choose a reason for hiding this comment

Comments on the docs

Uh oh!

drbenvincent Apr 23, 2026

Choose a reason for hiding this comment

Suggestion: promote plot_placebo_calibration into the codebase

Why it belongs in the library

Suggested shape

Cleanups before promoting

Scope

Uh oh!

drbenvincent commented Apr 24, 2026

Uh oh!

cetagostini commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cetagostini commented May 7, 2026

Uh oh!

cetagostini commented May 21, 2026

Uh oh!

drbenvincent commented May 25, 2026

cetagostini commented Apr 6, 2026 •

edited by drbenvincent

Loading

codecov Bot commented Apr 6, 2026 •

edited

Loading

read-the-docs-community Bot commented Apr 6, 2026 •

edited

Loading

1. `_clone()` methods should forward user priors (`pymc_models.py`)

2. `selection_method` should use `Literal` (`placebo_in_time.py`)

3. `min_gap` semantics are confusing (`placebo_in_time.py`)

4. Assurance formula fix needs a targeted test (`placebo_in_time.py`)

5. Bare `except Exception` is too broad (`outcome_falsification.py`)

6. `FalsificationResult` stores the full fitted experiment (`outcome_falsification.py`)

10. Greedy fold selection can fail unnecessarily (`placebo_in_time.py`)

1. `min_gap=1` is silently a no-op (`placebo_in_time.py`)

2. Random placebo folds can overlap each other (`placebo_in_time.py`)

3. `store_experiments=True` is still the default (`outcome_falsification.py`)

4. `_draw_expected_effect_samples` ignores `n` for numpy arrays (`placebo_in_time.py`)

6. `OutcomeFalsification.repr` always shows `store_experiments`

7. `np.linalg.LinAlgError` in the caught-exceptions tuple is dead defense

8. Random-mode pre-period is effectively halved when `treatment_end_time` is unset

9. `test_run_handles_failed_formula` uses a syntactically invalid formula

Suggestion: promote `plot_placebo_calibration` into the codebase

cetagostini commented Apr 24, 2026 •

edited

Loading