Skip to content

Add outcome falsification, random placebo folds, and model clone support#826

Merged
drbenvincent merged 40 commits into
mainfrom
cetagostini/working_on_placebos
Jun 10, 2026
Merged

Add outcome falsification, random placebo folds, and model clone support#826
drbenvincent merged 40 commits into
mainfrom
cetagostini/working_on_placebos

Conversation

@cetagostini

@cetagostini cetagostini commented Apr 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix interrogate coverage: Exclude .marimo/ and .scratch/ directories from docstring coverage checks (pre-existing failure at 82.8%)
  • Add _clone methods to BayesianBasisExpansionTimeSeries and StateSpaceTimeSeries so sensitivity checks can create fresh, unfitted copies
  • Add random fold selection to PlaceboInTime with min_training_pct, min_gap, and exclude_periods constraints; fix assurance simulation to correctly model alternative hypothesis as theta_new + expected_effect
  • Add OutcomeFalsification sensitivity check that re-fits experiments with alternative outcome formulas and reports effect sizes with HDI intervals (informational, no pass/fail)
  • Add article notebook (its_placebo_in_time_analysis.ipynb) documenting placebo-in-time and outcome-falsification methodology with worked examples

Contributes towards #914 (Bayesian assurance / operating-characteristics thinking for design assessment) but does not close it — #914 tracks a separate methodology review of iid Gaussian noise injection in SyntheticControl.power_analysis().

Test plan

  • Unit tests for OutcomeFalsification construction, validation, and repr
  • Integration tests for OutcomeFalsification.run() with single/multiple formulas and failed formulas
  • Unit tests for PlaceboInTime random selection construction, validation, and geometry
  • Integration tests for PlaceboInTime with random selection mode
  • Verify interrogate passes on CI (was failing pre-existing at 82.8%)
  • Verify all pre-commit hooks pass

🤖 Generated with Claude Code

cetagostini and others added 5 commits April 6, 2026 10:37
These directories contain exploratory/demo scripts that are not part
of the main package and should not count toward docstring coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…TimeSeries

Sensitivity checks like PlaceboInTime and OutcomeFalsification need
to create fresh, unfitted copies of models. These _clone methods
preserve all configuration (components, sample_kwargs, mode) while
resetting fitted state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Time

Adds a "random" selection_method that randomly samples eligible placebo
windows from the pre-intervention period, with constraints on minimum
training fraction, minimum gap between folds, and optional period
exclusion. Also fixes the assurance simulation to correctly model the
alternative hypothesis as null baseline noise + expected treatment
effect (theta_new + expected_effect), matching the paper's formulation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces outcome falsification, which re-fits the experiment with
alternative outcome formulas and reports their estimated effect sizes
with HDI intervals. This is an informational check (no pass/fail)
that lets researchers assess whether the pattern of effects across
outcomes is consistent with their causal story. Inspired by the
"causal detective" approach in Gallea (2026).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a developer notebook documenting the placebo-in-time and
outcome falsification sensitivity check methodology with worked
examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov

codecov Bot commented Apr 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.81621% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.12%. Comparing base (6fe2ec3) to head (ab50a40).

Files with missing lines Patch % Lines
causalpy/tests/test_timeseries_model_coverage.py 80.00% 9 Missing ⚠️
causalpy/checks/outcome_falsification.py 92.47% 3 Missing and 4 partials ⚠️
causalpy/checks/placebo_in_time.py 95.34% 2 Missing and 2 partials ⚠️
causalpy/pymc_models.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #826      +/-   ##
==========================================
+ Coverage   95.03%   95.12%   +0.08%     
==========================================
  Files          90       92       +2     
  Lines       14117    14803     +686     
  Branches      851      890      +39     
==========================================
+ Hits        13416    14081     +665     
- Misses        490      505      +15     
- Partials      211      217       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@read-the-docs-community

read-the-docs-community Bot commented Apr 6, 2026

Copy link
Copy Markdown

@cetagostini cetagostini requested a review from drbenvincent April 6, 2026 08:10
@drbenvincent

Copy link
Copy Markdown
Collaborator

Bit of a mega payload here. There are a bunch of definitely different themes going on. In an ideal world, future PR's would be broken up into slightly more discrete PR's. Partly because it could be easier to revert any changes, partly because it's easier to review smaller PR's. But if there are genuine interactions/dependencies then cool.

Review incoming...

@drbenvincent drbenvincent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The sensitivity-checking work here is solid — the OutcomeFalsification check and the random fold selection for PlaceboInTime are both well-motivated features, and the article notebook is pedagogically excellent.

I have comments grouped by severity below.


Must-fix

1. _clone() methods should forward user priors (pymc_models.py)

Both new _clone() methods (BayesianBasisExpansionTimeSeries and StateSpaceTimeSeries) omit priors=self._user_priors, unlike the base PyMCModel._clone() which passes it through. If a user provided custom priors to the original model, the clone silently drops them, meaning the cloned model may behave differently from the original. Both methods should forward self._user_priors for consistency with the base class pattern.

2. selection_method should use Literal (placebo_in_time.py)

The project's AGENTS.md type-checking rules specify: "use Literal for constrained string parameters." Literal is already used across 15+ source files in the codebase. selection_method: str should be selection_method: Literal["sequential", "random"].


Should-fix

3. min_gap semantics are confusing (placebo_in_time.py)

The docstring says min_gap is the "minimum index distance between any two selected candidate positions in the sorted candidate list." But test_random_fold_respects_min_gap asserts that the actual treatment times differ by >= min_gap, not the candidate-list indices. These are different quantities — a gap of 5 in candidate-list positions doesn't necessarily mean 5 units of actual time between folds (candidates may not be contiguous).

Either:

  • Clarify the docstring to match the actual intent, or
  • Change the constraint to operate on actual time distance between selected folds, which is what users likely care about.

4. Assurance formula fix needs a targeted test (placebo_in_time.py)

The change to the assurance simulation (from true_effect = expected_effect to true_effect = theta_new + expected_effect) is a behavioral change to existing functionality. The commit message explains the rationale well, but there's no test that specifically validates the new formula is correct (or demonstrates the old one was wrong). A test that checks assurance power changes in the expected direction under a known scenario would give confidence in this fix.

5. Bare except Exception is too broad (outcome_falsification.py)

In the run() loop, except Exception swallows everything. This is convenient for robustness, but it can hide real bugs during development (e.g., an AttributeError from a code change in the experiment class). Consider catching a more targeted set of exceptions (e.g., (ValueError, PatsyError, RuntimeError)) and letting unexpected errors propagate. The exc_info=True logging is good, but users who don't check logs will never know something unexpected happened.

6. FalsificationResult stores the full fitted experiment (outcome_falsification.py)

Each FalsificationResult holds a reference to the complete fitted BaseExperiment (including its InferenceData). For a check with many falsification formulas, this could consume significant memory. Consider whether storing only the summary statistics (already captured in effect_mean, hdi_lower, hdi_upper) would suffice for most use cases, and document that the full experiments are retained for users who want to inspect posteriors.


Suggestions / nits

7. External data dependency in notebook

The notebook fetches data at runtime from https://nyc3.digitaloceanspaces.com/owid-public/data/.... If these URLs change or go down, the notebook breaks. Consider bundling the dataset as a CausalPy example dataset, or at minimum noting the dependency in a comment.

8. "Co2" should be "CO₂" in prose

Throughout the notebook markdown, "Co2" is used where "CO₂" (or at least "CO2") would be the standard scientific notation. Variable/column names like coal_co2 are fine, but the prose should use the conventional form.

9. Test setup duplication

Several integration tests in test_outcome_falsification.py repeat the exact same setup block (create data, create experiment, create context, populate experiment_config). A shared pytest fixture would reduce ~60 lines of duplication and make the tests easier to maintain.

10. Greedy fold selection can fail unnecessarily (placebo_in_time.py)

The random fold selection algorithm picks one fold at a time without backtracking. This means a bad early pick can make it impossible to select remaining folds even when valid selections exist. For small candidate pools with large min_gap, this could lead to unnecessary ValueErrors. A note in the docstring would help, or the algorithm could retry with shuffled order.


Things I liked

  • The "causal detective" pedagogical framing in the notebook is genuinely excellent.
  • The informational-only design (passed=None) for OutcomeFalsification is the right call — this check should inform, not gate.
  • The test coverage is thorough, with both unit tests (no sampling) and integration tests for both new features.
  • The interrogate config fix is a nice drive-by cleanup.

Must-fix:
- pymc_models: forward user priors through BayesianBasisExpansionTimeSeries
  and StateSpaceTimeSeries _clone() (add priors kwarg to both __init__s)
- placebo_in_time: tighten selection_method to Literal["sequential",
  "random"]

Should-fix:
- placebo_in_time: min_gap now measures observation-count distance
  between selected folds (tracked via each candidate's position in the
  sorted pre-period index), matching the test's intent
- test_placebo_in_time: add targeted unit tests pinning the corrected
  assurance formula (true_effect = theta_new + expected_effect)
- outcome_falsification: narrow the broad except to (PatsyError,
  FormulaException, DataException, ValueError, KeyError, RuntimeError,
  LinAlgError) and warnings.warn on skip so silent failures surface
- outcome_falsification: add store_experiments flag so callers can drop
  fitted experiments and keep only the summary stats

Nits:
- notebook: add comment flagging the OWID URL runtime dependency
- notebook: CO2 -> CO2 (with subscript) in markdown prose
- test_outcome_falsification: extract shared its_context fixture
- placebo_in_time: document greedy-selection failure mode in docstring
  and error message

Tests:
- add regression tests that _clone() preserves user_priors on both
  time-series models

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@drbenvincent drbenvincent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by claude-opus-4-7-xhigh. Code-only review — Ben will manually review the new its_place_in_time_analysis.ipynb docs page.

Thanks for the thorough turnaround on the prior round — the two new assurance-formula tests in particular are a nice way to pin the corrected math, and the _clone() priors regression tests directly address what was flagged. CI is green, patch coverage is 96.65%, and every item from the previous review is accounted for in 3ab9fd6.

A few remaining concerns and nits below, none of which I think are blockers.


Should-fix

1. min_gap=1 is silently a no-op (placebo_in_time.py)

With sampling without replacement, abs(pos_i - pos_j) >= 1 is trivially true for any two distinct candidates (positions are unique integers). So the documented default "minimum 1 observation gap" doesn't actually constrain anything beyond "don't pick the same candidate twice". Either:

  • document this honestly (e.g., "default 1 imposes no spacing constraint"), or
  • pick a more useful default such as intervention_length, which would nudge users toward non-overlapping placebo windows by default.

2. Random placebo folds can overlap each other (placebo_in_time.py)

The candidate filter pseudo_end > treatment_time prevents a fold from overlapping the real intervention, but nothing prevents two random folds whose pseudo-windows overlap each other (e.g., positions 40 and 45 with intervention_length=10). Overlapping folds share observations, which violates the exchangeability assumption implicit in the hierarchical status-quo model — each fold_mean is treated as an independent draw from a common mu_status_quo. A min_gap = intervention_length default in random mode would solve both this and (1); at minimum, a docstring warning would help.

3. store_experiments=True is still the default (outcome_falsification.py)

Adding the flag addresses the memory concern as an opt-out, but for the common case (a handful of falsification formulas, each retaining a full BaseExperiment with InferenceData) the footprint can run into hundreds of MB on PiecewiseITS or large samples. The summary stats are what most users actually need; consider flipping the default to False and documenting True as the inspect-the-posteriors opt-in.

4. _draw_expected_effect_samples ignores n for numpy arrays (placebo_in_time.py)

def _draw_expected_effect_samples(self, n: int) -> np.ndarray:
    """Draw samples from the expected-effect prior."""
    prior = self.expected_effect_prior
    if prior is None:
        raise ValueError("expected_effect_prior is not set.")
    if isinstance(prior, np.ndarray):
        return prior

When the user passes a numpy array, it is returned verbatim regardless of the requested n. _compute_assurance then wraps with i % len(expected_samples), so it doesn't crash, but a 10-element user-supplied array will cycle 10-to-1 against a 4000-sample theta_new without any signal to the user. Either tile/subsample to n, or document the cycling behaviour in the docstring (and ideally warn when len(prior) < n).

5. Greedy failure path is stochastic and unfriendly to reproducibility

The "retry with a different random_seed" suggestion in the error message defeats the purpose of passing a seed. A bounded internal retry loop (e.g., up to N reshuffles using deterministic sub-seeds derived from random_seed) would eliminate the failure for almost all realistic configurations while preserving reproducibility. Not a must-fix, but cheap and user-friendly.


Nits

6. OutcomeFalsification.__repr__ always shows store_experiments

Even at default. PlaceboInTime.__repr__ uses the nicer "only show non-default" pattern — worth mirroring here.

7. np.linalg.LinAlgError in the caught-exceptions tuple is dead defense

It's not actually raised by the PyMC sampling path for these models. Harmless, just noting.

8. Random-mode pre-period is effectively halved when treatment_end_time is unset

_compute_intervention_length falls back to data.index.max() - treatment_time, so the candidate filter pseudo_end > treatment_time requires pos_val + post_length < treatment_time. For a typical 75/25 split that leaves ~50% of the pre-period as eligible. Worth a one-line note in the random-mode docstring so users aren't surprised.

9. test_run_handles_failed_formula uses a syntactically invalid formula

The comment correctly attributes this to a Python 3.13 / patsy traceback interaction, but the more common real-world failure mode is a missing-column formula. A short link to the upstream bug (or a TODO to swap back when fixed) would be useful long term.


Process note (for future PRs, not this one)

Echoing the earlier "mega payload" point: five logically independent themes (interrogate config, time-series _clone(), random placebo folds, new OutcomeFalsification check, PanelRegression plot bug) in one PR makes review slower and revert harder. Worth keeping in mind for the next round rather than reshuffling this one.

@drbenvincent drbenvincent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up by claude-opus-4-7-xhigh.

Must-fix: notebook is not wired into the docs

docs/source/notebooks/its_place_in_time_analysis.ipynb exists in the repo but is not referenced from any toctree. Specifically:

  • It is not in docs/source/notebooks/index.md (no toctree entry under any section).
  • It is not linked from docs/source/notebooks/sensitivity_checks.md, even though that page maintains a "Where examples already exist" list and a dedicated PlaceboInTime description that would be the natural pointers.
  • No :glob: directive picks it up automatically.

The Read the Docs preview happens to render the page because Sphinx builds every .ipynb it finds, but the page will be an orphan: not reachable from the sidebar or any index, and Sphinx will emit a "document isn't included in any toctree" warning at build time.

Question on placement

Where do you want this notebook to live in the navigation? My hunch is ITS-specific — the content is built around an ITS analysis of the UK 2008 climate intervention and PlaceboInTime is one of the headline diagnostics for ITS. In that case, it belongs in the existing Interrupted Time Series toctree in docs/source/notebooks/index.md:

:::{toctree}
:caption: Interrupted Time Series
:maxdepth: 1

its_skl.ipynb
its_pymc.ipynb
its_post_intervention_analysis.ipynb
its_covid.ipynb
its_lift_test.ipynb
its_place_in_time_analysis.ipynb
:::

Alternatively, since PlaceboInTime is a sensitivity check and the prose page sensitivity_checks.md already has a section for it, it could go in the Sensitivity Checks toctree:

:::{toctree}
:caption: Sensitivity Checks
:maxdepth: 1

sensitivity_checks.md
its_place_in_time_analysis.ipynb
:::

Could you confirm which section you want it under? Either way, please also add a discoverability link from sensitivity_checks.md — at minimum extending the existing line:

- `PlaceboInTime`: {doc}`pipeline_workflow`, {doc}`report_demo`, {doc}`its_place_in_time_analysis`

so users land on the worked example from the sensitivity-checks prose entry point.

@drbenvincent

Copy link
Copy Markdown
Collaborator

Ben thinks ITS is the most appropriate section

@drbenvincent drbenvincent left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on the docs

For code cells focussing on plotting, add the hide-input cell tag. That will collapse the code so that it's hidden, but viewable if people want.

See what you can do about the sampler warnings

For cells with lots of output, add the hide-output tag. That should also be collapsable. It's fine if the output is the important point, but where you get lots of lines of sampler output and it's not a core part of the story, then just hide/collapse that stuff

More from me on the notebook soon - trying to fit this review in in the evening, but sleepiness is winning.

"Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
"Chain 2 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
"Chain 3 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.\n",
"Sampling: [beta, delta, fourier_beta

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: promote plot_placebo_calibration into the codebase

claude-opus-4-7-xhigh here. Ben will manually review the docs page itself; this is a code-shape suggestion based on what the notebook defines.

The notebook defines a ~130-line helper plot_placebo_calibration(pit_check, original_result, title=...) and then calls it three times on different datasets. That repeated reuse is a strong signal it's general, not example-specific.

Why it belongs in the library

Looking at what the function actually consumes, it's entirely keyed on the PlaceboInTime CheckResult contract:

  • pit_check.metadata["fold_results"] (list of PlaceboFoldResult)
  • pit_check.metadata["null_samples"]
  • pit_check.metadata["actual_cumulative_mean"]
  • pit_check.metadata["p_effect_outside_null"]
  • original_result.post_impact — present on every applicable_methods member of PlaceboInTime (ITS and SyntheticControl)

It already handles both datetime and integer pseudo_treatment_time formatting, so it's general for every experiment type PlaceboInTime supports, not ITS-specific.

There is also already a documented home for this in CheckResult:

# causalpy/checks/base.py
figures : list
    Optional matplotlib figures produced by the check.
...
figures: list[Any] = field(default_factory=list)

…but no check currently populates it, so GenerateReport never has plots to surface. Promoting this helper would be the first concrete user of that field.

Suggested shape

Two clean options, not mutually exclusive:

  1. Method on the check classPlaceboInTime.plot_calibration(check_result, experiment) -> Figure. Mirrors how BaseExperiment exposes its plotting methods. Lives next to the metadata schema it depends on, in causalpy/checks/placebo_in_time.py.
  2. Auto-populate CheckResult.figures — call the plotter inside PlaceboInTime.run() (gated behind something like make_figures: bool = True) and append the Figure to the result. This makes GenerateReport strictly more useful for free.

I'd lean toward (1) as the building block and (2) as the default behaviour.

Cleanups before promoting

  • The original_result arg is only used to extract post_impact. Since run() already has experiment in scope, you could compute actual_samples once during run() and stash it in metadata (e.g. metadata["actual_cumulative_samples"]), then the plotter takes just the CheckResult — no second positional arg, no easy way to mismatch them.
  • Hard-coded FOLD_COLORS (5 colors) silently wraps past 5 folds; promote to a module-level _DEFAULT_FOLD_COLORS or use matplotlib's color cycle.
  • The print(...) fallback for "not enough folds completed" should be warnings.warn (or return an empty figure with an annotation) — print is noisy inside library code.
  • Drop the trailing plt.show(); return the Figure and let the caller / notebook display it.
  • Add figsize and axes kwargs so it's composable and unit-testable.

Scope

Happy for this to be a follow-up PR rather than expanding this one further — the notebook can keep its inline helper for now, and a follow-up can do the move + cleanups + a test that just asserts run(..., make_figures=True) produces one Figure of the expected shape. Flag if you'd rather roll it in here.

cetagostini added a commit that referenced this pull request Apr 24, 2026
Ben's review from 2026-04-23:

PlaceboInTime (checks/placebo_in_time.py):
- Add allow_overlap parameter (default False) that enforces non-overlap
  of pseudo-intervention windows, so random folds no longer violate
  the hierarchical model's exchangeability assumption by default.
- Replace the "retry with a different random_seed" error path with a
  bounded greedy-selection retry loop (MAX_RANDOM_SELECTION_RETRIES=16)
  using deterministic sub-seeds derived from random_seed; failure
  message now names the knobs to relax (allow_overlap, min_gap,
  n_folds).
- Warn when a pre-drawn numpy expected_effect_prior is shorter than
  the number of replications requested by the assurance simulation,
  documenting the cycling behaviour in _draw_expected_effect_samples.
- Surface allow_overlap=True in __repr__ following the same
  "non-default only" pattern as selection_method.
- Document how intervention_length falls back to
  data.index.max() - treatment_time when treatment_end_time is unset,
  shrinking the random-mode eligible window.

OutcomeFalsification (checks/outcome_falsification.py):
- Warn at run() when storing >= 3 fitted experiments, explaining that
  store_experiments=False keeps only summary statistics.
- Rewrite __repr__ to hide default alpha and store_experiments flags.
- Drop dead np.linalg.LinAlgError from the caught-exception tuple and
  the now-unused numpy import.

Docs:
- Wire its_place_in_time_analysis.ipynb into the ITS toctree so it
  stops being an orphan page.
- Cross-link the notebook from sensitivity_checks.md under the
  "Where examples already exist" list.
- Tag plot-only cells with hide-input and sampler-heavy cells with
  hide-output so the rendered page collapses non-essential chunks.

Tests:
- Pin allow_overlap default, the non-overlap invariant, the
  _windows_overlap helper for numeric and datetime indices, the
  allow_overlap opt-out, the bounded-retry reproducibility and
  exhaustion paths, and the expected-effect-prior cycling warning.
- Pin the new OutcomeFalsification __repr__ pattern, the
  store_experiments memory warning at run() time, and its opt-out and
  below-threshold paths.
- Expand the upstream-bug TODO in test_run_handles_failed_formula.

Made-with: Cursor
@drbenvincent

Copy link
Copy Markdown
Collaborator

Hi @cetagostini — heads up: this branch contains two commits that touch causalpy/experiments/panel_regression.py and overlap with the fix that just landed on main via #853 (which closes #852):

  • 8c91967d — "Fix mypy union-attr errors in PanelRegression"
  • f50caff5 — "Fix PanelRegression.plot_unit_effects with duplicate coeffs labels"

The merged solution in #853 takes a slightly different (and reviewed/CI-validated) approach for both:

  1. Uses # type: ignore[union-attr] only, rather than [attr-defined,union-attr] — only union-attr is the real error today, so the extra code is unnecessary.
  2. Keeps .sel(coeffs=label) and adds .squeeze("treated_units", drop=True).item() to match the existing idiom at lines 852–862 of the same file. (Patsy column names are unique by construction, so the "duplicate coeffs labels" diagnosis didn't apply to PanelRegression. If you have a real reproducer for that scenario, it's worth filing as a separate issue.)

Suggested next step:

  • Drop 8c91967d and f50caff5 from this branch (e.g. git rebase -i main and remove those two commits, or git reset + cherry-pick everything else).
  • Merge/rebase main to pick up the canonical fix from Fix mypy type: ignore codes in panel_regression.py #853.
  • This should also help unblock the current CONFLICTING merge state.

Happy to help walk through the rebase if useful.

Ben's review from 2026-04-23:

PlaceboInTime (checks/placebo_in_time.py):
- Add allow_overlap parameter (default False) that enforces non-overlap
  of pseudo-intervention windows, so random folds no longer violate
  the hierarchical model's exchangeability assumption by default.
- Replace the "retry with a different random_seed" error path with a
  bounded greedy-selection retry loop (MAX_RANDOM_SELECTION_RETRIES=16)
  using deterministic sub-seeds derived from random_seed; failure
  message now names the knobs to relax (allow_overlap, min_gap,
  n_folds).
- Warn when a pre-drawn numpy expected_effect_prior is shorter than
  the number of replications requested by the assurance simulation,
  documenting the cycling behaviour in _draw_expected_effect_samples.
- Surface allow_overlap=True in __repr__ following the same
  "non-default only" pattern as selection_method.
- Document how intervention_length falls back to
  data.index.max() - treatment_time when treatment_end_time is unset,
  shrinking the random-mode eligible window.

OutcomeFalsification (checks/outcome_falsification.py):
- Warn at run() when storing >= 3 fitted experiments, explaining that
  store_experiments=False keeps only summary statistics.
- Rewrite __repr__ to hide default alpha and store_experiments flags.
- Drop dead np.linalg.LinAlgError from the caught-exception tuple and
  the now-unused numpy import.

Docs:
- Wire its_place_in_time_analysis.ipynb into the ITS toctree so it
  stops being an orphan page.
- Cross-link the notebook from sensitivity_checks.md under the
  "Where examples already exist" list.
- Tag plot-only cells with hide-input and sampler-heavy cells with
  hide-output so the rendered page collapses non-essential chunks.

Tests:
- Pin allow_overlap default, the non-overlap invariant, the
  _windows_overlap helper for numeric and datetime indices, the
  allow_overlap opt-out, the bounded-retry reproducibility and
  exhaustion paths, and the expected-effect-prior cycling warning.
- Pin the new OutcomeFalsification __repr__ pattern, the
  store_experiments memory warning at run() time, and its opt-out and
  below-threshold paths.
- Expand the upstream-bug TODO in test_run_handles_failed_formula.

Made-with: Cursor
@cetagostini cetagostini force-pushed the cetagostini/working_on_placebos branch from 8342c7d to 76b477c Compare April 24, 2026 21:22
@cetagostini

cetagostini commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator Author

@drbenvincent all good, I did already! The feedback was implemented already as well.

@cetagostini

Copy link
Copy Markdown
Collaborator Author

@drbenvincent quick ping here. All should be good now!

@cetagostini

Copy link
Copy Markdown
Collaborator Author

Maybe @juanitorduz can take a look as well this should solve #914 and similars.

@drbenvincent

Copy link
Copy Markdown
Collaborator

NOTE TO SELF: I think this just needs a quick read over the rendered docs from me and @juanitorduz maybe

@cetagostini

Copy link
Copy Markdown
Collaborator Author

@drbenvincent @juanitorduz small nudge again! jiji

@juanitorduz juanitorduz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small initial comments

Comment on lines +142 to +144
# ------------------------------------------------------------------
# Validation
# ------------------------------------------------------------------

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these comments :)

Comment on lines +168 to +170
# ------------------------------------------------------------------
# Factory helper
# ------------------------------------------------------------------

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment on lines +200 to +202
# ------------------------------------------------------------------
# Effect extraction
# ------------------------------------------------------------------

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (and all the following ones)

@daimon-pymclabs

Copy link
Copy Markdown

Detailed Review — PR #826

Reviewer: @juanitorduz (via Daimon)

This PR adds outcome falsification, random placebo fold selection, model _clone() support, and an article notebook. It has already gone through two rounds of review with Ben, which addressed many issues. Below is a fresh pass focusing on correctness, modularity, maintainability, type hints, docstrings, and tests.


1. Correctness

Assurance formula fix (placebo_in_time.py) — good catch, correctly fixed. The old code set true_effect = expected_effect, ignoring the theta_new null baseline noise. The fix true_effect = null_component + treatment_component correctly models the alternative hypothesis as the sum of both components, matching the paper's formulation. The two new unit tests (test_assurance_formula_zero_expected_matches_null and test_assurance_formula_large_expected_effect_dominates_baseline) pin the corrected behavior well.

_windows_overlap boundary semantics are correct. later < earlier + intervention_length gives half-open [start, start+L) windows, so back-to-back windows return False (non-overlapping). The datetime and numeric tests confirm this.

_extract_effect_stats assumes table.iloc[0] always has the right row. This is fine for the current effect_summary API, but if that method ever returns multi-row tables (e.g. multi-period effects), this will silently take the wrong row. Consider adding an assertion or at least a comment noting the assumption.

Minor: _draw_expected_effect_samples cycling behavior. When prior is an ndarray shorter than n, the method now warns but still silently cycles via prior[i % len(prior)]. The warning is good. However, the cycling logic itself lives in _compute_assurance, not in _draw_expected_effect_samples — the method just returns prior as-is and the cycling happens at index time. This is fine but the warning message could be confusing since the cycling isn't actually happening in the method that warns.


2. Code Modularity & Design

OutcomeFalsification is well-structured. Clean separation between validation, experiment building, effect extraction, and result assembly. The _build_experiment / _extract_effect_stats static methods make the class easy to test and extend.

_compute_random_fold_treatment_times is long (~100 lines) but logical. The candidate filtering → greedy selection → retry loop decomposition is clean. The _try_greedy_selection helper handles the constraint-satisfaction logic separately, which is good.

clone_model in base.py is a nice pattern — duck-typing _clone() with a deepcopy fallback keeps the model classes decoupled from the check infrastructure.

Suggestion: consider extracting _compute_random_fold_treatment_times's candidate-building loop. The date formatting + exclusion + min_training + pseudo_end filtering could be a _build_candidate_pool method. This would make _compute_random_fold_treatment_times easier to read and test the candidate logic in isolation.


3. Maintainability

Exception handling in OutcomeFalsification.run() is well-scoped. The caught tuple (PatsyError, FormulaException, DataException, ValueError, KeyError, RuntimeError) is specific enough to avoid swallowing unexpected errors while still being resilient to the realistic failure modes. The warnings.warn on skip is a good choice — silent failures would be much harder to debug.

MAX_RANDOM_SELECTION_RETRIES = 16 is a module-level constant, which is good. The error message when retries are exhausted is actionable — it names the knobs to relax (allow_overlap, min_gap, n_folds).

_STORE_EXPERIMENTS_WARN_THRESHOLD = 3 is reasonable. Three fitted PyMC experiments with full InferenceData can easily reach hundreds of MB.

Minor concern: pool.remove(pick) in _try_greedy_selection is O(n). Not a real problem since n_folds and candidate pools are small in practice, but worth noting.


4. Type Hints

Good overall. Key functions have return type annotations and parameter types. Specific notes:

  • selection_method: Literal["sequential", "random"] — correctly tightened from str per review feedback. ✅
  • _compute_random_fold_treatment_times returns list[Any] — the Any is unavoidable since treatment times can be numeric or datetime. Consider list[Any] → a type alias like TreatmentTime = Union[int, float, pd.Timestamp] to make intent clearer, but this is a nit.
  • FalsificationResult.experiment: BaseExperiment | None = None — correct, None when store_experiments=False. ✅
  • _clone() -> "PyMCModel" — forward reference string is fine. ✅
  • _windows_overlap(idx_a: Any, idx_b: Any, intervention_length: Any) -> bool — all Any is a bit loose. Same as above — a type alias would help readability but isn't blocking.

5. Docstrings

Module-level docstrings are excellent. outcome_falsification.py has a clear description of purpose, references Gallea (2026), and lists supported experiment types.

FalsificationResult dataclass has an Attributes section.

OutcomeFalsification class docstring documents all parameters.

_clone() methods have a one-liner docstring. Appropriate for internal methods. ✅

_try_greedy_selection and _windows_overlap lack docstrings. These are internal methods, so this is borderline. _windows_overlap in particular would benefit from a one-line docstring clarifying the half-open interval semantics (e.g., "True iff [a, a+L) and [b, b+L) share at least one point").

_compute_random_fold_treatment_times has no docstring. Given its complexity (~100 lines, multiple parameters, retry logic), a Parameters/Returns/Raises docstring would help future contributors understand the contract without reading the full body.


6. Tests

Coverage is thorough and well-organized. The test suite covers:

  • Construction/validation (no sampling — fast) ✅
  • Geometry/constraints (no sampling — fast) ✅
  • Integration (PyMC sampling — slow but necessary) ✅
  • Edge cases (failed formulas, infeasible constraints, cycling warnings) ✅

Specific strengths:

  • test_run_handles_failed_formula documents the Python 3.13/patsy interaction bug in a TODO — good practice.
  • The its_context fixture in test_outcome_falsification.py avoids repeated setup.
  • test_greedy_retry_raises_when_infeasible and test_greedy_retry_error_suggests_relaxing_constraints pin the error UX.
  • TestTimeSeriesModelClonePreservesPriors is a regression test for the priors forwarding fix — exactly the right kind of test.

One gap: no test for OutcomeFalsification with DifferenceInDifferences or PiecewiseITS. All integration tests use InterruptedTimeSeries. A parametrized test across the three applicable_methods would increase confidence, though this is arguably low-risk since the formula-swapping logic is method-agnostic.

Another gap: no negative test for _clone() verifying that the clone is truly unfitted. The tests verify _user_priors is preserved, but don't assert that the cloned model has no idata / fitted state. This would guard against future regressions where _clone() accidentally copies fitted state.


7. Notebook

I didn't do a line-by-line review of the notebook, but the commit history shows it's been through extensive iteration with Ben (heading structure, plot sizing, retina rendering, prose clarity, codespell/ruff fixes). The methodology documentation (fossil-fuel substitution case study, calibration plot interpretation, four-question structure) looks well-scaffolded.


Summary

This is a solid PR that adds meaningful sensitivity-analysis capabilities to CausalPy. The assurance formula bug fix is a real correctness improvement. The code is well-structured, the exception handling is thoughtful, and the test coverage is comprehensive. The main suggestions are:

  1. Add a docstring to _compute_random_fold_treatment_times (given its complexity)
  2. Add a one-liner docstring to _windows_overlap clarifying half-open interval semantics
  3. Consider a negative _clone() test asserting no fitted state is carried over
  4. Consider a parametrized OutcomeFalsification integration test across DiD/PiecewiseITS (low priority)
  5. The _extract_effect_stats iloc[0] assumption is fragile if effect_summary ever changes — a defensive assertion would be cheap insurance

None of these are blockers. The PR is in good shape.

@juanitorduz

Copy link
Copy Markdown
Collaborator

@cetagostini I think these are valid comments :) I suggest addressing them

@cetagostini

Copy link
Copy Markdown
Collaborator Author

@cetagostini I think these are valid comments :) I suggest addressing them

Yes, I can do in a few hours. The 5 points by Daimon?

@juanitorduz

Copy link
Copy Markdown
Collaborator

yes, if possible :)

@daimon-pymclabs

Copy link
Copy Markdown

Re-review — PR #826 (round 2)

Reviewer: @juanitorduz (via Daimon)

Carlos addressed all five suggestions from the first review. Here's a point-by-point check plus notes on the additional changes.


Addressed items

  1. _extract_effect_stats defensive assertion — Added validation for empty table and missing required columns (mean, hdi_lower, hdi_upper) with clear RuntimeError messages. The docstring now documents the Raises contract. ✅

  2. _windows_overlap docstring — Rewritten to clearly describe the half-open [idx, idx + L) semantics, the overlap condition abs(a - b) < L, and the back-to-back non-overlap convention. Also explains why idx + intervention_length arithmetic is used instead of direct Timedelta. ✅

  3. _compute_random_fold_treatment_times docstring — Expanded significantly into a two-stage algorithm description (candidate pool + greedy selection with retry). Documents the three candidate eligibility constraints, the retry mechanism with deterministic sub-seeds, and the error path. ✅

  4. Negative _clone() tests — New TestTimeSeriesModelCloneIsUnfitted class with four tests:

    • LinearRegression, BayesianBasisExpansionTimeSeries, StateSpaceTimeSeries each assert cloned.idata is None and cloned is not original
    • test_clone_after_fit_does_not_inherit_idata actually fits a LinearRegression with real sampling, then clones and verifies the clone has no idata while the original is untouched. This is the strongest possible test for the invariant. ✅
  5. Cross-method parametrized teststest_run_across_applicable_methods covers DiD and PiecewiseITS with proper data generators, context builders, and assertions on the full CheckResult shape. ITS was already covered by existing tests. ✅


Additional changes (not from my suggestions)

Removed section-separator comments (# ------- Validation ------- etc.) from placebo_in_time.py and outcome_falsification.py. Cleaner — the class is readable without them. ✅

Enriched PlaceboInTime.run() metadata with rope_half_width, threshold, expected_effect_prior, and fold_sds. Two new tests pin this:

  • test_run_metadata_carries_design_configuration — verifies all four fields round-trip correctly, including identity check on the prior array and shape/positivity check on fold_sds.
  • test_run_metadata_carries_defaults_when_unconfigured — verifies None/default values when ROPE/prior are not configured.

This is a good addition — downstream consumers (report generation, plotting) can now access the design configuration without needing to inspect the check object. ✅

type: ignore[arg-type] annotations on round_num calls in interrupted_time_series.py, piecewise_its.py, regression_discontinuity.py, and synthetic_control.py. These suppress a mypy complaint about the round_to parameter. Fine as a pragmatic fix. ✅


Verdict

All review feedback is addressed. The new tests are thorough and well-motivated. No new issues found. This PR is ready to merge from a code-quality perspective.

@cetagostini

Copy link
Copy Markdown
Collaborator Author

Everything address @juanitorduz

@cetagostini cetagostini requested a review from drbenvincent June 7, 2026 15:22
drbenvincent and others added 2 commits June 10, 2026 11:01
Rename its_place_in_time_analysis.ipynb to its_placebo_in_time_analysis.ipynb and update notebook index and sensitivity_checks cross-links. Merge latest main to stay current with upstream.

Co-authored-by: Cursor <cursoragent@cursor.com>
@drbenvincent

Copy link
Copy Markdown
Collaborator

@cetagostini sorry about the duration on this one. I had a few minor requests, but I just implemented them so I'm not standing in your way any more. Will merge once remote CI is green

@drbenvincent drbenvincent merged commit 261a84e into main Jun 10, 2026
19 checks passed
@drbenvincent drbenvincent deleted the cetagostini/working_on_placebos branch June 10, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants