Skip to content

adding hierarchical its#833

Open
NathanielF wants to merge 9 commits into
pymc-labs:mainfrom
NathanielF:hierarchical_its_event_study
Open

adding hierarchical its#833
NathanielF wants to merge 9 commits into
pymc-labs:mainfrom
NathanielF:hierarchical_its_event_study

Conversation

@NathanielF
Copy link
Copy Markdown
Collaborator

Working on this ticket: #830

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 96.60537% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.66%. Comparing base (711970c) to head (657f54d).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
causalpy/pymc_models.py 77.66% 10 Missing and 13 partials ⚠️
...xperiments/hierarchical_interrupted_time_series.py 99.72% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #833      +/-   ##
==========================================
+ Coverage   94.20%   94.66%   +0.45%     
==========================================
  Files          78       82       +4     
  Lines       11880    13515    +1635     
  Branches      695      850     +155     
==========================================
+ Hits        11192    12794    +1602     
- Misses        496      501       +5     
- Partials      192      220      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented Apr 9, 2026

Documentation build overview

📚 causalpy | 🛠️ Build #32424019 | 📁 Comparing 657f54d against latest (deb8774)

  🔍 Preview build  

278 files changed · + 100 added · ± 117 modified · - 61 deleted

+ Added

± Modified

- Deleted

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
@NathanielF NathanielF marked this pull request as ready for review April 10, 2026 17:10
@NathanielF
Copy link
Copy Markdown
Collaborator Author

I think this is good for a review.

@NathanielF
Copy link
Copy Markdown
Collaborator Author

Some things to consider when reviewing.

  • Seasonality is shared, not hierarchical. This is deliberate to control parameter count, but debatable; AR residuals compensate for unit-level deviations.
  • AR requires balanced panels I could probably do something with masking to handle imbalanced panels, but AR is opt-in via ar_residuals=True.
  • mu includes AR contribution — impact calculation is valid because AR cancels in the observed−counterfactual difference.
  • sigma_ar / sigma can trade off — recommend checking pair plots for identification.
  • Trend and AR can compete on slow drift. hierarchical shrinkage and the deterministic-vs-stochastic distinction should provide soft identification.

@drbenvincent drbenvincent added enhancement New feature or request major labels Apr 13, 2026
@drbenvincent
Copy link
Copy Markdown
Collaborator

drbenvincent commented Apr 24, 2026

Code review (claude-opus-4-4-xhigh)

Reviewing the code in this PR only — I have not reviewed docs/source/notebooks/hierarchical_its_launch.ipynb.

A genuinely useful addition — staggered-launch panels with hierarchical pooling and event-study/placebo parameterizations are a real gap in CausalPy, and the PyMC model is well-structured (non-centered everywhere, sensible data-adaptive priors, optional Fourier/AR machinery). The 32 tests pass locally. Below is a critical-but-constructive pass focused on correctness, repo consistency, and test coverage.


Critical issues (correctness)

1. Silent data-corruption bug in AR(1) within_unit_tidx

causalpy/experiments/hierarchical_interrupted_time_series.py lines 246–249:

self._n_time_steps = int(counts[0])
# Compute within-unit sequential index (assumes data sorted by unit)
self._within_unit_tidx = np.concatenate(
    [np.arange(self._n_time_steps) for _ in range(n_units)]
)

The comment ("assumes data sorted by unit") admits the assumption, but the code never validates it and never sorts. If a user passes a panel that isn't sorted unit-by-unit (very common — pandas operations often shuffle), within_unit_tidx becomes meaningless and the AR(1) innovations get associated with the wrong (unit, time) cells in ar_resid_matrix[within_unit_tidx_, unit_idx_] (pymc_models.py:2352).

Reproducer on a shuffled panel (no model fit needed):

unit_idx (shuffled order):     [2 0 1 0 2 0 1 1 2]
within_unit_tidx (assigned):   [0 1 2 0 1 2 0 1 2]
within_unit_tidx (correct):    [0 0 0 1 1 2 1 2 2]   # df.groupby(unit).cumcount()

Worse, this even applies to time_col ordering: _n_time_steps only checks counts.min() == counts.max(); a unit with the same number of rows but different timestamps still passes silently.

Fix: sort df by (unit_col, time_col) early in _prepare_data, recompute unit_idx, and derive within_unit_tidx via df.groupby(unit_col).cumcount(). A regression test on a df.sample(frac=1)-shuffled panel would have caught this.

2. _aux zeroing skips the pre-launch leads in placebo

if self.effect_type == "instant":
    post = self._post
    if not effect_on and post is not None:
        post = np.zeros_like(post)
    aux["post"] = post
else:
    D = self._D
    if not effect_on and D is not None:
        D = np.zeros_like(D)
    aux["D"] = D

For effect_type="placebo", zeroing all of D to construct the counterfactual zeroes both pre-launch leads and post-launch effects. The pre-launch bins are placebos — they belong in the "no-intervention" world. Zeroing them inflates the counterfactual deviation in the pre-period and contaminates the impact trace and any cumulative summaries.

Fix: for placebo the counterfactual mask should keep the pre-launch columns "on" and zero only the post columns (D[:, K_pre:] = 0). A test should assert that impact[pre_period].mean() is small relative to impact[post_period].mean() for a synthetic dataset with no anticipation.

3. treatment_time_col is never validated for constancy within a unit

A user could legitimately pass a column with different launch_week values for different rows of the same unit (e.g. accidental join). The model uses the per-row value to compute tau, so inconsistent rows silently produce a meaningless event-time mapping. A simple df.groupby(unit_col)[treatment_time_col].nunique().max() == 1 check belongs in _validate_inputs.

4. print_coefficients() is broken on this experiment

BaseExperiment.print_coefficients() calls self.model.print_coefficients(self.labels, …) (base.py:88), and PyMCModel.print_coefficients does az.extract(self.idata.posterior, var_names="beta").sel(treated_units=unit) (pymc_models.py:453). The hierarchical model's beta has dims ["unit", "coeffs"], not ["treated_units", "coeffs"], and the sigma is sigma_beta rather than y_hat_sigma. Calling result.print_coefficients() will raise.

Fix: either override print_coefficients in the experiment to print mu_beta/sigma_beta/per-unit alpha, or document that it's unsupported and raise a clearer error. Add a test either way.


High-value issues (style, API, consistency)

5. Naming inconsistent with every other experiment

Every other experiment uses public method names algorithm() (called from __init__); this PR uses _algorithm() and _prepare_data():

causalpy/experiments/regression_kink.py:90:        self.algorithm()
causalpy/experiments/staggered_did.py:191:        self.algorithm()
…
causalpy/experiments/hierarchical_interrupted_time_series.py:160:        self._algorithm()

This is the lone holdout among 10 experiments. Rename for consistency.

6. effect_summary accepts 8 parameters and ignores 7 of them

def effect_summary(
    self,
    *,
    window: ... = "post",
    direction: ... = "increase",
    alpha: float = 0.05,
    cumulative: bool = True,
    relative: bool = True,
    min_effect: float | None = None,
    treated_unit: str | None = None,
    period: ... = None,
    prefix: str = "Post-period",
    **kwargs: Any,
) -> EffectSummary:

Only alpha and prefix are read. The base class requires accepting these args, but the implementation should at minimum:

  • use direction to compute the appropriate tail probability (right now prob_positive is hardcoded samples > 0 regardless of direction="decrease" — users asking about a decrease get the wrong column),
  • use min_effect for ROPE,
  • emit warnings.warn (or raise NotImplementedError) for the truly unsupported ones (window, cumulative, relative, period) so users don't think they did something.

See causalpy/experiments/interrupted_time_series.py for how the existing ITS handles direction/min_effect.

7. Triple-call posterior-predictive sampling

def _algorithm(self) -> None:
    model: HierarchicalLaunchITS = self.model  # type: ignore[assignment]
    model.fit(X=self.X, y=self.y, coords=self._coords, aux=self._aux(effect_on=True))
    self.score = model.score(X=self.X, y=self.y)
    self.observed_pred = model.predict(X=self.X, aux=self._aux(effect_on=True))
    self.counterfactual_pred = model.predict(X=self.X, aux=self._aux(effect_on=False))

model.score(X, y) internally calls model.predict(X) (which runs pm.sample_posterior_predictive), then observed_pred = model.predict(...) runs it again with the same aux. For real (non-mocked) sampling this can dominate fit time. Consider computing score from the cached observed_pred.

A subtler related issue: score calls model.predict(X=self.X) without aux. This works because _pred_aux falls back to self._aux set during fit, but it's implicit and fragile.

8. Side-effecting __init__ swallows unknown kwargs

def __init__(self, …, **kwargs: Any) (line 138) collects **kwargs and silently drops them. A typo like seasonalty=… will be silently swallowed. Either drop **kwargs or pass to super().

9. Pytensor scan deprecation warning

DeprecationWarning: Scan return signature will change. Updates dict will not be returned…
Pass `return_updates=False` to conform to the new API and avoid this warning

Triggered by pytensor.scan(...) around pymc_models.py:2344. Add return_updates=False so the new code lands without immediately accumulating warnings on AR runs.

10. Standardization of covariates is irreversible and undocumented

if X_values.shape[1] > 0:
    self._x_mean = X_values.mean(axis=0)
    self._x_std = X_values.std(axis=0)
    self._x_std[self._x_std == 0] = 1.0
    X_values = (X_values - self._x_mean) / self._x_std

Posterior mu_beta/sigma_beta are now on the standardized scale, which makes downstream interpretation tricky (an ETable showing beta in standardized units is misleading). Either document this in summary()/effect_summary() or expose unstandardized coefficients (e.g. as Deterministics scaled back).

11. Fourier _fourier_terms uses raw time_col units

def _fourier_terms(t: np.ndarray, period: float, K: int) -> np.ndarray:
    …
    cols.append(np.sin(2 * np.pi * k * t / period))

period is in the same units as time_col. With unhelpful time_col values (offset integer counters, datetimes coerced via .view('i8'), etc.) this is a footgun. Add a docstring example, and have test_with_seasonality actually assert columns are non-degenerate (currently only checks the variable exists).


Test coverage gaps

The test suite is thoughtful — happy paths plus a TestValidation block covering 7 negative scenarios — but several important behaviours are untested.

Missing tests (high priority)

  1. AR with shuffled rows (point 1 above): test_ar_residuals_unsorted_panel — pass panel.sample(frac=1, random_state=0).reset_index(drop=True) and assert that the model either raises or fits with the correct within_unit_tidx.
  2. Inconsistent treatment_time_col per unit (point 3 above): assert ValueError when one unit has two different launch_week values.
  3. Counterfactual correctness for placebo (point 2 above): synthesize a panel with no true effect, fit placebo, and assert that pre-period impact.mean(("chain","draw")) is near zero.
  4. print_coefficients() doesn't crash (point 4 above): currently has zero coverage.
  5. generate_report() end-to-end — none of the experiment-level reporting machinery is exercised. Other experiments cover this.
  6. get_plot_data_bayesian() / get_plot_data() — not implemented and not tested. The base class default raises NotImplementedError. Either implement it (returning data with predicted/counterfactual/impact columns appended) or test that it raises a clean error.
  7. Maketables integrationresult.__maketables_coef_table__ likely fails for the same reason as print_coefficients (it expects treated_units dim). Add a regression test in test_maketables_plugin.py style.
  8. set_maketables_options(hdi_prob=…) — never tested for this experiment.
  9. predictive_for_new_unit shape & finiteness for event_study with seasonality + AR — combined-feature happy paths are completely missing (every test enables one feature in isolation).
  10. model=None path — does the default HierarchicalLaunchITS() get instantiated with sensible defaults? _default_model_class wiring isn't covered.

Missing validation tests

  1. Empty data / single rowpd.DataFrame() and a one-row dataframe.
  2. Single unit — does the hierarchical model degenerate gracefully when n_units == 1?
  3. All units have the same launch_week (i.e. simultaneous treatment).
  4. time_col with NaN / negative values / floats.
  5. Bin edges that don't cover any observations — e.g. bin_edges=[100, 200, 300] for tau in [-10, 30]. _assign_bins returns -1 for everything and D is all zeros. Should this raise?
  6. bin_edges of length 1 (zero bins): K_bins = 0, D becomes shape (n, 0), mu_delta has event_bin size 0. Likely crashes deep in PyMC; needs an explicit check.
  7. placebo_edges overlapping with bin_edges (e.g. placebo_edges=[-4, 2], bin_edges=[0, 4]). Currently silently wrong.
  8. Seasonality with K=0 — produces empty design.
  9. Missing keys in seasonality dict ({"period": 7} without "K") — currently raises KeyError, should raise ValueError with a clearer message.

Test-quality / robustness

  1. The panel fixture is constructed unit-by-unit, so it's already sorted — that masks the AR sorting bug (point 1) entirely. Consider shuffling in the fixture as a default, or adding a parametrized [sorted, shuffled] axis.
  2. _make_panel(true_lift=8.0) is set up but the tests never assert recovered mu_lift is near 8.0 (only existence in the InferenceData). With mock_pymc_sample recovery isn't possible, but a single non-mocked test using small tune=20, draws=20, chains=2 gated behind @pytest.mark.slow would catch a regression where the model accidentally fits the wrong sign or magnitude. See test_pymc_models.py:25 for the established pattern.
  3. test_predictive_unfitted_model uses cls.__new__(cls) to skip __init__. That's brittle — if anyone reorders attributes the test still passes for the wrong reason. Better: call predictive_for_new_unit on a freshly-constructed (unfitted) HierarchicalLaunchITS directly.
  4. No @pytest.mark.integration marker used (the project defines it in pyproject.toml:121). Marking the heavier "fit + plot + summary + report" flow as integration would help CI partitioning.
  5. Codecov reports 35 missing lines on the patch — most are likely the AR branch and the placebo branches in pymc_models.py plus the rarely-exercised plotting branches.

Minor / nits

  • seasonality: dict | None = None — the inner shape is undocumented ({"period": float, "K": int}). Use a TypedDict for clarity.
  • _placebo_check_text hardcodes HDI_PROB (94%) for pass/fail, while effect_summary accepts alpha. Share a single threshold.
  • Docstring at module top says "event-study-style" but the class supports three explicit types — update the wording.
  • Line 248: np.concatenate([np.arange(n) for _ in range(n_units)]) is np.tile(np.arange(n), n_units).
  • _fourier_terms returns shape (n, 2K); the label generator emits f0..f(2K-1), which doesn't distinguish sin/cos. Consider f_sin_1, f_cos_1, f_sin_2, … for posterior interpretability.
  • self.score = model.score(...) overwrites whatever score attribute might be inherited; not a bug, just surprising.

Suggested ordering for the author

  1. Block on: fix points 1–4 above (AR sorting, placebo counterfactual, treatment-time validation, print_coefficients), with a regression test for each.
  2. Should fix before merge: rename _algorithmalgorithm, audit effect_summary ignored args, kill the pytensor scan deprecation warning.
  3. Defer to a follow-up issue: standardization surfacing in summaries, get_plot_data_bayesian/report integration, combined-feature tests, recovery test gated behind @pytest.mark.slow.

The bones of this PR are solid and the modeling is well-thought-out — the AR indexing bug and the placebo counterfactual issue are the two I would not let through without fixes plus tests.

Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Signed-off-by: Nathaniel <NathanielF@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request major

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants