Skip to content

Commit 1ee72cf

Browse files
committed
docs(tweedie-post): add reference links for claims
Link key claims to their sources: Dunn & Smyth (2005) paper, statsmodels GLM docs, statmod tweedie docs, Akshat's blog post, dataCar dataset, PyMC WAIC, and pymc-bart.
1 parent 0e42449 commit 1ee72cf

1 file changed

Lines changed: 10 additions & 10 deletions

File tree

docs/blog/posts/2026/pearson-phi-broken-tweedie.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@ comments: true
1111

1212
## The Problem
1313

14-
Insurance pure premium data has a distinctive shape: 90%+ of policies have zero claims, while the remaining few have positive amounts that are right-skewed and occasionally extreme. The Tweedie distribution is the standard tool for this setting — it naturally handles the zero-mass point and continuous positive tail through a single compound Poisson-Gamma process.
14+
Insurance pure premium data has a distinctive shape: 90%+ of policies have zero claims, while the remaining few have positive amounts that are right-skewed and occasionally extreme. The [Tweedie distribution](https://doi.org/10.1007/s11222-005-4070-y) is the standard tool for this setting — it naturally handles the zero-mass point and continuous positive tail through a single compound Poisson-Gamma process.
1515

16-
Here is the paradox. A well-known blog post on Tweedie GLMs for insurance reported something strange: the posterior predictive check predicted **99.95% zeros** against an observed **~94%**. The model collapsed to almost-all-zero predictions. This is not because Tweedie is the wrong distribution — it is because the dispersion parameter φ was estimated using the wrong tool.
16+
Here is the paradox. A [well-known blog post on Tweedie GLMs for insurance](https://akshat.blog/posts/fitting-tweedie-models-to-claims-data/) reported something strange: the posterior predictive check predicted **99.95% zeros** against an observed **~94%**. The model collapsed to almost-all-zero predictions. This is not because Tweedie is the wrong distribution — it is because the dispersion parameter φ was estimated using the wrong tool.
1717

18-
The default dispersion estimator in both R's `statmod::tweedie` and Python's `statsmodels` GLM is the **Pearson chi-squared statistic** divided by residual degrees of freedom. For zero-inflated data, this estimator is catastrophically biased — inflating φ by a factor of 5-56×. The consequence is a model that predicts nearly all zeros.
18+
The default dispersion estimator in both R's [`statmod::tweedie`](https://search.r-project.org/CRAN/refmans/statmod/html/tweedie.html) and Python's [`statsmodels` GLM](https://www.statsmodels.org/stable/glm.html) is the **Pearson chi-squared statistic** ([Wikipedia](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)) divided by residual degrees of freedom. For zero-inflated data, this estimator is catastrophically biased — inflating φ by a factor of 5-56×. The consequence is a model that predicts nearly all zeros.
1919

2020
## Why Pearson φ Fails
2121

@@ -43,7 +43,7 @@ The MLE parameters are $10^{1000+}$ times more probable than the Pearson paramet
4343

4444
## The Fix: Series Log-PDF
4545

46-
The reason practitioners reach for Pearson φ is that the Tweedie density does not have a closed form. The likelihood is an infinite series (Dunn & Smyth, 2005):
46+
The reason practitioners reach for Pearson φ is that the Tweedie density does not have a closed form. The likelihood is an infinite series ([Dunn & Smyth, 2005](https://doi.org/10.1007/s11222-005-4070-y)):
4747

4848
$$ f(y; \mu, \phi, p) = \frac{1}{y} \sum_{j=1}^{\infty} W_j $$
4949

@@ -97,9 +97,9 @@ This `tweedie_logp_series` function becomes the `logp` for the `Tweedie` CustomD
9797
```
9898

9999
!!! info "Validation Against Reference"
100-
Our log-pdf matches the `tweedie` Python reference package to machine precision (all tested values show difference of exactly 0.000000). The implementation is verified for both datasets across the full support of the distribution.
100+
Our log-pdf matches the [`tweedie` Python reference package](https://pypi.org/project/tweedie/) to machine precision (all tested values show difference of exactly 0.000000). The implementation is verified for both datasets across the full support of the distribution.
101101

102-
R users will recognize this series likelihood — `statmod::tweedie.profile()` estimates φ and p via MLE using the same Dunn & Smyth (2005) expansion. The Bayesian approach builds on that foundation, adding full posterior uncertainty and predictive distributions via MCMC instead of point estimates.
102+
R users will recognize this series likelihood — [`statmod::tweedie.profile()`](https://www.rdocumentation.org/packages/tweedie/versions/2.3.5/topics/tweedie.profile) estimates φ and p via MLE using the same [Dunn & Smyth (2005)](https://doi.org/10.1007/s11222-005-4070-y) expansion. The Bayesian approach builds on that foundation, adding full posterior uncertainty and predictive distributions via MCMC instead of point estimates.
103103

104104
Together, two functions power the `Tweedie` CustomDist wrapper below: `tweedie_logp_series` for the log-pdf (inference) and `tweedie_random` for random draws (posterior predictive sampling):
105105

@@ -194,8 +194,8 @@ We fit the model on two canonical insurance datasets:
194194

195195
| Dataset | Policies | Zero Rate | Weighted Mean |
196196
|---------|----------|-----------|---------------|
197-
| **dataCar** (De Jong & Heller 2008) | 67,856 | 93.2% | \$293 |
198-
| **French Motor TPL** (Akshat's setup, 60k subset) | 60,000 | 96.3% | \$207 |
197+
| **dataCar** ([De Jong & Heller 2008](https://doi.org/10.1017/CBO9780511755408)) | 67,856 | 93.2% | \$293 |
198+
| **French Motor TPL** ([Akshat's setup](https://akshat.blog/posts/fitting-tweedie-models-to-claims-data/), 60k subset) | 60,000 | 96.3% | \$207 |
199199

200200
!!! note "Real Data vs Figure Scripts"
201201
The tables in this section contain results from fitting the Bayesian model to the actual dataCar and French TPL datasets. The figure scripts in the repository instead generate synthetic data from known Tweedie parameters that mimic these datasets — this keeps them self-contained and reproducible without requiring external data files. The figures illustrate the same phenomena (PPC quality, zero rate inflation, etc.) using parameter values matched to each dataset.
@@ -361,7 +361,7 @@ Dispersion remains virtually unchanged:
361361
| Intercept-only | 174.3 | 266.8 |
362362
| μ-GLM (22 / 7 features) | 174.9 | 265.9 |
363363

364-
Model comparison via Watanabe–Akaike Information Criterion (WAIC) confirms that the extra covariates do not materially improve predictive fit:
364+
Model comparison via [Watanabe–Akaike Information Criterion (WAIC)](https://www.pymc.io/projects/docs/en/stable/api/generated/pymc.waic.html) confirms that the extra covariates do not materially improve predictive fit:
365365

366366
| Model | WAIC | ΔWAIC | pWAIC | Weight |
367367
|-------|------|-------|-------|--------|
@@ -387,7 +387,7 @@ Three practical recommendations:
387387
## Possible Extensions
388388

389389
- **$p > 2$ for severity modeling** — the alternating sin series handles this case, though identifiability weakens
390-
- **BART for the mean structure** — nonparametric mean estimation via `pymc-bart` for automatic interaction and nonlinearity detection
390+
- **BART for the mean structure** — nonparametric mean estimation via [`pymc-bart`](https://www.pymc.io/projects/bart) for automatic interaction and nonlinearity detection
391391
- **Hurdle models** — separate models for claim frequency and severity for heavy-tailed data
392392
- **Double GLM (μ-φ DGLM)** — regressing dispersion $\phi$ on covariates could capture heteroskedasticity by risk class
393393

0 commit comments

Comments
 (0)