You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(tweedie-post): add reference links for claims
Link key claims to their sources: Dunn & Smyth (2005) paper, statsmodels
GLM docs, statmod tweedie docs, Akshat's blog post, dataCar dataset,
PyMC WAIC, and pymc-bart.
Copy file name to clipboardExpand all lines: docs/blog/posts/2026/pearson-phi-broken-tweedie.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,11 +11,11 @@ comments: true
11
11
12
12
## The Problem
13
13
14
-
Insurance pure premium data has a distinctive shape: 90%+ of policies have zero claims, while the remaining few have positive amounts that are right-skewed and occasionally extreme. The Tweedie distribution is the standard tool for this setting — it naturally handles the zero-mass point and continuous positive tail through a single compound Poisson-Gamma process.
14
+
Insurance pure premium data has a distinctive shape: 90%+ of policies have zero claims, while the remaining few have positive amounts that are right-skewed and occasionally extreme. The [Tweedie distribution](https://doi.org/10.1007/s11222-005-4070-y) is the standard tool for this setting — it naturally handles the zero-mass point and continuous positive tail through a single compound Poisson-Gamma process.
15
15
16
-
Here is the paradox. A well-known blog post on Tweedie GLMs for insurance reported something strange: the posterior predictive check predicted **99.95% zeros** against an observed **~94%**. The model collapsed to almost-all-zero predictions. This is not because Tweedie is the wrong distribution — it is because the dispersion parameter φ was estimated using the wrong tool.
16
+
Here is the paradox. A [well-known blog post on Tweedie GLMs for insurance](https://akshat.blog/posts/fitting-tweedie-models-to-claims-data/) reported something strange: the posterior predictive check predicted **99.95% zeros** against an observed **~94%**. The model collapsed to almost-all-zero predictions. This is not because Tweedie is the wrong distribution — it is because the dispersion parameter φ was estimated using the wrong tool.
17
17
18
-
The default dispersion estimator in both R's `statmod::tweedie` and Python's `statsmodels` GLM is the **Pearson chi-squared statistic** divided by residual degrees of freedom. For zero-inflated data, this estimator is catastrophically biased — inflating φ by a factor of 5-56×. The consequence is a model that predicts nearly all zeros.
18
+
The default dispersion estimator in both R's [`statmod::tweedie`](https://search.r-project.org/CRAN/refmans/statmod/html/tweedie.html) and Python's [`statsmodels` GLM](https://www.statsmodels.org/stable/glm.html) is the **Pearson chi-squared statistic** ([Wikipedia](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)) divided by residual degrees of freedom. For zero-inflated data, this estimator is catastrophically biased — inflating φ by a factor of 5-56×. The consequence is a model that predicts nearly all zeros.
19
19
20
20
## Why Pearson φ Fails
21
21
@@ -43,7 +43,7 @@ The MLE parameters are $10^{1000+}$ times more probable than the Pearson paramet
43
43
44
44
## The Fix: Series Log-PDF
45
45
46
-
The reason practitioners reach for Pearson φ is that the Tweedie density does not have a closed form. The likelihood is an infinite series (Dunn & Smyth, 2005):
46
+
The reason practitioners reach for Pearson φ is that the Tweedie density does not have a closed form. The likelihood is an infinite series ([Dunn & Smyth, 2005](https://doi.org/10.1007/s11222-005-4070-y)):
@@ -97,9 +97,9 @@ This `tweedie_logp_series` function becomes the `logp` for the `Tweedie` CustomD
97
97
```
98
98
99
99
!!! info "Validation Against Reference"
100
-
Our log-pdf matches the `tweedie` Python reference package to machine precision (all tested values show difference of exactly 0.000000). The implementation is verified for both datasets across the full support of the distribution.
100
+
Our log-pdf matches the [`tweedie` Python reference package](https://pypi.org/project/tweedie/) to machine precision (all tested values show difference of exactly 0.000000). The implementation is verified for both datasets across the full support of the distribution.
101
101
102
-
R users will recognize this series likelihood — `statmod::tweedie.profile()` estimates φ and p via MLE using the same Dunn & Smyth (2005) expansion. The Bayesian approach builds on that foundation, adding full posterior uncertainty and predictive distributions via MCMC instead of point estimates.
102
+
R users will recognize this series likelihood — [`statmod::tweedie.profile()`](https://www.rdocumentation.org/packages/tweedie/versions/2.3.5/topics/tweedie.profile) estimates φ and p via MLE using the same [Dunn & Smyth (2005)](https://doi.org/10.1007/s11222-005-4070-y) expansion. The Bayesian approach builds on that foundation, adding full posterior uncertainty and predictive distributions via MCMC instead of point estimates.
103
103
104
104
Together, two functions power the `Tweedie` CustomDist wrapper below: `tweedie_logp_series` for the log-pdf (inference) and `tweedie_random` for random draws (posterior predictive sampling):
105
105
@@ -194,8 +194,8 @@ We fit the model on two canonical insurance datasets:
194
194
195
195
| Dataset | Policies | Zero Rate | Weighted Mean |
The tables in this section contain results from fitting the Bayesian model to the actual dataCar and French TPL datasets. The figure scripts in the repository instead generate synthetic data from known Tweedie parameters that mimic these datasets — this keeps them self-contained and reproducible without requiring external data files. The figures illustrate the same phenomena (PPC quality, zero rate inflation, etc.) using parameter values matched to each dataset.
Model comparison via Watanabe–Akaike Information Criterion (WAIC) confirms that the extra covariates do not materially improve predictive fit:
364
+
Model comparison via [Watanabe–Akaike Information Criterion (WAIC)](https://www.pymc.io/projects/docs/en/stable/api/generated/pymc.waic.html) confirms that the extra covariates do not materially improve predictive fit:
365
365
366
366
| Model | WAIC | ΔWAIC | pWAIC | Weight |
367
367
|-------|------|-------|-------|--------|
@@ -387,7 +387,7 @@ Three practical recommendations:
387
387
## Possible Extensions
388
388
389
389
-**$p > 2$ for severity modeling** — the alternating sin series handles this case, though identifiability weakens
390
-
-**BART for the mean structure** — nonparametric mean estimation via `pymc-bart` for automatic interaction and nonlinearity detection
390
+
-**BART for the mean structure** — nonparametric mean estimation via [`pymc-bart`](https://www.pymc.io/projects/bart) for automatic interaction and nonlinearity detection
391
391
-**Hurdle models** — separate models for claim frequency and severity for heavy-tailed data
392
392
-**Double GLM (μ-φ DGLM)** — regressing dispersion $\phi$ on covariates could capture heteroskedasticity by risk class
0 commit comments