Coverage-standardized Hill diversity (tl.hill_diversity_profile) by KilianMaire · Pull Request #714 · scverse/scirpy

KilianMaire · 2026-06-17T15:01:56Z

Close #535

Summary

Adds coverage-standardized Hill-number diversity to scirpy.tl:

tl.hill_diversity_profile computes a Hill diversity profile over a range of orders q, standardized to a common sample coverage so that profiles are comparable across samples of different sequencing depth.
tl.convert_hill_table converts a profile into the classical alpha diversity indices (observed richness, Shannon entropy, inverse Simpson, Gini-Simpson) and into evenness measures.

This builds on #535 and addresses the open question raised there about sequencing-depth correction. Credit to @MKanetscheider for the original hill_diversity_profile / convert_hill_table design and for the convert_hill_table conversions requested by @FFinotello; this PR keeps those public signatures and replaces the estimation engine.

Motivation

The plug-in Hill estimator grows with sampling depth at every order of q (richness most strongly, but inverse Simpson too), so two samples sequenced to different depth show different profiles even when the underlying repertoire is identical. This is the confounder discussed in #535, and it is more acute for scRNA-seq where the number of cells varies a lot between samples.

The established fix is the iNEXT framework (Chao et al. 2014; Hsieh, Ma & Chao 2016): estimate Hill numbers and standardize them to a common sample coverage before comparing. tl.hill_diversity_profile standardizes all groups to a shared coverage (iNEXT's Cmax rule) and returns the standardized profile.

What is new compared to #535

The naive plug-in engine is replaced with the coverage-standardized estimator.
When the groups cannot be standardized to a fully reliable shared coverage (for example one group is heavily undersampled), a warning is raised rather than silently returning a number.
Unit tests are added for both functions, including the warning path. The numeric output is checked against the underlying estimator to a relative tolerance of 1e-9, for both AnnData and MuData.

Implementation notes

Estimation is delegated to the hillrep package, whose kernels are validated against R iNEXT 3.0.2 to a relative tolerance of 1e-6. hillrep is pure numpy/scipy/pandas and is added as an optional dependency under the existing diversity extra, alongside scikit-bio.
The backend is isolated behind a single private helper (_coverage_hill_profile), so swapping the dependency for a vendored estimator later would be a localized change.
Group counts are extracted via DataHandler, so both AnnData and MuData are supported (tests cover both).

Open question for maintainers

One design decision I would like your call on:

(A) scirpy takes hillrep as an optional dependency (as in this PR), so the validated kernels live in one place and stay in sync with the R reference; or
(B) the coverage estimator is vendored into _diversity.py, keeping scirpy dependency-free at the cost of duplicating the math.

This PR implements (A) because hillrep is dependency-light, but it is your dependency policy and your call. Switching to (B) would be localized to _coverage_hill_profile.

API

import scirpy as ir

profile = ir.tl.hill_diversity_profile(
    adata, groupby="sample", target_col="clone_id", q_min=0, q_max=2, q_step=1
)
indices = ir.tl.convert_hill_table(profile, convert_to="diversity")

hill_diversity_profile returns a DataFrame with one row per diversity order q and one column per group, which plots directly with seaborn and flows into convert_hill_table.

Checklist

CHANGELOG.md updated
Tests added (for the new functions, including the not-comparable warning)
Tutorial updated (Hill diversity section added to the 5k BCR tutorial)

Add tl.hill_diversity_profile and tl.convert_hill_table for coverage-based Hill-number diversity. Profiles are standardized to a common sample coverage (iNEXT framework) so they are comparable across samples of different sequencing depth, and a warning is raised when a fair comparison is not supported. Estimation is delegated to the hillrep package, added as an optional dependency under the diversity extra. Builds on the hill_diversity_profile / convert_hill_table design from scverse#535 by Mario Kanetscheider, keeping those public signatures and replacing the plug-in estimator with the coverage-standardized one.

KilianMaire · 2026-06-17T15:10:43Z

Note on the red CI: all the tests added in this PR pass on every environment (test_hill_diversity_profile[AnnData], [MuData], test_convert_hill_table, test_hill_diversity_profile_warns_when_not_comparable).

The two failing items are pre-existing on main and unrelated to this PR:

test_io.py::test_convert_dandelion fails with ImportError: Please install dandelion (sc-dandelion not available in the test env).
test_plotting.py fails at collection with duplicate parametrization of 'adata_clonotype'.

Both also fail on the latest main run and on other open PRs (e.g. #711) with the same messages, so they are not introduced here. Happy to rebase once CI on main is green again.

grst · 2026-06-18T07:09:19Z

Sorry for being slow with reviews, I still need to take a closer look. I'll also take care of CI.

One thing I was wondering at first glance: Do you think it makes sense to also provide a corresponding plotting function (scirpy.pl.hill_diversity_profile)? But if it's just a seaborn oneliner, I would also be fine with just adding that to the tutorial.

Fill in the previously empty Clonotype Diversity section with a coverage-standardized Hill diversity profile across patient status groups, a seaborn plot of the profile, and the conversion to classical alpha diversity indices via convert_hill_table.

review-notebook-app · 2026-06-18T08:33:47Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

KilianMaire · 2026-06-18T08:34:08Z

Thanks! I went with the tutorial route for now, since the output of hill_diversity_profile is a tidy DataFrame (one row per q, one column per group) and plotting it is essentially a seaborn lineplot oneliner. A dedicated pl.hill_diversity_profile would mostly be a thin wrapper around that, so I would rather not add the maintenance surface unless you prefer it for consistency with pl.alpha_diversity. Happy to add it as a follow-up if you do.

I filled in the previously empty "Clonotype Diversity" section of the 5k BCR tutorial: it computes the coverage-standardized profile across patient status groups, plots it with seaborn, and shows the conversion to the classical indices via convert_hill_table. I ran the full tutorial up to that section locally to confirm the new cells execute and the outputs are committed.

The only open design point from my side is the dependency question in the description (hillrep as an optional dep vs vendoring the estimator), whenever you get to it. No rush.

codecov · 2026-06-18T11:07:29Z

Codecov Report

❌ Patch coverage is 19.51220% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.67%. Comparing base (13779ec) to head (fff867d).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scirpy/tl/_diversity.py	17.50%	33 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #714      +/-   ##
==========================================
- Coverage   19.71%   19.67%   -0.04%     
==========================================
  Files          51       51              
  Lines        4581     4620      +39     
==========================================
+ Hits          903      909       +6     
- Misses       3678     3711      +33

Files with missing lines	Coverage Δ
src/scirpy/tl/__init__.py	`100.00% <100.00%> (ø)`
src/scirpy/tl/_diversity.py	`17.04% <17.50%> (-1.33%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

grst · 2026-06-18T11:28:41Z

The only open design point from my side is the dependency question in the description (hillrep as an optional dep vs vendoring the estimator), whenever you get to it. No rush.

It of course also depends a bit on your commitment to maintain the package in the future! But given that hillrep adds no additional transitive dependencies and doesn't have compiled code, I'm fine with adding it as a dependency.

The 5k BCR tutorial now uses hill_diversity_profile, so the docs environment needs hillrep. Add scirpy[diversity] to the doc dependency group so the tutorial executes on CI. Also add tests for the convert_hill_table evenness modes, the missing-order error, and the missing-hillrep import error, which run without optional deps.

KilianMaire · 2026-06-18T12:15:47Z

Great, thanks. And yes, I am committed to maintaining hillrep: it is my package, under active development, and the estimators are validated against R iNEXT to a relative tolerance of 1e-6 with the golden values committed, so regressions are caught.

Two follow-ups I just pushed:

The tutorial-execution job was failing because the docs environment did not have hillrep (it only pulls the doc group, not test). Fixed by adding scirpy[diversity] to the doc dependency group, so the new tutorial section runs on CI.
Added tests for the remaining convert_hill_table branches (the evenness modes and the missing-order error) and for the missing-hillrep import error. These need no optional dependency, so they run in every environment.

Let me know if you would like anything changed in the tutorial section or the API.

for more information, see https://pre-commit.ci

Kilian added 2 commits June 17, 2026 17:01

Reference PR number in changelog

b3590e7

Merge branch 'main' into feat/coverage-hill-diversity

819a11d

grst added the skip-gpu-ci label Jun 18, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

fff867d

for more information, see https://pre-commit.ci

grst mentioned this pull request Jun 23, 2026

Hill diversity profile #535

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714

Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714
KilianMaire wants to merge 6 commits into
scverse:mainfrom
KilianMaire:feat/coverage-hill-diversity

KilianMaire commented Jun 17, 2026 •

edited by grst

Loading

Uh oh!

KilianMaire commented Jun 17, 2026

Uh oh!

grst commented Jun 18, 2026

Uh oh!

review-notebook-app Bot commented Jun 18, 2026

Uh oh!

KilianMaire commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

grst commented Jun 18, 2026

Uh oh!

KilianMaire commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KilianMaire commented Jun 17, 2026 • edited by grst Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What is new compared to #535

Implementation notes

Open question for maintainers

API

Checklist

Uh oh!

KilianMaire commented Jun 17, 2026

Uh oh!

grst commented Jun 18, 2026

Uh oh!

review-notebook-app Bot commented Jun 18, 2026

Uh oh!

KilianMaire commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

grst commented Jun 18, 2026

Uh oh!

KilianMaire commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KilianMaire commented Jun 17, 2026 •

edited by grst

Loading

codecov Bot commented Jun 18, 2026 •

edited

Loading