Skip to content

Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714

Open
KilianMaire wants to merge 6 commits into
scverse:mainfrom
KilianMaire:feat/coverage-hill-diversity
Open

Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714
KilianMaire wants to merge 6 commits into
scverse:mainfrom
KilianMaire:feat/coverage-hill-diversity

Conversation

@KilianMaire

@KilianMaire KilianMaire commented Jun 17, 2026

Copy link
Copy Markdown

Close #535

Summary

Adds coverage-standardized Hill-number diversity to scirpy.tl:

  • tl.hill_diversity_profile computes a Hill diversity profile over a range of orders q, standardized to a common sample coverage so that profiles are comparable across samples of different sequencing depth.
  • tl.convert_hill_table converts a profile into the classical alpha diversity indices (observed richness, Shannon entropy, inverse Simpson, Gini-Simpson) and into evenness measures.

This builds on #535 and addresses the open question raised there about sequencing-depth correction. Credit to @MKanetscheider for the original hill_diversity_profile / convert_hill_table design and for the convert_hill_table conversions requested by @FFinotello; this PR keeps those public signatures and replaces the estimation engine.

Motivation

The plug-in Hill estimator grows with sampling depth at every order of q (richness most strongly, but inverse Simpson too), so two samples sequenced to different depth show different profiles even when the underlying repertoire is identical. This is the confounder discussed in #535, and it is more acute for scRNA-seq where the number of cells varies a lot between samples.

The established fix is the iNEXT framework (Chao et al. 2014; Hsieh, Ma & Chao 2016): estimate Hill numbers and standardize them to a common sample coverage before comparing. tl.hill_diversity_profile standardizes all groups to a shared coverage (iNEXT's Cmax rule) and returns the standardized profile.

What is new compared to #535

  • The naive plug-in engine is replaced with the coverage-standardized estimator.
  • When the groups cannot be standardized to a fully reliable shared coverage (for example one group is heavily undersampled), a warning is raised rather than silently returning a number.
  • Unit tests are added for both functions, including the warning path. The numeric output is checked against the underlying estimator to a relative tolerance of 1e-9, for both AnnData and MuData.

Implementation notes

  • Estimation is delegated to the hillrep package, whose kernels are validated against R iNEXT 3.0.2 to a relative tolerance of 1e-6. hillrep is pure numpy/scipy/pandas and is added as an optional dependency under the existing diversity extra, alongside scikit-bio.
  • The backend is isolated behind a single private helper (_coverage_hill_profile), so swapping the dependency for a vendored estimator later would be a localized change.
  • Group counts are extracted via DataHandler, so both AnnData and MuData are supported (tests cover both).

Open question for maintainers

One design decision I would like your call on:

  • (A) scirpy takes hillrep as an optional dependency (as in this PR), so the validated kernels live in one place and stay in sync with the R reference; or
  • (B) the coverage estimator is vendored into _diversity.py, keeping scirpy dependency-free at the cost of duplicating the math.

This PR implements (A) because hillrep is dependency-light, but it is your dependency policy and your call. Switching to (B) would be localized to _coverage_hill_profile.

API

import scirpy as ir

profile = ir.tl.hill_diversity_profile(
    adata, groupby="sample", target_col="clone_id", q_min=0, q_max=2, q_step=1
)
indices = ir.tl.convert_hill_table(profile, convert_to="diversity")

hill_diversity_profile returns a DataFrame with one row per diversity order q and one column per group, which plots directly with seaborn and flows into convert_hill_table.

Checklist

  • CHANGELOG.md updated
  • Tests added (for the new functions, including the not-comparable warning)
  • Tutorial updated (Hill diversity section added to the 5k BCR tutorial)

Kilian added 2 commits June 17, 2026 17:01
Add tl.hill_diversity_profile and tl.convert_hill_table for coverage-based
Hill-number diversity. Profiles are standardized to a common sample coverage
(iNEXT framework) so they are comparable across samples of different sequencing
depth, and a warning is raised when a fair comparison is not supported.
Estimation is delegated to the hillrep package, added as an optional dependency
under the diversity extra.

Builds on the hill_diversity_profile / convert_hill_table design from scverse#535 by
Mario Kanetscheider, keeping those public signatures and replacing the plug-in
estimator with the coverage-standardized one.
@KilianMaire

Copy link
Copy Markdown
Author

Note on the red CI: all the tests added in this PR pass on every environment (test_hill_diversity_profile[AnnData], [MuData], test_convert_hill_table, test_hill_diversity_profile_warns_when_not_comparable).

The two failing items are pre-existing on main and unrelated to this PR:

  • test_io.py::test_convert_dandelion fails with ImportError: Please install dandelion (sc-dandelion not available in the test env).
  • test_plotting.py fails at collection with duplicate parametrization of 'adata_clonotype'.

Both also fail on the latest main run and on other open PRs (e.g. #711) with the same messages, so they are not introduced here. Happy to rebase once CI on main is green again.

@grst

grst commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Sorry for being slow with reviews, I still need to take a closer look. I'll also take care of CI.

One thing I was wondering at first glance: Do you think it makes sense to also provide a corresponding plotting function (scirpy.pl.hill_diversity_profile)? But if it's just a seaborn oneliner, I would also be fine with just adding that to the tutorial.

Fill in the previously empty Clonotype Diversity section with a
coverage-standardized Hill diversity profile across patient status groups,
a seaborn plot of the profile, and the conversion to classical alpha
diversity indices via convert_hill_table.
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@KilianMaire

Copy link
Copy Markdown
Author

Thanks! I went with the tutorial route for now, since the output of hill_diversity_profile is a tidy DataFrame (one row per q, one column per group) and plotting it is essentially a seaborn lineplot oneliner. A dedicated pl.hill_diversity_profile would mostly be a thin wrapper around that, so I would rather not add the maintenance surface unless you prefer it for consistency with pl.alpha_diversity. Happy to add it as a follow-up if you do.

I filled in the previously empty "Clonotype Diversity" section of the 5k BCR tutorial: it computes the coverage-standardized profile across patient status groups, plots it with seaborn, and shows the conversion to the classical indices via convert_hill_table. I ran the full tutorial up to that section locally to confirm the new cells execute and the outputs are committed.

The only open design point from my side is the dependency question in the description (hillrep as an optional dep vs vendoring the estimator), whenever you get to it. No rush.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.51220% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 19.67%. Comparing base (13779ec) to head (fff867d).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/scirpy/tl/_diversity.py 17.50% 33 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #714      +/-   ##
==========================================
- Coverage   19.71%   19.67%   -0.04%     
==========================================
  Files          51       51              
  Lines        4581     4620      +39     
==========================================
+ Hits          903      909       +6     
- Misses       3678     3711      +33     
Files with missing lines Coverage Δ
src/scirpy/tl/__init__.py 100.00% <100.00%> (ø)
src/scirpy/tl/_diversity.py 17.04% <17.50%> (-1.33%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grst

grst commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

The only open design point from my side is the dependency question in the description (hillrep as an optional dep vs vendoring the estimator), whenever you get to it. No rush.

It of course also depends a bit on your commitment to maintain the package in the future! But given that hillrep adds no additional transitive dependencies and doesn't have compiled code, I'm fine with adding it as a dependency.

The 5k BCR tutorial now uses hill_diversity_profile, so the docs environment
needs hillrep. Add scirpy[diversity] to the doc dependency group so the tutorial
executes on CI.

Also add tests for the convert_hill_table evenness modes, the missing-order
error, and the missing-hillrep import error, which run without optional deps.
@KilianMaire

Copy link
Copy Markdown
Author

Great, thanks. And yes, I am committed to maintaining hillrep: it is my package, under active development, and the estimators are validated against R iNEXT to a relative tolerance of 1e-6 with the golden values committed, so regressions are caught.

Two follow-ups I just pushed:

  • The tutorial-execution job was failing because the docs environment did not have hillrep (it only pulls the doc group, not test). Fixed by adding scirpy[diversity] to the doc dependency group, so the new tutorial section runs on CI.
  • Added tests for the remaining convert_hill_table branches (the evenness modes and the missing-order error) and for the missing-hillrep import error. These need no optional dependency, so they run in every environment.

Let me know if you would like anything changed in the tutorial section or the API.

@grst grst mentioned this pull request Jun 23, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants