Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714
Coverage-standardized Hill diversity (tl.hill_diversity_profile)#714KilianMaire wants to merge 6 commits into
Conversation
Add tl.hill_diversity_profile and tl.convert_hill_table for coverage-based Hill-number diversity. Profiles are standardized to a common sample coverage (iNEXT framework) so they are comparable across samples of different sequencing depth, and a warning is raised when a fair comparison is not supported. Estimation is delegated to the hillrep package, added as an optional dependency under the diversity extra. Builds on the hill_diversity_profile / convert_hill_table design from scverse#535 by Mario Kanetscheider, keeping those public signatures and replacing the plug-in estimator with the coverage-standardized one.
|
Note on the red CI: all the tests added in this PR pass on every environment ( The two failing items are pre-existing on
Both also fail on the latest |
|
Sorry for being slow with reviews, I still need to take a closer look. I'll also take care of CI. One thing I was wondering at first glance: Do you think it makes sense to also provide a corresponding plotting function (scirpy.pl.hill_diversity_profile)? But if it's just a seaborn oneliner, I would also be fine with just adding that to the tutorial. |
Fill in the previously empty Clonotype Diversity section with a coverage-standardized Hill diversity profile across patient status groups, a seaborn plot of the profile, and the conversion to classical alpha diversity indices via convert_hill_table.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Thanks! I went with the tutorial route for now, since the output of I filled in the previously empty "Clonotype Diversity" section of the 5k BCR tutorial: it computes the coverage-standardized profile across patient status groups, plots it with seaborn, and shows the conversion to the classical indices via The only open design point from my side is the dependency question in the description (hillrep as an optional dep vs vendoring the estimator), whenever you get to it. No rush. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #714 +/- ##
==========================================
- Coverage 19.71% 19.67% -0.04%
==========================================
Files 51 51
Lines 4581 4620 +39
==========================================
+ Hits 903 909 +6
- Misses 3678 3711 +33
🚀 New features to boost your workflow:
|
It of course also depends a bit on your commitment to maintain the package in the future! But given that hillrep adds no additional transitive dependencies and doesn't have compiled code, I'm fine with adding it as a dependency. |
The 5k BCR tutorial now uses hill_diversity_profile, so the docs environment needs hillrep. Add scirpy[diversity] to the doc dependency group so the tutorial executes on CI. Also add tests for the convert_hill_table evenness modes, the missing-order error, and the missing-hillrep import error, which run without optional deps.
|
Great, thanks. And yes, I am committed to maintaining hillrep: it is my package, under active development, and the estimators are validated against R iNEXT to a relative tolerance of 1e-6 with the golden values committed, so regressions are caught. Two follow-ups I just pushed:
Let me know if you would like anything changed in the tutorial section or the API. |
for more information, see https://pre-commit.ci
Close #535
Summary
Adds coverage-standardized Hill-number diversity to
scirpy.tl:tl.hill_diversity_profilecomputes a Hill diversity profile over a range of ordersq, standardized to a common sample coverage so that profiles are comparable across samples of different sequencing depth.tl.convert_hill_tableconverts a profile into the classical alpha diversity indices (observed richness, Shannon entropy, inverse Simpson, Gini-Simpson) and into evenness measures.This builds on #535 and addresses the open question raised there about sequencing-depth correction. Credit to @MKanetscheider for the original
hill_diversity_profile/convert_hill_tabledesign and for theconvert_hill_tableconversions requested by @FFinotello; this PR keeps those public signatures and replaces the estimation engine.Motivation
The plug-in Hill estimator grows with sampling depth at every order of
q(richness most strongly, but inverse Simpson too), so two samples sequenced to different depth show different profiles even when the underlying repertoire is identical. This is the confounder discussed in #535, and it is more acute for scRNA-seq where the number of cells varies a lot between samples.The established fix is the iNEXT framework (Chao et al. 2014; Hsieh, Ma & Chao 2016): estimate Hill numbers and standardize them to a common sample coverage before comparing.
tl.hill_diversity_profilestandardizes all groups to a shared coverage (iNEXT'sCmaxrule) and returns the standardized profile.What is new compared to #535
Implementation notes
hillrepis pure numpy/scipy/pandas and is added as an optional dependency under the existingdiversityextra, alongsidescikit-bio._coverage_hill_profile), so swapping the dependency for a vendored estimator later would be a localized change.DataHandler, so bothAnnDataandMuDataare supported (tests cover both).Open question for maintainers
One design decision I would like your call on:
hillrepas an optional dependency (as in this PR), so the validated kernels live in one place and stay in sync with the R reference; or_diversity.py, keeping scirpy dependency-free at the cost of duplicating the math.This PR implements (A) because
hillrepis dependency-light, but it is your dependency policy and your call. Switching to (B) would be localized to_coverage_hill_profile.API
hill_diversity_profilereturns aDataFramewith one row per diversity orderqand one column per group, which plots directly with seaborn and flows intoconvert_hill_table.Checklist