Implement LeverageSHAP approximator#524
Conversation
Forward-looking spec for the 3 new SV approximators (LeverageSHAP, PolySHAP, OddSHAP). Approximator classes are looked up dynamically by name, so the file auto-skips classes that have not yet been registered in shapiq.approximator. As each implementation lands, the corresponding parametrizations activate. - Interface conformance (always required): index='SV', n_players, max_order/min_order, values shape and dtype, interaction_lookup. - Numerical convergence vs ExactComputer (xfail strict=False): atol schedule by budget percentage. - Determinism: same (n, random_state, budget, game) -> bit-identical output. 75 tests, all currently SKIP on main. Will activate as classes land.
Honors the cross-method testing platform promised to the tutor:
unified harness covering every SV approximator in shapiq (the
existing 11 — KernelSHAP, SVARM, Permutation*, ProxySPEX, ... — and
the 3 new ones from this project) instead of only the new line-up.
Approximator list is sourced dynamically from
shapiq.approximator.SV_APPROXIMATORS (canonical registry) plus the
3 new project names, deduplicated. Future shapiq additions land in
the harness automatically.
Split into two scopes:
* test_interface_conformance — strict shape/dtype/index/lookup
contract from the API spec. Applied ONLY to the 3 new
approximators (the contract is ours; existing methods have
different default output conventions like ProxySPEX defaulting
to FBII and max_order=n).
* test_numerical_convergence_vs_exact + test_determinism — apply
to ALL SV approximators. Cross-method comparison against
ExactComputer ground truth on identical SOUM games. xfail with
strict=False so methods that do converge surface as XPASS;
methods still under development surface as XFAIL.
Two robustness helpers:
* _construct_or_skip — tries (n=, index='SV', max_order=1,
random_state=) first (covers multi-index methods like SPEX,
ProxySPEX, ProxySHAP, MSRBiased, kADDSHAP), then falls back to
minimal signature for SV-only methods (KernelSHAP, OwenSamplingSV).
* _safe_approximate — skips on ValueError raised by approximators
that explicitly refuse a regime (e.g. SPEX 'Insufficient budget
to compute the transform' at low budgets).
Results: 10 passed, 95 skipped, 90 xfailed, 23 xpassed. The 23
xpassed are existing shapiq SV methods that converge cleanly at
full budget on small SOUM — a useful baseline for the upcoming
benchmark report.
Drop-in framework that any teammate can merge into their feature branch
to run head-to-head benchmarks against ExactComputer across every SV
approximator in shapiq, then plot the standard SHAP-literature metric
curves. No source files are modified — adds a top-level benchmark/
package, a single test file, and a small in-place test-helper sys.path
hook. Does not touch pyproject.toml or any other upstream config.
Files added:
* benchmark/__init__.py: makes the runner a proper Python package so
invocation is 'python -m benchmark.performance'.
* benchmark/_discovery.py: single source of truth for SV approximator
discovery + SV-mode construction. Holds:
- PROJECT_APPROXIMATOR_NAMES: LeverageSHAP, PolySHAP,
PolySHAPKAdd / Partial / Prior, OddSHAP.
- _SV_CONSTRUCT_OVERRIDES: per-class kwargs for non-standard
constructors (PolySHAP variants need max_order /
n_explanation_terms / q_prior).
- construct_for_sv(): three-stage construction (override ->
explicit SV signature -> minimal signature), returning
(estimator, exc) so the caller can report the most informative
exception. A ValueError from inside a matched signature wins
over a TypeError from a signature mismatch.
- safe_approximate(): catches ValueError and RuntimeError so
sparse approximators that refuse a budget regime (SPEX,
ProxySPEX, ...) skip the cell cleanly instead of crashing.
* benchmark/performance.py: CLI runner that consumes _discovery,
sweeps (method, game, budget, seed), records every cell in a
long-format CSV, and emits one PNG per (game, metric) plus a
runtime PNG. Seven metrics chosen from the union of LeverageSHAP,
PolySHAP, OddSHAP and shapiq.benchmark.metrics literature:
MSE / MAE / SSE / SAE / Precision@5 / Precision@10 / KendallTau.
Includes a '--check' interface-probe mode that prints a
constructibility table without running a sweep.
* benchmark/README.md: usage doc covering merge workflow, --check,
sweep CLI, output layout, CSV format, metric definitions, plot
conventions, and notes on the multi-index approximators that need
explicit (index='SV', max_order=1).
Files modified:
* tests/shapiq/tests_unit/tests_approximators/test_approximators_vs_exact.py:
now imports the shared helpers from benchmark._discovery via a
tightly-scoped sys.path hook at the top of the file. Picks up the
ValueError-priority construction and the RuntimeError-catch that
the test file previously did not have. Interface conformance is
now applied to the project's six new approximator names
(LeverageSHAP, PolySHAP + 3 variants, OddSHAP), so Matthias's
PolySHAP variants are no longer silently skipped by the contract
check.
Verified locally:
* pytest test_approximators_vs_exact.py: 10 passed, 170 skipped,
87 xfailed, 26 xpassed. No failures.
* python -m benchmark.performance --check: surfaces all 17 method
names (11 existing on main + 6 project additions) correctly.
* Drop-in compatibility verified by temporary merge into all three
feature branches (oddshap_approximator, leverageSHAP, PolySHAP) —
clean merge in each, --check picks up the local approximator.
|
Finally all tests pass again after:
========================================== 1270 passed, 166 skipped, 124 xfailed, 32 xpassed, 5181 warnings in 956.40s (0:15:56) ==========================================
real 16m2,138s
user 119m51,705s
sys 0m37,665s |
…tions set, Z_list and probs_list and create all-true and all-false coalitions
…x pre-commit errors
…determinism on LeverageSHAP()
…ferent game variables to avoid access counters interfering; Also compare metadata
…ames produce (slightly) different outputs
…d tiny-n edge case and add comments to document and explain the test
…use its core claim was not reliable With n = 6, a budget of 100 is above 2^n = 64, so the implementation enters the full-budget/exact regime. In that regime, the result should be identical no matter which seed you use, so asserting that different seeds must differ is false and will fail even though the code is correct. => I lowered the budget to budget=20
…o test_exact_regime_seed_independence and test_stochastic_regime_seed_variability
…stochastic regime
…est_exact_matches_multiple_small_games, test_null_player_axiom and test_minimal_budget_sweep
…her n to avoid minimal floating errors
…to base regression class
There was a problem hiding this comment.
Pull request overview
This PR adds a new regression-based Shapley value approximator, LeverageSHAP, and refactors the regression solver to support a more robust SVD-backed path, with extensive unit tests validating numerical stability and reproducibility.
Changes:
- Implement
LeverageSHAP(Musco & Witter, 2024) with leverage-score-guided paired coalition sampling and centered regression. - Refactor regression solving into a shared
solve_regression(..., use_svd=...)utility +Regression.solve_regression()wrapper, adding numerical guards and fallback behavior. - Add comprehensive unit tests for LeverageSHAP behavior and for regression-solver edge cases.
Reviewed changes
Copilot reviewed 7 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/shapiq/tests_unit/tests_approximators/test_approximator_regression_base.py | Adds targeted tests for new solve_regression(..., use_svd=...) behavior and guards. |
| tests/shapiq/tests_unit/tests_approximators/test_approximator_leverageshap.py | Adds a full unit test suite for LeverageSHAP (accuracy, axioms, seeds, numerical stability). |
| src/shapiq/approximator/regression/leverageshap.py | New LeverageSHAP approximator implementation including custom sampling and IS reweighting. |
| src/shapiq/approximator/regression/base.py | Introduces solve_regression(..., use_svd=...) + class wrapper and updates internal call sites. |
| src/shapiq/approximator/regression/init.py | Exports LeverageSHAP from the regression approximators package. |
| src/shapiq/approximator/init.py | Exposes LeverageSHAP at the top-level approximator API and lists it in SV_APPROXIMATORS. |
| notebooks/data/communities.names | Adds UCI Communities & Crime metadata used by an existing notebook. |
| .gitignore | Ignores benchmark result output files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Advueu963
left a comment
There was a problem hiding this comment.
Overall a very nice implementation and nice safeguards implemented for large n. Yet I think we can (a) get rid of the log procedure you introduced, as the _find_c is already the bottleneck for large n, if I see this correctly^^.
Also I have left a comment on how the CoalitionSampler can be used, to produce very similar sampling procedures. See Figure 9 of the Leverageshape paper, of where the differences lie. The CoalitionSampler in our code is somewhat doing without BernoulliSampling.
There was a problem hiding this comment.
It this file really necessary? Can you not use the load_communities_and_crime function in shapiq_games?
There was a problem hiding this comment.
Good point, thanks for pointing this out! I implemented your suggestion in b98d617. Changes:
- Replace custom dataset loading with shapiq_games.datasets.load_communities_and_crime() and refactor all code cells accordingly
- Also load communities data set and extend codes cells to skip computing exact shapley values for n > 15 (for Communities dataset)
- Skip experiments with communities dataset where exact_svs is None
- Add fallback to src/shapiq/approximator/regression/leverageshap._sample_without_replacement() for astronomically large binomial pools
I committed the ruff style-fixes in a separate commit (ad74f97) so that the diff doesn't blow up in one commit. I had to ignore some ruff rules in a comment in the first cell as I believe it makes sense to ignore those rules in a scientific notebook environment:
# ruff: noqa: T201, RUF001, RUF002, RUF003, E402
# Justification for rule suppressions:
# - T201 (print found): Standard print statements are intentionally utilized for inline
# execution logging. Standard logging modules would introduce unnecessary verbosity,
# thereby reducing the readability of the notebook's experimental flow.
# - RUF001 / RUF002 / RUF003 (Ambiguous characters): The inclusion of specific typographic
# symbols (such as mathematical multiplication or minus signs) is intentional to maintain
# standard notation and ensure formal clarity within text cells and documentation strings.
# - E402 (Import not at top): In an interactive notebook environment, contextualizing
# imports within specific cells ensures logical modularity and encapsulation. This prevents
# unnecessary global scope clutter and allows for isolated cell execution during
# iterative experimentation without re-running the initial setup.Could you please review the changes? Thank you.
There was a problem hiding this comment.
As this I think is not necessary as shapiq_games should already have the capability to load communities_and_crime via shapiq_games.
| for i, s in enumerate(sizes): # for each sampled coalition of size s | ||
| log_w = ( | ||
| math.lgamma(s) + math.lgamma(n - s) - math.lgamma(n + 1) | ||
| ) # log Shapley kernel w(s) | ||
| log_C = ( | ||
| math.lgamma(n + 1) - math.lgamma(s + 1) - math.lgamma(n - s + 1) | ||
| ) # log C(n,s) | ||
| log_p = log_2c - log_C # log(2c * l_z) = log(2c / C(n,s)) | ||
| log_min_p = min(0.0, log_p) # cap probability at 1 (log 1 = 0) | ||
| log_weights[i] = log_w - log_min_p # IS weight = w(s) / min(1, 2c*l_z) | ||
| log_weights -= log_weights.max() # shift so exp doesn't overflow |
There was a problem hiding this comment.
I understand why you are guarding against the size of s, due to the explosino with high numbers. But actually the algorithm will already fail with _find_c, as you there need also to construct all the binom terms.
So I would rather argue that this might not even be suitable here, and could be removed to ensure more clarity. But I am open to other opinions on this.
There was a problem hiding this comment.
Thank you. I implemented the changes here: c28409b
Could you please review the changes?
| if target <= 0: | ||
| return 0.0 # nothing left to sample beyond empty + grand | ||
|
|
||
| binoms = [float(math.comb(n, s)) for s in range(1, n)] # C(n,s) for each interior size |
There was a problem hiding this comment.
This is the part, where already for large n the algorithm will collapse.
There was a problem hiding this comment.
I have seen that you have done quite some work in reproducing Algorithm 2 and 3 from the paper. Real nice job! Yet I would like to point out that the CoalitionSampler inherently already does support sampling based on leverage scores using sampling_weights=np.ones(n_players + 1). Most importantly this should be very similar to the approach you have implemented here, whilst you are somewhat more efficient with how you deal with duplicates (you avoid them completly; the CoalitionSampler accounts for them in sampling adjustment weights). But as n increases it becomes quite difficult to differentiate those in total. So I would (before merging) be interested in the different of your implementation with that implementation using the CoalitionSampler with the weights described above. It should then come down to something very similar as depicted in Figure 9 of the LeverageSHAP paper.
There was a problem hiding this comment.
Great catch! I wrote a new Jupyter NB in e98266a that reproduces what you are asking for. In the NB I compare only KernelSHAP, only LeverageSHAP (with our custom bernoulli sampling implementation), LeverageSHAP without Bernoulli by using a class override inside the NB to force using CoalitionSampler instead of our custom sampler and additionally KernelSHAP with sampling_weights=np.ones(n_players + 1) set. From my understanding KernelSHAP with sampling_weights=np.ones(n_players + 1) should behave exactly the same as "LeverageSHAP without Bernoulli" (assuming same fixed seed) but for illustration purposes I plot both.
Below you can find the results (you can also directly view them inside the committed NB). As you can see our custom implementation consistently beats KernelSHAP + np.ones / LeverageSHAP w/o Bernoulli after a certain budget count (for m >= 42 consistently). Because of that, I believe keeping the custom Bernoulli sampling is worth the extra lines of code. What do you think/suggest?
The plot in words:
Median ℓ₂-norm error (Lower is better):
----------------------------------------------------------------------------------------------------
KernelSHAP (Standard) LeverageSHAP w/o Bernoulli (Override) KernelSHAP + np.ones (Reviewer Setup) LeverageSHAP (Custom Bernoulli)
Budget (m)
2 0.90230 0.90230 0.90230 0.90230
8 0.77130 0.77382 0.77382 0.77667
15 0.45012 0.27952 0.28783 0.12703
22 0.04356 0.04092 0.04092 0.04499
29 0.06454 0.05521 0.05521 0.02961
36 0.02105 0.02224 0.02224 0.02422
42 0.01673 0.02007 0.02007 0.01984
49 0.03694 0.03514 0.03514 0.01719
56 0.01262 0.01592 0.01592 0.01398
63 0.02596 0.02355 0.02355 0.01203
70 0.01104 0.01159 0.01159 0.01040
77 0.02186 0.01517 0.01517 0.00896
83 0.01464 0.01680 0.01680 0.00897
90 0.00835 0.00917 0.00917 0.00890
97 0.01366 0.01357 0.01357 0.00786
104 0.00740 0.00845 0.00845 0.00712
111 0.01297 0.01079 0.01079 0.00661
118 0.00647 0.00769 0.00769 0.00667
124 0.00605 0.00705 0.00705 0.00600
131 0.00841 0.00997 0.00997 0.00521
138 0.00559 0.00650 0.00650 0.00450
145 0.00964 0.00817 0.00817 0.00459
152 0.00429 0.00603 0.00603 0.00433
159 0.00688 0.00645 0.00645 0.00364
165 0.00591 0.00551 0.00551 0.00323
172 0.00377 0.00363 0.00363 0.00329
179 0.00499 0.00565 0.00565 0.00308
186 0.00329 0.00340 0.00340 0.00276
193 0.00448 0.00467 0.00467 0.00248
200 0.00286 0.00289 0.00289 0.00249
====================================================================================================
DIRECT COMPARISON: Custom Bernoulli vs. Reviewer Setup (np.ones)
====================================================================================================
m=2 : Reviewer Setup WINS! Error is 0.0% lower.
m=8 : Reviewer Setup WINS! Error is 0.4% lower.
m=15 : Custom Sampling WINS! Error is 55.9% lower.
m=22 : Reviewer Setup WINS! Error is 9.9% lower.
m=29 : Custom Sampling WINS! Error is 46.4% lower.
m=36 : Reviewer Setup WINS! Error is 8.9% lower.
m=42 : Custom Sampling WINS! Error is 1.1% lower.
m=49 : Custom Sampling WINS! Error is 51.1% lower.
m=56 : Custom Sampling WINS! Error is 12.2% lower.
m=63 : Custom Sampling WINS! Error is 48.9% lower.
m=70 : Custom Sampling WINS! Error is 10.3% lower.
m=77 : Custom Sampling WINS! Error is 41.0% lower.
m=83 : Custom Sampling WINS! Error is 46.6% lower.
m=90 : Custom Sampling WINS! Error is 2.9% lower.
m=97 : Custom Sampling WINS! Error is 42.1% lower.
m=104: Custom Sampling WINS! Error is 15.8% lower.
m=111: Custom Sampling WINS! Error is 38.8% lower.
m=118: Custom Sampling WINS! Error is 13.3% lower.
m=124: Custom Sampling WINS! Error is 14.9% lower.
m=131: Custom Sampling WINS! Error is 47.7% lower.
m=138: Custom Sampling WINS! Error is 30.8% lower.
m=145: Custom Sampling WINS! Error is 43.8% lower.
m=152: Custom Sampling WINS! Error is 28.2% lower.
m=159: Custom Sampling WINS! Error is 43.6% lower.
m=165: Custom Sampling WINS! Error is 41.4% lower.
m=172: Custom Sampling WINS! Error is 9.4% lower.
m=179: Custom Sampling WINS! Error is 45.5% lower.
m=186: Custom Sampling WINS! Error is 18.8% lower.
m=193: Custom Sampling WINS! Error is 46.8% lower.
m=200: Custom Sampling WINS! Error is 13.9% lower.
Dear @Advueu963, thank you very much for the feedback! I will implement your suggestions asap. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- I accidentally only accepted one out of three autofix suggestions by Copilot; the other two are now marked as outdated and can't be committed inside GitHub anymore which is why I commit them manually - "_find_c() converts binomial coefficients to float (`float(math.comb(...))`). For large `n`, `math.comb(n, s)` can exceed the maximum finite float and become `inf`, which then makes `hi` infinite and the bisection never meaningfully converges (returning `inf` for `c`). This can break sampling/weight computation for large-player games. Keep binomials as Python ints and choose an upper bound for `c` via exponential search (or another integer-safe strategy) instead of `max_binom/2` in float space." - "_sample_without_replacement() uses rejection sampling for `total >= 10**6`. This is only efficient when `total >> k`, but in this implementation `k` can be a large fraction of `total` (e.g., when `prob` is close to 1 but still < 1). In that case the `while len(seen) < k:` loop can take an extremely long time due to heavy collisions. Python's `random.sample()` supports sampling directly from `range(total)` without materializing it and handles the `k` vs `total` regime robustly, so it's safer to use it for the large-pool case too." - mmschlk#524
…ities_and_crime() and added fallback mechanism in the random sampling method - Implement feedback from mmschlk#524 - Replace custom dataset loading with shapiq_games.datasets.load_communities_and_crime() and refactor all code cells accordingly - Also load communities data set and extend codes cells to skip computing exact shapley values for n > 15 (for Communities dataset) - Skip experiments with communities dataset where exact_svs is None - Add fallback to src/shapiq/approximator/regression/leverageshap._sample_without_replacement() for astronomically large binomial pools
…ges into this separate commit) for easier reviewing
…s, boolean traps, and df naming in SOUM notebook
…AP (with our custom bernoulli sampling), LeverageSHAP w/o bernoulli using class override and KernelSHAP + np.ones
|
Dead @mmschlk, I refactored the solve_regression() method in src/shapiq/approximator/regression/base.py as discussed in our meeting and removed a (now outdated/unused) old unit test in e70f9d6 and solved all remaining ruff style warnings in 67d05fc. The method is now cleaned up, more readable introduces the "safe" try-except code block as well as the Could you please review the changes? Thank you. I've also run the following commands locally to ensure all code quality checks (pre-commit), unit-tests and coverage pass (the same commands like in the GitHub CI pipeline): uv sync --group lint --all-extras
uv run pre-commit run --all-files --show-diff-on-failure
uv sync --all-extras
uv run pytest "tests/shapiq" --cov=shapiq --cov-report=term -n logical
uv sync --no-dev --all-extras --group all_ml
uv run --no-sync python -c "import shapiq; print('✅ shapiq imported successfully')"
uv run --no-sync python -c "import shapiq_games; print('✅ shapiq_games imported successfully')"I've you're okay with the changes, please feel free to re-run the workflow in Github. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Thanks for running the CI. I see that the documentation is not building correctly and that some lines are still missing coverage. I'll have a look at it ASAP. |
…sets and different n
…want to make wrong assumptions)
|
@Advueu963 As requested in the last meeting, I have updated and expanded the benchmark notebook to evaluate our custom LeverageSHAP implementation (Algorithm 2) against the uniform KernelSHAP baseline across 25 configurations and 4 distinct datasets. The results demonstrate that our method surpasses KernelSHAP + np.zeros, with the custom implementation achieving a lower absolute L2 error and winning in 22 to 28 out of 30 budget steps across almost every configuration (please see the table below). Also, our Bernoulli sampling mechanism resolves the "zig-zag" pattern seen in 9 out of 25 baseline plots for KernelSHAP + np.zeros. I assume the "zig-zag" occurs when the KernelSHAP + np.zeros setup runs out of budget mid-layer. Some configurations show extreme negative percentages in the relative "Avg Improvement" column. I think this caused when the baseline randomly hits a near-zero error in its symmetry valleys which then distorts the relative average. The fully cleaned and documented notebook is now ready for review. I also added a "Empirical Evaluation: Custom LeverageSHAP vs. Uniform Weighting Baseline" section which describes and tries to explain the observations. We also plot every single experiment (i.e. each line in the table below in the Jupyter NB).
You can also find the Jupyter NB exported in the PDF here: |
…NBs will not be pushed to the shapiq repository according to the last meeting)
|
@Advueu963 I moved the notebooks located in |


Motivation and Context
In this PR we added the implementation of the LeverageSHAP approximator, based on the paper by Musco & Witter (2024). We built (1) a custom sampler for leverage score sampling (uniform size + paired sampling) (2) implemented the regression solver using the row-centering trick (Lemma 3.1) and (3) added different test cases. We originally opened a PR in our fork (FabianK-Dev#1) which I closed now and I am reopening here.
A few disclaimers / important notes:
src/shapiq/approximator/regression/our_impl_progress.md(as requsted in the project description) but this file is stil work-in-progress and not really ready for review, yet.Public API Changes
Details: Added LeverageSHAP class to
shapiq.approximator.regression.How Has This Been Tested?
We added many unittests to cover the following things:
lstsqsolver maintains the efficiency axiom on ill-conditioned matrices.You can run all new unittests using:
uv run pytest tests/shapiq/tests_unit/tests_approximators/test_approximator_leverageshap.pyTests are passing:
Checklist
We haven't completed all points on the checklist yet, as this is still an early PR where we ask for feedback but don't plan to merge it into the upstream repository, yet.
Documentation has been updated (if the public API or usage changes).An entry has been added toCHANGELOG.md(if relevant for users).