Vision-Language Model Imputer Module#543
Conversation
… tests - Add CustomSegmenter — user-provided binary masks as players (registered as "custom_segmenter", file: segmenters/custom.py) - Add VisionExplainer(Explainer) with auto-dispatch for HF VLMs (file: explainer/vision.py, registered in ExplainerTypes) - Add ExactComputer correctness test with manually verified SV values - Add Player×Masker matrix test (4 combos × forward_1d + InteractionValues) - Add blur masker combos for both patch and custom segmenters - Add VisionExplainer integration + auto-dispatch tests 99 tests passing (78 existing + 21 new)
Split the former vision/base.py into three files by responsibility:
- vision/base.py — data transfer protocol only
(SpatialLayout, PhysicalMask, ProcessorOutput)
- vision/segmenters/base.py — Segmenter(ABC), SegmenterConfig,
per-strategy param dataclasses (PatchParams,
SlicParams, GradientGuidedParams, CustomSegmenterParams)
- vision/maskers/base.py — Masker(ABC), MaskerConfig,
per-strategy param dataclasses (CrossModalMeanParams,
CrossModalBlurParams, VisionMeanParams,
VisionBlurParams, TextAttentionParams)
All cross-references within the vision package updated to relative imports.
99 tests pass.
clean unrelated files restore to origin restore to origin
|
Please ping me here as soon as you want a PR review. Note however, that for a review, the CI pipeline needs to be green first (all tests pass, the code-quality checks are okay, and the docs building pipeline compiles). :) |
Fix all ruff lint and ty type-checker errors across the vision module
to satisfy pre-commit CI requirements (ruff: 203→0, ty: 0 errors).
Changes:
- Add missing docstrings (D102/D107/D205/D417) to all public methods
- Add missing type annotations (ANN*) and replace Any with concrete types
- Convert relative imports to absolute (TID252)
- Move type-only imports to TYPE_CHECKING guards (TC001/TC002)
- Replace exception string literals with msg variable (EM101/EM102/TRY003)
- Fix dict() -> {} literals (C408), redundant assignments (RET504)
- Fix ambiguous x characters in docstrings/tests (RUF001/RUF002)
- Make bool params keyword-only (FBT001/FBT002)
- Add __all__ and noqa for registry __init__.py pattern (E402/F401)
- Fix ty errors: kwargs type annotations, import guard patterns
- Fix test files: replace x with x, unused variable
- Fix notebook: replace x with x, dict() -> {}, add per-file-ignores
The HuggingFace VLM check (hasattr(model, 'vision_model')) was overwriting already-matched model types because Mock() returns True for all hasattr checks. Added _model_type == 'tabular' guard so VLM detection only fires when no prior check matched.
Hi @mmschlk , I have fixed all the issues, and the CI pipeline is now green except for the "upload coverage to codecov" step. This remaining failure appears to be a configuration or token issue with the Codecov service rather than a code quality or test failure. I am looking forward to your review of this PR! |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
mmschlk
left a comment
There was a problem hiding this comment.
Thank you already for your very detailed work. Overall I am already quite happy with the extend and breath you did for bringing VLMs to shapiq. I commented on a couple of nitpick comments. Then I also have two more elaborate comments as well, which I will detail here further.
First, the break of the API design (also see the big comment in the Explainer's init). Currently your Explainer cannot be called with different x (image + text) inputs inside its explain function but all the instance relevant information has to be passed at init time. This is not consistent with shapiq's core API. Explainer inits carry the information "how explanation will be done in this setting" and explain_functions bring the instance related information, which x is to be explained. Then you can use the same explainer actually to explain mutliple x after each other. This is currently not possible, but would be nice to achieve.
Second, currently practitioners and novice users cannot will not really understand what the Epxlainer is actually about (also see the notebook comments). For this we need to have proper examples showcasing how to use the explainer and why the API (and its different choices e.g. for maskers matter and what they change). For this a set of example scripts alongside the existing examples would be very welcome. Note however, that these examples may not run too long on the doc building runners and can also be manually turned off to not always run automatically if you want. So please provide more examples.
There was a problem hiding this comment.
Actually, we do not allow .ipynb files anymore inside the docs or the examples folder. So you cannot have notebook examples but only real script files. You can see how this is done with scripts here.
There was a problem hiding this comment.
3 demo files are added to examples/vision/ now
| """Valid index types for the VisionExplainer.""" | ||
|
|
||
|
|
||
| class VisionExplainer(Explainer): |
There was a problem hiding this comment.
I do not like the name VisionExplainer for this explainer class as it only really deals with VisionLanguage models. I would rather refactor it to be named as such: VisionLanguageExplainer.
There was a problem hiding this comment.
VisionExplainer is renamed as VisionLanguageExplainer in new commit.
| and not (model_class or "").startswith("torch.nn.") | ||
| and _model_type == "tabular" | ||
| ): | ||
| _model_type = "vision" |
There was a problem hiding this comment.
similar comment as the explainer name. "vision" is to generic for the thing you built regarding vision language models. Please address the naming refactor consistently across the PR from there.
| @@ -0,0 +1,239 @@ | |||
| """Vision Explainer for shapiq. | |||
|
|
|||
| The :class:`VisionExplainer` explains vision-language model predictions using | |||
There was a problem hiding this comment.
I have already marked this in the notebook file. It would be nice to also get a couple of example scripts (in the correct file structure as python executable scripts) that showcase how to do the explanations for laypeople. So this is rather a bigger comment, but the codebase should also be documented well for someone who does not really know how VisionLanguage models may be explained. A couple of examples 2-3 would be very welcome on this issue.
| interaction_values.baseline_value = self.baseline_value | ||
| return interaction_values | ||
|
|
||
| # ─── Internal helpers ───────────────────────────────────────────────── |
There was a problem hiding this comment.
I do not like these AI delimiters.
|
|
||
| ExplainerTypes = Literal[ | ||
| "tabular", "tree", "tabpfn", "game", "product_kernel", "knn", "wknn", "tnn" | ||
| "tabular", "tree", "tabpfn", "game", "product_kernel", "knn", "wknn", "tnn", "vision" |
| Operates exclusively on pixel_values. Must never touch input_ids or | ||
| attention_mask. | ||
|
|
||
| .. note:: |
There was a problem hiding this comment.
It's implementation task allocation, I will remove this comment.
| max_order: int = 1, | ||
| random_state: int | None = None, | ||
| verbose: bool = False, | ||
| use_amp: bool = False, |
There was a problem hiding this comment.
inside shapiq we currently do not support amp in any other places, which is why I would like to remove it here as well.
| model: Any, # noqa: ANN401 | ||
| data: Any = None, # noqa: ANN401 | ||
| *, | ||
| text: str = "", |
There was a problem hiding this comment.
I do not like defaults like this. The text being empty should actually not be possible to be provided as default. If a default is non-sensical then it should not have a default. As this is a Vision Language game it needs an input text and and input image both of which need to be provided. So for consistency this should not be provided too. However, there actually lies a rather more important issue now that is actually breaking the current shapiq API:
Explainer instantiations are not expected to carry information about the local explanation that you can do with them but rather carry the boiler plate setup code that is governing how "explaining" with this explainer will work (like setting the masking strategy or the processor steps, or the interaction index we are interested in together with the approximators that are available.
The actual "explanation point" is then always provided once the user's call the explainers.explain function. This is provided via the x parameter. Which in this case would then carry the image and the text. Of course, the overall setup you had can change in your VLM case compared for example to explaining tabular models (with different text lengths you will have different n_players at each explain time). So you would need to reinstantiate your approximators each time for each new x again.
This issue also can be seen in the explain_function doctoring as:
x: Ignored for vision models (the image is fixed at construction time). Kept for API compatibility.
You should not adhere to API compatibility by offering dead parameters, but actually be conform to the API.
|
|
||
|
|
||
| @dataclass | ||
| class CrossModalMeanParams: |
There was a problem hiding this comment.
Most of these data classes are empty and carry no meaning. I was not able to see where they are all used? Maybe I missed something, but double check if this strucutre is necessary as it is here.
…r fix
CHANGES:
- VisionExplainer renamed to VisionLanguageExplainer
- image/text moved from __init__ to explain(x={"image": ..., "text": ...})
- SegmenterConfig / MaskerConfig replaced per-strategy fields
- (patch/slic/custom_segmenter/crossmodal_mean, ...) with single params field
Features:
- VisionLanguageExplainer: new API conforming to shapiq convention;
- game + approximator rebuilt per explain() call (handles varying n_players)
- Example scripts: plot_vision_language_clip.py (CLIP Patch + SLIC),
- plot_vision_explainer_custom.py (custom masks + blur masker)
- safe_processor_call: fallback for Fast processors (Transformers 4.51+)
Cleanup:
- Remove .ipynb notebook, replace with .py Sphinx-Gallery scripts
- Remove notebook-only ruff rules from pyproject.toml
- Remove use_amp parameter from explainer, imputer, and factory
- Collapse 6 empty params dataclasses into single EmptyParams alias
Description
This PR introduces a new
shapiq.imputer.visionsub-package that extends shapiq's imputer framework with pluggable segmenters and maskers for vision-language model explanation.Keywords: CLIP, SigLIP, VLM explanation, image occlusion, patch segmentation, SLIC superpixels
Motivation
shapiq currently supports tabular data imputation (
MarginalImputer,BaselineImputer,GaussianImputer, etc.) and nearest-neighbour explainer games (NNExplainerGameBase). However, explaining vision-language models (VLMs) such as CLIP and SigLIP requires a different paradigm:This PR adds a modular vision pipeline that follows shapiq's existing
Gamecontract while introducing two new abstractions —SegmenterandMasker— that can be mixed and matched.Solution
A new sub-package
shapiq/imputer/vision/with:Segmenter(ABC)Masker(ABC)VisionImputerVisionImputerFactoryVisionLanguageGameshapiq.Gameadapter for approximatorsPatchSegmenterSLICSegmenterCustomSegmenterVisionMeanMaskerVisionBlurMaskerTextAttentionMaskerCrossModalMeanMaskerCrossModalBlurMaskerRelated Work
This follows the same architectural pattern as
shapiq.explainer.nn.games.NNExplainerGameBase(Game)— a domain-specificGamesubclass that does not inherit fromImputer. Like the NN explainer games,VisionLanguageGame(Game)can be used directly with any shapiq approximator.Changes
New files
VisionExplainer
New explainer integration at
src/shapiq/explainer/vision.py:VisionExplainer(Explainer)— wrapsVisionImputerFactory→VisionLanguageGame→ approximatorshapiq.Explainer(model, data=image, text=..., processor=...)automatically routes HF VLMs toVisionExplainerExplainerTypes("vision") andget_explainers()Tests
99 unit tests covering:
Dependencies
torchshapiq.explainer.nntransformersscikit-imagePillowUsage
Testing
Expected output: 99 passed for the vision + explainer test suite.
Design Rationale
Why a new sub-package instead of extending
Imputer?The existing
Imputerbase class has a tabular-specific constructor signature (model, data, x, sample_size, categorical_features). A VLM imputer needs a HuggingFacemodel + processor,PIL.Image, and pluggableSegmenter/Maskercomponents. Forcing these into the existingImputerhierarchy would require breaking backward compatibility and would create a semantic mismatch between "imputing missing values" (tabular) and "occluding features" (vision).Following the precedent of
NNExplainerGameBase(Game)underexplainer/nn/games/, the vision module introduces its ownGamesubclass that delegates all masking/batching logic to aVisionImputerorchestration layer.CPU Planning, GPU Execution
Segmenters compute pixel-to-player mappings on CPU (e.g.,
skimage.segmentation.slic) and upload the result to GPU once. Mask application runs entirely on GPU via tensor ops. This keeps the expensive per-coalition work on GPU while respecting skimage's CPU-only API.Open for extension
Segmenter.get_layout()+generate_masks(), decorate with@register_segmenter("name").Masker.apply(), decorate with@register_masker("name").model.name_or_path; new model families require adding a detection branch in_infer_model_type.Checklist
137 passed, 3 skippedfor all imputer tests; skipped tests are tabpfn optional dependency, unrelated to vision module)docs/examples/vision_language_clip.ipynb) added__all__defined in__init__.py