Fix DiceGenetic.compute_proximity_loss for all-categorical datasets (#276)#471
Open
jbbqqf wants to merge 1 commit intointerpretml:mainfrom
Open
Fix DiceGenetic.compute_proximity_loss for all-categorical datasets (#276)#471jbbqqf wants to merge 1 commit intointerpretml:mainfrom
jbbqqf wants to merge 1 commit intointerpretml:mainfrom
Conversation
When the dataset has no continuous features, `continuous_feature_indexes` is empty, so `feature_weights` is an empty np.array and the original implementation hit either: * `proximity_loss / sum(feature_weights)` ⇒ ZeroDivisionError / RuntimeWarning + NaN losses (the symptom @kburchfiel reported in interpretml#276 with the original quoted snippet), or * `product.reshape(-1, product.shape[-1])` ⇒ ValueError on a 0-sized array, depending on input shape. Both paths poison `compute_loss` with NaN/exceptions and break the genetic search for legitimate all-categorical use cases. Proximity is conceptually undefined when there are no continuous distances to weigh, so short-circuit with a zero loss vector matching the population shape. The categorical penalty in `compute_loss` already accounts for categorical sparsity, so dropping the proximity contribution is the correct semantic — and it matches what users expect when they explicitly set up a categorical-only `dice_ml.Data`. Adds `TestComputeProximityLossNoContinuousFeatures` covering the all-categorical path. The test fails on `origin/main` with the ValueError reshape variant of this bug. Closes interpretml#276. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DiceGenetic.compute_proximity_lossdivides bysum(feature_weights),where
feature_weightsis restricted to the continuous feature indexes.For an all-categorical dataset (e.g. survey responses, OHE-only inputs)
that array is empty — and the original code hit either of two bugs
depending on input shape:
proximity_loss / sum(feature_weights)⇒ ZeroDivisionError /RuntimeWarning + NaN (the snippet @kburchfiel quoted in proximity loss when there are not continous features #276), or
product.reshape(-1, product.shape[-1])⇒ ValueError on the 0-sizedarray.
Both poison
compute_losswith NaN/exceptions and break genetic searchfor legitimate categorical-only use.
Why
Proximity is conceptually undefined when there are no continuous distances
to weigh, so the function short-circuits with a zero loss vector matching
the population shape. The
categorical_penaltyterm incompute_lossalready handles categorical sparsity, so dropping the proximity
contribution is the correct semantic — and matches what users expect when
they explicitly set up a categorical-only
dice_ml.Data.The change is one early-return on
len(feature_weights) == 0. Comment inthe code explains the choice for future reviewers.
Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
origin/mainwithValueError: cannot reshape array of size 0 into shape (0)(the original snippet'sproximity_loss / sum(...)path is reached only with a slightly different input shape, but both have the same root cause)tests/test_dice_interface/test_dice_genetic.pyran: same 4 pre-existing failures asorigin/main(regression suite forTestDiceGeneticRegressionMethodsis broken on main for separate reasons), 0 new failures introducedEdge cases
np.zeros(population)proximity_weight = 0compute_lossAI disclosure
This change was prepared with the assistance of Claude (Anthropic).
The author reviewed every line and is responsible for the final result.