fix(checks): refresh dispatcher for built-in checks with user names (#2042)#2330
Closed
jbbqqf wants to merge 1 commit into
Closed
fix(checks): refresh dispatcher for built-in checks with user names (#2042)#2330jbbqqf wants to merge 1 commit into
jbbqqf wants to merge 1 commit into
Conversation
…nionai-oss#2042) ``Check.__call__`` reloads ``self._check_fn`` from the class-level registry so that per-type entries added by lazy backend registration (during ``schema.validate``) are visible — schema validation deep-copies Check instances, leaving the deep-copied ``Dispatcher`` with a stale per-type function map. The reload was guarded by ``self.is_builtin_check(self.name)``, but ``self.name`` is the user-supplied label when constructing via ``Check.gt(0, name="my_check")``, so the guard never fires for named built-in checks. The result was ``KeyError: <class 'pandas.Series'>`` when validating a column with a renamed built-in check. Track the underlying built-in registry key on the instance via ``_builtin_check_name`` (set by ``BaseCheck.from_builtin_check_name``) and consult it in addition to ``self.name`` when deciding whether to refresh the dispatcher. Add a regression test that fails on origin/main and passes on this branch. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: jbb <jbaptiste.braun@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2330 +/- ##
=======================================
Coverage 83.51% 83.52%
=======================================
Files 190 190
Lines 16613 16616 +3
=======================================
+ Hits 13875 13878 +3
Misses 2738 2738 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Collaborator
|
Thanks for the contributions @jbbqqf ! On this one, need to make the linters happy with Check out https://github.com/unionai-oss/pandera/blob/main/AGENTS.md, which should make Claude more useful on this project. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Built-in check builders accept a
name=kwarg used to label the check in error messages (Check.gt(0, name="my_check")). On schema validation, this raisedKeyError: <class 'pandas.Series'>becauseCheck.__call__'s dispatcher-refresh guard relied onself.namematching a registered built-in name — which it never does once the user has supplied a custom label. Track the underlying built-in registry key separately and use it for the refresh decision.Resolves #2042.
Context
Check.gt(0)(default name) andCheck.gt(0, name="positive")follow different paths inCheck.__call__:The reload exists because schema validation deep-copies Check instances (via
copy.deepcopy(self)in manypandera/api/dataframe/container.pymethods). Deep-copying aDispatcherproduces a new instance with a new_function_registrydict — so per-type entries that lazy backend registration (register_pandas_backends) adds to the original dispatcher are invisible to the deep-copy.The reload re-fetches the live dispatcher from the class-level
CHECK_FUNCTION_REGISTRY, papering over the deep-copy. But it gates onself.name, which the user is allowed to override; with a custom name the gate never opens, the deep-copied stale dispatcher stays in place, anddispatcher.__call__KeyErrors when it triestype(args[0])lookup.Verified by tracing dispatcher object identity: at validation time
Check.gt(0, name="my_check")._check_fnhas a differentid()thanBaseCheck.CHECK_FUNCTION_REGISTRY["greater_than"], with a registry containing only{typing.Any}(the base function registered before any pandas-specific overload).Changes
pandera/api/base/checks.py:from_builtin_check_namenow setsinstance._builtin_check_name = nameon the returned check, capturing the registry key independently of the user-supplied label. A 7-line comment explains the deep-copy interaction so a future reader doesn't have to re-derive it.pandera/api/checks.py:Check.__call__now resolves abuiltin_key = getattr(self, "_builtin_check_name", None) or self.nameand uses that (notself.namedirectly) to decide whether to refreshself._check_fn. Falls back toself.namefor instances built outside the factory (e.g. user-defined Check subclasses), preserving existing behaviour.tests/pandas/test_checks.py: regression testtest_builtin_check_with_user_supplied_namecovering both green path (validation passes for valid input) and red path (invalid input still produces a cleanSchemaErrorwith the underlyinggreater_than(0)formatting, not aKeyError).Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
pytest tests/pandas/test_checks.py::test_builtin_check_with_user_supplied_name -v→ 1/1 passed (regression test).pytest tests/pandas/test_checks.py tests/pandas/test_checks_builtin.py -q→ 285 passed.origin/mainwith the documentedKeyError: <class 'pandas.Series'>traceback.Edge cases tested
Check.gt(0, name="positive")on a column that satisfies the checkCheck.gt(0, name="positive")on a column that fails the checkgreater_than(0)failure-case formatting (not a KeyError)Check.gt(0)(noname=) — pre-existing happy pathtests/pandas/test_checks*.pysuite (285 passed)Checksubclasses without_builtin_check_nameself.name, original behaviour preservedRisk / blast radius
The new attribute
_builtin_check_nameis set only on instances created viafrom_builtin_check_name; everywhere else,getattr(self, "_builtin_check_name", None)returns None and the resolution falls through toself.name— i.e. exactly the prior behaviour. No public API changes. The attribute is set by all existing call sites inpandera/api/checks.py(everycls.from_builtin_check_name(...)builder) and inpandera/api/hypotheses.py.PR drafted with assistance from Claude Code. The change was reviewed manually against pandera's check / dispatcher / backend layers and against the reporter's traceback in #2042.