Drop purge_asymmetric_pollution + KNOWN_BAD_NARROWMATCH (Codex-#558 closed upstream)#560
Closed
realmarcin wants to merge 1 commit into
Closed
Conversation
…losed upstream) MIM PRs #2, #3, #4 closed the Codex-#558 hardening pass on the upstream side. The kg-microbe-side workarounds added to compensate for upstream invariant violations are now redundant. Removed: - purge_asymmetric_pollution() (95 LOC) and its main() call site. Swept stray child-label synonyms and MIM xrefs that earlier consolidator runs leaked from narrowMatch-MIM subjects onto OBO parents. Now redundant: MIM PR #4 Rule B4 enforces canonical object_label, MIM PR #2 Rule A refuses zero-token-overlap auto-classifier rows; the label drift that fed the pollution is gone at the source. - Test docstring reference to purge_asymmetric_pollution updated to point at the upstream MIM PRs that now enforce the invariant. KNOWN_BAD_NARROWMATCH was already removed in commit ed421d0 ("Sync MIM SSSOM and remove redundant KNOWN_BAD_NARROWMATCH filter") — its 5 entries no longer appear in the published MIM SSSOM (PR #2 removed them upstream), and the claw build_mim_ingredient_sssom.py defensive token-overlap gate (e00d4f5) plus categorize_residual_p25.py producer-side guards (a3e36db) provide defense-in-depth. Verification: - Consolidator runs end-to-end clean. Output: 595,755 lines (595,681 mappings) vs baseline 595,436. The "Purged N stray child-label synonym(s)..." log line is gone. - tests/test_chemical_mapping_utils.py + test_chemical_stereochemistry.py: 76/76 passing. test_vermont_soil_resolves_to_child still resolves to the kgmicrobe child correctly without the purge step. Net diff: 105 lines deleted, 2 lines updated.
Contributor
There was a problem hiding this comment.
Pull request overview
Removes kg-microbe-side defensive cleanup logic that was previously needed to compensate for upstream MediaIngredientMech (MIM) SSSOM invariant violations, now resolved upstream (Codex-#558 follow-ups).
Changes:
- Deleted
purge_asymmetric_pollution()and itsmain()call site fromscripts/consolidate_chemical_mappings.py. - Updated the regression-test assertion message in
tests/test_chemical_mapping_utils.pyto reference upstream MIM PRs instead of the removed purge step.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| scripts/consolidate_chemical_mappings.py | Removes the now-redundant asymmetric pollution purge step from the consolidator workflow. |
| tests/test_chemical_mapping_utils.py | Updates regression-test messaging to point at upstream hardening rather than a local cleanup routine. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Collaborator
Author
|
Superseded — applied directly on mapping-cleanup-canonical-hub as commit 937a285 per team-lead course correction. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MIM PRs #2, #3, #4 closed the Codex-#558 hardening pass on the upstream side. The kg-microbe-side workarounds added to compensate for upstream invariant violations are now redundant.
Removed:
purge_asymmetric_pollution()(95 LOC) + itsmain()call site inscripts/consolidate_chemical_mappings.py. This swept stray child-label synonyms and MIM xrefs that earlier consolidator runs leaked from narrowMatch-MIM subjects onto OBO parents.purge_asymmetric_pollutionupdated to point at the upstream MIM PRs that now enforce the invariant (tests/test_chemical_mapping_utils.py).Already removed (this PR confirms it stays gone):
KNOWN_BAD_NARROWMATCHconstant + filter call site — removed in commited421d0e("Sync MIM SSSOM and remove redundant KNOWN_BAD_NARROWMATCH filter") on this branch. Its 5 entries no longer appear in the published MIM SSSOM (PR ingest trait table data from #2 removed them upstream).Why these are now redundant
purge_asymmetric_pollutionobject_label) refuses rows where label doesn't match source ontology's canonical name. MIM PR #2 Rule A refuses auto-classifier rows with zero token overlap. Label drift that fed the pollution is gone at the source.KNOWN_BAD_NARROWMATCHbuild_mim_ingredient_sssom.pytoken-overlap gate (commite00d4f5) refuses CONSIDER_SPECIFIC overrides where MIM and kg-microbe labels share zero tokens;categorize_residual_p25.pyproducer-side guards (commita3e36db) prevent the residual JSON from emitting bad CONSIDER_SPECIFIC entries.Behaviorally,
KNOWN_BAD_NARROWMATCHlookup is now unconditionally false (input data is hardened upstream), andpurge_asymmetric_pollution()was a no-op (no parent records to clean) under the new invariants — so the deletions are safe.Test plan
pytest tests/test_chemical_mapping_utils.py tests/test_chemical_stereochemistry.py -x: 76/76 passing.test_vermont_soil_resolves_to_childstill resolves correctly to thekgmicrobe.ingredient:vermont_soilchild without the purge step.parent_relationsand_PARENT_INDEXruntime indexes preserved (still used byfind_chebi_by_name/get_parentslookups and byexport_unified_sssom).Diff stats
Cross-repo references
MediaIngredientMech.e00d4f5(defensive token-overlap gate inbuild_mim_ingredient_sssom.py),a3e36db(producer-side guards incategorize_residual_p25.py).ed421d0e(removedKNOWN_BAD_NARROWMATCH).