Skip to content

Drop purge_asymmetric_pollution + KNOWN_BAD_NARROWMATCH (Codex-#558 closed upstream)#560

Closed
realmarcin wants to merge 1 commit into
mapping-cleanup-canonical-hubfrom
drop-purge-asymmetric-pollution-codex-558-closed
Closed

Drop purge_asymmetric_pollution + KNOWN_BAD_NARROWMATCH (Codex-#558 closed upstream)#560
realmarcin wants to merge 1 commit into
mapping-cleanup-canonical-hubfrom
drop-purge-asymmetric-pollution-codex-558-closed

Conversation

@realmarcin

Copy link
Copy Markdown
Collaborator

Summary

MIM PRs #2, #3, #4 closed the Codex-#558 hardening pass on the upstream side. The kg-microbe-side workarounds added to compensate for upstream invariant violations are now redundant.

Removed:

  • purge_asymmetric_pollution() (95 LOC) + its main() call site in scripts/consolidate_chemical_mappings.py. This swept stray child-label synonyms and MIM xrefs that earlier consolidator runs leaked from narrowMatch-MIM subjects onto OBO parents.
  • Test docstring reference to purge_asymmetric_pollution updated to point at the upstream MIM PRs that now enforce the invariant (tests/test_chemical_mapping_utils.py).

Already removed (this PR confirms it stays gone):

  • KNOWN_BAD_NARROWMATCH constant + filter call site — removed in commit ed421d0e ("Sync MIM SSSOM and remove redundant KNOWN_BAD_NARROWMATCH filter") on this branch. Its 5 entries no longer appear in the published MIM SSSOM (PR ingest trait table data from #2 removed them upstream).

Why these are now redundant

Workaround Upstream resolution
purge_asymmetric_pollution MIM PR #4 Rule B4 (canonical object_label) refuses rows where label doesn't match source ontology's canonical name. MIM PR #2 Rule A refuses auto-classifier rows with zero token overlap. Label drift that fed the pollution is gone at the source.
KNOWN_BAD_NARROWMATCH MIM PR #2 removed all 5 rows from the published SSSOM. Defense-in-depth: claw build_mim_ingredient_sssom.py token-overlap gate (commit e00d4f5) refuses CONSIDER_SPECIFIC overrides where MIM and kg-microbe labels share zero tokens; categorize_residual_p25.py producer-side guards (commit a3e36db) prevent the residual JSON from emitting bad CONSIDER_SPECIFIC entries.

Behaviorally, KNOWN_BAD_NARROWMATCH lookup is now unconditionally false (input data is hardened upstream), and purge_asymmetric_pollution() was a no-op (no parent records to clean) under the new invariants — so the deletions are safe.

Test plan

  • Consolidator runs end-to-end cleanly. Output: 595,755 lines (595,681 mappings) vs baseline 595,436 — net delta near 0 as expected. The "Purged N stray child-label synonym(s) and M stray MIM xref(s)..." log line is gone (no longer called).
  • pytest tests/test_chemical_mapping_utils.py tests/test_chemical_stereochemistry.py -x: 76/76 passing. test_vermont_soil_resolves_to_child still resolves correctly to the kgmicrobe.ingredient:vermont_soil child without the purge step.
  • parent_relations and _PARENT_INDEX runtime indexes preserved (still used by find_chebi_by_name / get_parents lookups and by export_unified_sssom).

Diff stats

 scripts/consolidate_chemical_mappings.py | 105 -------------------------------
 tests/test_chemical_mapping_utils.py     |   4 +-
 2 files changed, 2 insertions(+), 107 deletions(-)

Cross-repo references

…losed upstream)

MIM PRs #2, #3, #4 closed the Codex-#558 hardening pass on the upstream
side. The kg-microbe-side workarounds added to compensate for upstream
invariant violations are now redundant.

Removed:
  - purge_asymmetric_pollution() (95 LOC) and its main() call site.
    Swept stray child-label synonyms and MIM xrefs that earlier
    consolidator runs leaked from narrowMatch-MIM subjects onto OBO
    parents. Now redundant: MIM PR #4 Rule B4 enforces canonical
    object_label, MIM PR #2 Rule A refuses zero-token-overlap
    auto-classifier rows; the label drift that fed the pollution is
    gone at the source.
  - Test docstring reference to purge_asymmetric_pollution updated to
    point at the upstream MIM PRs that now enforce the invariant.

KNOWN_BAD_NARROWMATCH was already removed in commit ed421d0 ("Sync MIM
SSSOM and remove redundant KNOWN_BAD_NARROWMATCH filter") — its 5
entries no longer appear in the published MIM SSSOM (PR #2 removed
them upstream), and the claw build_mim_ingredient_sssom.py
defensive token-overlap gate (e00d4f5) plus categorize_residual_p25.py
producer-side guards (a3e36db) provide defense-in-depth.

Verification:
  - Consolidator runs end-to-end clean. Output: 595,755 lines
    (595,681 mappings) vs baseline 595,436. The "Purged N stray
    child-label synonym(s)..." log line is gone.
  - tests/test_chemical_mapping_utils.py + test_chemical_stereochemistry.py:
    76/76 passing. test_vermont_soil_resolves_to_child still resolves
    to the kgmicrobe child correctly without the purge step.

Net diff: 105 lines deleted, 2 lines updated.
Copilot AI review requested due to automatic review settings May 5, 2026 01:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes kg-microbe-side defensive cleanup logic that was previously needed to compensate for upstream MediaIngredientMech (MIM) SSSOM invariant violations, now resolved upstream (Codex-#558 follow-ups).

Changes:

  • Deleted purge_asymmetric_pollution() and its main() call site from scripts/consolidate_chemical_mappings.py.
  • Updated the regression-test assertion message in tests/test_chemical_mapping_utils.py to reference upstream MIM PRs instead of the removed purge step.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
scripts/consolidate_chemical_mappings.py Removes the now-redundant asymmetric pollution purge step from the consolidator workflow.
tests/test_chemical_mapping_utils.py Updates regression-test messaging to point at upstream hardening rather than a local cleanup routine.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin

Copy link
Copy Markdown
Collaborator Author

Superseded — applied directly on mapping-cleanup-canonical-hub as commit 937a285 per team-lead course correction.

@realmarcin realmarcin closed this May 5, 2026
@realmarcin realmarcin deleted the drop-purge-asymmetric-pollution-codex-558-closed branch May 5, 2026 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants