You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community
Previously this script mutated 149 community YAMLs in place via regex /
line-edit surgery on the raw text, then wrote the result with
path.write_text(string). The recent schema-gap audit (PR #85) flagged
it as the highest-risk writer in the repo: no validation gate, no
curation_history entry on changed files, and the audit detector itself
was blind to it until the previous PR broadened the heuristic to catch
read_text/write_text round-trips against kb/communities/.
Now: load through yaml.safe_load, run the same cleaning logic on the
parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back
via write_validated_community (with backup-restore on validation
failure, matching the apply_pmc_conversions / link_growth_media pattern
from PR #85). The --dry-run default + --apply opt-in flag matches the
existing repo convention.
Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS
table is consulted (TITANIUM / GOLD / PALLADIUM), and the same
word-bounded evidence search runs over the document minus its
metals_present block — implemented now by yaml.safe_dump'ing a shallow
copy of the doc with metals_present stripped, which keyword_in_text
scans identically (its regex anchors on non-alphanumeric boundaries, so
YAML structural characters do not change match outcomes).
Smoke-tested against a /tmp copy of the corpus:
- Baseline corpus (already cleaned by a prior buggy-extractor run):
reports 0/265 files would change, as expected.
- With an injected false-positive (TITANIUM in a doc with no titanium
evidence): correctly removed in dry-run preview and in --apply, with
a schema-valid curation_history event appended and backup unlinked
on success.
Audit row before: yes no yes no no
Audit row after: yes yes yes yes no
(writes_yaml / appends_curation_history / has_write_safeguard /
validates_before_write / wired_into_just — wired_into_just stays no;
this script is a one-shot cleanup, not a recurring pipeline step.)
Baselines remain green: scripts/validate_strict.py reports 0 errors
across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments