Convert clean_metals_inplace.py to validated yaml roundtrip#86
Merged
Conversation
…idated_community Previously this script mutated 149 community YAMLs in place via regex / line-edit surgery on the raw text, then wrote the result with path.write_text(string). The recent schema-gap audit (PR #85) flagged it as the highest-risk writer in the repo: no validation gate, no curation_history entry on changed files, and the audit detector itself was blind to it until the previous PR broadened the heuristic to catch read_text/write_text round-trips against kb/communities/. Now: load through yaml.safe_load, run the same cleaning logic on the parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back via write_validated_community (with backup-restore on validation failure, matching the apply_pmc_conversions / link_growth_media pattern from PR #85). The --dry-run default + --apply opt-in flag matches the existing repo convention. Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block — implemented now by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped, which keyword_in_text scans identically (its regex anchors on non-alphanumeric boundaries, so YAML structural characters do not change match outcomes). Smoke-tested against a /tmp copy of the corpus: - Baseline corpus (already cleaned by a prior buggy-extractor run): reports 0/265 files would change, as expected. - With an injected false-positive (TITANIUM in a doc with no titanium evidence): correctly removed in dry-run preview and in --apply, with a schema-valid curation_history event appended and backup unlinked on success. Audit row before: yes no yes no no Audit row after: yes yes yes yes no (writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just — wired_into_just stays no; this script is a one-shot cleanup, not a recurring pipeline step.) Baselines remain green: scripts/validate_strict.py reports 0 errors across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors scripts/clean_metals_inplace.py away from regex/text-based in-place YAML mutation to the repo’s validated-writer pattern (load YAML → mutate dict → write via write_validated_community), and adds curation-history recording when changes are applied.
Changes:
- Replaces raw text line-editing with
yaml.safe_load+ in-memory cleaning viaclean_metal_fields_in_doc. - Adds
record_curation_event(..., action="CLEAN_METAL_FIELDS")and writes throughwrite_validated_community. - Introduces
--applyopt-in with dry-run default, plus a backup/restore flow around the write.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two findings: - The backup/restore flow only caught ValidationFailedError. If any other exception fired after path.rename(backup) (disk error, permission denied, unexpected serializer fault), the original would be left as .yaml.bak_metals with no .yaml. Catch BaseException too: restore the backup before re-raising so the corpus is never left with a missing canonical file. - _evidence_text serializes via yaml.safe_dump, which drops comments and can reflow scalars — strictly not a perfect raw-text equivalent. Document the caveat: the keywords we search for live in YAML content (description, evidence snippet, ecological_state), never in comments, so the on-disk behavior is identical for the corpus today. If a future workflow starts encoding evidence in YAML comments, the function should switch to path.read_text() with a line-strip filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
PR #86 (Convert clean_metals_inplace.py) just merged into main. After rebasing, scripts/clean_metals_inplace.py is now gated, so the appends_curation_history / validates_before_write columns for it flip to yes. Re-running scripts/audit_writers.py produces a 1-row delta; commit it so the report reflects the actual post-rebase state. Combined post-merge: 9/16 appends_curation_history, 10/16 validates_before_write (was 5/16 and 6/16 respectively at the start of this PR series). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
…ance_strain_data + add_evidence_source) (#87) * Convert 3 more CommunityMech writers to use shared validate-and-record helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Refresh audit TSV after rebase onto #86 PR #86 (Convert clean_metals_inplace.py) just merged into main. After rebasing, scripts/clean_metals_inplace.py is now gated, so the appends_curation_history / validates_before_write columns for it flip to yes. Re-running scripts/audit_writers.py produces a 1-row delta; commit it so the report reflects the actual post-rebase state. Combined post-merge: 9/16 appends_curation_history, 10/16 validates_before_write (was 5/16 and 6/16 respectively at the start of this PR series). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
The schema-gap writer audit (PR #85) flagged
scripts/clean_metals_inplace.pyas the highest-risk writer in the repo: it mutated 149 community YAMLs in place via regex / line-edit surgery on the raw file text, then wrote the result back withpath.write_text(string). That bypassed:write_validated_community),record_curation_event),read_text/write_textround-trips againstkb/communities/).What this PR does
Converts the script to the same pattern every other validated-writer script in the repo now uses:
yaml.safe_load→ mutate the parsed dict →write_validated_community.CLEAN_METAL_FIELDSCurationEventviarecord_curation_eventwhen something actually changes..yaml.bak_metalsbefore write; restore if validation raises; unlink on success). This matches the pattern introduced inapply_pmc_conversions.pyandlink_growth_media.pyin PR Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns #85.--dry-rundefault +--applyopt-in flag, matching the repo convention. Dry-run does the in-memory cleaning + logs the planned mutations but does not record a CurationEvent or write.The actual metal-field-cleaning logic moves into a helper
clean_metal_fields_in_doc(doc) -> (removed, kept)so the I/O and the cleaning are testable independently.Cleaning semantics — preserved
The same
AMBIGUOUS_METAL_KEYWORDStable is consulted (TITANIUM/GOLD/PALLADIUM), and the same word-bounded evidence search runs over the document minus itsmetals_presentblock. That exclusion is now done byyaml.safe_dump'ing a shallow copy of the doc withmetals_presentstripped (so an ambiguous entry can never self-satisfy by matching its own listing — same invariant the original text-based implementation enforced vialines[:start] + lines[end:]).keyword_in_textanchors on non-alphanumeric boundaries, so YAML structural characters (:,-, newlines, quotes) do not change match outcomes versus the original raw-text view of the file.The only externally visible difference is that the file is re-emitted as canonical YAML on the changed files — the same cost paid by every other validated-writer script in the repo. Unchanged files are not touched.
Out-of-scope (unchanged from the original docstring)
metal_notesandmetal_relevanceare never touched.rare_earth_elements_presentis left alone — no REE keyword is short enough to false-match the substring pattern that affects metals.Audit row
Before:
After:
(columns: writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just)
wired_into_juststaysno: this script is a one-shot cleanup of a known buggy-extractor run, not a recurring pipeline step. The orchestrator can wire it in if it becomes recurring.Test plan
uv run ruff check scripts/clean_metals_inplace.py→ All checks passedyes no yes no notoyes yes yes yes no--apply, schema-valid curation event appended, backup unlinked on successkb/communities/corpus — that's an operator step (the corpus is already clean from the prior buggy-extractor cleanup).🤖 Generated with Claude Code