Skip to content

Convert clean_metals_inplace.py to validated yaml roundtrip#86

Merged
realmarcin merged 2 commits into
mainfrom
convert/clean-metals-inplace
May 26, 2026
Merged

Convert clean_metals_inplace.py to validated yaml roundtrip#86
realmarcin merged 2 commits into
mainfrom
convert/clean-metals-inplace

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Context

The schema-gap writer audit (PR #85) flagged scripts/clean_metals_inplace.py as the highest-risk writer in the repo: it mutated 149 community YAMLs in place via regex / line-edit surgery on the raw file text, then wrote the result back with path.write_text(string). That bypassed:

What this PR does

Converts the script to the same pattern every other validated-writer script in the repo now uses:

  1. yaml.safe_load → mutate the parsed dict → write_validated_community.
  2. Append a CLEAN_METAL_FIELDS CurationEvent via record_curation_event when something actually changes.
  3. Backup-restore on validation failure (rename to .yaml.bak_metals before write; restore if validation raises; unlink on success). This matches the pattern introduced in apply_pmc_conversions.py and link_growth_media.py in PR Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns #85.
  4. --dry-run default + --apply opt-in flag, matching the repo convention. Dry-run does the in-memory cleaning + logs the planned mutations but does not record a CurationEvent or write.

The actual metal-field-cleaning logic moves into a helper clean_metal_fields_in_doc(doc) -> (removed, kept) so the I/O and the cleaning are testable independently.

Cleaning semantics — preserved

The same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block. That exclusion is now done by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped (so an ambiguous entry can never self-satisfy by matching its own listing — same invariant the original text-based implementation enforced via lines[:start] + lines[end:]).

keyword_in_text anchors on non-alphanumeric boundaries, so YAML structural characters (:, -, newlines, quotes) do not change match outcomes versus the original raw-text view of the file.

The only externally visible difference is that the file is re-emitted as canonical YAML on the changed files — the same cost paid by every other validated-writer script in the repo. Unchanged files are not touched.

Out-of-scope (unchanged from the original docstring)

  • metal_notes and metal_relevance are never touched.
  • Adding new metal entries is also out of scope.
  • rare_earth_elements_present is left alone — no REE keyword is short enough to false-match the substring pattern that affects metals.

Audit row

Before:

scripts/clean_metals_inplace.py  yes  no   yes  no   no

After:

scripts/clean_metals_inplace.py  yes  yes  yes  yes  no

(columns: writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just)

wired_into_just stays no: this script is a one-shot cleanup of a known buggy-extractor run, not a recurring pipeline step. The orchestrator can wire it in if it becomes recurring.

Test plan

  • Ruff clean: uv run ruff check scripts/clean_metals_inplace.py → All checks passed
  • Baseline strict validation: 0 errors across 265 community YAMLs
  • Baseline pytest: 136 passed / 9 skipped / 0 failed
  • Audit row flips from yes no yes no no to yes yes yes yes no
  • Smoke test on /tmp copy of the corpus:
    • Baseline (already cleaned by prior run) — reports 0/265 files would change, as expected
    • With injected false-positive (TITANIUM in doc with no titanium evidence) — correctly removed in dry-run preview and in --apply, schema-valid curation event appended, backup unlinked on success
  • Not run against the live kb/communities/ corpus — that's an operator step (the corpus is already clean from the prior buggy-extractor cleanup).

🤖 Generated with Claude Code

…idated_community

Previously this script mutated 149 community YAMLs in place via regex /
line-edit surgery on the raw text, then wrote the result with
path.write_text(string). The recent schema-gap audit (PR #85) flagged
it as the highest-risk writer in the repo: no validation gate, no
curation_history entry on changed files, and the audit detector itself
was blind to it until the previous PR broadened the heuristic to catch
read_text/write_text round-trips against kb/communities/.

Now: load through yaml.safe_load, run the same cleaning logic on the
parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back
via write_validated_community (with backup-restore on validation
failure, matching the apply_pmc_conversions / link_growth_media pattern
from PR #85). The --dry-run default + --apply opt-in flag matches the
existing repo convention.

Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS
table is consulted (TITANIUM / GOLD / PALLADIUM), and the same
word-bounded evidence search runs over the document minus its
metals_present block — implemented now by yaml.safe_dump'ing a shallow
copy of the doc with metals_present stripped, which keyword_in_text
scans identically (its regex anchors on non-alphanumeric boundaries, so
YAML structural characters do not change match outcomes).

Smoke-tested against a /tmp copy of the corpus:
- Baseline corpus (already cleaned by a prior buggy-extractor run):
  reports 0/265 files would change, as expected.
- With an injected false-positive (TITANIUM in a doc with no titanium
  evidence): correctly removed in dry-run preview and in --apply, with
  a schema-valid curation_history event appended and backup unlinked
  on success.

Audit row before:  yes  no   yes  no   no
Audit row after:   yes  yes  yes  yes  no
(writes_yaml / appends_curation_history / has_write_safeguard /
validates_before_write / wired_into_just — wired_into_just stays no;
this script is a one-shot cleanup, not a recurring pipeline step.)

Baselines remain green: scripts/validate_strict.py reports 0 errors
across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 02:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors scripts/clean_metals_inplace.py away from regex/text-based in-place YAML mutation to the repo’s validated-writer pattern (load YAML → mutate dict → write via write_validated_community), and adds curation-history recording when changes are applied.

Changes:

  • Replaces raw text line-editing with yaml.safe_load + in-memory cleaning via clean_metal_fields_in_doc.
  • Adds record_curation_event(..., action="CLEAN_METAL_FIELDS") and writes through write_validated_community.
  • Introduces --apply opt-in with dry-run default, plus a backup/restore flow around the write.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/clean_metals_inplace.py
Comment thread scripts/clean_metals_inplace.py Outdated
Two findings:

- The backup/restore flow only caught ValidationFailedError. If any
  other exception fired after path.rename(backup) (disk error,
  permission denied, unexpected serializer fault), the original would
  be left as .yaml.bak_metals with no .yaml. Catch BaseException too:
  restore the backup before re-raising so the corpus is never left
  with a missing canonical file.

- _evidence_text serializes via yaml.safe_dump, which drops comments
  and can reflow scalars — strictly not a perfect raw-text equivalent.
  Document the caveat: the keywords we search for live in YAML
  content (description, evidence snippet, ecological_state), never in
  comments, so the on-disk behavior is identical for the corpus today.
  If a future workflow starts encoding evidence in YAML comments, the
  function should switch to path.read_text() with a line-strip filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit e2894f1 into main May 26, 2026
@realmarcin realmarcin deleted the convert/clean-metals-inplace branch May 26, 2026 02:13
realmarcin added a commit that referenced this pull request May 26, 2026
PR #86 (Convert clean_metals_inplace.py) just merged into main. After
rebasing, scripts/clean_metals_inplace.py is now gated, so the
appends_curation_history / validates_before_write columns for it flip
to yes. Re-running scripts/audit_writers.py produces a 1-row delta;
commit it so the report reflects the actual post-rebase state.

Combined post-merge: 9/16 appends_curation_history, 10/16
validates_before_write (was 5/16 and 6/16 respectively at the start
of this PR series).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 26, 2026
…ance_strain_data + add_evidence_source) (#87)

* Convert 3 more CommunityMech writers to use shared validate-and-record helpers

Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the
pattern established in PR #84 and refined by PR #85 (restore-on-failure
backup handling): every script that mutates a community YAML loads
through yaml.safe_load, mutates the dict, records a CurationEvent via
record_curation_event(), and writes back via write_validated_community()
which gates on closed-schema LinkML validation.

Converted scripts:
- scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair;
  llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True
  on the curation event so a session of auto-approved fixes collapses
  into a single trail entry instead of one per snippet. The existing
  .yaml.bak_intelligent backup created at session start by
  shutil.copy2 remains the user-visible safety net.
- scripts/enhance_strain_data.py (strain-ID enrichment;
  action=ENHANCE_STRAIN_DATA). Previously the script extracted strain
  data but only emitted a copy-paste snippets file; this PR adds an
  --apply mode that writes strain_designation entries back into
  matching taxonomy[*] entries via write_validated_community(). Default
  behavior preserves the historical extract-only flow (no kb/communities
  writes without --apply). --overwrite controls whether to replace
  existing curator-authored strain_designation values.
- scripts/add_evidence_source.py (evidence_source enum backfill;
  action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern
  from PR #85 — the original is moved to .yaml.bak_source before the
  validated write; on ValidationFailedError the backup is renamed back
  in place so the batch loop can continue without leaving a half-written
  community on disk.

Each per-record loop continues on ValidationFailedError so one bad file
can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run,
--auto-approve, --file, --apply, etc.) preserved.

After-state: scripts/audit_writers.py reports 8/16 writers gated
(was 5/16). The remaining un-converted writers (apply_strain_designations,
apply_taxonomy_corrections, backfill_metals, clean_metals_inplace,
fix_reference_formats, plus a handful of smaller src/ writers) follow
the same conversion pattern; left as future work to keep this PR focused.

Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py
import communitymech.literature_enhanced (a pre-existing module that does
not currently exist in the repo) at module top-level. This PR does not
introduce or fix that — the scripts have always failed at the import step
when invoked from CLI. Out of scope for this conversion; tracked separately.

Baseline (unchanged):
- validate_strict: 0 ERROR rows / 265 files
- pytest tests/: 136 passed, 9 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Refresh audit TSV after rebase onto #86

PR #86 (Convert clean_metals_inplace.py) just merged into main. After
rebasing, scripts/clean_metals_inplace.py is now gated, so the
appends_curation_history / validates_before_write columns for it flip
to yes. Re-running scripts/audit_writers.py produces a 1-row delta;
commit it so the report reflects the actual post-rebase state.

Combined post-merge: 9/16 appends_curation_history, 10/16
validates_before_write (was 5/16 and 6/16 respectively at the start
of this PR series).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants