Skip to content

Commit e2894f1

Browse files
realmarcinclaude
andauthored
Convert clean_metals_inplace.py to validated yaml roundtrip (#86)
* Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community Previously this script mutated 149 community YAMLs in place via regex / line-edit surgery on the raw text, then wrote the result with path.write_text(string). The recent schema-gap audit (PR #85) flagged it as the highest-risk writer in the repo: no validation gate, no curation_history entry on changed files, and the audit detector itself was blind to it until the previous PR broadened the heuristic to catch read_text/write_text round-trips against kb/communities/. Now: load through yaml.safe_load, run the same cleaning logic on the parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back via write_validated_community (with backup-restore on validation failure, matching the apply_pmc_conversions / link_growth_media pattern from PR #85). The --dry-run default + --apply opt-in flag matches the existing repo convention. Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block — implemented now by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped, which keyword_in_text scans identically (its regex anchors on non-alphanumeric boundaries, so YAML structural characters do not change match outcomes). Smoke-tested against a /tmp copy of the corpus: - Baseline corpus (already cleaned by a prior buggy-extractor run): reports 0/265 files would change, as expected. - With an injected false-positive (TITANIUM in a doc with no titanium evidence): correctly removed in dry-run preview and in --apply, with a schema-valid curation_history event appended and backup unlinked on success. Audit row before: yes no yes no no Audit row after: yes yes yes yes no (writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just — wired_into_just stays no; this script is a one-shot cleanup, not a recurring pipeline step.) Baselines remain green: scripts/validate_strict.py reports 0 errors across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #86 Two findings: - The backup/restore flow only caught ValidationFailedError. If any other exception fired after path.rename(backup) (disk error, permission denied, unexpected serializer fault), the original would be left as .yaml.bak_metals with no .yaml. Catch BaseException too: restore the backup before re-raising so the corpus is never left with a missing canonical file. - _evidence_text serializes via yaml.safe_dump, which drops comments and can reflow scalars — strictly not a perfect raw-text equivalent. Document the caveat: the keywords we search for live in YAML content (description, evidence snippet, ecological_state), never in comments, so the on-disk behavior is identical for the corpus today. If a future workflow starts encoding evidence in YAML comments, the function should switch to path.read_text() with a line-strip filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9d863f8 commit e2894f1

1 file changed

Lines changed: 145 additions & 53 deletions

File tree

scripts/clean_metals_inplace.py

Lines changed: 145 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,7 @@
1818
Out of scope deliberately:
1919
2020
- `metal_notes` and `metal_relevance` are never touched. Curator-authored
21-
values in those fields would otherwise be silently clobbered, and
22-
rewriting a multi-line YAML scalar from a key-line regex is unsafe
23-
(it leaves orphaned continuation lines that get re-folded into the
24-
new value).
21+
values in those fields would otherwise be silently clobbered.
2522
- Adding new metal entries is also out of scope. The fixed extractor
2623
may find legitimate metals the old buggy run missed (e.g., via CHEBI
2724
tier-1 matching), but adding them here would surprise curators and
@@ -31,20 +28,48 @@
3128
short enough to false-match the substring pattern that affects metals.
3229
3330
Usage:
34-
uv run python scripts/clean_metals_inplace.py --dry-run
35-
uv run python scripts/clean_metals_inplace.py
31+
uv run python scripts/clean_metals_inplace.py # dry-run (default)
32+
uv run python scripts/clean_metals_inplace.py --apply # write changes
3633
3734
The script self-bootstraps `src/` onto `sys.path`, so PYTHONPATH does
3835
not need to be set when invoking it directly.
36+
37+
History
38+
-------
39+
Originally this script mutated each YAML in place via regex/line edits
40+
on the raw text and `path.write_text(string)`. That bypassed the
41+
write-time validation gate and never recorded a CurationEvent — both
42+
flagged by the writers audit (PR #85). This rewrite loads the document
43+
through `yaml.safe_load`, mutates the in-memory dict, records a
44+
`CLEAN_METAL_FIELDS` curation event, and writes back through
45+
`write_validated_community` (with backup-restore on validation failure,
46+
matching the apply_pmc_conversions / link_growth_media pattern).
47+
48+
The cleaning semantics are preserved verbatim: the same ambiguous-symbol
49+
table is consulted, and the same word-bounded evidence search runs over
50+
the document minus its `metals_present` block. The only externally
51+
visible difference is that the file is now re-emitted as canonical YAML
52+
(comments and incidental formatting are not preserved across the
53+
roundtrip — this is the same cost paid by every other validated-writer
54+
script in the repo).
3955
"""
4056

57+
from __future__ import annotations
58+
4159
import argparse
4260
import sys
4361
from pathlib import Path
4462

63+
import yaml
64+
4565
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
4666

67+
from communitymech.curate.curation_event import record_curation_event
4768
from communitymech.metal_extraction import keyword_in_text
69+
from communitymech.validation.write_validated import (
70+
ValidationFailedError,
71+
write_validated_community,
72+
)
4873

4974
# Metals whose keyword list contains a short symbol that the old buggy
5075
# substring matcher false-fired on. For each, the "unambiguous" tokens
@@ -57,37 +82,43 @@
5782
}
5883

5984

60-
def _read_metals_block(lines: list[str]) -> tuple[int, int, list[str]] | None:
61-
"""Locate the metals_present block. Returns (start_idx, end_idx, entries)."""
62-
for i, line in enumerate(lines):
63-
if line.rstrip("\n") == "metals_present:" or line.startswith("metals_present: "):
64-
inline = line.rstrip("\n").removeprefix("metals_present:").strip()
65-
if inline:
66-
# Inline scalar form (e.g., `metals_present: []`) — nothing to clean.
67-
return i, i + 1, []
68-
entries: list[str] = []
69-
end = i + 1
70-
while end < len(lines) and lines[end].startswith("- "):
71-
entries.append(lines[end][2:].strip())
72-
end += 1
73-
return i, end, entries
74-
return None
75-
76-
77-
def clean_file(path: Path, dry_run: bool) -> tuple[bool, str]:
78-
text = path.read_text()
79-
lines = text.splitlines(keepends=True)
80-
located = _read_metals_block(lines)
81-
if located is None:
82-
return False, ""
83-
start, end, entries = located
85+
def _evidence_text(doc: dict) -> str:
86+
"""Serialize ``doc`` minus ``metals_present`` for word-bounded keyword search.
87+
88+
The original (text-based) implementation built the evidence string by
89+
concatenating every line of the source file *except* the
90+
``metals_present:`` block, so that an ambiguous entry would never
91+
self-satisfy by matching its own listing. We mirror that exclusion
92+
here by stripping the key from a shallow copy before dumping.
93+
94+
Caveat — this is a near-but-not-perfect equivalent of the raw-text
95+
view: ``yaml.safe_dump`` drops YAML comments and may reflow scalars
96+
(e.g. line widths on long descriptions). Practically, the keywords
97+
we search for (``zinc``, ``copper``, ``calcium``, …) live in
98+
``description``, ``ecological_state``, evidence ``snippet`` values,
99+
and other YAML *content* — never in YAML comments. So the on-disk
100+
behavior is identical for the corpus today. If a future workflow
101+
starts encoding evidence in YAML comments, this assumption breaks
102+
and the function should switch to ``path.read_text()`` with a
103+
``metals_present`` line-strip filter to preserve full fidelity.
104+
"""
105+
scratch = {k: v for k, v in doc.items() if k != "metals_present"}
106+
return yaml.safe_dump(scratch, allow_unicode=True, sort_keys=False)
107+
108+
109+
def clean_metal_fields_in_doc(doc: dict) -> tuple[list[str], list[str]]:
110+
"""Remove false-positive ambiguous metals from ``doc['metals_present']``.
111+
112+
Mutates ``doc`` in place. Returns ``(removed, kept)`` so callers can
113+
log what changed. When ``metals_present`` is missing, empty, or has
114+
no entries to drop, returns ``([], existing_list)`` and leaves the
115+
doc untouched.
116+
"""
117+
entries = doc.get("metals_present")
84118
if not entries:
85-
return False, ""
119+
return [], list(entries or [])
86120

87-
# Exclude the metals_present block itself when searching for evidence —
88-
# otherwise the `- TITANIUM` entry we're trying to validate would match
89-
# its own keywords and never be flagged as a false positive.
90-
evidence_text = "".join(lines[:start] + lines[end:])
121+
evidence = _evidence_text(doc)
91122

92123
kept: list[str] = []
93124
removed: list[str] = []
@@ -96,47 +127,108 @@ def clean_file(path: Path, dry_run: bool) -> tuple[bool, str]:
96127
if unambig is None:
97128
kept.append(entry)
98129
continue
99-
if any(keyword_in_text(kw, evidence_text) for kw in unambig):
130+
if any(keyword_in_text(kw, evidence) for kw in unambig):
100131
kept.append(entry)
101132
else:
102133
removed.append(entry)
103134

104-
if not removed:
105-
return False, ""
135+
if removed:
136+
doc["metals_present"] = kept
137+
return removed, kept
106138

107-
# Rebuild only the metals_present block; everything else (including
108-
# metal_relevance, metal_notes, and any comments) is preserved.
109-
if kept:
110-
body = "".join(f"- {e}\n" for e in kept)
111-
new_block = "metals_present:\n" + body
112-
else:
113-
new_block = "metals_present: []\n"
114139

115-
new_lines = lines[:start] + [new_block] + lines[end:]
116-
new_text = "".join(new_lines)
140+
def clean_file(path: Path, apply: bool) -> tuple[bool, str]:
141+
"""Load, clean, and (if ``apply``) write a single community YAML.
142+
143+
Returns ``(changed, summary)``. ``changed`` is True iff at least one
144+
false-positive metal was removed. ``summary`` is a human-readable
145+
one-liner suitable for printing alongside the file name.
146+
"""
147+
with path.open() as f:
148+
doc = yaml.safe_load(f)
149+
if not isinstance(doc, dict):
150+
return False, ""
151+
152+
removed, kept = clean_metal_fields_in_doc(doc)
153+
if not removed:
154+
return False, ""
117155

118156
summary = f" removed: {removed}; kept: {kept}"
119-
if not dry_run:
120-
path.write_text(new_text)
157+
158+
if not apply:
159+
return True, summary
160+
161+
record_curation_event(
162+
doc,
163+
curator="clean_metals_inplace",
164+
action="CLEAN_METAL_FIELDS",
165+
changes=(
166+
f"Removed {len(removed)} false-positive metal(s) from metals_present: "
167+
f"{', '.join(removed)}"
168+
),
169+
)
170+
171+
# Restore-on-failure pattern (matches apply_pmc_conversions /
172+
# link_growth_media from PR #85): rename to .bak first so a failed
173+
# write doesn't leave the original missing on disk. Catch BOTH
174+
# ValidationFailedError (the expected-error case, surfaced with a
175+
# short summary) and *any other* exception (disk error, permission
176+
# denied, unexpected serializer fault) so the bak file always gets
177+
# rolled back rather than leaving the user with `.yaml.bak_metals`
178+
# and no `.yaml`.
179+
backup = path.with_suffix(".yaml.bak_metals")
180+
path.rename(backup)
181+
try:
182+
write_validated_community(doc, path)
183+
except ValidationFailedError as exc:
184+
backup.rename(path)
185+
print(
186+
f" ✗ validation failed for {path.name}: {exc.summary()} "
187+
"(original restored from backup)",
188+
file=sys.stderr,
189+
)
190+
return False, summary
191+
except BaseException:
192+
# Restore the backup before re-raising so the corpus is never
193+
# left with a missing canonical file.
194+
if backup.exists() and not path.exists():
195+
backup.rename(path)
196+
raise
197+
backup.unlink() # success — drop the backup
121198
return True, summary
122199

123200

124201
def main() -> None:
125-
parser = argparse.ArgumentParser()
126-
parser.add_argument("--dry-run", action="store_true")
202+
parser = argparse.ArgumentParser(description=__doc__)
203+
group = parser.add_mutually_exclusive_group()
204+
group.add_argument(
205+
"--dry-run",
206+
action="store_true",
207+
help="Preview changes without writing (default).",
208+
)
209+
group.add_argument(
210+
"--apply",
211+
action="store_true",
212+
help="Actually write changes back to disk.",
213+
)
127214
args = parser.parse_args()
128215

216+
apply = args.apply # dry-run is the default; --apply is the explicit opt-in
217+
129218
community_dir = Path("kb/communities")
130219
files = sorted(community_dir.glob("*.yaml"))
131220
changed = 0
132221
for f in files:
133-
did_change, summary = clean_file(f, dry_run=args.dry_run)
222+
did_change, summary = clean_file(f, apply=apply)
134223
if did_change:
135224
changed += 1
136225
print(f"{f.name}")
137226
print(summary)
138-
verb = "would change" if args.dry_run else "changed"
227+
228+
verb = "changed" if apply else "would change"
139229
print(f"\n{verb} {changed}/{len(files)} files")
230+
if not apply:
231+
print("(dry-run; re-run with --apply to write)")
140232

141233

142234
if __name__ == "__main__":

0 commit comments

Comments
 (0)