Skip to content

Commit 50b34f0

Browse files
realmarcinclaude
andcommitted
Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community
Previously this script mutated 149 community YAMLs in place via regex / line-edit surgery on the raw text, then wrote the result with path.write_text(string). The recent schema-gap audit (PR #85) flagged it as the highest-risk writer in the repo: no validation gate, no curation_history entry on changed files, and the audit detector itself was blind to it until the previous PR broadened the heuristic to catch read_text/write_text round-trips against kb/communities/. Now: load through yaml.safe_load, run the same cleaning logic on the parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back via write_validated_community (with backup-restore on validation failure, matching the apply_pmc_conversions / link_growth_media pattern from PR #85). The --dry-run default + --apply opt-in flag matches the existing repo convention. Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block — implemented now by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped, which keyword_in_text scans identically (its regex anchors on non-alphanumeric boundaries, so YAML structural characters do not change match outcomes). Smoke-tested against a /tmp copy of the corpus: - Baseline corpus (already cleaned by a prior buggy-extractor run): reports 0/265 files would change, as expected. - With an injected false-positive (TITANIUM in a doc with no titanium evidence): correctly removed in dry-run preview and in --apply, with a schema-valid curation_history event appended and backup unlinked on success. Audit row before: yes no yes no no Audit row after: yes yes yes yes no (writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just — wired_into_just stays no; this script is a one-shot cleanup, not a recurring pipeline step.) Baselines remain green: scripts/validate_strict.py reports 0 errors across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9d863f8 commit 50b34f0

1 file changed

Lines changed: 127 additions & 53 deletions

File tree

scripts/clean_metals_inplace.py

Lines changed: 127 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,7 @@
1818
Out of scope deliberately:
1919
2020
- `metal_notes` and `metal_relevance` are never touched. Curator-authored
21-
values in those fields would otherwise be silently clobbered, and
22-
rewriting a multi-line YAML scalar from a key-line regex is unsafe
23-
(it leaves orphaned continuation lines that get re-folded into the
24-
new value).
21+
values in those fields would otherwise be silently clobbered.
2522
- Adding new metal entries is also out of scope. The fixed extractor
2623
may find legitimate metals the old buggy run missed (e.g., via CHEBI
2724
tier-1 matching), but adding them here would surprise curators and
@@ -31,20 +28,48 @@
3128
short enough to false-match the substring pattern that affects metals.
3229
3330
Usage:
34-
uv run python scripts/clean_metals_inplace.py --dry-run
35-
uv run python scripts/clean_metals_inplace.py
31+
uv run python scripts/clean_metals_inplace.py # dry-run (default)
32+
uv run python scripts/clean_metals_inplace.py --apply # write changes
3633
3734
The script self-bootstraps `src/` onto `sys.path`, so PYTHONPATH does
3835
not need to be set when invoking it directly.
36+
37+
History
38+
-------
39+
Originally this script mutated each YAML in place via regex/line edits
40+
on the raw text and `path.write_text(string)`. That bypassed the
41+
write-time validation gate and never recorded a CurationEvent — both
42+
flagged by the writers audit (PR #85). This rewrite loads the document
43+
through `yaml.safe_load`, mutates the in-memory dict, records a
44+
`CLEAN_METAL_FIELDS` curation event, and writes back through
45+
`write_validated_community` (with backup-restore on validation failure,
46+
matching the apply_pmc_conversions / link_growth_media pattern).
47+
48+
The cleaning semantics are preserved verbatim: the same ambiguous-symbol
49+
table is consulted, and the same word-bounded evidence search runs over
50+
the document minus its `metals_present` block. The only externally
51+
visible difference is that the file is now re-emitted as canonical YAML
52+
(comments and incidental formatting are not preserved across the
53+
roundtrip — this is the same cost paid by every other validated-writer
54+
script in the repo).
3955
"""
4056

57+
from __future__ import annotations
58+
4159
import argparse
4260
import sys
4361
from pathlib import Path
4462

63+
import yaml
64+
4565
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
4666

67+
from communitymech.curate.curation_event import record_curation_event
4768
from communitymech.metal_extraction import keyword_in_text
69+
from communitymech.validation.write_validated import (
70+
ValidationFailedError,
71+
write_validated_community,
72+
)
4873

4974
# Metals whose keyword list contains a short symbol that the old buggy
5075
# substring matcher false-fired on. For each, the "unambiguous" tokens
@@ -57,37 +82,36 @@
5782
}
5883

5984

60-
def _read_metals_block(lines: list[str]) -> tuple[int, int, list[str]] | None:
61-
"""Locate the metals_present block. Returns (start_idx, end_idx, entries)."""
62-
for i, line in enumerate(lines):
63-
if line.rstrip("\n") == "metals_present:" or line.startswith("metals_present: "):
64-
inline = line.rstrip("\n").removeprefix("metals_present:").strip()
65-
if inline:
66-
# Inline scalar form (e.g., `metals_present: []`) — nothing to clean.
67-
return i, i + 1, []
68-
entries: list[str] = []
69-
end = i + 1
70-
while end < len(lines) and lines[end].startswith("- "):
71-
entries.append(lines[end][2:].strip())
72-
end += 1
73-
return i, end, entries
74-
return None
75-
76-
77-
def clean_file(path: Path, dry_run: bool) -> tuple[bool, str]:
78-
text = path.read_text()
79-
lines = text.splitlines(keepends=True)
80-
located = _read_metals_block(lines)
81-
if located is None:
82-
return False, ""
83-
start, end, entries = located
85+
def _evidence_text(doc: dict) -> str:
86+
"""Serialize ``doc`` minus ``metals_present`` for word-bounded keyword search.
87+
88+
The original (text-based) implementation built the evidence string by
89+
concatenating every line of the source file *except* the
90+
``metals_present:`` block, so that an ambiguous entry would never
91+
self-satisfy by matching its own listing. We mirror that exclusion
92+
here by stripping the key from a shallow copy before dumping.
93+
94+
``keyword_in_text`` anchors on non-alphanumeric boundaries, so YAML
95+
structural characters (``:``, ``-``, newlines, quotes) do not affect
96+
match outcomes versus the original raw-text view.
97+
"""
98+
scratch = {k: v for k, v in doc.items() if k != "metals_present"}
99+
return yaml.safe_dump(scratch, allow_unicode=True, sort_keys=False)
100+
101+
102+
def clean_metal_fields_in_doc(doc: dict) -> tuple[list[str], list[str]]:
103+
"""Remove false-positive ambiguous metals from ``doc['metals_present']``.
104+
105+
Mutates ``doc`` in place. Returns ``(removed, kept)`` so callers can
106+
log what changed. When ``metals_present`` is missing, empty, or has
107+
no entries to drop, returns ``([], existing_list)`` and leaves the
108+
doc untouched.
109+
"""
110+
entries = doc.get("metals_present")
84111
if not entries:
85-
return False, ""
112+
return [], list(entries or [])
86113

87-
# Exclude the metals_present block itself when searching for evidence —
88-
# otherwise the `- TITANIUM` entry we're trying to validate would match
89-
# its own keywords and never be flagged as a false positive.
90-
evidence_text = "".join(lines[:start] + lines[end:])
114+
evidence = _evidence_text(doc)
91115

92116
kept: list[str] = []
93117
removed: list[str] = []
@@ -96,47 +120,97 @@ def clean_file(path: Path, dry_run: bool) -> tuple[bool, str]:
96120
if unambig is None:
97121
kept.append(entry)
98122
continue
99-
if any(keyword_in_text(kw, evidence_text) for kw in unambig):
123+
if any(keyword_in_text(kw, evidence) for kw in unambig):
100124
kept.append(entry)
101125
else:
102126
removed.append(entry)
103127

104-
if not removed:
105-
return False, ""
128+
if removed:
129+
doc["metals_present"] = kept
130+
return removed, kept
131+
106132

107-
# Rebuild only the metals_present block; everything else (including
108-
# metal_relevance, metal_notes, and any comments) is preserved.
109-
if kept:
110-
body = "".join(f"- {e}\n" for e in kept)
111-
new_block = "metals_present:\n" + body
112-
else:
113-
new_block = "metals_present: []\n"
133+
def clean_file(path: Path, apply: bool) -> tuple[bool, str]:
134+
"""Load, clean, and (if ``apply``) write a single community YAML.
135+
136+
Returns ``(changed, summary)``. ``changed`` is True iff at least one
137+
false-positive metal was removed. ``summary`` is a human-readable
138+
one-liner suitable for printing alongside the file name.
139+
"""
140+
with path.open() as f:
141+
doc = yaml.safe_load(f)
142+
if not isinstance(doc, dict):
143+
return False, ""
114144

115-
new_lines = lines[:start] + [new_block] + lines[end:]
116-
new_text = "".join(new_lines)
145+
removed, kept = clean_metal_fields_in_doc(doc)
146+
if not removed:
147+
return False, ""
117148

118149
summary = f" removed: {removed}; kept: {kept}"
119-
if not dry_run:
120-
path.write_text(new_text)
150+
151+
if not apply:
152+
return True, summary
153+
154+
record_curation_event(
155+
doc,
156+
curator="clean_metals_inplace",
157+
action="CLEAN_METAL_FIELDS",
158+
changes=(
159+
f"Removed {len(removed)} false-positive metal(s) from metals_present: "
160+
f"{', '.join(removed)}"
161+
),
162+
)
163+
164+
# Restore-on-failure pattern (matches apply_pmc_conversions /
165+
# link_growth_media from PR #85): rename to .bak first so a failed
166+
# validation doesn't leave the original missing on disk.
167+
backup = path.with_suffix(".yaml.bak_metals")
168+
path.rename(backup)
169+
try:
170+
write_validated_community(doc, path)
171+
except ValidationFailedError as exc:
172+
backup.rename(path)
173+
print(
174+
f" ✗ validation failed for {path.name}: {exc.summary()} "
175+
"(original restored from backup)",
176+
file=sys.stderr,
177+
)
178+
return False, summary
179+
backup.unlink() # success — drop the backup
121180
return True, summary
122181

123182

124183
def main() -> None:
125-
parser = argparse.ArgumentParser()
126-
parser.add_argument("--dry-run", action="store_true")
184+
parser = argparse.ArgumentParser(description=__doc__)
185+
group = parser.add_mutually_exclusive_group()
186+
group.add_argument(
187+
"--dry-run",
188+
action="store_true",
189+
help="Preview changes without writing (default).",
190+
)
191+
group.add_argument(
192+
"--apply",
193+
action="store_true",
194+
help="Actually write changes back to disk.",
195+
)
127196
args = parser.parse_args()
128197

198+
apply = args.apply # dry-run is the default; --apply is the explicit opt-in
199+
129200
community_dir = Path("kb/communities")
130201
files = sorted(community_dir.glob("*.yaml"))
131202
changed = 0
132203
for f in files:
133-
did_change, summary = clean_file(f, dry_run=args.dry_run)
204+
did_change, summary = clean_file(f, apply=apply)
134205
if did_change:
135206
changed += 1
136207
print(f"{f.name}")
137208
print(summary)
138-
verb = "would change" if args.dry_run else "changed"
209+
210+
verb = "changed" if apply else "would change"
139211
print(f"\n{verb} {changed}/{len(files)} files")
212+
if not apply:
213+
print("(dry-run; re-run with --apply to write)")
140214

141215

142216
if __name__ == "__main__":

0 commit comments

Comments
 (0)