Skip to content

Commit 4e6dcd0

Browse files
realmarcinclaude
andcommitted
Address Copilot review on PR #86
Two findings: - The backup/restore flow only caught ValidationFailedError. If any other exception fired after path.rename(backup) (disk error, permission denied, unexpected serializer fault), the original would be left as .yaml.bak_metals with no .yaml. Catch BaseException too: restore the backup before re-raising so the corpus is never left with a missing canonical file. - _evidence_text serializes via yaml.safe_dump, which drops comments and can reflow scalars — strictly not a perfect raw-text equivalent. Document the caveat: the keywords we search for live in YAML content (description, evidence snippet, ecological_state), never in comments, so the on-disk behavior is identical for the corpus today. If a future workflow starts encoding evidence in YAML comments, the function should switch to path.read_text() with a line-strip filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 50b34f0 commit 4e6dcd0

1 file changed

Lines changed: 22 additions & 4 deletions

File tree

scripts/clean_metals_inplace.py

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,16 @@ def _evidence_text(doc: dict) -> str:
9191
self-satisfy by matching its own listing. We mirror that exclusion
9292
here by stripping the key from a shallow copy before dumping.
9393
94-
``keyword_in_text`` anchors on non-alphanumeric boundaries, so YAML
95-
structural characters (``:``, ``-``, newlines, quotes) do not affect
96-
match outcomes versus the original raw-text view.
94+
Caveat — this is a near-but-not-perfect equivalent of the raw-text
95+
view: ``yaml.safe_dump`` drops YAML comments and may reflow scalars
96+
(e.g. line widths on long descriptions). Practically, the keywords
97+
we search for (``zinc``, ``copper``, ``calcium``, …) live in
98+
``description``, ``ecological_state``, evidence ``snippet`` values,
99+
and other YAML *content* — never in YAML comments. So the on-disk
100+
behavior is identical for the corpus today. If a future workflow
101+
starts encoding evidence in YAML comments, this assumption breaks
102+
and the function should switch to ``path.read_text()`` with a
103+
``metals_present`` line-strip filter to preserve full fidelity.
97104
"""
98105
scratch = {k: v for k, v in doc.items() if k != "metals_present"}
99106
return yaml.safe_dump(scratch, allow_unicode=True, sort_keys=False)
@@ -163,7 +170,12 @@ def clean_file(path: Path, apply: bool) -> tuple[bool, str]:
163170

164171
# Restore-on-failure pattern (matches apply_pmc_conversions /
165172
# link_growth_media from PR #85): rename to .bak first so a failed
166-
# validation doesn't leave the original missing on disk.
173+
# write doesn't leave the original missing on disk. Catch BOTH
174+
# ValidationFailedError (the expected-error case, surfaced with a
175+
# short summary) and *any other* exception (disk error, permission
176+
# denied, unexpected serializer fault) so the bak file always gets
177+
# rolled back rather than leaving the user with `.yaml.bak_metals`
178+
# and no `.yaml`.
167179
backup = path.with_suffix(".yaml.bak_metals")
168180
path.rename(backup)
169181
try:
@@ -176,6 +188,12 @@ def clean_file(path: Path, apply: bool) -> tuple[bool, str]:
176188
file=sys.stderr,
177189
)
178190
return False, summary
191+
except BaseException:
192+
# Restore the backup before re-raising so the corpus is never
193+
# left with a missing canonical file.
194+
if backup.exists() and not path.exists():
195+
backup.rename(path)
196+
raise
179197
backup.unlink() # success — drop the backup
180198
return True, summary
181199

0 commit comments

Comments
 (0)