Skip to content

Commit 9d863f8

Browse files
realmarcinclaude
andauthored
Close audit-writers blind-spot + lock in growth_media id patterns (#85)
Two small follow-ups to the recent schema-gap audit (#84): - F7: scripts/audit_writers.py's writer-detection heuristic only caught yaml.dump / yaml.safe_dump / path.write_text(yaml.dump(...)) patterns. scripts/clean_metals_inplace.py mutates 149 community YAMLs via regex/line-edit substitutions + path.write_text(string), bypassing the detector entirely. Broaden the heuristic: a file that references the kb/communities path literal AND round-trips the same path variable through .read_text() and .write_text(...) is also a community-YAML writer. clean_metals_inplace.py is the only script that matches this pattern today; it now surfaces in the audit TSV (correctly flagged validates_before_write=no, appends_curation_history=no — slated for conversion as separate work). - Fix #5: GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id had no pattern declared, while the sibling RelatedMedia.culturemech_id / RelatedIngredient.mediaingredientmech_id slots do. Live data already conforms to the standard ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ format — just lock it in to catch future drift at write time. Schema regenerated via `just gen-python`; `validate_strict.py` still reports 0 ERROR rows across 265 files. This PR does NOT convert clean_metals_inplace.py itself — that's a separate, more invasive change (regex surgery → load/dump roundtrip across 149 files). The audit-blind-spot fix above ensures the next audit cycle surfaces it. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d584f55 commit 9d863f8

4 files changed

Lines changed: 28 additions & 4 deletions

File tree

reports/pipeline_writers_audit.tsv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ scripts/apply_pmc_conversions.py yes yes yes yes no
55
scripts/apply_strain_designations.py yes no no no no
66
scripts/apply_taxonomy_corrections.py yes no no no no
77
scripts/backfill_metals.py yes no yes no no
8+
scripts/clean_metals_inplace.py yes no yes no no
89
scripts/enhance_strain_data.py yes no no no no
910
scripts/fix_network_integrity.py yes yes yes yes no
1011
scripts/fix_reference_formats.py yes no yes no no

scripts/audit_writers.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@
3838

3939
# Patterns
4040
_WRITE_TEXT_OF_YAML = re.compile(r"\.write_text\s*\(\s*yaml\.(?:safe_)?dump")
41+
# Match the path literal "kb/communities" appearing in source — used to
42+
# anchor the in-place-mutation heuristic below to scripts that actually
43+
# operate on community YAMLs.
44+
_COMMUNITIES_PATH_LITERAL = re.compile(r"kb/communities")
45+
_READ_TEXT_VAR = re.compile(r"(\w+)\.read_text\s*\(")
46+
_WRITE_TEXT_VAR = re.compile(r"(\w+)\.write_text\s*\(")
4147
_CURATION_APPEND = re.compile(
4248
r"curation_history.*?(append|\+=|\.insert)"
4349
r"|['\"]curator['\"]\s*:"
@@ -73,7 +79,20 @@ def looks_like_yaml_writer(text: str) -> bool:
7379
return True
7480
# write_validated_community is the closed-schema-gated wrapper that
7581
# callers route through instead of yaml.dump directly.
76-
return "write_validated_community(" in text
82+
if "write_validated_community(" in text:
83+
return True
84+
# Text-templated in-place writers: scripts that round-trip the same
85+
# path variable through .read_text() and .write_text(...) AND
86+
# reference the kb/communities directory are mutating community
87+
# YAMLs via raw text surgery (e.g. clean_metals_inplace.py uses
88+
# regex/line edits + path.write_text(string)). They bypass yaml.dump
89+
# but are still community-YAML writers and must be audited.
90+
if _COMMUNITIES_PATH_LITERAL.search(text):
91+
read_vars = set(_READ_TEXT_VAR.findall(text))
92+
write_vars = set(_WRITE_TEXT_VAR.findall(text))
93+
if read_vars & write_vars:
94+
return True
95+
return False
7796

7897

7998
def audit(path: Path, justfile_text: str) -> dict | None:

src/communitymech/datamodel/communitymech.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Auto generated from communitymech.yaml by pythongen.py version: 0.0.1
2-
# Generation date: 2026-05-25T01:44:55
2+
# Generation date: 2026-05-25T18:12:41
33
# Schema: communitymech
44
#
55
# id: https://w3id.org/communitymech
@@ -1949,7 +1949,8 @@ class slots:
19491949
model_uri=COMMUNITYMECH.growthMediaComponent__name, domain=None, range=str)
19501950

19511951
slots.growthMediaComponent__media_ingredient_mech_id = Slot(uri=COMMUNITYMECH.media_ingredient_mech_id, name="growthMediaComponent__media_ingredient_mech_id", curie=COMMUNITYMECH.curie('media_ingredient_mech_id'),
1952-
model_uri=COMMUNITYMECH.growthMediaComponent__media_ingredient_mech_id, domain=None, range=Optional[str])
1952+
model_uri=COMMUNITYMECH.growthMediaComponent__media_ingredient_mech_id, domain=None, range=Optional[str],
1953+
pattern=re.compile(r'^MediaIngredientMech:\d{6}$'))
19531954

19541955
slots.growthMediaComponent__media_ingredient_mech_url = Slot(uri=COMMUNITYMECH.media_ingredient_mech_url, name="growthMediaComponent__media_ingredient_mech_url", curie=COMMUNITYMECH.curie('media_ingredient_mech_url'),
19551956
model_uri=COMMUNITYMECH.growthMediaComponent__media_ingredient_mech_url, domain=None, range=Optional[str])
@@ -1970,7 +1971,8 @@ class slots:
19701971
model_uri=COMMUNITYMECH.growthMedia__name, domain=None, range=str)
19711972

19721973
slots.growthMedia__culturemech_id = Slot(uri=COMMUNITYMECH.culturemech_id, name="growthMedia__culturemech_id", curie=COMMUNITYMECH.curie('culturemech_id'),
1973-
model_uri=COMMUNITYMECH.growthMedia__culturemech_id, domain=None, range=Optional[str])
1974+
model_uri=COMMUNITYMECH.growthMedia__culturemech_id, domain=None, range=Optional[str],
1975+
pattern=re.compile(r'^CultureMech:\d{6}$'))
19741976

19751977
slots.growthMedia__culturemech_url = Slot(uri=COMMUNITYMECH.culturemech_url, name="growthMedia__culturemech_url", curie=COMMUNITYMECH.curie('culturemech_url'),
19761978
model_uri=COMMUNITYMECH.growthMedia__culturemech_url, domain=None, range=Optional[str])

src/communitymech/schema/communitymech.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,7 @@ classes:
699699
required: true
700700
media_ingredient_mech_id:
701701
description: Identifier in MediaIngredientMech database (e.g., MediaIngredientMech:000001)
702+
pattern: "^MediaIngredientMech:\\d{6}$"
702703
media_ingredient_mech_url:
703704
description: URL to MediaIngredientMech ingredient entry
704705
concentration:
@@ -719,6 +720,7 @@ classes:
719720
required: true
720721
culturemech_id:
721722
description: Identifier in CultureMech media database (e.g., CultureMech:000001)
723+
pattern: "^CultureMech:\\d{6}$"
722724
culturemech_url:
723725
description: URL to CultureMech media entry
724726
composition:

0 commit comments

Comments
 (0)