Skip to content

Commit d584f55

Browse files
realmarcinclaude
andauthored
Port audit machinery from CultureMech (schema extension + validate_strict + write_validated_community + record_curation_event + audit_writers) (#84)
* Port audit machinery from CultureMech: schema extension + validate_strict + write_validated_community + record_curation_event + audit_writers Brings CommunityMech to parity with the audit-machinery ports recently landed in CultureMech (source), MediaIngredientMech (#32), and TraitMech (#76). CommunityMech is the last sibling; the lift is larger than MIM / TraitMech because the schema did not yet define CurationEvent or curation_history. Schema additions (additive, no migration needed): - New CurationEvent class with timestamp / curator / action / changes / llm_assisted attributes, mirroring the shape used by sibling Mech repos so cross-repo tooling reads curation events uniformly. - New curation_history slot on MicrobialCommunity, multivalued + inlined + optional. Existing community YAMLs continue to validate without modification. - src/communitymech/datamodel/communitymech.py regenerated (just gen-python). New helpers: - src/communitymech/validation/write_validated.py — write_validated_community() refuses to dump a MicrobialCommunity that fails closed-schema LinkML validation; raises ValidationFailedError. Single-root-class schema so no target_class routing needed. Default yaml opts match the repo's existing emission convention (default_flow_style=False, sort_keys=False, allow_unicode=True, width=120, indent=2) so existing files roundtrip byte-identically. - src/communitymech/curate/curation_event.py — record_curation_event() is the standard helper for appending a CurationEvent to doc['curation_history']. Schema-aligned signature; whole-second + Z suffix timestamps; skip_if_recent support for idempotent re-runs. New scripts: - scripts/validate_strict.py — strict closed-schema parallel walk of kb/communities/ (with backups/ + snapshots/ excluded). Emits reports/instance_validation_failures.tsv categorized by error class, exits non-zero on ERROR. Strictly stronger than the per-file linkml-validate loop in just validate-all (open-mode, swallows exit codes). - scripts/audit_writers.py — inventory of every YAML-writing module under scripts/ + src/communitymech/, flags whether each script validates before writing and appends a curation_history event. Writer conversions (5 of ~15): - scripts/add_community_ids.py (action=ASSIGN_COMMUNITY_ID; also gained a --dry-run safeguard it lacked before) - scripts/apply_pmc_conversions.py (action=CONVERT_PMC_TO_PMID) - scripts/fix_network_integrity.py (action=FIX_NETWORK_INTEGRITY) - scripts/link_growth_media.py (action=LINK_GROWTH_MEDIA) - src/communitymech/network/llm_repair.py (action=LLM_REPAIR_APPLIED, llm_assisted=True) Each one was wrapped in try/except ValidationFailedError on the write call so one bad record can't kill a batch run. Existing CLI surfaces preserved. Justfile: - New validate-strict + audit-writers recipes. - qc composite extended to include validate-strict. Baseline: - just validate-strict — 265 files, 0 ERROR rows (clean). - just audit-writers — 15 writers; 5 now validate before write + append curation_history. The other 10 are flagged in the TSV as future-work conversions (apply_strain_designations, apply_taxonomy_corrections, apply_suggested_fixes / suggested_snippets, backfill_metals, batch_snippet_fixer, clean_metals_inplace, curate_evidence_with_pdfs, enhance_strain_data, fix_invalid_snippets, fix_reference_formats, intelligent_snippet_fixer, etc.) — converting them follows the same pattern as the 5 above. - pytest tests/ — 136 passed, 9 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #84 5 findings, all real and addressed: - scripts/apply_pmc_conversions.py + scripts/link_growth_media.py (both process_single_community and process_all_communities paths): all three scripts rename the source file to a `.bak` backup before writing the validated result. Previously, if write_validated_community raised ValidationFailedError the handler only logged and continued — leaving the original path missing on disk (only the .bak existed). Now restore the backup on validation failure before logging. - scripts/audit_writers.py: replace the substring check for `wired_into_just` with a per-line check that ignores comments and requires a word-boundary match on the full filename. The previous check was a false positive when a justfile comment merely mentioned the filename — e.g. write_validated.py matched the justfile comment referencing write_validated_community(). Drops the wired-into-just count from 3 (with false positives) to 1 (genuine: link_growth_media). - scripts/add_community_ids.py: guard against running on already-IDed YAMLs. The previous flow built `{"id": community_id}.update(data)`, which silently retained the source file's existing id while the curation event still recorded "Assigned id=<new>" — a misleading audit entry. Skip such files with an explanatory log line instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5979c43 commit d584f55

16 files changed

Lines changed: 1792 additions & 1797 deletions

justfile

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,22 @@ validate-all:
2323
uv run linkml-validate -s src/communitymech/schema/communitymech.yaml "$file"
2424
done
2525

26+
# Strict in-process validation in *closed* mode (rejects unknown fields).
27+
# Emits reports/instance_validation_failures.tsv and exits 1 on any ERROR.
28+
# Catches the same drift class that gave CultureMech 59k silent errors;
29+
# closed-mode + non-zero exit is what the per-file linkml-validate loop
30+
# above silently passes today. Use this for the corpus-wide health check.
31+
validate-strict *args:
32+
uv run python scripts/validate_strict.py {{args}}
33+
34+
# Audit every YAML-writing Python module under scripts/ and
35+
# src/communitymech/ for safeguards (curation_history append,
36+
# --dry-run/--apply, validates before write, wired into justfile).
37+
# Writes reports/pipeline_writers_audit.tsv. Useful for tracking
38+
# adoption of write_validated_community + record_curation_event.
39+
audit-writers *args:
40+
uv run python scripts/audit_writers.py {{args}}
41+
2642
# Validate evidence references in a community file
2743
validate-references FILE:
2844
uv run linkml-reference-validator validate data {{FILE}} -s src/communitymech/schema/communitymech.yaml --config conf/reference_validator.yaml
@@ -111,8 +127,8 @@ lint:
111127
uv run ruff check src/ tests/
112128
uv run mypy src/
113129

114-
# Full QC (validate + lint + test)
115-
qc: validate-all validate-terms-all validate-references-all lint test
130+
# Full QC (validate + strict validate + lint + test)
131+
qc: validate-all validate-strict validate-terms-all validate-references-all lint test
116132
@echo "✅ All QC checks passed!"
117133

118134
# Check which community strains are represented in UniProt reference proteomes
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
file category detail path message

reports/pipeline_writers_audit.tsv

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
path writes_yaml appends_curation_history has_write_safeguard validates_before_write wired_into_just
2+
scripts/add_community_ids.py yes yes yes yes no
3+
scripts/add_evidence_source.py yes no yes no no
4+
scripts/apply_pmc_conversions.py yes yes yes yes no
5+
scripts/apply_strain_designations.py yes no no no no
6+
scripts/apply_taxonomy_corrections.py yes no no no no
7+
scripts/backfill_metals.py yes no yes no no
8+
scripts/enhance_strain_data.py yes no no no no
9+
scripts/fix_network_integrity.py yes yes yes yes no
10+
scripts/fix_reference_formats.py yes no yes no no
11+
scripts/intelligent_snippet_fixer.py yes no no no no
12+
scripts/link_growth_media.py yes yes yes yes yes
13+
src/communitymech/cli.py yes no yes no no
14+
src/communitymech/network/batch_reporter.py yes no no no no
15+
src/communitymech/network/llm_repair.py yes yes yes yes no
16+
src/communitymech/validation/write_validated.py yes no no yes no

scripts/add_community_ids.py

Lines changed: 52 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
11
#!/usr/bin/env python
22
"""Add sequential CommunityMech IDs to all community YAML files."""
33

4-
import yaml
4+
import sys
55
from pathlib import Path
66

7+
import yaml
8+
9+
from communitymech.curate.curation_event import record_curation_event
10+
from communitymech.validation.write_validated import (
11+
ValidationFailedError,
12+
write_validated_community,
13+
)
714

8-
def add_ids_to_communities(communities_dir: Path = Path("kb/communities")):
15+
16+
def add_ids_to_communities(communities_dir: Path = Path("kb/communities"), dry_run: bool = False):
917
"""Add sequential IDs to all community YAML files."""
1018
yaml_files = sorted(communities_dir.glob("*.yaml"))
1119

@@ -19,26 +27,56 @@ def add_ids_to_communities(communities_dir: Path = Path("kb/communities")):
1927
with open(yaml_file) as f:
2028
data = yaml.safe_load(f)
2129

22-
# Add ID as first field
30+
# If a file already has an id, leave it alone — this script is
31+
# an id-minter, not a re-assignment tool. Without this guard the
32+
# `{"id": ...}; data_with_id.update(data)` sequence below silently
33+
# took the source file's existing id while the curation event
34+
# still claimed "Assigned id=<new>" — misleading audit trail.
35+
if data.get("id"):
36+
print(f" · {yaml_file.name} already has id={data['id']}, skipping")
37+
continue
38+
39+
# Add ID as first field (data has no `id` per the guard above,
40+
# so the .update() will not clobber it).
2341
data_with_id = {"id": community_id}
2442
data_with_id.update(data)
2543

26-
# Write back with proper formatting
27-
with open(yaml_file, "w") as f:
28-
# Use default_flow_style=False for block style
29-
yaml.dump(
30-
data_with_id,
31-
f,
32-
default_flow_style=False,
33-
sort_keys=False,
34-
allow_unicode=True,
35-
width=100,
44+
# Record curation event before write
45+
record_curation_event(
46+
data_with_id,
47+
curator="add_community_ids",
48+
action="ASSIGN_COMMUNITY_ID",
49+
changes=f"Assigned id={community_id}",
50+
)
51+
52+
if dry_run:
53+
print(f" (dry-run) would assign {community_id}{yaml_file.name}")
54+
continue
55+
56+
# Write back via validated writer (replaces direct yaml.dump)
57+
try:
58+
write_validated_community(data_with_id, yaml_file)
59+
except ValidationFailedError as exc:
60+
print(
61+
f" ✗ validation failed for {yaml_file.name}: {exc.summary()}",
62+
file=sys.stderr,
3663
)
64+
continue
3765

3866
print(f" ✓ {community_id}{yaml_file.name}")
3967

4068
print(f"\n✅ Added IDs to {len(yaml_files)} communities")
4169

4270

4371
if __name__ == "__main__":
44-
add_ids_to_communities()
72+
import argparse
73+
74+
parser = argparse.ArgumentParser(description="Add sequential CommunityMech IDs to YAML files")
75+
parser.add_argument(
76+
"--dry-run",
77+
action="store_true",
78+
help="Preview changes without modifying files",
79+
)
80+
args = parser.parse_args()
81+
82+
add_ids_to_communities(dry_run=args.dry_run)

scripts/apply_pmc_conversions.py

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,16 @@
1717
- PMC4187173 → PMID:25369810
1818
"""
1919

20-
import yaml
21-
import re
20+
import sys
2221
from pathlib import Path
23-
from typing import Dict, List
24-
from collections import defaultdict
2522

23+
import yaml
24+
25+
from communitymech.curate.curation_event import record_curation_event
26+
from communitymech.validation.write_validated import (
27+
ValidationFailedError,
28+
write_validated_community,
29+
)
2630

2731
# Known PMC → PMID conversions from special_references_report.txt
2832
PMC_TO_PMID = {
@@ -38,10 +42,10 @@
3842
}
3943

4044

41-
def apply_pmc_conversions(yaml_path: Path, dry_run: bool = True) -> Dict:
45+
def apply_pmc_conversions(yaml_path: Path, dry_run: bool = True) -> dict:
4246
"""Apply PMC→PMID conversions to a YAML file"""
4347

44-
with open(yaml_path, 'r') as f:
48+
with open(yaml_path) as f:
4549
data = yaml.safe_load(f)
4650

4751
changes = []
@@ -120,14 +124,30 @@ def apply_pmc_conversions(yaml_path: Path, dry_run: bool = True) -> Dict:
120124
backup_path = yaml_path.with_suffix('.yaml.bak_pmc')
121125
yaml_path.rename(backup_path)
122126

123-
# Write updated
124-
with open(yaml_path, 'w') as f:
125-
yaml.dump(data, f,
126-
default_flow_style=False,
127-
sort_keys=False,
128-
allow_unicode=True,
129-
width=120,
130-
indent=2)
127+
# Record curation event before writing
128+
contexts = sorted({ctx for _, _, ctx in changes})
129+
record_curation_event(
130+
data,
131+
curator="apply_pmc_conversions",
132+
action="CONVERT_PMC_TO_PMID",
133+
changes=(
134+
f"Converted {len(changes)} PMC reference(s) to PMID across "
135+
f"{', '.join(contexts) if contexts else 'no'} section(s)"
136+
),
137+
)
138+
139+
# Write updated via validated writer. Restore the backup on
140+
# validation failure so the original file isn't left missing
141+
# (it's still on disk as backup_path until the write succeeds).
142+
try:
143+
write_validated_community(data, yaml_path)
144+
except ValidationFailedError as exc:
145+
backup_path.rename(yaml_path)
146+
print(
147+
f" ✗ validation failed for {yaml_path.name}: {exc.summary()} "
148+
"(original restored from backup)",
149+
file=sys.stderr,
150+
)
131151

132152
return {
133153
'file': yaml_path.name,

scripts/audit_writers.py

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
#!/usr/bin/env python3
2+
"""Audit YAML-writing scripts in CommunityMech.
3+
4+
For every Python module under ``scripts/`` and the central
5+
``src/communitymech/`` package that writes a YAML (looks for
6+
``yaml.dump``, ``yaml.safe_dump``, ``write_validated_community``, or a
7+
``path.write_text(yaml.safe_dump(...))`` flow), record:
8+
9+
- ``appends_curation_history``: does the script append a CurationEvent
10+
to ``community['curation_history']``?
11+
- ``has_write_safeguard``: a ``--dry-run`` opt-out OR ``--apply``/``--write``
12+
opt-in flag.
13+
- ``validates_before_write``: does it route through
14+
``write_validated_community`` or call ``validate_community`` /
15+
``linkml-validate`` first?
16+
- ``wired_into_just``: is the script invoked from a justfile recipe?
17+
18+
TSV columns: path, writes_yaml, appends_curation_history,
19+
has_write_safeguard, validates_before_write, wired_into_just.
20+
21+
Output: TSV to stdout (and via ``--out`` to a file).
22+
23+
Ported from CultureMech / MediaIngredientMech / TraitMech.
24+
"""
25+
26+
from __future__ import annotations
27+
28+
import argparse
29+
import csv
30+
import re
31+
import sys
32+
from pathlib import Path
33+
34+
SEARCH_DIRS = [
35+
Path("scripts"),
36+
Path("src/communitymech"),
37+
]
38+
39+
# Patterns
40+
_WRITE_TEXT_OF_YAML = re.compile(r"\.write_text\s*\(\s*yaml\.(?:safe_)?dump")
41+
_CURATION_APPEND = re.compile(
42+
r"curation_history.*?(append|\+=|\.insert)"
43+
r"|['\"]curator['\"]\s*:"
44+
r"|append_curation_event"
45+
r"|record_curation_event"
46+
)
47+
_WRITE_SAFEGUARD = re.compile(
48+
r"--dry[-_]run|dry_run\s*[:=]"
49+
r"|--apply\b|args\.apply\b"
50+
r"|--write\b|args\.write\b"
51+
)
52+
_VALIDATE_BEFORE_WRITE = re.compile(
53+
r"linkml[._-]?validate"
54+
r"|validate_community\("
55+
r"|validator\.validate\("
56+
r"|write_validated_community\("
57+
)
58+
59+
60+
def script_paths() -> list[Path]:
61+
out: list[Path] = []
62+
for d in SEARCH_DIRS:
63+
if not d.exists():
64+
continue
65+
out.extend(sorted(p for p in d.rglob("*.py") if "__pycache__" not in str(p)))
66+
return out
67+
68+
69+
def looks_like_yaml_writer(text: str) -> bool:
70+
if "yaml.safe_dump(" in text or "yaml.dump(" in text:
71+
return True
72+
if _WRITE_TEXT_OF_YAML.search(text):
73+
return True
74+
# write_validated_community is the closed-schema-gated wrapper that
75+
# callers route through instead of yaml.dump directly.
76+
return "write_validated_community(" in text
77+
78+
79+
def audit(path: Path, justfile_text: str) -> dict | None:
80+
# Suppress self-match: this module's regex source contains
81+
# `yaml.safe_dump` etc., so it would otherwise appear in its own output.
82+
if path.resolve() == Path(__file__).resolve():
83+
return None
84+
try:
85+
text = path.read_text()
86+
except (UnicodeDecodeError, OSError):
87+
return None
88+
if not looks_like_yaml_writer(text):
89+
return None
90+
return {
91+
"path": str(path),
92+
"writes_yaml": "yes",
93+
"appends_curation_history": "yes" if _CURATION_APPEND.search(text) else "no",
94+
"has_write_safeguard": "yes" if _WRITE_SAFEGUARD.search(text) else "no",
95+
"validates_before_write": "yes" if _VALIDATE_BEFORE_WRITE.search(text) else "no",
96+
"wired_into_just": "yes" if _is_wired_into_just(path, justfile_text) else "no",
97+
}
98+
99+
100+
def _is_wired_into_just(path: Path, justfile_text: str) -> bool:
101+
"""Detect whether a justfile recipe actually invokes this script.
102+
103+
The earlier substring check (``path.stem in justfile_text``) had false
104+
positives — e.g. ``write_validated.py`` matched a justfile comment
105+
referencing ``write_validated_community``. Require the filename to
106+
appear as an explicit ``python ... <name>.py`` invocation, which is
107+
how every justfile recipe actually runs a script.
108+
"""
109+
needle = re.compile(rf"\b{re.escape(path.name)}\b")
110+
for line in justfile_text.splitlines():
111+
stripped = line.strip()
112+
# Ignore comment-only lines so a mention in docs doesn't count.
113+
if stripped.startswith("#"):
114+
continue
115+
if needle.search(stripped):
116+
return True
117+
return False
118+
119+
120+
def main() -> int:
121+
ap = argparse.ArgumentParser(description=__doc__)
122+
ap.add_argument("--out", type=Path, default=None, help="TSV output path (default stdout)")
123+
args = ap.parse_args()
124+
125+
justfile_path = Path("justfile")
126+
justfile_text = justfile_path.read_text() if justfile_path.exists() else ""
127+
128+
rows: list[dict] = []
129+
for p in script_paths():
130+
row = audit(p, justfile_text)
131+
if row is not None:
132+
rows.append(row)
133+
134+
fields = [
135+
"path",
136+
"writes_yaml",
137+
"appends_curation_history",
138+
"has_write_safeguard",
139+
"validates_before_write",
140+
"wired_into_just",
141+
]
142+
143+
if args.out:
144+
args.out.parent.mkdir(parents=True, exist_ok=True)
145+
with args.out.open("w", newline="") as fh:
146+
w = csv.DictWriter(fh, fieldnames=fields, delimiter="\t", lineterminator="\n")
147+
w.writeheader()
148+
for row in rows:
149+
w.writerow(row)
150+
print(f"Wrote {len(rows)} rows to {args.out}", file=sys.stderr)
151+
else:
152+
w = csv.DictWriter(sys.stdout, fieldnames=fields, delimiter="\t")
153+
w.writeheader()
154+
for row in rows:
155+
w.writerow(row)
156+
157+
def count(field: str, val: str) -> int:
158+
return sum(1 for r in rows if r[field] == val)
159+
160+
print("", file=sys.stderr)
161+
print(f"=== writers audit summary ({len(rows)} writers) ===", file=sys.stderr)
162+
print(f" appends curation_history: {count('appends_curation_history', 'yes')} / {len(rows)}",
163+
file=sys.stderr)
164+
print(f" has write safeguard: {count('has_write_safeguard', 'yes')} / {len(rows)}",
165+
file=sys.stderr)
166+
print(f" validates before write: {count('validates_before_write', 'yes')} / {len(rows)}",
167+
file=sys.stderr)
168+
print(f" wired into justfile: {count('wired_into_just', 'yes')} / {len(rows)}",
169+
file=sys.stderr)
170+
return 0
171+
172+
173+
if __name__ == "__main__":
174+
sys.exit(main())

0 commit comments

Comments
 (0)