Convert 3 more CommunityMech writers (intelligent_snippet_fixer + enhance_strain_data + add_evidence_source)#87
Merged
Conversation
…d helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #86 (Convert clean_metals_inplace.py) just merged into main. After rebasing, scripts/clean_metals_inplace.py is now gated, so the appends_curation_history / validates_before_write columns for it flip to yes. Re-running scripts/audit_writers.py produces a 1-row delta; commit it so the report reflects the actual post-rebase state. Combined post-merge: 9/16 appends_curation_history, 10/16 validates_before_write (was 5/16 and 6/16 respectively at the start of this PR series). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7aa14dc to
a9f062a
Compare
There was a problem hiding this comment.
Pull request overview
Continues the audit-machinery rollout from PR #84/#85 by converting three more writers (intelligent_snippet_fixer.py, enhance_strain_data.py, add_evidence_source.py) to route through record_curation_event + write_validated_community, bringing coverage from 5/16 to 8/16.
Changes:
- Convert three writers to use the curation-event + closed-schema-validated writer pattern, each with a script-specific action label
- Add
--apply/--overwrite/--kb-dirflags toenhance_strain_data.py(previously extract-only); the historical default still writes nothing tokb/communities/ - Refresh
reports/pipeline_writers_audit.tsvso the three converted scripts now showappends_curation_history=yesandvalidates_before_write=yes
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| scripts/intelligent_snippet_fixer.py | Wrap snippet-fix write in record_curation_event(skip_if_recent=True, llm_assisted=True) + write_validated_community; pre-existing .yaml.bak_intelligent backup remains the user-facing safety net |
| scripts/enhance_strain_data.py | Add apply_strain_data_to_community method and --apply / --overwrite / --kb-dir CLI flags; preserves curator-authored strain_designation values unless --overwrite is set |
| scripts/add_evidence_source.py | Summarize per-file changes in a curation event, rename original to .yaml.bak_source, write via validated writer, restore backup on ValidationFailedError |
| reports/pipeline_writers_audit.tsv | Update three rows to reflect new appends_curation_history / validates_before_write status |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4 tasks
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
* Fix broken literature_enhanced imports in two writer scripts scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py both import EnhancedLiteratureFetcher from communitymech.literature_enhanced — a module that was never committed to git (only a stale .pyc was shadowing the missing source locally). Both scripts have raised ModuleNotFoundError on import for as long as anyone has tried to run them, which was surfaced as a pre-existing-state heads-up by the recent writer-conversion PR #87. Swap to LiteratureFetcher from communitymech.literature, which exposes the same fetch_pubmed_abstract + fetch_paper surface plus a richer DOI fallback chain (CrossRef → PubMed via DOI lookup → PMC full-text → OpenAlex → Semantic Scholar → Europe PMC → publisher meta-tag scrape) that subsumes what fetch_abstract_for_doi did. API differences: - fetch_paper returns (abstract, pdf_url) not a dict; tuple-unpack at call sites. - LiteratureFetcher.fetch_paper has no download_pdf kwarg (the older version's flag was a no-op in the LiteratureFetcher pipeline; the pdf URL is just returned alongside the abstract). - Title field is unavailable separately. In add_evidence_source.py's guess_evidence_source classifier the title was filter(None, …)-merged with snippet and abstract anyway; losing it degrades classification marginally (PubMed abstracts include the title in the abstract text, so PMID references are unaffected). If richer DOI classification is needed later, LiteratureFetcher.fetch_doi_metadata() returns CrossRef metadata with a title field. After-state: both scripts now import and run their initialization paths cleanly. pytest tests/ still passes (136 passed, 9 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review: drop dead title param from guess_evidence_source Copilot flagged that title was assigned None and then passed through guess_evidence_source as a parameter that the classifier merged into its keyword-matching text via filter(None, ...). With title always None the parameter was dead code that just clutters the call sites. Remove the title parameter from guess_evidence_source and from both caller blocks. PubMed abstracts already embed the title in the abstract text (so PMID-driven classification is unchanged), and CrossRef titles for DOI references are available via LiteratureFetcher.fetch_doi_metadata() if richer classification is wanted later — that's now a clear future-work hook rather than a hard-coded-None pretense. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continues the audit-machinery rollout from #84 (initial port: 5 writers) and #85 (audit-detector blind spot fix + restore-on-failure pattern). Brings CommunityMech writer coverage from 5/16 → 8/16 gated through
record_curation_event+write_validated_community.Each converted script now:
yaml.safe_loadrecord_curation_event(doc, curator=..., action=..., changes=..., llm_assisted=...)write_validated_community(doc, path)— closed-schema LinkML gate refuses any driftValidationFailedErrorper-record so one bad doc can't kill a batchConversions
scripts/intelligent_snippet_fixer.pyFIX_SNIPPETS_LLMskip_if_recent=Trueso a session of auto-approved snippet fixes collapses into one curation_history entry. Pre-existing.yaml.bak_intelligentbackup (fromshutil.copy2at session start) is unchanged — still the user-visible safety net.scripts/enhance_strain_data.pyENHANCE_STRAIN_DATA--applymode that writesstrain_designationinto matchingtaxonomy[*]entries via the validated writer. Default behavior preserves the historical extract-only flow (nokb/communitieswrites without--apply).--overwritecontrols whether to replace existing curator-authored values.scripts/add_evidence_source.pyBACKFILL_EVIDENCE_SOURCE.yaml.bak_sourcebefore the validated write; onValidationFailedErrorthe backup is renamed back into place so the batch loop continues cleanly.All existing CLI flags preserved (
--auto,--interactive,--dry-run,--auto-approve,--only-invalid,--file,--verbose,--relaxed).After-state audit
The remaining un-converted writers (
apply_strain_designations.py,apply_taxonomy_corrections.py,backfill_metals.py,clean_metals_inplace.py[in flight on a sibling PR],fix_reference_formats.py, plus a couple ofsrc/writers) follow the same pattern; left as future work to keep this PR focused.Heads-up: pre-existing import errors (not changed by this PR)
scripts/add_evidence_source.pyandscripts/intelligent_snippet_fixer.pyimportcommunitymech.literature_enhancedat the module top level — a module that does not exist in this repo and never has. Both scripts have always failed at the import step when invoked from CLI. This PR does not introduce or fix that; it's tracked separately. The conversion itself (imports + writer routing) is independently verifiable by ruff + ast-parse, which is what this PR verifies.Baseline (unchanged)
uv run python scripts/validate_strict.py --quietuv run pytest tests/ -quv run ruff check scripts/{add_evidence_source,enhance_strain_data,intelligent_snippet_fixer}.pyTest plan
uv run python -c "import ast; ast.parse(open('scripts/<name>.py').read())"— all three parseuv run ruff check scripts/<name>.py— net error reduction (removed pre-existing unused imports); no new findingsuv run python scripts/audit_writers.py— all 3 now showappends_curation_history=yesANDvalidates_before_write=yesuv run python scripts/validate_strict.py --quiet— 0 ERROR rows / 265 filesuv run pytest tests/— 136 passed, 9 skippeduv run python scripts/enhance_strain_data.py --help— new--apply/--overwrite/--kb-dirflags surface cleanly🤖 Generated with Claude Code