Skip to content

Convert 3 more CommunityMech writers (intelligent_snippet_fixer + enhance_strain_data + add_evidence_source)#87

Merged
realmarcin merged 2 commits into
mainfrom
convert/three-more-writers
May 26, 2026
Merged

Convert 3 more CommunityMech writers (intelligent_snippet_fixer + enhance_strain_data + add_evidence_source)#87
realmarcin merged 2 commits into
mainfrom
convert/three-more-writers

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Continues the audit-machinery rollout from #84 (initial port: 5 writers) and #85 (audit-detector blind spot fix + restore-on-failure pattern). Brings CommunityMech writer coverage from 5/16 → 8/16 gated through record_curation_event + write_validated_community.

Each converted script now:

  1. Loads the community YAML via yaml.safe_load
  2. Mutates the in-memory dict
  3. Calls record_curation_event(doc, curator=..., action=..., changes=..., llm_assisted=...)
  4. Writes via write_validated_community(doc, path) — closed-schema LinkML gate refuses any drift
  5. Catches ValidationFailedError per-record so one bad doc can't kill a batch

Conversions

Script Action label LLM Notes
scripts/intelligent_snippet_fixer.py FIX_SNIPPETS_LLM yes Uses skip_if_recent=True so a session of auto-approved snippet fixes collapses into one curation_history entry. Pre-existing .yaml.bak_intelligent backup (from shutil.copy2 at session start) is unchanged — still the user-visible safety net.
scripts/enhance_strain_data.py ENHANCE_STRAIN_DATA no Previously extract-only; added an --apply mode that writes strain_designation into matching taxonomy[*] entries via the validated writer. Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored values.
scripts/add_evidence_source.py BACKFILL_EVIDENCE_SOURCE no Backup-then-rename pattern from #85: original renamed to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back into place so the batch loop continues cleanly.

All existing CLI flags preserved (--auto, --interactive, --dry-run, --auto-approve, --only-invalid, --file, --verbose, --relaxed).

After-state audit

=== writers audit summary (16 writers) ===
  appends curation_history:   8 / 16   (was 5/16)
  has write safeguard:        11 / 16
  validates before write:     9 / 16   (was 6/16)
  wired into justfile:        1 / 16

The remaining un-converted writers (apply_strain_designations.py, apply_taxonomy_corrections.py, backfill_metals.py, clean_metals_inplace.py [in flight on a sibling PR], fix_reference_formats.py, plus a couple of src/ writers) follow the same pattern; left as future work to keep this PR focused.

Heads-up: pre-existing import errors (not changed by this PR)

scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced at the module top level — a module that does not exist in this repo and never has. Both scripts have always failed at the import step when invoked from CLI. This PR does not introduce or fix that; it's tracked separately. The conversion itself (imports + writer routing) is independently verifiable by ruff + ast-parse, which is what this PR verifies.

Baseline (unchanged)

check result
uv run python scripts/validate_strict.py --quiet 0 ERROR rows / 265 files
uv run pytest tests/ -q 136 passed, 9 skipped, 7 deselected
uv run ruff check scripts/{add_evidence_source,enhance_strain_data,intelligent_snippet_fixer}.py net -4 errors (all I001 / F401 from removing unused imports); no new findings

Test plan

  • uv run python -c "import ast; ast.parse(open('scripts/<name>.py').read())" — all three parse
  • uv run ruff check scripts/<name>.py — net error reduction (removed pre-existing unused imports); no new findings
  • uv run python scripts/audit_writers.py — all 3 now show appends_curation_history=yes AND validates_before_write=yes
  • uv run python scripts/validate_strict.py --quiet — 0 ERROR rows / 265 files
  • uv run pytest tests/ — 136 passed, 9 skipped
  • uv run python scripts/enhance_strain_data.py --help — new --apply / --overwrite / --kb-dir flags surface cleanly

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 26, 2026 02:14
realmarcin and others added 2 commits May 25, 2026 19:15
…d helpers

Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the
pattern established in PR #84 and refined by PR #85 (restore-on-failure
backup handling): every script that mutates a community YAML loads
through yaml.safe_load, mutates the dict, records a CurationEvent via
record_curation_event(), and writes back via write_validated_community()
which gates on closed-schema LinkML validation.

Converted scripts:
- scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair;
  llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True
  on the curation event so a session of auto-approved fixes collapses
  into a single trail entry instead of one per snippet. The existing
  .yaml.bak_intelligent backup created at session start by
  shutil.copy2 remains the user-visible safety net.
- scripts/enhance_strain_data.py (strain-ID enrichment;
  action=ENHANCE_STRAIN_DATA). Previously the script extracted strain
  data but only emitted a copy-paste snippets file; this PR adds an
  --apply mode that writes strain_designation entries back into
  matching taxonomy[*] entries via write_validated_community(). Default
  behavior preserves the historical extract-only flow (no kb/communities
  writes without --apply). --overwrite controls whether to replace
  existing curator-authored strain_designation values.
- scripts/add_evidence_source.py (evidence_source enum backfill;
  action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern
  from PR #85 — the original is moved to .yaml.bak_source before the
  validated write; on ValidationFailedError the backup is renamed back
  in place so the batch loop can continue without leaving a half-written
  community on disk.

Each per-record loop continues on ValidationFailedError so one bad file
can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run,
--auto-approve, --file, --apply, etc.) preserved.

After-state: scripts/audit_writers.py reports 8/16 writers gated
(was 5/16). The remaining un-converted writers (apply_strain_designations,
apply_taxonomy_corrections, backfill_metals, clean_metals_inplace,
fix_reference_formats, plus a handful of smaller src/ writers) follow
the same conversion pattern; left as future work to keep this PR focused.

Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py
import communitymech.literature_enhanced (a pre-existing module that does
not currently exist in the repo) at module top-level. This PR does not
introduce or fix that — the scripts have always failed at the import step
when invoked from CLI. Out of scope for this conversion; tracked separately.

Baseline (unchanged):
- validate_strict: 0 ERROR rows / 265 files
- pytest tests/: 136 passed, 9 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #86 (Convert clean_metals_inplace.py) just merged into main. After
rebasing, scripts/clean_metals_inplace.py is now gated, so the
appends_curation_history / validates_before_write columns for it flip
to yes. Re-running scripts/audit_writers.py produces a 1-row delta;
commit it so the report reflects the actual post-rebase state.

Combined post-merge: 9/16 appends_curation_history, 10/16
validates_before_write (was 5/16 and 6/16 respectively at the start
of this PR series).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin force-pushed the convert/three-more-writers branch from 7aa14dc to a9f062a Compare May 26, 2026 02:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Continues the audit-machinery rollout from PR #84/#85 by converting three more writers (intelligent_snippet_fixer.py, enhance_strain_data.py, add_evidence_source.py) to route through record_curation_event + write_validated_community, bringing coverage from 5/16 to 8/16.

Changes:

  • Convert three writers to use the curation-event + closed-schema-validated writer pattern, each with a script-specific action label
  • Add --apply / --overwrite / --kb-dir flags to enhance_strain_data.py (previously extract-only); the historical default still writes nothing to kb/communities/
  • Refresh reports/pipeline_writers_audit.tsv so the three converted scripts now show appends_curation_history=yes and validates_before_write=yes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
scripts/intelligent_snippet_fixer.py Wrap snippet-fix write in record_curation_event(skip_if_recent=True, llm_assisted=True) + write_validated_community; pre-existing .yaml.bak_intelligent backup remains the user-facing safety net
scripts/enhance_strain_data.py Add apply_strain_data_to_community method and --apply / --overwrite / --kb-dir CLI flags; preserves curator-authored strain_designation values unless --overwrite is set
scripts/add_evidence_source.py Summarize per-file changes in a curation event, rename original to .yaml.bak_source, write via validated writer, restore backup on ValidationFailedError
reports/pipeline_writers_audit.tsv Update three rows to reflect new appends_curation_history / validates_before_write status

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin merged commit a49f889 into main May 26, 2026
@realmarcin realmarcin deleted the convert/three-more-writers branch May 26, 2026 02:17
realmarcin added a commit that referenced this pull request May 26, 2026
* Fix broken literature_enhanced imports in two writer scripts

scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py
both import EnhancedLiteratureFetcher from communitymech.literature_enhanced
— a module that was never committed to git (only a stale .pyc was
shadowing the missing source locally). Both scripts have raised
ModuleNotFoundError on import for as long as anyone has tried to run
them, which was surfaced as a pre-existing-state heads-up by the recent
writer-conversion PR #87.

Swap to LiteratureFetcher from communitymech.literature, which exposes
the same fetch_pubmed_abstract + fetch_paper surface plus a richer
DOI fallback chain (CrossRef → PubMed via DOI lookup → PMC full-text →
OpenAlex → Semantic Scholar → Europe PMC → publisher meta-tag scrape)
that subsumes what fetch_abstract_for_doi did. API differences:

- fetch_paper returns (abstract, pdf_url) not a dict; tuple-unpack at
  call sites.
- LiteratureFetcher.fetch_paper has no download_pdf kwarg (the older
  version's flag was a no-op in the LiteratureFetcher pipeline; the
  pdf URL is just returned alongside the abstract).
- Title field is unavailable separately. In add_evidence_source.py's
  guess_evidence_source classifier the title was filter(None, …)-merged
  with snippet and abstract anyway; losing it degrades classification
  marginally (PubMed abstracts include the title in the abstract text,
  so PMID references are unaffected). If richer DOI classification is
  needed later, LiteratureFetcher.fetch_doi_metadata() returns CrossRef
  metadata with a title field.

After-state: both scripts now import and run their initialization paths
cleanly. pytest tests/ still passes (136 passed, 9 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review: drop dead title param from guess_evidence_source

Copilot flagged that title was assigned None and then passed through
guess_evidence_source as a parameter that the classifier merged into
its keyword-matching text via filter(None, ...). With title always None
the parameter was dead code that just clutters the call sites.

Remove the title parameter from guess_evidence_source and from both
caller blocks. PubMed abstracts already embed the title in the
abstract text (so PMID-driven classification is unchanged), and
CrossRef titles for DOI references are available via
LiteratureFetcher.fetch_doi_metadata() if richer classification is
wanted later — that's now a clear future-work hook rather than a
hard-coded-None pretense.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants