Skip to content

Commit 7fed895

Browse files
realmarcinclaude
andauthored
Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator (#80)
* Backfill #30 wetlands, fix metals extractor bug, lint cleanup, cross-repo validator Combines four follow-ups against #30 (cross-repo environmental linking) plus an unrelated lint cleanup, all of which build on each other and share the same test surface. 1. Wetland backfill (#30 Phase 5) Apply the SPRUCE related_ingredients pattern to 6 more peatland and wetland communities (Stordalen Mire, Prairie Pothole, MUCC Freshwater Wetland, Asgard Wetland Soil, Coastal Forested Wetland, Wetland Oxygen-Sulfate GHG). Each entry uses CHEBI terms and evidence anchored to already-cached PubMed abstracts; no MediaIngredientMech IDs are minted yet. 2. Metals extractor bug fix + 65-file cleanup metal_extraction.py used plain substring matching against 2-letter element symbols ('ti' for TITANIUM, 'au' for GOLD), which matched inside unrelated words ('characteristic', 'australia') and salted metals_present with TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more. Switched to non-alphanumeric-boundary regex matching (case-insensitive), with tests pinning the behavior. scripts/clean_metals_inplace.py re-runs extraction and rewrites only the metals_present / rare_earth_elements_present / metal_relevance / metal_notes blocks via line-based replacement, preserving comments and unrelated formatting (unlike backfill_metals.py's yaml.dump path). Applied once across the corpus: 65 community YAMLs corrected. 3. Lint cleanup (just lint ruff/black) 178 pre-existing ruff errors -> 0. Removed T20 (print) from the ruff selection with rationale: src/communitymech/ ships CLI entry points that legitimately use print. The remaining 44 non-print errors were fixed inline (unused imports, raise-from chains, collapsible ifs, redundant list() calls, zip strict, line splits, import order in batch_reporter.py) or suppressed with a per-file E501 ignore for llm/prompts.py (long prompt strings) and targeted `# noqa` lines with comments for S301/S701/S704/S112 cases that are intentional within their internal-only contexts. mypy still reports 256 pre-existing errors and is out of scope here. 4. Cross-repo ID validator (#30 Phase 3, local half) New module communitymech.validators.cross_repo_ids with a pattern + existence checker, plus a CLI (scripts/validate_cross_repo_ids.py) and justfile entries (validate-cross-repo-ids, validate-cross-repo-ids-all). Sibling repo paths are opt-in via env or flags; when omitted, the validator emits info-level skip notices rather than silently passing. 10 new tests cover pattern, existence, and edge cases. Test plan: just test (136 passed, 9 skipped), just validate-all (all 265 communities clean), ruff/black green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review: surgical removal-only metals cleanup Copilot flagged two serious bugs in scripts/clean_metals_inplace.py from the prior commit: 1. _replace_scalar rewrote metal_notes by substituting only the first physical line of the YAML key. When the existing value spanned multiple lines (as PyYAML's folded scalars often do), the indented continuation lines were left orphaned and silently re-folded by the parser into the new value — producing strings like "...(context-validated) measurements; ...(context-validated)" and, on Ngawha, merging curator prose about mercury cycling into the auto-generated note. 2. The script unconditionally overwrote metal_notes and metal_relevance and removed any metals_present entries the (newly fixed) extractor wouldn't infer. That clobbered curator-authored values (Ngawha's MERCURY + curator note, Oak Ridge's NICKEL/COBALT/ZINC, Bayan Obo notes, etc.) — entries the extractor cannot derive but that are curator decisions to keep. Reverted all 65 YAMLs the prior commit touched, then rewrote the script to be surgical: - Touches only metals_present. Never reads or writes metal_relevance or metal_notes, which sidesteps the multi-line scalar bug entirely and preserves curator metadata. - Removes only entries whose extractor keyword list contains a known ambiguous short symbol (`ti`/`au`/`pd`) AND whose unambiguous tokens (full element name, charged ionic forms) do not appear anywhere else in the file as word-bounded tokens. Anything else is kept, including curator-added entries the extractor couldn't have inferred. - Never adds metals. Surprising additions (e.g., Trichodesmium IRON via newly-correct CHEBI tier-1 matching) are out of scope; running `scripts/backfill_metals.py --dry-run` surfaces them for separate curator review. Result: 56 files (down from 65), each diff is a 1-2 line removal of TITANIUM and/or GOLD. Ngawha MERCURY, Oak Ridge metals, all curator metal_notes preserved verbatim. 136 tests pass, all 265 communities validate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 84af7bd commit 7fed895

83 files changed

Lines changed: 1080 additions & 142 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/cross_repo_linking.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ All new fields are optional:
299299

300300
## Validation
301301

302+
### Schema-level tests
303+
302304
Run the cross-repo linking tests:
303305

304306
```bash
@@ -311,6 +313,31 @@ Test data files are in `tests/data/test_cross_repo_linking/`:
311313
- `community_no_links.yaml` -- Backward compatibility
312314
- `community_all_relationship_types.yaml` -- All 5 enum values
313315

316+
### Cross-repo ID validator
317+
318+
`just validate-cross-repo-ids FILE` checks that `culturemech_id` /
319+
`mediaingredientmech_id` values match their CURIE patterns and, when
320+
sibling-repo paths are configured, that the referenced IDs actually
321+
exist in those repos.
322+
323+
```bash
324+
# Pattern check only (no sibling-repo paths)
325+
just validate-cross-repo-ids kb/communities/SPRUCE_Peatland_Methane_Cycling_Community.yaml
326+
327+
# Pattern + existence check
328+
COMMUNITYMECH_SIBLING_REPOS="CultureMech=../CultureMech/kb/media,MediaIngredientMech=../MediaIngredientMech/kb/ingredients" \
329+
just validate-cross-repo-ids-all
330+
```
331+
332+
The validator returns:
333+
- `error` for malformed CURIEs or IDs missing from a configured sibling repo
334+
- `info` for IDs whose existence check was skipped because the relevant
335+
sibling-repo path wasn't configured
336+
- nothing if a community has no cross-repo IDs at all
337+
338+
Sibling-repo paths can also be passed via `--culturemech` /
339+
`--mediaingredientmech` flags to `scripts/validate_cross_repo_ids.py`.
340+
314341
## See Also
315342

316343
- [Growth Media Linking](media_linking.md) -- Existing cultivation-based linking

justfile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,16 @@ validate-references-all:
3535
uv run linkml-reference-validator validate data "$file" -s src/communitymech/schema/communitymech.yaml --config conf/reference_validator.yaml
3636
done
3737

38+
# Validate cross-repo IDs (CultureMech, MediaIngredientMech) in one community file.
39+
# Pattern checks always run; existence checks run when sibling-repo paths are
40+
# configured via COMMUNITYMECH_SIBLING_REPOS env (Name=path,Name=path).
41+
validate-cross-repo-ids FILE:
42+
PYTHONPATH=src uv run python scripts/validate_cross_repo_ids.py {{FILE}}
43+
44+
# Validate cross-repo IDs across all community files.
45+
validate-cross-repo-ids-all:
46+
PYTHONPATH=src uv run python scripts/validate_cross_repo_ids.py kb/communities/*.yaml
47+
3848
# Validate ontology terms in a community file
3949
validate-terms FILE:
4050
uv run linkml-term-validator validate-data {{FILE}} -s src/communitymech/schema/communitymech.yaml --labels

kb/communities/AMD_Acidophile_Heterotroph_Network.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -765,7 +765,6 @@ metals_present:
765765
- COPPER
766766
- GOLD
767767
- IRON
768-
- TITANIUM
769768
metal_relevance: PRIMARY
770769
metal_notes: Metal/REE detected via CHEBI terms in metabolites; Metal/REE detected via environmental factor
771770
measurements; Metal/REE detected via keyword matching in description (context-validated)

kb/communities/AMD_Nitrososphaerota_Archaeal.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -699,9 +699,7 @@ environmental_factors:
699699
explanation: Demonstrates value of genomic data for understanding archaeal adaptations
700700
metals_present:
701701
- COPPER
702-
- GOLD
703702
- IRON
704-
- TITANIUM
705703
metal_relevance: PRIMARY
706704
metal_notes: Metal/REE detected via CHEBI terms in metabolites; Metal/REE detected via environmental factor
707705
measurements; Metal/REE detected via keyword matching in description (context-validated)

kb/communities/Asgard_Wetland_Soil_Methanogenesis_Substrate_Community.yaml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,50 @@ environmental_factors:
179179
snippet: contributions in soil ecosystems remain unknown
180180
explanation: Supports terrestrial soil as the focus environment.
181181
growth_media: []
182+
related_ingredients:
183+
- preferred_term: acetate
184+
chebi_term:
185+
id: CHEBI:30089
186+
label: acetate
187+
relevance: Asgard archaeal acetogens in this wetland soil generate acetate
188+
from carbohydrate breakdown via the Wood-Ljungdahl pathway; acetate is
189+
therefore both the headline output of Asgard metabolism in this community
190+
and the substrate that feeds the co-resident acetoclastic methanogens.
191+
evidence:
192+
- reference: PMID:39085194
193+
supports: SUPPORT
194+
evidence_source: COMPUTATIONAL
195+
snippet: carbohydrate breakdown to acetate and formate
196+
explanation: Anchors acetate as the central Asgard metabolic output that
197+
modulates downstream methanogenesis substrates in wetland soil.
198+
- preferred_term: formate
199+
chebi_term:
200+
id: CHEBI:15740
201+
label: formate
202+
relevance: Formate co-produced with acetate is a major C1 substrate for
203+
hydrogenotrophic and formate-utilizing methanogens, defining a second
204+
Asgard-mediated methanogenesis-substrate channel in this community.
205+
evidence:
206+
- reference: PMID:39085194
207+
supports: SUPPORT
208+
evidence_source: COMPUTATIONAL
209+
snippet: carbohydrate breakdown to acetate and formate
210+
explanation: Anchors formate as the second Asgard-derived methanogenesis
211+
substrate alongside acetate.
212+
- preferred_term: dihydrogen
213+
chebi_term:
214+
id: CHEBI:18276
215+
label: dihydrogen
216+
relevance: Expression of [NiFe]-hydrogenases by both Atabeyarchaeia and
217+
Freyarchaeia genomes implicates H2 cycling as a core Asgard activity in
218+
this wetland soil; any cultivation medium designed around the community
219+
would need an H2 headspace.
220+
evidence:
221+
- reference: PMID:39085194
222+
supports: SUPPORT
223+
evidence_source: COMPUTATIONAL
224+
snippet: expression of genes for [NiFe]-hydrogenases
225+
explanation: Anchors H2 cycling as an in situ expressed Asgard activity.
182226
external_resources:
183227
- name: Primary publication for the Asgard wetland soil methanogenesis-substrate
184228
community

kb/communities/At_RSPHERE_SynCom.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -427,8 +427,7 @@ environmental_factors:
427427
plant-microbe interactions, and synthetic community design
428428
429429
'
430-
metals_present:
431-
- TITANIUM
430+
metals_present: []
432431
metal_relevance: INCIDENTAL
433432
metal_notes: Metal/REE detected via environmental factor measurements; Metal/REE detected
434433
via keyword matching in description (context-validated)

kb/communities/Australian_Lead_Zinc_Polymetallic.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -966,10 +966,8 @@ environmental_factors:
966966
explanation: Documents long-term weathering profile development
967967
metals_present:
968968
- COPPER
969-
- GOLD
970969
- IRON
971970
- LEAD
972-
- TITANIUM
973971
- ZINC
974972
metal_relevance: SIGNIFICANT
975973
metal_notes: Metal/REE detected via CHEBI terms in metabolites; Metal/REE detected via environmental factor

kb/communities/Bayan_Obo_REE_Tailings_Consortium.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -544,9 +544,7 @@ environmental_factors:
544544
was narrated for plausible real-world use
545545
explanation: Describes REE mineralogy at Bayan Obo
546546
metals_present:
547-
- GOLD
548547
- IRON
549-
- TITANIUM
550548
rare_earth_elements_present:
551549
- CERIUM
552550
- LANTHANUM

kb/communities/Chlamydomonas_Bacterial_H2_Consortium.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -351,7 +351,6 @@ growth_media:
351351
explanation: Establishes mannitol and yeast extract as key medium components for sustained H2 production
352352
culturemech_id: CultureMech:000139
353353
culturemech_url: https://github.com/CultureBotAI/CultureMech/tree/main/kb/media/CultureMech:000139
354-
metals_present:
355-
- TITANIUM
354+
metals_present: []
356355
metal_relevance: SIGNIFICANT
357356
metal_notes: Metal/REE detected via environmental factor measurements

kb/communities/Chlamydomonas_Methylobacterium_Mutualism.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,5 @@ growth_media:
327327
culturemech_url: https://github.com/CultureBotAI/CultureMech/tree/main/kb/media/CultureMech:000139
328328
metals_present:
329329
- IRON
330-
- TITANIUM
331330
metal_relevance: SIGNIFICANT
332331
metal_notes: Metal/REE detected via environmental factor measurements

0 commit comments

Comments
 (0)