Skip to content

Commit 4801893

Browse files
realmarcinclaude
andauthored
Add publisher-meta scraper fallback + 3 Springer cache refreshes (#59)
Drive `just validate-references-all` errors from 43 to 39 (and from session-start 101 to 39) by adding a last-resort DOI page scraper to the literature fetcher and refreshing the three Springer caches it unblocks. Fetcher (src/communitymech/literature.py): - fetch_publisher_meta_abstract(): GET https://doi.org/<DOI>, follow redirects, and pull the abstract excerpt out of the page's twitter:description / og:description / description meta tag. Springer publishes the first ~200 characters of the abstract in twitter:description even for paywalled articles where Crossref / OpenAlex / Semantic Scholar / Europe PMC have no abstract. Includes on-disk caching as publisher_<safe-doi>.txt and strips the "Journal Name - " prefix Springer adds to that field. Elsevier ScienceDirect intentionally serves a bot-detection page and yields nothing - that's the residual cap. - fetch_paper() fallback chain now: CrossRef -> PMID -> PMC -> OpenAlex -> Semantic Scholar -> Europe PMC -> publisher meta scrape. Cache refresh (recovers 4 ERROR rows): - DOI_10.1007_s10311-019-00911-y (Ewaste copper bioleaching, Springer) - DOI_10.1007_s10230-008-0059-z (Iberian meromictic pit lakes, Springer) - DOI_10.1007_BF02106205 (Acidobacterium taxonomy paper, Current Microbiology / Springer; cited 2x in AMD_Acidophile_Heterotroph_Network) Snippet repairs: - Ewaste_Bioleaching_Consortium: replace title quote with the abstract's verbatim e-waste bioleaching framing. - Iberian_Pit_Lake_Stratified_Community: upgrade PARTIAL to SUPPORT and expand the snippet to the abstract's vertical-gradient quote. - AMD_Acidophile_Heterotroph_Network: replace two title quotes with the abstract's verbatim genus proposal. Remaining 39 "No content available" errors are all Elsevier 2024-2025 papers (j.jece.2025.120403, j.cej.2024.153492, j.ibiod.2025.106190, 10889868.2024.2407240) plus one ResearchGate preprint - their abstracts are not in any aggregator we can query and the publisher pages serve bot-detection HTML. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 85f6b2e commit 4801893

7 files changed

Lines changed: 91 additions & 13 deletions

kb/communities/AMD_Acidophile_Heterotroph_Network.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,8 +229,8 @@ taxonomy:
229229
- reference: doi:10.1007/BF02106205
230230
supports: SUPPORT
231231
evidence_source: IN_VITRO
232-
snippet: 'Acidobacterium capsulatum gen. nov., sp. nov.: An acidophilic chemoorganotrophic bacterium
233-
containing menaquinone from acidic mineral environment'
232+
snippet: Acidobacterium is proposed as a new genus for the acidophilic, chemoorganotrophic
233+
bacteria containing menaquinone isolated from acidic mineral environments
234234
explanation: Establishes isolation from acidic mineral environment
235235
- reference: doi:10.3389/fmicb.2016.00744
236236
supports: SUPPORT
@@ -622,8 +622,8 @@ ecological_interactions:
622622
- reference: doi:10.1007/BF02106205
623623
supports: SUPPORT
624624
evidence_source: IN_VITRO
625-
snippet: 'Acidobacterium capsulatum gen. nov., sp. nov.: An acidophilic chemoorganotrophic bacterium
626-
containing menaquinone from acidic mineral environment'
625+
snippet: Acidobacterium is proposed as a new genus for the acidophilic, chemoorganotrophic
626+
bacteria containing menaquinone isolated from acidic mineral environments
627627
explanation: Establishes chemoorganotrophic metabolism
628628
- reference: doi:10.3389/fmicb.2016.00744
629629
supports: SUPPORT

kb/communities/Ewaste_Bioleaching_Consortium.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -136,9 +136,11 @@ taxonomy:
136136
- reference: doi:10.1007/s10311-019-00911-y
137137
supports: SUPPORT
138138
evidence_source: IN_VITRO
139-
snippet: Enhanced bioleaching of copper from circuit boards of computer waste by Acidithiobacillus
140-
ferrooxidans
141-
explanation: Establishes At. ferrooxidans efficacy for e-waste copper recovery
139+
snippet: Computer circuit boards are a major electronic waste containing higher concentrations
140+
of copper, gold and silver. These metals may be recovered by bioleaching
141+
explanation: Establishes bioleaching of computer circuit boards for recovery of copper
142+
and other metals; the paper specifically evaluates Acidithiobacillus ferrooxidans
143+
efficacy.
142144
- taxon_term:
143145
preferred_term: Acidithiobacillus thiooxidans
144146
term:

kb/communities/Iberian_Pit_Lake_Stratified_Community.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -697,10 +697,12 @@ environmental_factors:
697697
with Proteobacteria
698698
explanation: Documents meromictic stability
699699
- reference: doi:10.1007/s10230-008-0059-z
700-
supports: PARTIAL
700+
supports: SUPPORT
701701
evidence_source: IN_VIVO
702-
snippet: permanently stratified acidic pit lake
703-
explanation: Documents thermal stratification maintaining meromixis
702+
snippet: A marked vertical trend of increasing temperature and dissolved metal concentrations
703+
is observed in the monimolimnia of some meromictic pit lakes of the Iberian
704+
explanation: Documents the vertical temperature and metal gradients that maintain
705+
meromictic stratification in Iberian acidic pit lakes.
704706
- name: Microbial Diversity Gradient
705707
value: 72-206
706708
unit: ASVs per sample

references_cache/DOI_10.1007_BF02106205.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ authors:
88
journal: Current Microbiology
99
year: '1991'
1010
doi: 10.1007/BF02106205
11-
content_type: unavailable
11+
content_type: abstract_only
1212
---
1313

1414
# Acidobacterium capsulatum gen. nov., sp. nov.: An acidophilic chemoorganotrophic bacterium containing menaquinone from acidic mineral environment
@@ -17,3 +17,5 @@ content_type: unavailable
1717
**DOI:** [10.1007/BF02106205](https://doi.org/10.1007/BF02106205)
1818

1919
## Content
20+
21+
Acidobacterium is proposed as a new genus for the acidophilic, chemoorganotrophic bacteria containing menaquinone isolated from acidic mineral environments.Acidobacterium

references_cache/DOI_10.1007_s10230-008-0059-z.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ authors:
99
journal: Mine Water and the Environment
1010
year: '2009'
1111
doi: 10.1007/s10230-008-0059-z
12-
content_type: unavailable
12+
content_type: abstract_only
1313
---
1414

1515
# Physico-chemical gradients and meromictic stratification in Cueva de la Mora and other acidic pit lakes of the Iberian Pyrite Belt
@@ -18,3 +18,5 @@ content_type: unavailable
1818
**DOI:** [10.1007/s10230-008-0059-z](https://doi.org/10.1007/s10230-008-0059-z)
1919

2020
## Content
21+
22+
A marked vertical trend of increasing temperature and dissolved metal concentrations is observed in the monimolimnia of some meromictic pit lakes of the Iberian

references_cache/DOI_10.1007_s10311-019-00911-y.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ authors:
77
journal: Environmental Chemistry Letters
88
year: '2019'
99
doi: 10.1007/s10311-019-00911-y
10-
content_type: unavailable
10+
content_type: abstract_only
1111
---
1212

1313
# Enhanced bioleaching of copper from circuit boards of computer waste by Acidithiobacillus ferrooxidans
@@ -16,3 +16,5 @@ content_type: unavailable
1616
**DOI:** [10.1007/s10311-019-00911-y](https://doi.org/10.1007/s10311-019-00911-y)
1717

1818
## Content
19+
20+
Computer circuit boards are a major electronic waste containing higher concentrations of copper, gold and silver. These metals may be recovered by bioleaching, an

src/communitymech/literature.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,72 @@ def fetch_europepmc_abstract(self, doi: str) -> str | None:
368368
print(f"Error fetching Europe PMC for {doi}: {e}")
369369
return None
370370

371+
def fetch_publisher_meta_abstract(self, doi: str) -> str | None:
372+
"""
373+
Last-resort scrape of the DOI landing page for an abstract excerpt.
374+
375+
Most publishers expose the abstract (or its first ~200 chars) in
376+
page-level meta tags - typically `twitter:description`,
377+
`og:description`, or the standard `<meta name="description">` -
378+
even when the article itself is paywalled and Crossref / OpenAlex /
379+
Semantic Scholar / Europe PMC carry no abstract. Springer/Nature
380+
is the most reliable source for this pattern; Elsevier
381+
ScienceDirect serves a bot-detection page and yields nothing.
382+
383+
Args:
384+
doi: DOI string (with or without "doi:" prefix)
385+
386+
Returns:
387+
Abstract excerpt (may be truncated to ~200 chars by the
388+
publisher) or None.
389+
"""
390+
doi = re.sub(r"^(?i:doi:)", "", doi).replace("https://doi.org/", "").strip()
391+
cache_file = self._abstract_cache_path("publisher", doi)
392+
if cache_file.exists():
393+
return cache_file.read_text() or None
394+
395+
try:
396+
response = self.session.get(
397+
f"https://doi.org/{doi}",
398+
headers={"User-Agent": "Mozilla/5.0 (Macintosh)"},
399+
allow_redirects=True,
400+
timeout=30,
401+
)
402+
response.raise_for_status()
403+
html = response.text
404+
except requests.exceptions.RequestException as e:
405+
print(f"Error scraping publisher page for {doi}: {e}")
406+
return None
407+
408+
for tag in ("twitter:description", "og:description", "description"):
409+
match = re.search(
410+
rf'<meta\s+[^>]*name=["\']?{tag}["\']?\s+content=["\']([^"\']+)["\']',
411+
html,
412+
re.IGNORECASE,
413+
)
414+
if not match:
415+
match = re.search(
416+
rf'<meta\s+[^>]*property=["\']?{tag}["\']?\s+content=["\']([^"\']+)["\']',
417+
html,
418+
re.IGNORECASE,
419+
)
420+
if match:
421+
text = match.group(1)
422+
# Decode common HTML entities
423+
text = (
424+
text.replace("&amp;", "&")
425+
.replace("&quot;", '"')
426+
.replace("&#x27;", "'")
427+
.replace("&lt;", "<")
428+
.replace("&gt;", ">")
429+
)
430+
# Strip Springer's "Journal Name - " prefix and trailing ellipsis
431+
text = re.sub(r"^[^-]+-\s*", "", text).rstrip(" .")
432+
if len(text) > 80: # skip nav-text and similar short snippets
433+
cache_file.write_text(text)
434+
return text
435+
return None
436+
371437
def fetch_unpaywall(self, doi: str, email: str = "noreply@example.com") -> str | None:
372438
"""
373439
Try to fetch open access PDF URL from Unpaywall.
@@ -458,6 +524,8 @@ def fetch_paper(
458524
abstract = self.fetch_semantic_scholar_abstract(doi)
459525
if not abstract:
460526
abstract = self.fetch_europepmc_abstract(doi)
527+
if not abstract:
528+
abstract = self.fetch_publisher_meta_abstract(doi)
461529

462530
return (abstract, pdf_url)
463531

0 commit comments

Comments
 (0)