Skip to content

Commit 5a822d4

Browse files
author
Ang
committed
Release 0.1.1: align abstract retrieval semantics
- Remove IdentifierModule._check_doi_content_consistency and the consistency_score / low_consistency warning path. Fuzzy string-similarity was empirically unable to detect subtle LLM-hallucinated references and only surfaced as a logger.warning that downstream tools could not observe. Citation-authenticity belongs at the semantic abstract-vs-claim layer in the consuming tool (e.g. the sci skill), not at the bibliographic-string layer here. - Add DOI-only abstract fallback cascade: CrossRef -> Semantic Scholar (/paper/DOI:) -> PubMed (ESearch DOI->PMID, EFetch PMID->abstract). Gated on raw input carrying a DOI; DOIs inferred by fuzzy search do not trigger it. Title-based PubMed fallback is removed because it empirically returned the abstract of an unrelated paper for at least one DOI, which is strictly worse than returning None for downstream semantic checks. - Rename _complete_fields(..., allow_pubmed_fallback=...) to allow_abstract_fallback. Old name kept as deprecated alias for one release cycle (emits DeprecationWarning). - journal_article_full template declares abstract as an optional field to match what the enricher emits. journal_article_with_abstract retained as compatibility alias. - Regression test test_enrich_single_entry_no_doi_in_raw_skips_abstract_fallback pins the no-DOI-in-raw => no Semantic-Scholar/PubMed network call guarantee at the _enrich_single_entry layer. - Sync CHANGELOG.md and docs/changelog.rst for 0.1.1, bump 4 version sites (pyproject.toml, __init__.py, CITATION.cff, enricher.py User-Agent), and add truncated abstract line to README / quick_start.rst / output_formats.rst sample outputs.
1 parent e5fe3ac commit 5a822d4

14 files changed

Lines changed: 608 additions & 164 deletions

CHANGELOG.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,76 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
66

77
## [Unreleased]
88

9+
## [0.1.1] - 2026-04-17
10+
11+
Maintenance release focused on **aligning the abstract-retrieval
12+
semantics across code, templates, docs, tests, and metadata**. No
13+
breaking public-API changes; the one renamed kwarg keeps its old name
14+
as a deprecated alias for this release cycle.
15+
16+
### Added
17+
- Abstract retrieval now falls back through a DOI-only cascade when
18+
CrossRef does not return an abstract:
19+
Semantic Scholar (`/paper/DOI:{doi}?fields=abstract`) → PubMed
20+
(ESearch DOI→PMID, then EFetch PMID→abstract). The cascade is only
21+
invoked when the user's **original raw input** carried a DOI; DOIs
22+
inferred by fuzzy search do not trigger it, so that a possibly-wrong
23+
candidate does not cost extra roundtrips. In particular, a local
24+
BibTeX entry with no DOI field — regardless of whether other stages
25+
would later resolve one — does not trigger the abstract cascade.
26+
- Semantic Scholar search results now carry the `abstract` field, which
27+
propagates through `_convert_search_metadata` into the final BibTeX
28+
output whenever the identification stage already resolved the entry
29+
through SS.
30+
- `EnricherModule._get_semantic_scholar_abstract(doi)` helper for
31+
DOI-based Semantic Scholar abstract retrieval. Handles `404` / `429`
32+
gracefully by returning `None`.
33+
- `_complete_fields` gained an `allow_abstract_fallback` kwarg
34+
(default `False`) that gates the new cascade. `_enrich_single_entry`
35+
passes `True` only when the raw entry contributed a DOI.
36+
- Default `journal_article_full` template now lists `abstract` as an
37+
optional field, so the template declaration matches what the enricher
38+
emits. The older `journal_article_with_abstract` template is retained
39+
as a compatibility alias and will stay available for at least one
40+
release cycle.
41+
- Regression test `test_enrich_single_entry_no_doi_in_raw_skips_abstract_fallback`
42+
pinning the "no-DOI-in raw ⇒ no Semantic-Scholar / PubMed network
43+
call" guarantee at the `_enrich_single_entry` layer, so a future
44+
refactor of the `raw_has_doi` gate cannot silently start leaking
45+
network calls for local-only inputs.
46+
47+
### Changed
48+
- `_get_pubmed_abstract` now requires a DOI and no longer falls back to
49+
PubMed title search. The removed title-based path empirically returned
50+
the abstract of an unrelated paper (e.g., the Zhang 2020 AI Review DOI
51+
`10.1007/s10462-019-09792-7` pulled the abstract of a different RSI
52+
segmentation paper), which is strictly worse than returning `None` for
53+
downstream semantic cross-checks such as the `sci` skill.
54+
- Abstract coverage on an internal 10-DOI cross-publisher spot-check
55+
(Nature, Science, PLOS, Cell, IEEE CVPR, Frontiers, arXiv, Springer,
56+
ACM, plus one deliberately invalid DOI) rose from 4/9 to 8/9. This
57+
number is a local indicator, **not** a release gate: reproducing it
58+
requires a live network and the probe scripts are no longer in the
59+
repository.
60+
61+
### Deprecated
62+
- `_complete_fields(..., allow_pubmed_fallback=...)` is deprecated in
63+
favour of `allow_abstract_fallback`. The old name still works for one
64+
release cycle and emits `DeprecationWarning`. It was renamed because
65+
the flag actually gates the entire Semantic-Scholar + PubMed cascade,
66+
not PubMed alone.
67+
68+
### Removed
69+
- `IdentifierModule._check_doi_content_consistency` and the
70+
`consistency_score` / `low_consistency` warning path. The fuzzy
71+
string-similarity score was empirically unable to detect subtle
72+
LLM-hallucinated references (scored 85/100 on author-only
73+
hallucinations against a real DOI) and was only surfaced as a
74+
`logger.warning` that downstream tools could not observe, producing
75+
false reassurance. Citation authenticity verification belongs at the
76+
abstract-vs-claim semantic layer in the consuming tool (e.g. the
77+
`sci` skill), not at the bibliographic-string layer here.
78+
979
## [0.1.0] - 2026-04-17
1080

1181
First formal PyPI release since `0.0.12`. Incorporates the complete

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,6 @@ keywords:
1919
- arxiv
2020
- research
2121
license: MIT
22-
version: "0.1.0"
22+
version: "0.1.1"
2323
date-released: "2026-04-17"
2424

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ OneCite solves this by accepting **any mix of identifiers and text queries** and
6565
| **Fuzzy Matching** | Match references against multiple academic databases even from incomplete or inaccurate info. |
6666
| **Multiple Formats** | Input `.txt`/`.bib` → Output **BibTeX**. |
6767
| **4-stage Pipeline** | A 4-stage process (clean → query → validate → format) to produce consistent output. |
68-
| **Field Completion** | Enrich entries by filling in missing fields like journal, volume, pages, and authors. |
68+
| **Field Completion** | Enrich entries by filling in missing fields like journal, volume, pages, authors, and abstract. |
6969
| 🎓 **7+ Citation Types** | Handles journal articles, conference papers, books, software, datasets, theses, and preprints. |
7070
| **Multi-Source Lookup** | Queries CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others for every entry. |
7171
| **Many Identifier Types** | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries. |
@@ -143,6 +143,7 @@ Your `results.bib` file now contains entries of different types.
143143
publisher = "Springer Science and Business Media LLC",
144144
url = "https://doi.org/10.1038/nature14539",
145145
type = "journal-article",
146+
abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
146147
}
147148
@inproceedings{Vaswani2017Attention,
148149
arxiv = "1706.03762",

docs/api/pipeline.rst

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -142,17 +142,46 @@ requires.
142142
- ``author``, ``title``, ``journal`` / ``booktitle``, ``year``
143143
- ``volume``, ``number``, ``pages``, ``publisher``
144144
- ``doi``, ``url``, ``arxiv`` / ``arxiv_id``
145-
- ``abstract`` (from CrossRef when available, otherwise from PubMed's
146-
``EFetch`` endpoint when the entry has a resolvable PMID)
145+
- ``abstract`` — returned directly by CrossRef or Semantic Scholar when
146+
the identification stage resolved the entry through them; otherwise
147+
filled in by a post-hoc DOI-only cascade described below.
147148

148149
The ``_get_crossref_metadata`` method requests each DOI with a proper
149150
``User-Agent`` header and a ``mailto`` query-string parameter, per
150151
CrossRef's etiquette (fixes #21).
151152

152-
``_complete_fields`` is intentionally a no-op pass-through:
153-
template-driven field completion from external scrapers was removed
154-
(see #16, #29). The template now only affects which ``entry_type`` the
155-
formatter falls back to when classification is ambiguous.
153+
``_complete_fields`` intentionally performs **only one** kind of
154+
completion: abstract back-fill, through a DOI-only cascade
155+
156+
.. code-block:: text
157+
158+
Semantic Scholar (/paper/DOI:{doi}?fields=abstract)
159+
↓ (empty or 4xx)
160+
PubMed ESearch (DOI → PMID) + EFetch (PMID → abstract)
161+
162+
The cascade is gated by ``allow_abstract_fallback`` and is only invoked
163+
when the caller's **raw input** carried a DOI; DOIs inferred by fuzzy
164+
search never trigger it, so a possibly-wrong candidate does not cost
165+
extra roundtrips. Title-based fallback is intentionally not used
166+
anywhere on this path — in testing it silently returned the abstract
167+
of an unrelated paper for at least one DOI
168+
(``10.1007/s10462-019-09792-7``), which is strictly worse than
169+
returning ``None`` for downstream semantic cross-checks.
170+
171+
Wider template-driven field completion from external scrapers (the
172+
Google Scholar path flagged in review #29) was removed in 0.1.0 and is
173+
**not** being reintroduced here. The template still controls which
174+
``entry_type`` the formatter falls back to when classification is
175+
ambiguous, and continues to determine the declared field set; as of
176+
this release, the default ``journal_article_full`` template lists
177+
``abstract`` as an optional field so its declaration matches what the
178+
enricher actually emits.
179+
180+
The legacy kwarg name ``allow_pubmed_fallback`` is retained as a
181+
deprecated alias for one release cycle and emits
182+
``DeprecationWarning`` when used — its replacement
183+
``allow_abstract_fallback`` reflects that the flag gates the full
184+
Semantic-Scholar + PubMed cascade, not just PubMed.
156185

157186
Stage 4: Format (``FormatterModule``)
158187
-------------------------------------

docs/changelog.rst

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,84 @@ The format is based on `Keep a Changelog <https://keepachangelog.com/>`_, and th
88
Unreleased
99
----------
1010

11+
[0.1.1] - 2026-04-17
12+
---------------------
13+
14+
Maintenance release focused on **aligning the abstract-retrieval
15+
semantics across code, templates, docs, tests, and metadata**. No
16+
breaking public-API changes; the one renamed kwarg keeps its old name
17+
as a deprecated alias for this release cycle.
18+
19+
Added
20+
~~~~~
21+
22+
- Abstract retrieval now falls back through a DOI-only cascade when
23+
CrossRef does not return an abstract: Semantic Scholar
24+
(``/paper/DOI:{doi}?fields=abstract``) → PubMed (ESearch DOI→PMID,
25+
then EFetch PMID→abstract). The cascade is only invoked when the
26+
user's **original raw input** carried a DOI; DOIs inferred by fuzzy
27+
search do not trigger it, so a possibly-wrong candidate does not cost
28+
extra roundtrips. In particular, a local BibTeX entry with no
29+
``doi`` field — regardless of whether other stages would later
30+
resolve one — does not trigger the abstract cascade.
31+
- Semantic Scholar search results now carry the ``abstract`` field,
32+
which propagates through ``_convert_search_metadata`` into the final
33+
BibTeX output whenever the identification stage already resolved the
34+
entry through SS.
35+
- ``EnricherModule._get_semantic_scholar_abstract(doi)`` helper for
36+
DOI-based Semantic Scholar abstract retrieval. Handles ``404`` /
37+
``429`` gracefully by returning ``None``.
38+
- ``_complete_fields`` gained an ``allow_abstract_fallback`` kwarg
39+
(default ``False``) that gates the new cascade.
40+
``_enrich_single_entry`` passes ``True`` only when the raw entry
41+
contributed a DOI.
42+
- Default ``journal_article_full`` template now lists ``abstract`` as
43+
an optional field so the declaration matches what the enricher
44+
emits. The older ``journal_article_with_abstract`` template is
45+
retained as a compatibility alias and will stay available for at
46+
least one release cycle.
47+
- Regression test ``test_enrich_single_entry_no_doi_in_raw_skips_abstract_fallback``
48+
pinning the "no-DOI-in raw ⇒ no Semantic-Scholar/PubMed network
49+
call" guarantee at the ``_enrich_single_entry`` layer.
50+
51+
Changed
52+
~~~~~~~
53+
54+
- ``_get_pubmed_abstract`` now requires a DOI and no longer falls back
55+
to PubMed title search. The removed title-based path empirically
56+
returned the abstract of an unrelated paper (e.g. the Zhang 2020 AI
57+
Review DOI ``10.1007/s10462-019-09792-7`` pulled the abstract of a
58+
different RSI segmentation paper), which is strictly worse than
59+
returning ``None`` for downstream semantic cross-checks such as the
60+
``sci`` skill.
61+
- Abstract coverage on an internal 10-DOI cross-publisher spot-check
62+
rose from 4/9 to 8/9. This number is a local indicator, **not** a
63+
release gate: reproducing it requires a live network and the probe
64+
scripts are no longer in the repository.
65+
66+
Deprecated
67+
~~~~~~~~~~
68+
69+
- ``_complete_fields(..., allow_pubmed_fallback=...)`` is deprecated in
70+
favour of ``allow_abstract_fallback``. The old name still works for
71+
one release cycle and emits ``DeprecationWarning``. It was renamed
72+
because the flag actually gates the entire Semantic-Scholar + PubMed
73+
cascade, not PubMed alone.
74+
75+
Removed
76+
~~~~~~~
77+
78+
- ``IdentifierModule._check_doi_content_consistency`` and the
79+
``consistency_score`` / ``low_consistency`` warning path. The fuzzy
80+
string-similarity score was empirically unable to detect subtle
81+
LLM-hallucinated references (scored 85/100 on author-only
82+
hallucinations against a real DOI) and was only surfaced as a
83+
``logger.warning`` that downstream tools could not observe, producing
84+
false reassurance. Citation-authenticity verification belongs at
85+
the abstract-vs-claim semantic layer in the consuming tool
86+
(e.g. the ``sci`` skill), not at the bibliographic-string layer
87+
here.
88+
1189
[0.1.0] - 2026-04-17
1290
---------------------
1391

docs/output_formats.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Format Specification
2525
doi = "10.1038/nature14539",
2626
title = "Deep Learning",
2727
author = "LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey",
28+
abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
2829
journal = "Nature",
2930
year = 2015,
3031
volume = 521,

docs/quick_start.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Your ``results.bib`` file now contains entries in BibTeX format::
5151
doi = "10.1038/nature14539",
5252
title = "Deep learning",
5353
author = "LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey",
54+
abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
5455
journal = "Nature",
5556
year = 2015,
5657
volume = 521,

docs/templates.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,27 @@ Template Basics
88

99
Templates define the **structure and metadata requirements** for different citation types. OneCite comes with built-in templates for:
1010

11-
- **journal_article_full** - Journal articles with complete metadata
11+
- **journal_article_full** - Journal articles with complete metadata (includes ``abstract`` as an optional field)
1212
- **conference_paper** - Conference proceedings papers
1313
- **book** - Books and monographs
1414
- **thesis** - Theses and dissertations
1515
- **software** - Software and code repositories
1616
- **dataset** - Research datasets
1717

18+
A legacy ``journal_article_with_abstract`` template is also shipped for
19+
backwards compatibility with older configurations. Since ``journal_article_full``
20+
now also declares ``abstract`` as an optional field, the two templates
21+
behave equivalently for journal articles; new configurations should
22+
prefer ``journal_article_full`` and treat ``journal_article_with_abstract``
23+
as deprecated.
24+
1825
Default Templates Location
1926
~~~~~~~~~~~~~~~~~~~~~~~~~~
2027

2128
Built-in templates are located in the ``onecite/templates/`` directory:
2229

2330
- ``journal_article_full.yaml``
31+
- ``journal_article_with_abstract.yaml`` *(deprecated alias of the above)*
2432
- ``conference_paper.yaml``
2533
- ``book.yaml``
2634
- ``thesis.yaml``

onecite/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
citations in multiple formats.
99
"""
1010

11-
__version__ = "0.1.0"
11+
__version__ = "0.1.1"
1212
__author__ = "OneCite Team"
1313
__email__ = "ang@hezhiang.com"
1414
__license__ = "MIT"

0 commit comments

Comments
 (0)