HzaCode
diff --git a/‎CHANGELOG.md‎
Lines changed: 70 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎CITATION.cff‎
Lines changed: 1 addition & 1 deletion b/‎CITATION.cff‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/api/pipeline.rst‎
Lines changed: 35 additions & 6 deletions b/‎docs/api/pipeline.rst‎
Lines changed: 35 additions & 6 deletions
diff --git a/‎docs/changelog.rst‎
Lines changed: 78 additions & 0 deletions b/‎docs/changelog.rst‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎docs/output_formats.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/output_formats.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/quick_start.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/quick_start.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/templates.rst‎
Lines changed: 9 additions & 1 deletion b/‎docs/templates.rst‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎onecite/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎onecite/__init__.py‎
Lines changed: 1 addition & 1 deletion
@@ -6,6 +6,76 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/), and this
 
 ## [Unreleased]
 
+## [0.1.1] - 2026-04-17
+
+Maintenance release focused on **aligning the abstract-retrieval
+semantics across code, templates, docs, tests, and metadata**. No
+breaking public-API changes; the one renamed kwarg keeps its old name
+as a deprecated alias for this release cycle.
+
+### Added
+- Abstract retrieval now falls back through a DOI-only cascade when
+  CrossRef does not return an abstract:
+  Semantic Scholar (`/paper/DOI:{doi}?fields=abstract`) → PubMed
+  (ESearch DOI→PMID, then EFetch PMID→abstract). The cascade is only
+  invoked when the user's **original raw input** carried a DOI; DOIs
+  inferred by fuzzy search do not trigger it, so that a possibly-wrong
+  candidate does not cost extra roundtrips. In particular, a local
+  BibTeX entry with no DOI field — regardless of whether other stages
+  would later resolve one — does not trigger the abstract cascade.
+- Semantic Scholar search results now carry the `abstract` field, which
+  propagates through `_convert_search_metadata` into the final BibTeX
+  output whenever the identification stage already resolved the entry
+  through SS.
+- `EnricherModule._get_semantic_scholar_abstract(doi)` helper for
+  DOI-based Semantic Scholar abstract retrieval. Handles `404` / `429`
+  gracefully by returning `None`.
+- `_complete_fields` gained an `allow_abstract_fallback` kwarg
+  (default `False`) that gates the new cascade. `_enrich_single_entry`
+  passes `True` only when the raw entry contributed a DOI.
+- Default `journal_article_full` template now lists `abstract` as an
+  optional field, so the template declaration matches what the enricher
+  emits. The older `journal_article_with_abstract` template is retained
+  as a compatibility alias and will stay available for at least one
+  release cycle.
+- Regression test `test_enrich_single_entry_no_doi_in_raw_skips_abstract_fallback`
+  pinning the "no-DOI-in raw ⇒ no Semantic-Scholar / PubMed network
+  call" guarantee at the `_enrich_single_entry` layer, so a future
+  refactor of the `raw_has_doi` gate cannot silently start leaking
+  network calls for local-only inputs.
+
+### Changed
+- `_get_pubmed_abstract` now requires a DOI and no longer falls back to
+  PubMed title search. The removed title-based path empirically returned
+  the abstract of an unrelated paper (e.g., the Zhang 2020 AI Review DOI
+  `10.1007/s10462-019-09792-7` pulled the abstract of a different RSI
+  segmentation paper), which is strictly worse than returning `None` for
+  downstream semantic cross-checks such as the `sci` skill.
+- Abstract coverage on an internal 10-DOI cross-publisher spot-check
+  (Nature, Science, PLOS, Cell, IEEE CVPR, Frontiers, arXiv, Springer,
+  ACM, plus one deliberately invalid DOI) rose from 4/9 to 8/9. This
+  number is a local indicator, **not** a release gate: reproducing it
+  requires a live network and the probe scripts are no longer in the
+  repository.
+
+### Deprecated
+- `_complete_fields(..., allow_pubmed_fallback=...)` is deprecated in
+  favour of `allow_abstract_fallback`. The old name still works for one
+  release cycle and emits `DeprecationWarning`. It was renamed because
+  the flag actually gates the entire Semantic-Scholar + PubMed cascade,
+  not PubMed alone.
+
+### Removed
+- `IdentifierModule._check_doi_content_consistency` and the
+  `consistency_score` / `low_consistency` warning path. The fuzzy
+  string-similarity score was empirically unable to detect subtle
+  LLM-hallucinated references (scored 85/100 on author-only
+  hallucinations against a real DOI) and was only surfaced as a
+  `logger.warning` that downstream tools could not observe, producing
+  false reassurance. Citation authenticity verification belongs at the
+  abstract-vs-claim semantic layer in the consuming tool (e.g. the
+  `sci` skill), not at the bibliographic-string layer here.
+
 ## [0.1.0] - 2026-04-17
 
 First formal PyPI release since `0.0.12`.  Incorporates the complete
 
@@ -19,6 +19,6 @@ keywords:
   - arxiv
   - research
 license: MIT
-version: "0.1.0"
+version: "0.1.1"
 date-released: "2026-04-17"
 
@@ -65,7 +65,7 @@ OneCite solves this by accepting **any mix of identifiers and text queries** and
 | **Fuzzy Matching**          | Match references against multiple academic databases even from incomplete or inaccurate info.         |
 | **Multiple Formats**        | Input `.txt`/`.bib` → Output **BibTeX**.                                                             |
 | **4-stage Pipeline**        | A 4-stage process (clean → query → validate → format) to produce consistent output.                  |
-| **Field Completion**        | Enrich entries by filling in missing fields like journal, volume, pages, and authors.                |
+| **Field Completion**        | Enrich entries by filling in missing fields like journal, volume, pages, authors, and abstract.                |
 | 🎓 **7+ Citation Types**    | Handles journal articles, conference papers, books, software, datasets, theses, and preprints.        |
 | **Multi-Source Lookup**     | Queries CrossRef, arXiv, PubMed, Semantic Scholar, Google Books, and others for every entry.         |
 | **Many Identifier Types**   | Accepts DOI, PMID, arXiv ID, ISBN, GitHub URL, Zenodo DOI, or plain text queries.                    |
@@ -143,6 +143,7 @@ Your `results.bib` file now contains entries of different types.
   publisher = "Springer Science and Business Media LLC",
   url = "https://doi.org/10.1038/nature14539",
   type = "journal-article",
+  abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
 }
 @inproceedings{Vaswani2017Attention,
   arxiv = "1706.03762",
 
@@ -142,17 +142,46 @@ requires.
 - ``author``, ``title``, ``journal`` / ``booktitle``, ``year``
 - ``volume``, ``number``, ``pages``, ``publisher``
 - ``doi``, ``url``, ``arxiv`` / ``arxiv_id``
-- ``abstract`` (from CrossRef when available, otherwise from PubMed's
-  ``EFetch`` endpoint when the entry has a resolvable PMID)
+- ``abstract`` — returned directly by CrossRef or Semantic Scholar when
+  the identification stage resolved the entry through them; otherwise
+  filled in by a post-hoc DOI-only cascade described below.
 
 The ``_get_crossref_metadata`` method requests each DOI with a proper
 ``User-Agent`` header and a ``mailto`` query-string parameter, per
 CrossRef's etiquette (fixes #21).
 
-``_complete_fields`` is intentionally a no-op pass-through:
-template-driven field completion from external scrapers was removed
-(see #16, #29).  The template now only affects which ``entry_type`` the
-formatter falls back to when classification is ambiguous.
+``_complete_fields`` intentionally performs **only one** kind of
+completion: abstract back-fill, through a DOI-only cascade
+
+.. code-block:: text
+
+    Semantic Scholar (/paper/DOI:{doi}?fields=abstract)
+      ↓  (empty or 4xx)
+    PubMed ESearch (DOI → PMID) + EFetch (PMID → abstract)
+
+The cascade is gated by ``allow_abstract_fallback`` and is only invoked
+when the caller's **raw input** carried a DOI; DOIs inferred by fuzzy
+search never trigger it, so a possibly-wrong candidate does not cost
+extra roundtrips.  Title-based fallback is intentionally not used
+anywhere on this path — in testing it silently returned the abstract
+of an unrelated paper for at least one DOI
+(``10.1007/s10462-019-09792-7``), which is strictly worse than
+returning ``None`` for downstream semantic cross-checks.
+
+Wider template-driven field completion from external scrapers (the
+Google Scholar path flagged in review #29) was removed in 0.1.0 and is
+**not** being reintroduced here.  The template still controls which
+``entry_type`` the formatter falls back to when classification is
+ambiguous, and continues to determine the declared field set; as of
+this release, the default ``journal_article_full`` template lists
+``abstract`` as an optional field so its declaration matches what the
+enricher actually emits.
+
+The legacy kwarg name ``allow_pubmed_fallback`` is retained as a
+deprecated alias for one release cycle and emits
+``DeprecationWarning`` when used — its replacement
+``allow_abstract_fallback`` reflects that the flag gates the full
+Semantic-Scholar + PubMed cascade, not just PubMed.
 
 Stage 4: Format (``FormatterModule``)
 -------------------------------------
 
@@ -8,6 +8,84 @@ The format is based on `Keep a Changelog <https://keepachangelog.com/>`_, and th
 Unreleased
 ----------
 
+[0.1.1] - 2026-04-17
+---------------------
+
+Maintenance release focused on **aligning the abstract-retrieval
+semantics across code, templates, docs, tests, and metadata**.  No
+breaking public-API changes; the one renamed kwarg keeps its old name
+as a deprecated alias for this release cycle.
+
+Added
+~~~~~
+
+- Abstract retrieval now falls back through a DOI-only cascade when
+  CrossRef does not return an abstract: Semantic Scholar
+  (``/paper/DOI:{doi}?fields=abstract``) → PubMed (ESearch DOI→PMID,
+  then EFetch PMID→abstract).  The cascade is only invoked when the
+  user's **original raw input** carried a DOI; DOIs inferred by fuzzy
+  search do not trigger it, so a possibly-wrong candidate does not cost
+  extra roundtrips.  In particular, a local BibTeX entry with no
+  ``doi`` field — regardless of whether other stages would later
+  resolve one — does not trigger the abstract cascade.
+- Semantic Scholar search results now carry the ``abstract`` field,
+  which propagates through ``_convert_search_metadata`` into the final
+  BibTeX output whenever the identification stage already resolved the
+  entry through SS.
+- ``EnricherModule._get_semantic_scholar_abstract(doi)`` helper for
+  DOI-based Semantic Scholar abstract retrieval.  Handles ``404`` /
+  ``429`` gracefully by returning ``None``.
+- ``_complete_fields`` gained an ``allow_abstract_fallback`` kwarg
+  (default ``False``) that gates the new cascade.
+  ``_enrich_single_entry`` passes ``True`` only when the raw entry
+  contributed a DOI.
+- Default ``journal_article_full`` template now lists ``abstract`` as
+  an optional field so the declaration matches what the enricher
+  emits.  The older ``journal_article_with_abstract`` template is
+  retained as a compatibility alias and will stay available for at
+  least one release cycle.
+- Regression test ``test_enrich_single_entry_no_doi_in_raw_skips_abstract_fallback``
+  pinning the "no-DOI-in raw ⇒ no Semantic-Scholar/PubMed network
+  call" guarantee at the ``_enrich_single_entry`` layer.
+
+Changed
+~~~~~~~
+
+- ``_get_pubmed_abstract`` now requires a DOI and no longer falls back
+  to PubMed title search.  The removed title-based path empirically
+  returned the abstract of an unrelated paper (e.g. the Zhang 2020 AI
+  Review DOI ``10.1007/s10462-019-09792-7`` pulled the abstract of a
+  different RSI segmentation paper), which is strictly worse than
+  returning ``None`` for downstream semantic cross-checks such as the
+  ``sci`` skill.
+- Abstract coverage on an internal 10-DOI cross-publisher spot-check
+  rose from 4/9 to 8/9.  This number is a local indicator, **not** a
+  release gate: reproducing it requires a live network and the probe
+  scripts are no longer in the repository.
+
+Deprecated
+~~~~~~~~~~
+
+- ``_complete_fields(..., allow_pubmed_fallback=...)`` is deprecated in
+  favour of ``allow_abstract_fallback``.  The old name still works for
+  one release cycle and emits ``DeprecationWarning``.  It was renamed
+  because the flag actually gates the entire Semantic-Scholar + PubMed
+  cascade, not PubMed alone.
+
+Removed
+~~~~~~~
+
+- ``IdentifierModule._check_doi_content_consistency`` and the
+  ``consistency_score`` / ``low_consistency`` warning path.  The fuzzy
+  string-similarity score was empirically unable to detect subtle
+  LLM-hallucinated references (scored 85/100 on author-only
+  hallucinations against a real DOI) and was only surfaced as a
+  ``logger.warning`` that downstream tools could not observe, producing
+  false reassurance.  Citation-authenticity verification belongs at
+  the abstract-vs-claim semantic layer in the consuming tool
+  (e.g. the ``sci`` skill), not at the bibliographic-string layer
+  here.
+
 [0.1.0] - 2026-04-17
 ---------------------
 
 
@@ -25,6 +25,7 @@ Format Specification
       doi = "10.1038/nature14539",
       title = "Deep Learning",
       author = "LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey",
+      abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
       journal = "Nature",
       year = 2015,
       volume = 521,
 
@@ -51,6 +51,7 @@ Your ``results.bib`` file now contains entries in BibTeX format::
       doi = "10.1038/nature14539",
       title = "Deep learning",
       author = "LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey",
+      abstract = "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...",
       journal = "Nature",
       year = 2015,
       volume = 521,
 
@@ -8,19 +8,27 @@ Template Basics
 
 Templates define the **structure and metadata requirements** for different citation types. OneCite comes with built-in templates for:
 
-- **journal_article_full** - Journal articles with complete metadata
+- **journal_article_full** - Journal articles with complete metadata (includes ``abstract`` as an optional field)
 - **conference_paper** - Conference proceedings papers
 - **book** - Books and monographs
 - **thesis** - Theses and dissertations
 - **software** - Software and code repositories
 - **dataset** - Research datasets
 
+A legacy ``journal_article_with_abstract`` template is also shipped for
+backwards compatibility with older configurations. Since ``journal_article_full``
+now also declares ``abstract`` as an optional field, the two templates
+behave equivalently for journal articles; new configurations should
+prefer ``journal_article_full`` and treat ``journal_article_with_abstract``
+as deprecated.
+
 Default Templates Location
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Built-in templates are located in the ``onecite/templates/`` directory:
 
 - ``journal_article_full.yaml``
+- ``journal_article_with_abstract.yaml`` *(deprecated alias of the above)*
 - ``conference_paper.yaml``
 - ``book.yaml``
 - ``thesis.yaml``
 
@@ -8,7 +8,7 @@
 citations in multiple formats.
 """
 
-__version__ = "0.1.0"
+__version__ = "0.1.1"
 __author__ = "OneCite Team"
 __email__ = "ang@hezhiang.com"
 __license__ = "MIT"