Skip to content

Commit e5934f2

Browse files
author
Ang
committed
docs: fix docs
1 parent f1782e9 commit e5934f2

6 files changed

Lines changed: 71 additions & 80 deletions

File tree

CHANGELOG.md

Lines changed: 29 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -67,19 +67,17 @@ as a deprecated alias for this release cycle.
6767

6868
### Removed
6969
- `IdentifierModule._check_doi_content_consistency` and the
70-
`consistency_score` / `low_consistency` warning path. The fuzzy
71-
string-similarity score was empirically unable to detect subtle
72-
LLM-hallucinated references (scored 85/100 on author-only
73-
hallucinations against a real DOI) and was only surfaced as a
74-
`logger.warning` that downstream tools could not observe, producing
75-
false reassurance. Citation authenticity verification belongs at the
76-
abstract-vs-claim semantic layer in the consuming tool (e.g. the
77-
`sci` skill), not at the bibliographic-string layer here.
70+
`consistency_score` / `low_consistency` warning path. A fuzzy
71+
string-similarity score on bibliographic fields is not a reliable
72+
signal for detecting fabricated references, and it was only emitted
73+
as a `logger.warning` that downstream tools could not act on.
74+
Citation-authenticity verification belongs at the abstract-vs-claim
75+
semantic layer in the consuming tool, not at the bibliographic-string
76+
layer here.
7877

7978
## [0.1.0] - 2026-04-17
8079

81-
First formal PyPI release since `0.0.12`. Incorporates the complete
82-
pyOpenSci review pass (issues #3, #5#34, #36) plus follow-up cleanup.
80+
First formal PyPI release since `0.0.12`.
8381

8482
### Added
8583
- RST documentation using Sphinx
@@ -94,25 +92,25 @@ pyOpenSci review pass (issues #3, #5–#34, #36) plus follow-up cleanup.
9492
- **Split monolithic `pipeline.py` (~3000 lines)** into a proper
9593
`onecite/pipeline/` package with one module per stage
9694
(`parser.py` / `identifier.py` / `enricher.py` / `formatter.py`)
97-
plus a `_utils.py` for shared helpers (#17). Public imports
95+
plus a `_utils.py` for shared helpers. Public imports
9896
(`from onecite.pipeline import IdentifierModule`) and mocking targets
9997
(`patch("onecite.pipeline.requests.get", ...)`) continue to work
10098
unchanged because `__init__.py` re-exports every public symbol and
10199
keeps `requests` at the package level.
102-
- Unify CrossRef request and parsing methods (#26); all CrossRef calls
100+
- Unify CrossRef request and parsing methods; all CrossRef calls
103101
now go through a single helper with a proper `User-Agent` header and
104-
`mailto` query-string parameter (#21).
102+
`mailto` query-string parameter.
105103
- Rewrite fuzzy-search scoring as a weighted title / author / year /
106104
venue model with three confidence tiers (auto-adopt / interactive /
107-
cautious) and a unified low-confidence threshold (#3, #23, #27).
105+
cautious) and a unified low-confidence threshold.
108106
- Simplify identifier routing; CrossRef and Semantic Scholar are always
109107
consulted for text queries, with signal-based additional queries to
110-
PubMed / Google Books / OpenAIRE / BASE (#8, #23).
111-
- Use `bibtexparser.dumps()` for BibTeX rendering (#30).
108+
PubMed / Google Books / OpenAIRE / BASE.
109+
- Use `bibtexparser.dumps()` for BibTeX rendering.
112110
- Expose `use_google_scholar` as a real CLI flag and API parameter
113-
instead of a hard-coded `False` (#10).
111+
instead of a hard-coded `False`.
114112
- Clarify that templates define metadata-field requirements and a
115-
fallback BibTeX entry type, not output formatting (#16, #29).
113+
fallback BibTeX entry type, not output formatting.
116114
- Refactored exception hierarchy
117115
- Added type hints to Python API
118116
- Updated README examples
@@ -125,42 +123,40 @@ pyOpenSci review pass (issues #3, #5–#34, #36) plus follow-up cleanup.
125123
- APA and MLA output renderers; they produced inconsistent output and
126124
the CLI now rejects anything other than `--output-format bibtex`.
127125
Users wanting APA/MLA should post-process the BibTeX through pandoc
128-
or citeproc-py (#31, #32).
126+
or citeproc-py.
129127
- Hard-coded "well-known paper" shortcut that masked failures on the
130-
main example input (#19).
128+
main example input.
131129
- MCP integration page and all related references
132130
- `.readthedocs.yml` (docs now hosted on GitHub Pages)
133131
- `docs/_build/` build artifacts from repository
134132

135133
### Fixed
136134
- README / `docs/index.rst` / `docs/faq.rst` no longer advertise
137135
OpenAlex or dblp as data sources — they were never wired into the
138-
code (#6).
136+
code.
139137
- README quick-start example now shows `booktitle` (NeurIPS) instead
140-
of `journal = "arXiv preprint"` for the `@inproceedings` sample
141-
(#28).
138+
of `journal = "arXiv preprint"` for the `@inproceedings` sample.
142139
- `docs/api/pipeline.rst` rewritten to match the actual module
143140
structure; removed references to classes and methods that never
144141
existed (`Validator` / `Identifier` / `Completer` / `Formatter`,
145-
`set_source_priority`, `set_timeout`, `add_template_path`) (#11).
142+
`set_source_priority`, `set_timeout`, `add_template_path`).
146143
- `docs/output_formats.rst`, `docs/faq.rst`, `docs/quick_start.rst`,
147144
`docs/python_api.rst`, `docs/templates.rst`, `docs/index.rst` and
148145
docstrings in `core.py` / `formatter.py` no longer advertise APA /
149-
MLA output (#31, #32).
146+
MLA output.
150147
- Crossref author names parsed as `given family` instead of mangled
151-
concatenations (#22).
148+
concatenations.
152149
- Semantic Scholar HTTP 429 responses return an empty candidate list
153-
cleanly instead of bubbling up (#25).
150+
cleanly instead of bubbling up.
154151
- Previously-unused exception classes (`ParseError`, `ValidationError`,
155-
`FormatError`) are now actually raised in the right places (#13).
152+
`FormatError`) are now actually raised in the right places.
156153
- `CONTRIBUTING.md` no longer tells developers to use a `requirements.txt`
157-
that does not exist; the documented install is `pip install -e .[dev]`
158-
(#12).
154+
that does not exist; the documented install is `pip install -e .[dev]`.
159155
- `black` formatting is enforced via `pyproject.toml` `[tool.black]`
160-
plus a pre-commit hook (#15).
161-
- URL-bearing entries are no longer queried twice (#20).
156+
plus a pre-commit hook.
157+
- URL-bearing entries are no longer queried twice.
162158
- Fallback paths mark entries as `identification_failed` rather than
163-
fabricating plausible-looking but invented metadata (#24).
159+
fabricating plausible-looking but invented metadata.
164160
- CrossRef and Semantic Scholar response parsing edge cases
165161
- API documentation using incorrect return value fields (`output_content` -> `results`)
166162
- Version number inconsistencies across metadata files

docs/api/pipeline.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,9 @@ into formatted BibTeX:
1212
3. **Enrich** — fetch full metadata for the identified entries
1313
4. **Format** — render the completed entries as BibTeX
1414

15-
Since pyOpenSci review issue #17, the implementation lives in the
16-
``onecite/pipeline/`` package with one module per stage. For
17-
backwards-compatibility all public symbols remain importable from
18-
``onecite.pipeline``:
15+
The implementation lives in the ``onecite/pipeline/`` package with one
16+
module per stage. For backwards-compatibility all public symbols remain
17+
importable from ``onecite.pipeline``:
1918

2019
.. code-block:: python
2120
@@ -116,8 +115,9 @@ year / venue similarity to the query. The decision logic in
116115
- ``match_score >= 50`` and a title is present → adopt cautiously
117116
- otherwise → mark the entry as ``identification_failed``
118117

119-
This matches what the pyOpenSci review flagged in issues #3, #23 and
120-
#27. Fallbacks never fabricate data (see #24).
118+
Fallback paths never fabricate data: an entry that cannot be resolved is
119+
marked ``identification_failed`` rather than filled with invented
120+
metadata.
121121

122122
.. code-block:: python
123123

docs/changelog.rst

Lines changed: 25 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -76,22 +76,18 @@ Removed
7676
~~~~~~~
7777

7878
- ``IdentifierModule._check_doi_content_consistency`` and the
79-
``consistency_score`` / ``low_consistency`` warning path. The fuzzy
80-
string-similarity score was empirically unable to detect subtle
81-
LLM-hallucinated references (scored 85/100 on author-only
82-
hallucinations against a real DOI) and was only surfaced as a
83-
``logger.warning`` that downstream tools could not observe, producing
84-
false reassurance. Citation-authenticity verification belongs at
85-
the abstract-vs-claim semantic layer in the consuming tool
86-
(e.g. the ``sci`` skill), not at the bibliographic-string layer
87-
here.
79+
``consistency_score`` / ``low_consistency`` warning path. A fuzzy
80+
string-similarity score on bibliographic fields is not a reliable
81+
signal for detecting fabricated references, and it was only emitted
82+
as a ``logger.warning`` that downstream tools could not act on.
83+
Citation-authenticity verification belongs at the abstract-vs-claim
84+
semantic layer in the consuming tool, not at the bibliographic-string
85+
layer here.
8886

8987
[0.1.0] - 2026-04-17
9088
---------------------
9189

92-
First formal PyPI release since ``0.0.12``. Incorporates the complete
93-
pyOpenSci review pass (issues #3, #5–#34, #36) plus follow-up cleanup.
94-
See ``CHANGELOG.md`` at the repository root for the full per-issue list.
90+
First formal PyPI release since ``0.0.12``.
9591

9692
Added
9793
~~~~~
@@ -108,18 +104,18 @@ Changed
108104
~~~~~~~
109105

110106
- **Split monolithic pipeline.py (~3000 lines)** into a proper
111-
``onecite/pipeline/`` package with one module per stage (#17)
107+
``onecite/pipeline/`` package with one module per stage
112108
- Unify CrossRef request and parsing methods, with ``User-Agent`` and
113-
``mailto`` set per CrossRef etiquette (#21, #26)
109+
``mailto`` set per CrossRef etiquette
114110
- Rewrite fuzzy-search scoring as a weighted title/author/year/venue
115-
model with three confidence tiers (#3, #23, #27)
111+
model with three confidence tiers
116112
- Simplify identifier routing; CrossRef and Semantic Scholar are the
117113
always-on sources, with signal-based PubMed / Google Books /
118-
OpenAIRE / BASE queries (#8, #23)
119-
- Use ``bibtexparser.dumps()`` for BibTeX rendering (#30)
120-
- Expose ``use_google_scholar`` as a real CLI flag and API parameter (#10)
114+
OpenAIRE / BASE queries
115+
- Use ``bibtexparser.dumps()`` for BibTeX rendering
116+
- Expose ``use_google_scholar`` as a real CLI flag and API parameter
121117
- Clarify that templates define metadata-field requirements and a
122-
fallback BibTeX entry type, not output formatting (#16, #29)
118+
fallback BibTeX entry type, not output formatting
123119
- Refactored exception hierarchy
124120
- Added type hints to Python API
125121

@@ -128,9 +124,9 @@ Removed
128124

129125
- APA and MLA output renderers; the CLI now rejects anything other than
130126
``--output-format bibtex``. Use pandoc or citeproc-py to convert the
131-
generated BibTeX to APA / MLA (#31, #32)
127+
generated BibTeX to APA / MLA
132128
- Hard-coded "well-known paper" shortcut that masked failures on the
133-
main example input (#19)
129+
main example input
134130
- MCP integration page and all related references
135131
- ``.readthedocs.yml`` (docs now hosted on GitHub Pages)
136132
- ``docs/_build/`` build artifacts from repository
@@ -139,20 +135,19 @@ Fixed
139135
~~~~~
140136

141137
- OpenAlex and dblp no longer listed as data sources — they were never
142-
wired into the code (#6)
138+
wired into the code
143139
- ``docs/api/pipeline.rst`` rewritten to match the real modules;
144-
removed references to nonexistent classes / methods (#11)
140+
removed references to nonexistent classes / methods
145141
- README and docs ``@inproceedings`` example now uses ``booktitle``
146-
instead of ``journal = "arXiv preprint"`` (#28)
147-
- Crossref author names parsed as ``given family`` (#22)
148-
- Semantic Scholar HTTP 429 handled cleanly (#25)
142+
instead of ``journal = "arXiv preprint"``
143+
- Crossref author names parsed as ``given family``
144+
- Semantic Scholar HTTP 429 handled cleanly
149145
- Previously-unused exception classes now raised in the right places
150-
(#13)
151146
- ``CONTRIBUTING.md`` documents ``pip install -e .[dev]`` instead of
152-
the non-existent ``requirements.txt`` (#12)
153-
- URL-bearing entries no longer queried twice (#20)
147+
the non-existent ``requirements.txt``
148+
- URL-bearing entries no longer queried twice
154149
- Fallback paths mark entries as ``identification_failed`` rather than
155-
fabricating invented metadata (#24)
150+
fabricating invented metadata
156151
- CrossRef and Semantic Scholar response parsing edge cases
157152
- API documentation using incorrect return value fields
158153
- Version number inconsistencies across metadata files

onecite/pipeline/__init__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33

44
"""OneCite's 4-stage processing pipeline.
55
6-
Historically this lived in a single ``pipeline.py`` of ~3000 lines. It was
7-
split per pyOpenSci review issue #17 into one module per stage. All public
8-
symbols are re-exported here so callers and tests that do
6+
Historically this lived in a single ``pipeline.py`` of ~3000 lines. It has
7+
been split into one module per stage. All public symbols are re-exported
8+
here so callers and tests that do
99
1010
from onecite.pipeline import IdentifierModule
1111
import onecite.pipeline as pm # and then: patch("onecite.pipeline.requests.get", ...)

onecite/pipeline/enricher.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -581,11 +581,12 @@ def _complete_fields(self, base_record: Dict, template: Dict,
581581
``10.1007/s10462-019-09792-7``), which is strictly worse for
582582
downstream semantic checks than returning ``None``.
583583
584-
Older versions of this function attempted template-driven completion
585-
of many fields across several sources (including Google Scholar),
586-
which the pyOpenSci review (#29) correctly flagged as a no-op in
587-
the default CLI path and as structurally wrong. That machinery is
588-
not being reintroduced. The narrow abstract cascade here is
584+
Older versions of this function attempted template-driven
585+
completion of many fields across several sources (including Google
586+
Scholar), which was a no-op in the default CLI path and
587+
structurally wrong (the declared sources were never actually wired
588+
for broad field completion). That machinery is not being
589+
reintroduced. The narrow abstract cascade here is
589590
directly observable by downstream tools via the ``abstract`` field
590591
in the emitted BibTeX and was empirically the only way to bridge
591592
the gap between CrossRef-only (~44% coverage on a 10-DOI

tests/test_pipeline_unit.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1207,9 +1207,8 @@ def test_conference_proceedings_type(self):
12071207

12081208
def test_complete_fields_no_google_scholar_for_abstract(self):
12091209
"""Even when template asks for google_scholar_scraper, we never call
1210-
Google Scholar from _complete_fields (pyOpenSci #29). PubMed is the
1211-
only abstract fallback, and if it returns nothing the result stays
1212-
untouched."""
1210+
Google Scholar from _complete_fields. PubMed is the only abstract
1211+
fallback, and if it returns nothing the result stays untouched."""
12131212
enr = EnricherModule(use_google_scholar=False)
12141213
base = {'title': 'T', 'author': 'A', 'year': '2020'}
12151214
template = {'fields': [{'name': 'abstract', 'source_priority': ['google_scholar_scraper']}]}

0 commit comments

Comments
 (0)