Skip to content

Commit 9be0085

Browse files
author
Ang
committed
refactor: split pipeline.py into per-stage package (#17)
The monolithic onecite/pipeline.py (~3000 lines) is replaced with a proper package: onecite/pipeline/ __init__.py - re-exports + `requests` at package level _utils.py - _safe_year helper parser.py - ParserModule identifier.py - IdentifierModule (largest stage, kept as one file) enricher.py - EnricherModule formatter.py - FormatterModule Backward-compat is preserved: * `from onecite.pipeline import IdentifierModule` still works * `patch('onecite.pipeline.requests.get', ...)` still works because __init__.py keeps `import requests` at package level and Python's module cache means all child modules share the same `requests` object. Tests: * 3 tests in test_pipeline_unit.py that used `patch.object(pipeline_mod, 'scholarly', ...)` now patch the concrete submodule (identifier / enricher) where `scholarly` is imported. * test_integration.py gained a missing `import pytest` so pytest.skip() works when the mocked first pass returns no results.
1 parent 682e5d7 commit 9be0085

8 files changed

Lines changed: 1072 additions & 976 deletions

File tree

onecite/pipeline/__init__.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# !/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
4+
"""OneCite's 4-stage processing pipeline.
5+
6+
Historically this lived in a single ``pipeline.py`` of ~3000 lines. It was
7+
split per pyOpenSci review issue #17 into one module per stage. All public
8+
symbols are re-exported here so callers and tests that do
9+
10+
from onecite.pipeline import IdentifierModule
11+
import onecite.pipeline as pm # and then: patch("onecite.pipeline.requests.get", ...)
12+
13+
keep working unchanged.
14+
"""
15+
16+
# Keep ``requests`` at package level so that tests which do
17+
# ``patch("onecite.pipeline.requests.get", ...)`` resolve the attribute
18+
# correctly. Because Python caches modules, this is the same ``requests``
19+
# module object that all sub-modules import — so the patch reaches them too.
20+
import requests # noqa: F401
21+
22+
from ._utils import _safe_year
23+
from .parser import ParserModule
24+
from .identifier import IdentifierModule
25+
from .enricher import EnricherModule
26+
from .formatter import FormatterModule
27+
28+
__all__ = [
29+
"ParserModule",
30+
"IdentifierModule",
31+
"EnricherModule",
32+
"FormatterModule",
33+
"_safe_year",
34+
]

onecite/pipeline/_utils.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# !/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
4+
"""Small shared helpers used by more than one pipeline stage."""
5+
6+
7+
def _safe_year(date_obj):
8+
"""Safely extract year from a CrossRef date object like {'date-parts': [[2015, 3, 1]]}."""
9+
if not date_obj:
10+
return None
11+
parts = date_obj.get('date-parts', [])
12+
if parts and isinstance(parts, list) and len(parts) > 0:
13+
inner = parts[0]
14+
if isinstance(inner, list) and len(inner) > 0:
15+
return inner[0]
16+
return None

0 commit comments

Comments
 (0)