Skip to content

KeyError: 'year' in parse_pubmed_xml when no date element contains a year #176

@KadenMc

Description

@KadenMc

KeyError: 'year' in parse_pubmed_xml when no date element contains a year

Version: latest (pip install pubmed-parser)
Python: 3.12

Description

Calling parse_pubmed_xml on certain PMC OA articles raises a KeyError: 'year'. The function already has fallback logic that tries the "ppub" date first and falls back to "collection" if "year" is absent, but the except clause on the subsequent int() conversion only catches TypeError, not KeyError. If neither date type contains a "year" key, the exception propagates uncaught.

Minimal reproduction

Pass any PMC XML whose <pub-date> elements lack a <year> child (observed in several articles from the PMC Open Access bulk download):

import pubmed_parser as pp
pp.parse_pubmed_xml("path/to/affected_article.xml")

Traceback:

File pubmed_parser/pubmed_oa_parser.py:198, in parse_pubmed_xml(path, ...)
    197     try:
--> 198         pub_year = int(pub_date_dict["year"])
    199     except TypeError:
    200         pub_year = None

KeyError: 'year'

Root cause

In parse_pubmed_xml (around line 192):

pub_date_dict = parse_date(tree, "ppub")
if "year" not in pub_date_dict:
    pub_date_dict = parse_date(tree, "collection")   # fallback
pub_date = format_date(pub_date_dict)

try:
    pub_year = int(pub_date_dict["year"])
except TypeError:          # ← only catches None; not KeyError
    pub_year = None

The if "year" not in pub_date_dict guard correctly handles the "ppub" case, but if the "collection" fallback also lacks "year", pub_date_dict["year"] raises KeyError, which is not caught.

Fix

Extend the except clause to also catch KeyError (one character change):

try:
    pub_year = int(pub_date_dict["year"])
except (TypeError, KeyError):   # ← add KeyError
    pub_year = None

Alternatively, use .get to make the KeyError impossible:

pub_year = pub_date_dict.get("year")
pub_year = int(pub_year) if pub_year is not None else None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions