KeyError: 'year' in parse_pubmed_xml when no date element contains a year
Version: latest (pip install pubmed-parser)
Python: 3.12
Description
Calling parse_pubmed_xml on certain PMC OA articles raises a KeyError: 'year'. The function already has fallback logic that tries the "ppub" date first and falls back to "collection" if "year" is absent, but the except clause on the subsequent int() conversion only catches TypeError, not KeyError. If neither date type contains a "year" key, the exception propagates uncaught.
Minimal reproduction
Pass any PMC XML whose <pub-date> elements lack a <year> child (observed in several articles from the PMC Open Access bulk download):
import pubmed_parser as pp
pp.parse_pubmed_xml("path/to/affected_article.xml")
Traceback:
File pubmed_parser/pubmed_oa_parser.py:198, in parse_pubmed_xml(path, ...)
197 try:
--> 198 pub_year = int(pub_date_dict["year"])
199 except TypeError:
200 pub_year = None
KeyError: 'year'
Root cause
In parse_pubmed_xml (around line 192):
pub_date_dict = parse_date(tree, "ppub")
if "year" not in pub_date_dict:
pub_date_dict = parse_date(tree, "collection") # fallback
pub_date = format_date(pub_date_dict)
try:
pub_year = int(pub_date_dict["year"])
except TypeError: # ← only catches None; not KeyError
pub_year = None
The if "year" not in pub_date_dict guard correctly handles the "ppub" case, but if the "collection" fallback also lacks "year", pub_date_dict["year"] raises KeyError, which is not caught.
Fix
Extend the except clause to also catch KeyError (one character change):
try:
pub_year = int(pub_date_dict["year"])
except (TypeError, KeyError): # ← add KeyError
pub_year = None
Alternatively, use .get to make the KeyError impossible:
pub_year = pub_date_dict.get("year")
pub_year = int(pub_year) if pub_year is not None else None
KeyError: 'year'inparse_pubmed_xmlwhen no date element contains a yearVersion: latest (
pip install pubmed-parser)Python: 3.12
Description
Calling
parse_pubmed_xmlon certain PMC OA articles raises aKeyError: 'year'. The function already has fallback logic that tries the"ppub"date first and falls back to"collection"if"year"is absent, but theexceptclause on the subsequentint()conversion only catchesTypeError, notKeyError. If neither date type contains a"year"key, the exception propagates uncaught.Minimal reproduction
Pass any PMC XML whose
<pub-date>elements lack a<year>child (observed in several articles from the PMC Open Access bulk download):Traceback:
Root cause
In
parse_pubmed_xml(around line 192):The
if "year" not in pub_date_dictguard correctly handles the"ppub"case, but if the"collection"fallback also lacks"year",pub_date_dict["year"]raisesKeyError, which is not caught.Fix
Extend the
exceptclause to also catchKeyError(one character change):Alternatively, use
.getto make theKeyErrorimpossible: