Skip to content

Download PDFs for list.bib entries from Zotero API#19

Draft
Copilot wants to merge 5 commits into
masterfrom
copilot/download-pdfs-from-zotero-api
Draft

Download PDFs for list.bib entries from Zotero API#19
Copilot wants to merge 5 commits into
masterfrom
copilot/download-pdfs-from-zotero-api

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 19, 2026

PDF attachments for publication list entries are not downloaded from Zotero, leaving bib/docs/preprints/ unpopulated.

Changes

download-pdfs.py (new)

  • Reads the JSON data file produced by download.py
  • Filters items by tag using --tag argument (same literal-match logic as extract-bibtex.py)
  • Resolves PDF attachments (itemType: attachment, contentType: application/pdf) via parent key lookup
  • Names each PDF using data['citationKey'] (the structured field populated by Better BibTeX), with fallback to Citation Key: in the extra field, then the original attachment filename
  • Skips download if the file already exists in the output directory (idempotent)
  • Downloads each via GET https://api.zotero.org/{prefix}/items/{key}/file with env var validation and network error handling
  • After a new download, if tex.preprint is not set on the parent Zotero item, PATCHes the item via the API to add tex.preprint: https://ag-gipp.github.io/bib/preprints/<citation_key>.pdf
  • Existing tex.preprint values are never overwritten
  • --max-writes N limits the number of Zotero PATCH requests in a single run (useful for testing; default: unlimited)
python ./download-pdfs.py <json_file> [--tag TAG] [--output-dir DIR] [--max-writes N]
# e.g.
python ./download-pdfs.py "$ZB_FILE" --tag '!ms_author' --output-dir bib/docs/preprints --max-writes 1

CI workflow

Added a step after list.bib extraction that downloads PDFs to bib/docs/preprints, committed and pushed with the rest of the bib repo updates.

Makefile

Added pdfs target for local use:

pdfs: $(ZB_FILE)
    mkdir -p bib/docs/preprints
    $(ZB_PYTHON) ./download-pdfs.py $(ZB_FILE) --tag $(ZB_SEARCH_TAG) --output-dir bib/docs/preprints

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…list.bib

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copilot AI changed the title [WIP] Download PDFs for list.bib using Zotero API Download PDFs for list.bib entries from Zotero API Feb 19, 2026
Copilot AI requested a review from physikerwelt February 19, 2026 18:23
@physikerwelt
Copy link
Copy Markdown
Member

@copilot that works nicely. As a follow up: Only download the PDF it the PDF does not exist in bib/docs/preprints and rename it according to the naming convention (from the unique tex key). New files should also be committed and referenced as preprint if the field tex.preprint is not set. All files that will be put to bib/docs/preprints will be available from https://ag-gipp.github.io/bib/preprints/filename.pdf

….preprint

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Feb 19, 2026

@copilot that works nicely. As a follow up: Only download the PDF it the PDF does not exist in bib/docs/preprints and rename it according to the naming convention (from the unique te...

Done in commit 6bccbef. Here's what was changed in download-pdfs.py:

  • Skip existing files: checks bib/docs/preprints/<citation_key>.pdf before downloading; skips if it already exists
  • Rename by citation key: reads Citation Key: <key> from the item's extra field and uses <key>.pdf as the filename (falls back to the original attachment filename if no citation key is found)
  • Set tex.preprint: after a new download, if tex.preprint is not present in the parent item's extra field, the script PATCHes the Zotero item to add tex.preprint: https://ag-gipp.github.io/bib/preprints/<citation_key>.pdf
  • The CI step and Makefile pdfs target now use bib/docs/preprints as the output directory

@physikerwelt
Copy link
Copy Markdown
Member

@copilot not bad, however the files are not renamed So it downloads for example to bib/docs/preprints/Petrera2021--MORS--zbmath_open_api_solutions_and_research_challenges.pdf instead of bib/docs/preprints/PetreraTBE21a.pdf which would be derived from the bibtex key. Also the API key I used for testing has no write access see logs below

2026-02-19 19:49:13,980 - root - INFO - Downloaded IKC87KPI -> bib/docs/preprints/Petrera2021--MORS--zbmath_open_api_solutions_and_research_challenges.pdf
2026-02-19 19:49:13,983 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): api.zotero.org:443
2026-02-19 19:49:14,366 - urllib3.connectionpool - DEBUG - https://api.zotero.org:443 "PATCH /groups/2480461/items/SIEJ7T47 HTTP/1.1" 403 19
2026-02-19 19:49:14,367 - root - WARNING - Failed to update SIEJ7T47: HTTP 403
2026-02-19 19:49:14,369 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): api.zotero.org:443
2026-02-19 19:49:15,080 - urllib3.connectionpool - DEBUG - https://api.zotero.org:443 "GET /groups/2480461/items/IU8S39XN/file HTTP/1.1" 302 0
2026-02-19 19:49:15,084 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): zoterofilestorage.s3.us-east-1.amazonaws.com:443
2026-02-19 19:49:15,729 - urllib3.connectionpool - DEBUG - https://zoterofilestorage.s3.us-east-1.amazonaws.com:443 "GET....

please add a top level entrypoint or a cli parameter e.g. max-writes=1 so I can manually test writing changes to individual entries. Double check that existing tex.preprint entries are not overwritten or modified.

…d argparse CLI

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Feb 19, 2026

@copilot not bad, however the files are not renamed So it downloads for example to bib/docs/preprints/Petrera2021--MORS--zbmath_open_api_solutions_and_research_challenges.pdf instead...

Fixed in commit 81ad754:

  • Renaming: citation key is now extracted directly from the biblatex entry header (@article{PetreraTBE21a,...}) using a regex, which is always populated. The previous approach only checked the extra field. There's also a fallback to Citation Key: in the extra field for robustness.
  • --max-writes N: added via argparse — pass --max-writes 1 to test writing to a single entry. Writes beyond the limit log a message and are skipped.
  • tex.preprint guard: has_preprint_field checks case-insensitively for tex.preprint: in the extra field before patching, so existing values are never overwritten.

Example usage for testing a single write:

python ./download-pdfs.py "$ZB_FILE" --tag '!ms_author' --output-dir bib/docs/preprints --max-writes 1

@physikerwelt
Copy link
Copy Markdown
Member

@copilot The filename is still incorrect. for the example mentioned above the JSON snippet looks like

{
    "biblatex": "\n@inproceedings{Petrera2021,\n\ttitle = {{zbMATH} Open: {API} Solutions and Research Challenges},\n\tvolume = {2976},\n\turl = {http://ceur-ws.org/Vol-2976/paper-1.pdf},\n\tseries = {{CEUR} Workshop Proceedings},\n\tpages = {4--13},\n\tbooktitle = {Proceedings of the Workshop on Digital Infrastructures for Scholarly Content Objects ({DISCO} 2021) co-located with {ACM}/{IEEE} Joint Conference on Digital Libraries 2021 ({JCDL} 2021), Online (due to the global pandemic), September 30, 2021},\n\tpublisher = {{CEUR}-{WS}.org},\n\tauthor = {Petrera, Matteo and Trautwein, Dennis and Beckenbach, Isabel and Ehsani, Dariush and M\u00fcller, Fabian and Teschke, Olaf and Gipp, Bela and Schubotz, Moritz},\n\teditor = {Balke, Wolf-Tilo and de Waard, Anita and Fu, Yuanxi and Hua, Bolin and Schneider, Jodi and Song, Ningyuan and Wang, Xiaoguang},\n\tdate = {2021},\n\tnote = {tex.topic: mathir},\n\tkeywords = {!ms\\_author, no-doi, \u26d4 No {DOI} found},\n}\n",
    "data": {
      "DOI": "",
      "ISBN": "",
      "ISSN": "",
      "abstractNote": "",
      "accessDate": "",
      "archive": "",
      "archiveLocation": "",
      "callNumber": "",
      "citationKey": "Petrera2021",
      "collections": [],
      "conferenceName": "",
      "creators": [
        {
          "creatorType": "author",
          "firstName": "Matteo",
          "lastName": "Petrera"
        },
        {
          "creatorType": "author",
          "firstName": "Dennis",
          "lastName": "Trautwein"
        },
        {
          "creatorType": "author",
          "firstName": "Isabel",
          "lastName": "Beckenbach"
        },
        {
          "creatorType": "author",
          "firstName": "Dariush",
          "lastName": "Ehsani"
        },
        {
          "creatorType": "author",
          "firstName": "Fabian",
          "lastName": "M\u00fcller"
        },
        {
          "creatorType": "author",
          "firstName": "Olaf",
          "lastName": "Teschke"
        },
        {
          "creatorType": "author",
          "firstName": "Bela",
          "lastName": "Gipp"
        },
        {
          "creatorType": "author",
          "firstName": "Moritz",
          "lastName": "Schubotz"
        },
        {
          "creatorType": "editor",
          "firstName": "Wolf-Tilo",
          "lastName": "Balke"
        },
        {
          "creatorType": "editor",
          "firstName": "Anita",
          "lastName": "de Waard"
        },
        {
          "creatorType": "editor",
          "firstName": "Yuanxi",
          "lastName": "Fu"
        },
        {
          "creatorType": "editor",
          "firstName": "Bolin",
          "lastName": "Hua"
        },
        {
          "creatorType": "editor",
          "firstName": "Jodi",
          "lastName": "Schneider"
        },
        {
          "creatorType": "editor",
          "firstName": "Ningyuan",
          "lastName": "Song"
        },
        {
          "creatorType": "editor",
          "firstName": "Xiaoguang",
          "lastName": "Wang"
        }
      ],
      "date": "2021",
      "dateAdded": "2023-10-03T19:54:33Z",
      "dateModified": "2025-05-29T10:30:37Z",
      "eventPlace": "",
      "extra": "tex.topic: mathir",
      "issue": "",
      "itemType": "conferencePaper",
      "key": "SIEJ7T47",
      "language": "",
      "libraryCatalog": "",
      "numberOfVolumes": "",
      "pages": "4\u201313",
      "place": "",
      "proceedingsTitle": "Proceedings of the Workshop on Digital Infrastructures for Scholarly Content Objects (DISCO 2021) co-located with ACM/IEEE Joint Conference on Digital Libraries 2021 (JCDL 2021), Online (due to the global pandemic), September 30, 2021",
      "publisher": "CEUR-WS.org",
      "relations": {},
      "rights": "",
      "series": "CEUR Workshop Proceedings",
      "seriesNumber": "",
      "shortTitle": "",
      "tags": [
        {
          "tag": "!ms_author"
        },
        {
          "tag": "no-doi"
        },
        {
          "tag": "\u26d4 No DOI found",
          "type": 1
        }
      ],
      "title": "zbMATH Open: API Solutions and Research Challenges",
      "url": "http://ceur-ws.org/Vol-2976/paper-1.pdf",
      "version": 39746,
      "volume": "2976"
    },
    "key": "SIEJ7T47",
    "library": {
      "id": 2480461,
      "links": {
        "alternate": {
          "href": "https://www.zotero.org/groups/2480461",
          "type": "text/html"
        }
      },
      "name": "ag-gipp",
      "type": "group"
    },
    "links": {
      "alternate": {
        "href": "https://www.zotero.org/groups/2480461/items/SIEJ7T47",
        "type": "text/html"
      },
      "attachment": {
        "attachmentSize": 1734052,
        "attachmentType": "application/pdf",
        "href": "https://api.zotero.org/groups/2480461/items/IKC87KPI",
        "type": "application/json"
      },
      "self": {
        "href": "https://api.zotero.org/groups/2480461/items/SIEJ7T47",
        "type": "application/json"
      }
    },
    "meta": {
      "createdByUser": {
        "id": 3075416,
        "links": {
          "alternate": {
            "href": "https://www.zotero.org/physikerwelt",
            "type": "text/html"
          }
        },
        "name": "Moritz Schubotz",
        "username": "physikerwelt"
      },
      "creatorSummary": "Petrera et al.",
      "lastModifiedByUser": {
        "id": 8024352,
        "links": {
          "alternate": {
            "href": "https://www.zotero.org/frederickirstein",
            "type": "text/html"
          }
        },
        "name": "",
        "username": "FredericKirstein"
      },
      "numChildren": 1,
      "parsedDate": "2021"
    },
    "version": 39746
  },
  ``` So you don't want to parse the dirty bibtex entry but use       "citationKey": "Petrera2021", as filename 

Co-authored-by: physikerwelt <2777736+physikerwelt@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented Feb 19, 2026

@copilot The filename is still incorrect. for the example mentioned above the JSON snippet looks like

...

Fixed in commit 3d9fa90. get_citation_key() now reads data["citationKey"] directly (the structured field populated by Better BibTeX), so the example from your JSON will save as bib/docs/preprints/Petrera2021.pdf. The extra-field fallback is still there for entries that haven't been processed by Better BibTeX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants