The tool's processing pipeline is composed of five exported functions in the TypeScript source tree:
maininsrc/index.tssearchArticlesBySpeciesinsrc/processor/searchArticleBySpecies.tsfetchArticleDetailsinsrc/processor/fetchArticleDetails.tsparseFiguresinsrc/processor/parseFigures.tsdownloadArticlePackageinsrc/processor/downloadArticlePackage.ts
The package URL resolver used by the downloader is:
mainloads species fromsrc/data/species.jsonmaincallssearchArticlesBySpeciesfor each speciesmaincallsfetchArticleDetailswhen PMC IDs are returnedfetchArticleDetailsrequests XML batches (50 IDs per batch) from EFetchfetchArticleDetailscallsparseFigureswith XML payloadsparseFiguresextracts PMC IDs and callsdownloadArticlePackagedownloadArticlePackagefetches the OA package URL, downloads the.tar.gz, extracts files, and copies selected image files tobuild/output/[species]/[pmcid]/
flowchart TD
accTitle: API Processing Pipeline
accDescr: Flow of function calls from main through species search, batched XML retrieval, XML parsing, package download, extraction, and cache updates.
A[main in src/index.ts] --> B[Load species keys from src/data/species.json]
B --> C[searchArticlesBySpecies for each species]
C --> D{PMC IDs returned?}
D -->|No| E[Log no articles for species]
D -->|Yes| F[fetchArticleDetails]
F --> G[Read cache build/output/cache/id.json]
G --> H[Batch IDs in groups of 50]
H --> I[Skip IDs already in cache]
I --> J{New IDs in batch?}
J -->|No| K[Continue to next batch]
J -->|Yes| L[Request EFetch XML]
L --> M[parseFigures]
M --> N[Extract PMC ID from article XML]
N --> O[downloadArticlePackage]
O --> P[fetchPackageUrl from OA service]
P --> Q[Download .tar.gz package]
Q --> R[Extract files and pick preferred image extension]
R --> S[Copy images to build/output/species/pmcid]
S --> T[Append batch IDs to cache file]
T --> K
E --> U[Next species]
K --> U
- Location:
src/index.ts - Behavior:
- Configures API request throughput via
throttled-queue - Iterates all species keys in
src/data/species.json - Dispatches species-level processing through
searchArticlesBySpeciesandfetchArticleDetails
- Configures API request throughput via
- Location:
src/processor/searchArticleBySpecies.ts - Behavior:
- Builds an NCBI ESearch query with
term=<species>[organism] - Calls
esearch.fcgiwithdb=pmc,retmode=json, andretmax=1000000 - Adds
api_keywhenNCBI_API_KEYis present - Returns
response.data.esearchresult.idlist - Returns
[]on request errors
- Builds an NCBI ESearch query with
- Location:
src/processor/fetchArticleDetails.ts - Behavior:
- Reads/writes cached IDs in
build/output/cache/id.json - Splits IDs into 50-item batches
- Skips IDs already present in cache
- Fetches article XML through
efetch.fcgi - Calls
parseFiguresfor each fetched batch - Appends processed IDs to cache
- Reads/writes cached IDs in
- Location:
src/processor/parseFigures.ts - Behavior:
- Parses XML with
xml2js - Extracts each article's PMC ID from
article.front[0]["article-meta"][0]["article-id"] - Creates per-article output directories under
build/output - Calls
downloadArticlePackagefor each parsed PMC ID - Continues processing when an individual article package fails
- Parses XML with
- Location:
src/processor/downloadArticlePackage.ts - Behavior:
- Resolves OA package links via
fetchPackageUrl - Downloads a
.tar.gzpackage stream - Extracts package contents with
tar - Selects one image per basename using extension priority from
src/constants.ts - Copies selected files into
outputDir - Removes temporary extraction files
- Returns extracted filenames
- Resolves OA package links via
- Location:
src/processor/fetchPackageUrl.ts - Behavior:
- Calls PMC OA service endpoint
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi - Normalizes IDs to
PMC... - Parses XML response and extracts package links
- Converts
ftp://link prefixes tohttps:// - Throws when article/package data is unavailable
- Calls PMC OA service endpoint
extractFigureUrlsinsrc/processor/extractFigureUrls.tsis currently a standalone utility and is not invoked by the active main pipeline.- Cache file format is a JSON array of PMC ID strings, not an object.