API Documentation

Overview

The tool's processing pipeline is composed of five exported functions in the TypeScript source tree:

The package URL resolver used by the downloader is:

fetchPackageUrl in src/processor/fetchPackageUrl.ts

Execution Order

main loads species from src/data/species.json
main calls searchArticlesBySpecies for each species
main calls fetchArticleDetails when PMC IDs are returned
fetchArticleDetails requests XML batches (50 IDs per batch) from EFetch
fetchArticleDetails calls parseFigures with XML payloads
parseFigures extracts PMC IDs and calls downloadArticlePackage
downloadArticlePackage fetches the OA package URL, downloads the .tar.gz, extracts files, and copies selected image files to build/output/[species]/[pmcid]/

Pipeline Diagram

flowchart TD
    accTitle: API Processing Pipeline
    accDescr: Flow of function calls from main through species search, batched XML retrieval, XML parsing, package download, extraction, and cache updates.

    A[main in src/index.ts] --> B[Load species keys from src/data/species.json]
    B --> C[searchArticlesBySpecies for each species]
    C --> D{PMC IDs returned?}
    D -->|No| E[Log no articles for species]
    D -->|Yes| F[fetchArticleDetails]
    F --> G[Read cache build/output/cache/id.json]
    G --> H[Batch IDs in groups of 50]
    H --> I[Skip IDs already in cache]
    I --> J{New IDs in batch?}
    J -->|No| K[Continue to next batch]
    J -->|Yes| L[Request EFetch XML]
    L --> M[parseFigures]
    M --> N[Extract PMC ID from article XML]
    N --> O[downloadArticlePackage]
    O --> P[fetchPackageUrl from OA service]
    P --> Q[Download .tar.gz package]
    Q --> R[Extract files and pick preferred image extension]
    R --> S[Copy images to build/output/species/pmcid]
    S --> T[Append batch IDs to cache file]
    T --> K
    E --> U[Next species]
    K --> U

Function Reference

`main(): Promise<void>`

Location: src/index.ts
Behavior:
- Configures API request throughput via throttled-queue
- Iterates all species keys in src/data/species.json
- Dispatches species-level processing through searchArticlesBySpecies and fetchArticleDetails

`searchArticlesBySpecies(throttle, species): Promise<string[]>`

Location: src/processor/searchArticleBySpecies.ts
Behavior:
- Builds an NCBI ESearch query with term=<species>[organism]
- Calls esearch.fcgi with db=pmc, retmode=json, and retmax=1000000
- Adds api_key when NCBI_API_KEY is present
- Returns response.data.esearchresult.idlist
- Returns [] on request errors

`fetchArticleDetails(throttle, pmids, species): Promise<void>`

Location: src/processor/fetchArticleDetails.ts
Behavior:
- Reads/writes cached IDs in build/output/cache/id.json
- Splits IDs into 50-item batches
- Skips IDs already present in cache
- Fetches article XML through efetch.fcgi
- Calls parseFigures for each fetched batch
- Appends processed IDs to cache

`parseFigures(throttle, xmlData, species): Promise<void>`

Location: src/processor/parseFigures.ts
Behavior:
- Parses XML with xml2js
- Extracts each article's PMC ID from article.front[0]["article-meta"][0]["article-id"]
- Creates per-article output directories under build/output
- Calls downloadArticlePackage for each parsed PMC ID
- Continues processing when an individual article package fails

`downloadArticlePackage(throttle, pmcId, outputDir): Promise<string[]>`

Location: src/processor/downloadArticlePackage.ts
Behavior:
- Resolves OA package links via fetchPackageUrl
- Downloads a .tar.gz package stream
- Extracts package contents with tar
- Selects one image per basename using extension priority from src/constants.ts
- Copies selected files into outputDir
- Removes temporary extraction files
- Returns extracted filenames

`fetchPackageUrl(pmcId): Promise<PackageInfo>`

Location: src/processor/fetchPackageUrl.ts
Behavior:
- Calls PMC OA service endpoint https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi
- Normalizes IDs to PMC...
- Parses XML response and extracts package links
- Converts ftp:// link prefixes to https://
- Throws when article/package data is unavailable

Notes

extractFigureUrls in src/processor/extractFigureUrls.ts is currently a standalone utility and is not invoked by the active main pipeline.
Cache file format is a JSON array of PMC ID strings, not an object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API Documentation

Overview

Execution Order

Pipeline Diagram

Function Reference

`main(): Promise<void>`

`searchArticlesBySpecies(throttle, species): Promise<string[]>`

`fetchArticleDetails(throttle, pmids, species): Promise<void>`

`parseFigures(throttle, xmlData, species): Promise<void>`

`downloadArticlePackage(throttle, pmcId, outputDir): Promise<string[]>`

`fetchPackageUrl(pmcId): Promise<PackageInfo>`

Notes

Uh oh!

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

API Documentation

Overview

Execution Order

Pipeline Diagram

Function Reference

main(): Promise<void>

searchArticlesBySpecies(throttle, species): Promise<string[]>

fetchArticleDetails(throttle, pmids, species): Promise<void>

parseFigures(throttle, xmlData, species): Promise<void>

downloadArticlePackage(throttle, pmcId, outputDir): Promise<string[]>

fetchPackageUrl(pmcId): Promise<PackageInfo>

Notes

`main(): Promise<void>`

`searchArticlesBySpecies(throttle, species): Promise<string[]>`

`fetchArticleDetails(throttle, pmids, species): Promise<void>`

`parseFigures(throttle, xmlData, species): Promise<void>`

`downloadArticlePackage(throttle, pmcId, outputDir): Promise<string[]>`

`fetchPackageUrl(pmcId): Promise<PackageInfo>`