This guide provides step-by-step instructions for using the Publication Figure Retrieval Tool, from basic installation to advanced configuration options.
Before you begin, ensure your system meets the following requirements:
- Node.js: Version 20 or higher
- RAM: Minimum 4GB available
- Internet: Stable connection with 7+ Mbps download speed
git clone https://github.com/AlexJSully/Publication-Figure-Retrieval.git
cd Publication-Figure-Retrieval# Install all dependencies
npm ci# Run tests to verify everything is working
npm test
# Validate
npm run validate
# Build the project
npm run buildExpect no errors.
# Start the figure retrieval process
npm run startThe tool will:
- Load species configuration from
src/data/species.json - Initialize rate limiting (3 requests/second without API key)
- Process each species sequentially
- Download article packages and extract images into
build/output/[species]/[pmcid]/(seesrc/processor/parseFigures.tsandsrc/processor/downloadArticlePackage.ts) - Cache progress for resume capability
Searching articles for the species: Arabidopsis_thaliana...
Found 1,247 articles for Arabidopsis_thaliana
Fetching Arabidopsis thaliana article details for batch 1-50...
Processing article PMC ID: PMC123456
Fetching package URL for PMC123456...
Downloading package from https://.../PMC123456.tar.gz
Package downloaded. Extracting images...
Extracted image: figure1.jpg (priority: jpg)
Extracted image: figure2.png (priority: png)
Successfully extracted 2 images from package.
Successfully processed article package for PMC123456Create a .env file in the project root:
# Optional: NCBI API key for faster processing
NCBI_API_KEY=your_api_key_here- Visit NCBI API Key Registration
- Follow the registration process
- Add your key to
.envfile
| Feature | Without API Key | With API Key |
|---|---|---|
| Rate Limit | 3 requests/second | 10 requests/second |
| Processing Speed | ~3x slower | ~3x faster |
| Large Dataset Handling | Limited | Better performance |
Edit src/data/species.json to customize which species to process:
{
"Arabidopsis_thaliana": {
"alias": ["Arabidopsis thaliana", "Mouse-ear cress", "Thale cress"]
},
"Cannabis_sativa": {
"alias": ["Cannabis sativa", "Hemp", "Marijuana"]
},
"Custom_species": {
"alias": ["Scientific name", "Common name 1", "Common name 2"]
}
}Scenario: You need figures for a particular research organism.
# Edit src/data/species.json to include only your target species
{
"Homo_sapiens": {
"alias": ["Homo sapiens", "Human", "Humans"]
}
}
# Run the tool
npm run startExpected Results:
build/output/
└── Homo_sapiens/
├── PMC123456/
│ ├── figure1.jpg
│ └── figure2.png
└── PMC789012/
└── figure1.jpg
Scenario: The process was interrupted and you want to continue.
# Simply run the tool again - it will resume automatically
npm run startThe tool automatically:
- Reads the cache file (
build/output/cache/id.json) - Skips already processed PMC IDs
- Continues from where it left off
Scenario: You want to download figures for many species (research dataset creation).
# The tool processes species sequentially to respect API limits
# For large lists, consider running overnight or over weekends
npm run start
# Monitor progress in real-time
tail -f output.log # If you redirect output to a log filePerformance Estimation:
- Without API key: ~1,000 articles per hour
- With API key: ~3,000 articles per hour
- Varies based on number of figures per article
graph TD
A[Define Research Question] --> B[Select Target Species]
B --> C[Edit species.json]
C --> D[Configure API Key]
D --> E[Run Tool]
E --> F[Monitor Progress]
F --> G[Verify Downloads]
G --> H[Analyze Figures]
Step-by-step:
- Plan your research: Determine which species are relevant
- Configure species list: Edit
src/data/species.json - Set up environment: Add API key to
.env - Start processing: Run
npm run start - Monitor progress: Watch console output
- Verify results: Check
build/output/directory structure - Analyze data: Use downloaded figures for research
sequenceDiagram
participant R as Researcher
participant T as Tool
participant PMC as PMC Database
participant A as Analysis Software
R->>T: Configure target species
T->>PMC: Search for articles
PMC-->>T: Return article lists
T->>PMC: Download figures
PMC-->>T: Figure files
T-->>R: Organized figure dataset
R->>A: Load figures for analysis
A-->>R: Comparative results
For processing multiple research projects:
#!/bin/bash
# batch_process.sh
# Project 1: Plant species
echo "Processing plant species..."
cp configs/plant_species.json src/data/species.json
npm run start
mv build/output build/output_plants
# Project 2: Animal species
echo "Processing animal species..."
cp configs/animal_species.json src/data/species.json
npm run start
mv build/output build/output_animals
echo "Batch processing complete!"# Monitor memory usage during processing
# Large species lists may require memory monitoring
# For very large datasets, process in smaller batches
node --max-old-space-size=8192 build/index.js# Estimate storage requirements
# Typical figure: 100KB - 2MB
# 1000 articles × 2 figures × 500KB = ~1GB
# Monitor disk space during processing
df -h build/output/# Count processed articles
find build/output -name "PMC*" -type d | wc -l
# Count downloaded figures
find build/output -name "*.jpg" -o -name "*.png" | wc -l
# Check cache status
cat build/output/cache/id.json | jq length- API Documentation - Detailed function and module references
- Contributing - How to extend and modify the tool
- FAQ - Common questions and advanced troubleshooting