Usage Guide

Quick Start

This guide provides step-by-step instructions for using the Publication Figure Retrieval Tool, from basic installation to advanced configuration options.

Prerequisites

Before you begin, ensure your system meets the following requirements:

Node.js: Version 20 or higher
RAM: Minimum 4GB available
Internet: Stable connection with 7+ Mbps download speed

Installation

1. Clone the Repository

git clone https://github.com/AlexJSully/Publication-Figure-Retrieval.git
cd Publication-Figure-Retrieval

2. Install Dependencies

# Install all dependencies
npm ci

3. Verify Installation

# Run tests to verify everything is working
npm test

# Validate
npm run validate

# Build the project
npm run build

Expect no errors.

Basic Usage

Running the Tool

# Start the figure retrieval process
npm run start

The tool will:

Load species configuration from src/data/species.json
Initialize rate limiting (3 requests/second without API key)
Process each species sequentially
Download article packages and extract images into build/output/[species]/[pmcid]/ (see src/processor/parseFigures.ts and src/processor/downloadArticlePackage.ts)
Cache progress for resume capability

Example Output

Searching articles for the species: Arabidopsis_thaliana...
Found 1,247 articles for Arabidopsis_thaliana
Fetching Arabidopsis thaliana article details for batch 1-50...
Processing article PMC ID: PMC123456
Fetching package URL for PMC123456...
Downloading package from https://.../PMC123456.tar.gz
Package downloaded. Extracting images...
Extracted image: figure1.jpg (priority: jpg)
Extracted image: figure2.png (priority: png)
Successfully extracted 2 images from package.
Successfully processed article package for PMC123456

Configuration Options

Environment Variables

Create a .env file in the project root:

# Optional: NCBI API key for faster processing
NCBI_API_KEY=your_api_key_here

API Key Configuration

Getting an NCBI API Key

Visit NCBI API Key Registration
Follow the registration process
Add your key to .env file

Benefits of Using an API Key

Feature	Without API Key	With API Key
Rate Limit	3 requests/second	10 requests/second
Processing Speed	~3x slower	~3x faster
Large Dataset Handling	Limited	Better performance

Species Configuration

Edit src/data/species.json to customize which species to process:

{
	"Arabidopsis_thaliana": {
		"alias": ["Arabidopsis thaliana", "Mouse-ear cress", "Thale cress"]
	},
	"Cannabis_sativa": {
		"alias": ["Cannabis sativa", "Hemp", "Marijuana"]
	},
	"Custom_species": {
		"alias": ["Scientific name", "Common name 1", "Common name 2"]
	}
}

Common Use Cases

1. Download Figures for Specific Species

Scenario: You need figures for a particular research organism.

# Edit src/data/species.json to include only your target species
{
  "Homo_sapiens": {
    "alias": ["Homo sapiens", "Human", "Humans"]
  }
}

# Run the tool
npm run start

Expected Results:

build/output/
└── Homo_sapiens/
    ├── PMC123456/
    │   ├── figure1.jpg
    │   └── figure2.png
    └── PMC789012/
        └── figure1.jpg

2. Resume Interrupted Downloads

Scenario: The process was interrupted and you want to continue.

# Simply run the tool again - it will resume automatically
npm run start

The tool automatically:

Reads the cache file (build/output/cache/id.json)
Skips already processed PMC IDs
Continues from where it left off

3. Process Large Species Lists

Scenario: You want to download figures for many species (research dataset creation).

# The tool processes species sequentially to respect API limits
# For large lists, consider running overnight or over weekends
npm run start

# Monitor progress in real-time
tail -f output.log  # If you redirect output to a log file

Performance Estimation:

Without API key: ~1,000 articles per hour
With API key: ~3,000 articles per hour
Varies based on number of figures per article

Workflow Examples

Research Dataset Creation

graph TD
    A[Define Research Question] --> B[Select Target Species]
    B --> C[Edit species.json]
    C --> D[Configure API Key]
    D --> E[Run Tool]
    E --> F[Monitor Progress]
    F --> G[Verify Downloads]
    G --> H[Analyze Figures]

Step-by-step:

Plan your research: Determine which species are relevant
Configure species list: Edit src/data/species.json
Set up environment: Add API key to .env
Start processing: Run npm run start
Monitor progress: Watch console output
Verify results: Check build/output/ directory structure
Analyze data: Use downloaded figures for research

Comparative Analysis Workflow

sequenceDiagram
    participant R as Researcher
    participant T as Tool
    participant PMC as PMC Database
    participant A as Analysis Software

    R->>T: Configure target species
    T->>PMC: Search for articles
    PMC-->>T: Return article lists
    T->>PMC: Download figures
    PMC-->>T: Figure files
    T-->>R: Organized figure dataset
    R->>A: Load figures for analysis
    A-->>R: Comparative results

Batch Processing Workflow

For processing multiple research projects:

#!/bin/bash
# batch_process.sh

# Project 1: Plant species
echo "Processing plant species..."
cp configs/plant_species.json src/data/species.json
npm run start
mv build/output build/output_plants

# Project 2: Animal species
echo "Processing animal species..."
cp configs/animal_species.json src/data/species.json
npm run start
mv build/output build/output_animals

echo "Batch processing complete!"

Performance Optimization

Memory Management

# Monitor memory usage during processing
# Large species lists may require memory monitoring

# For very large datasets, process in smaller batches
node --max-old-space-size=8192 build/index.js

Storage Considerations

# Estimate storage requirements
# Typical figure: 100KB - 2MB
# 1000 articles × 2 figures × 500KB = ~1GB

# Monitor disk space during processing
df -h build/output/

Monitoring and Troubleshooting

Progress Monitoring

# Count processed articles
find build/output -name "PMC*" -type d | wc -l

# Count downloaded figures
find build/output -name "*.jpg" -o -name "*.png" | wc -l

# Check cache status
cat build/output/cache/id.json | jq length

Next Steps

API Documentation - Detailed function and module references
Contributing - How to extend and modify the tool
FAQ - Common questions and advanced troubleshooting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage Guide

Quick Start

Prerequisites

Installation

1. Clone the Repository

2. Install Dependencies

3. Verify Installation

Basic Usage

Running the Tool

Example Output

Configuration Options

Environment Variables

API Key Configuration

Getting an NCBI API Key

Benefits of Using an API Key

Species Configuration

Common Use Cases

1. Download Figures for Specific Species

2. Resume Interrupted Downloads

3. Process Large Species Lists

Workflow Examples

Research Dataset Creation

Comparative Analysis Workflow

Batch Processing Workflow

Performance Optimization

Memory Management

Storage Considerations

Monitoring and Troubleshooting

Progress Monitoring

Next Steps

Uh oh!

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Usage Guide

Quick Start

Prerequisites

Installation

1. Clone the Repository

2. Install Dependencies

3. Verify Installation

Basic Usage

Running the Tool

Example Output

Configuration Options

Environment Variables

API Key Configuration

Getting an NCBI API Key

Benefits of Using an API Key

Species Configuration

Common Use Cases

1. Download Figures for Specific Species

2. Resume Interrupted Downloads

3. Process Large Species Lists

Workflow Examples

Research Dataset Creation

Comparative Analysis Workflow

Batch Processing Workflow

Performance Optimization

Memory Management

Storage Considerations

Monitoring and Troubleshooting

Progress Monitoring

Next Steps