Skip to content

Commit 883dd62

Browse files
authored
Fixed unable to download images (#233)
1 parent a52fc34 commit 883dd62

27 files changed

Lines changed: 1735 additions & 1160 deletions
Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,28 @@
1-
name: Code Quality Assurance
1+
name: Code Quality Assurance - JavaScript/TypeScript
22

33
on:
44
push:
55
branches: [master]
6+
paths:
7+
- "**/*.js"
8+
- "**/*.json"
9+
- "**/*.ts"
10+
- ".github/workflows/code-qa-js.yaml"
11+
- "eslint.config.js"
12+
- "jest.config.js"
13+
- "package*.json"
14+
- "tsconfig*.json"
615
pull_request:
716
branches: [master]
17+
paths:
18+
- "**/*.js"
19+
- "**/*.json"
20+
- "**/*.ts"
21+
- ".github/workflows/code-qa-js.yaml"
22+
- "eslint.config.js"
23+
- "jest.config.js"
24+
- "package*.json"
25+
- "tsconfig*.json"
826

927
permissions:
1028
contents: read
@@ -19,13 +37,13 @@ jobs:
1937

2038
steps:
2139
- name: Checkout repository
22-
uses: actions/checkout@v5
40+
uses: actions/checkout@v6
2341
- name: Use Node.js ${{ matrix.node-version }}
24-
uses: actions/setup-node@v5
42+
uses: actions/setup-node@v6
2543
with:
2644
node-version: ${{ matrix.node-version }}
2745
- name: Cache Node.js modules
28-
uses: actions/cache@v4
46+
uses: actions/cache@v5
2947
with:
3048
path: ~/.npm
3149
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
1-
name: Markdown Lint
1+
name: Code Quality Assurance - Markdown
22

33
on:
44
push:
55
branches: [master]
66
paths:
77
- "**/*.md"
8+
- ".github/workflows/markdown-lint.yaml"
89
- ".markdownlint.json"
910
- ".markdownlintignore"
10-
- ".github/workflows/markdown-lint.yaml"
11+
- "package*.json"
1112
pull_request:
1213
branches: [master]
1314
paths:
1415
- "**/*.md"
16+
- ".github/workflows/markdown-lint.yaml"
1517
- ".markdownlint.json"
1618
- ".markdownlintignore"
17-
- ".github/workflows/markdown-lint.yaml"
19+
- "package*.json"
1820

1921
permissions:
2022
contents: read
@@ -24,13 +26,13 @@ jobs:
2426

2527
steps:
2628
- name: Checkout repository
27-
uses: actions/checkout@v5
29+
uses: actions/checkout@v6
2830
- name: Use Node.js ${{ matrix.node-version }}
29-
uses: actions/setup-node@v5
31+
uses: actions/setup-node@v6
3032
with:
3133
node-version: ${{ matrix.node-version }}
3234
- name: Cache Node.js modules
33-
uses: actions/cache@v4
35+
uses: actions/cache@v5
3436
with:
3537
path: ~/.npm
3638
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -121,17 +121,17 @@ For comprehensive documentation, see the [`docs/`](docs/) folder:
121121

122122
## Maintenance Mode
123123

124-
This project is currently in maintenance mode. This means that:
124+
This project is currently in **maintenance mode**. This means that:
125125

126-
- Only critical bug fixes and security updates will be addressed.
127-
- New feature requests are unlikely to be implemented.
126+
-**Critical bug fixes** will be addressed
127+
-**Security updates** will be implemented promptly
128+
-**Minor improvements** to existing functionality may be accepted
129+
-**New features** are unlikely to be implemented
130+
-**Major refactoring** or architectural changes will not be pursued
128131

129-
## Sponsorship
132+
**Response Time:** While we strive to address issues promptly, response times may vary. Critical security issues will be prioritized.
130133

131-
If you want to support my work, you can do so through the following methods:
132-
133-
- [BTC](3Lp4pwF5nXqwFA62BYx4DSvDswyYpskBog) - 3Lp4pwF5nXqwFA62BYx4DSvDswyYpskBog
134-
- [ETH](0xc6EB17BD7cbe5976Bfc4f845669cD66Ff340a1A2) - 0xc6EB17BD7cbe5976Bfc4f845669cD66Ff340a1A2
134+
**Contributing:** Pull requests for bug fixes and security patches are still welcome. Please review the [Contributing Guidelines](CONTRIBUTING.md) before submitting.
135135

136136
## Authors
137137

docs/architecture/index.md

Lines changed: 30 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -143,56 +143,45 @@ graph TD
143143

144144
### 4. Parse Module (`src/processor/parseFigures.ts`)
145145

146-
Processes XML article data to extract figure information:
146+
Processes XML article data to extract PMC IDs and orchestrate package downloads:
147147

148148
```mermaid
149149
graph LR
150-
A[XML Article Data] --> B[Parse XML Structure]
151-
B --> C[Extract Article Metadata]
152-
C --> D[Find Figure Elements]
153-
D --> E[Extract Figure URLs]
154-
E --> F[Process Each Figure]
155-
F --> G[Download Figure]
156-
G --> H[Save to File System]
150+
A[XML Article Data] --> B[Parse XML Structure]
151+
B --> C[Extract Article Metadata]
152+
C --> D[Locate PMC ID]
153+
D --> E[Request Article Package]
154+
E --> F[Extract Images from Package]
155+
F --> G[Save Images to File System]
157156
```
158157

159-
**XML Structure Navigation:**
158+
**XML Structure Navigation (PMC ID extraction):**
159+
160+
The parser locates the PMC identifier in the article front matter (see implementation: [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts)).
160161

161162
```xml
162163
<pmc-articleset>
163-
<article>
164-
<front>
165-
<article-meta>
166-
<article-id pub-id-type="pmc">PMC123456</article-id>
167-
</article-meta>
168-
</front>
169-
<body>
170-
<fig>
171-
<graphic xlink:href="figure1.jpg"/>
172-
</fig>
173-
</body>
174-
</article>
164+
<article>
165+
<front>
166+
<article-meta>
167+
<article-id pub-id-type="pmcid">PMC123456</article-id>
168+
</article-meta>
169+
</front>
170+
</article>
175171
</pmc-articleset>
176172
```
177173

178-
### 5. Download Module (`src/processor/downloadImage.ts`)
174+
### 5. Download Module (`src/processor/downloadArticlePackage.ts`)
179175

180-
Handles actual file downloads with proper error handling:
176+
Downloads a complete PMC article package (.tar.gz) and extracts image files. The implementation fetches a package URL from the OA Web Service API, downloads the archive, extracts media, and selects the highest-priority image format per basename before copying results to the output directory (see implementation: [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts)).
181177

182-
```mermaid
183-
stateDiagram-v2
184-
[*] --> Validate_URL
185-
Validate_URL --> Create_Directory
186-
Create_Directory --> Download_File
187-
Download_File --> Success : HTTP 200
188-
Download_File --> Retry : Network Error
189-
Download_File --> Skip : HTTP 404
190-
Retry --> Download_File : Max 3 attempts
191-
Retry --> Failed : Exceeded retries
192-
Success --> [*]
193-
Skip --> [*]
194-
Failed --> [*]
195-
```
178+
Key implementation behaviors (implementation proof):
179+
180+
- Fetches OA package metadata via the OA API and converts FTP links to HTTPS (see [`src/processor/fetchPackageUrl.ts`](../src/processor/fetchPackageUrl.ts)).
181+
- Downloads the package archive and extracts it to a temporary directory (see [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts)).
182+
- Groups files by basename and keeps the highest-priority extension using the `IMAGE_EXTENSIONS` priority map (see [`src/constants.ts`](../src/constants.ts)).
183+
184+
Console-level messages written by the implementation include `Fetching package URL for <PMCID>`, `Package downloaded. Extracting images...`, `Extracted image: <filename>`, and `Successfully extracted <N> images from package.` (see [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts)).
196185

197186
## Data Flow Architecture
198187

@@ -211,13 +200,13 @@ graph TD
211200
212201
subgraph "Content Processing"
213202
D --> E[XML Parsing]
214-
E --> F[Figure URL Extraction]
215-
F --> G[URL Validation]
203+
E --> F[PMC ID Extraction]
204+
F --> G[Request Article Package]
205+
G --> H[Extract Images from Package]
216206
end
217207
218208
subgraph "File Operations"
219-
G --> H[Directory Creation]
220-
H --> I[Figure Download]
209+
H --> I[Directory Creation]
221210
I --> J[File System Storage]
222211
end
223212

docs/faq.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ The Publication Figure Retrieval Tool is an open-source Node.js application that
1111
### What formats are supported?
1212

1313
- **Input**: Species names (scientific and common names)
14-
- **Output**: JPEG images from publication figures
14+
- **Output**: Image formats extracted from article packages; supported extensions are defined in the code (`IMAGE_EXTENSIONS`) and include common formats such as `jpg`, `png`, `tiff`, `gif`, and `svg` (see [`src/constants.ts`](../src/constants.ts)).
1515
- **Data**: JSON metadata files with article and figure information
1616

1717
### Is this tool free to use?
1818

19-
Yes, this is an open-source tool released under the MIT License. However, please respect the NCBI API usage guidelines and publication copyright restrictions.
19+
The project is open-source; consult the repository [`package.json`](../package.json) for the declared license. Users must comply with NCBI API usage guidelines and any applicable publication copyright restrictions.
2020

2121
## Installation and Setup
2222

@@ -99,16 +99,16 @@ For the example above:
9999

100100
### Q: Where are the downloaded figures saved?
101101

102-
**A:** Figures are saved in the `src/output/` directory, organized by species:
102+
**A:** At runtime the tool writes extracted images to the `build/output/` directory (when running the compiled JavaScript). The layout is organized by species and PMC ID; the package extraction and write behavior are implemented in [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts) and [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts). Example:
103103

104104
```text
105-
src/output/
105+
build/output/
106+
├── cache/
107+
│ └── id.json
106108
├── Homo_sapiens/
107-
│ ├── figures/
108-
│ │ ├── PMC123456_figure1.jpg
109-
│ │ └── PMC123456_figure2.jpg
110-
│ └── metadata/
111-
│ └── PMC123456_metadata.json
109+
│ ├── PMC123456/
110+
│ │ ├── figure1.jpg
111+
│ │ └── figure2.png
112112
```
113113

114114
## Troubleshooting

docs/index.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@ graph TD
3838
D --> E[Get Article PMC IDs]
3939
E --> F[Fetch Article Details]
4040
F --> G[Parse XML Response]
41-
G --> H[Extract Figure URLs]
42-
H --> I[Download Figures]
41+
G --> H[Download Article Package]
42+
H --> I[Extract Images from Package]
4343
I --> J[Save to Species/PMCID Directory]
4444
J --> K{More Species?}
4545
K -->|Yes| C
@@ -102,10 +102,10 @@ npm run start
102102

103103
The tool will:
104104

105-
1. Read species from `src/data/species.json`
106-
2. Search PMC for each species
107-
3. Download figures to `build/output/[species]/[pmcid]/`
108-
4. Cache progress in `build/output/cache/id.json`
105+
1. Read species from [`src/data/species.json`](../src/data/species.json)
106+
2. Search PMC for each species (see [`src/processor/searchArticleBySpecies.ts`](../src/processor/searchArticleBySpecies.ts))
107+
3. For each article: fetch article XML, identify the PMC ID, download the article package (.tar.gz) and extract images into `build/output/[species]/[pmcid]/` (see [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts) and [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts))
108+
4. Cache progress in `build/output/cache/id.json` to enable resume
109109

110110
### With API Key (Recommended)
111111

@@ -141,10 +141,9 @@ sequenceDiagram
141141
Fetch->>PMC: efetch.fcgi?db=pmc&id=batch
142142
PMC-->>Fetch: XML article data
143143
144-
Fetch->>Parse: parseFigures(xmlData)
145-
Parse->>Parse: extractFigureUrls()
146-
Parse->>Download: downloadImage(figureUrl)
147-
Download-->>Parse: Downloaded figure
144+
Fetch->Parse: parseFigures(xmlData)
145+
Parse->>Download: downloadArticlePackage(pmcId)
146+
Download-->>Parse: Extracted images saved to disk
148147
Parse-->>User: Organized files
149148
```
150149

docs/usage/api/searchArticleBySpecies.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ graph TD
2727
### Signature
2828

2929
```typescript
30-
export async function searchArticlesBySpecies(throttle: any, species: string): Promise<string[]>;
30+
export async function searchArticlesBySpecies(throttle: ThrottleFunction, species: string): Promise<string[]>;
3131
```
3232

3333
### Parameters
@@ -40,8 +40,8 @@ export async function searchArticlesBySpecies(throttle: any, species: string): P
4040
### Return Value
4141

4242
- **Type**: `Promise<string[]>`
43-
- **Description**: Array of PMC IDs (without "PMC" prefix)
44-
- **Example**: `["123456", "789012", "345678"]`
43+
- **Description**: The function returns the ID list provided by the NCBI response at `response.data.esearchresult.idlist`. The implementation returns the value directly from the API response (see [`src/processor/searchArticleBySpecies.ts`](../src/processor/searchArticleBySpecies.ts)).
44+
- **Example**: `["PMC123456", "PMC789012"]` (exact contents depend on the API response)
4545

4646
### API Integration
4747

@@ -155,11 +155,11 @@ const pmcIds = await searchArticlesBySpecies(throttleWithKey, "Cannabis_sativa")
155155
```typescript
156156
// From the actual implementation
157157
try {
158-
const response = await throttle(async () => await axios.get(url));
159-
return response.data.esearchresult.idlist; // Returns array of PMC IDs
158+
const response = await throttle(async () => await axios.get(url));
159+
return response.data.esearchresult.idlist; // Returns array of PMC IDs
160160
} catch (error) {
161-
console.error("Error fetching articles:", error);
162-
return []; // Returns empty array on error
161+
console.error("Error fetching articles:", error);
162+
return []; // Returns empty array on error
163163
}
164164
```
165165

docs/usage/index.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,10 @@ npm run start
5454

5555
The tool will:
5656

57-
1. **Load species configuration** from `src/data/species.json`
57+
1. **Load species configuration** from [`src/data/species.json`](../src/data/species.json)
5858
2. **Initialize rate limiting** (3 requests/second without API key)
5959
3. **Process each species** sequentially
60-
4. **Download figures** to `build/output/[species]/[pmcid]/`
60+
4. **Download article packages and extract images** into `build/output/[species]/[pmcid]/` (see [`src/processor/parseFigures.ts`](../src/processor/parseFigures.ts) and [`src/processor/downloadArticlePackage.ts`](../src/processor/downloadArticlePackage.ts))
6161
5. **Cache progress** for resume capability
6262

6363
### Example Output
@@ -67,10 +67,13 @@ Searching articles for the species: Arabidopsis_thaliana...
6767
Found 1,247 articles for Arabidopsis_thaliana
6868
Fetching Arabidopsis thaliana article details for batch 1-50...
6969
Processing article PMC ID: PMC123456
70-
Found 3 figures in the article.
71-
Downloaded image: figure1.jpg
72-
Downloaded image: figure2.png
73-
Downloaded image: supplementary1.tiff
70+
Fetching package URL for PMC123456...
71+
Downloading package from https://.../PMC123456.tar.gz
72+
Package downloaded. Extracting images...
73+
Extracted image: figure1.jpg (priority: jpg)
74+
Extracted image: figure2.png (priority: png)
75+
Successfully extracted 2 images from package.
76+
Successfully processed article package for PMC123456
7477
```
7578

7679
## Configuration Options

0 commit comments

Comments
 (0)