Skip to content

Commit 3df0505

Browse files
committed
updated README
1 parent d9b83ea commit 3df0505

6 files changed

Lines changed: 54 additions & 24 deletions

File tree

README.md

Lines changed: 37 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
11
# Publication Figure Retrieval Tool
22

3-
This tool provides a method for scraping through NCBI's [PMC](https://www.ncbi.nlm.nih.gov/labs/pmc/) publications and retrieving (downloading) the figures from open access and publicly available publications.
3+
This tool provides a method for retrieving figures from NCBI's [PMC](https://www.ncbi.nlm.nih.gov/labs/pmc/) publications using the Entrez API.
44

55
[![Follow on Twitter](https://img.shields.io/twitter/follow/alexjsully?style=social)](https://twitter.com/alexjsully)
6-
[![GitHub repo size](https://img.shields.io/github/repo-size/AlexJSully/Publication-Figures-Web-Scraping)](https://github.com/AlexJSully/Publication-Figures-Web-Scraping)
7-
[![GitHub](https://img.shields.io/github/license/AlexJSully/Publication-Figures-Web-Scraping)](https://github.com/AlexJSully/Publication-Figures-Web-Scraping)
6+
[![GitHub repo size](https://img.shields.io/github/repo-size/AlexJSully/Publication-Figure-Retrieval)](https://github.com/AlexJSully/Publication-Figure-Retrieval)
7+
[![GitHub](https://img.shields.io/github/license/AlexJSully/Publication-Figure-Retrieval)](https://github.com/AlexJSully/Publication-Figure-Retrieval)
88

99
## Disclaimer
1010

11-
Please note that as of August 20, 2024, scraping figures from PMC (PubMed Central) publications is no longer permitted under their Copyright Notice ([https://www.ncbi.nlm.nih.gov/pmc/about/copyright/](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/)). Attempting to do so may result in a 403 error and an IP ban from NIH/PMC. Users are fully responsible for ensuring their use of this tool complies with current PMC policies and applicable laws. Use at your own risk.
12-
13-
This code is maintained for educational and historical reference purposes only. While it remains functional, it is not intended for use in scraping PMC data, as this is no longer allowed under current policies. The tool was originally developed for academic research, but due to these policy changes, it is not recommended for use in this capacity anymore.
11+
This code is maintained for educational and historical reference purposes only. The tool was originally developed for academic research. Please note that the use of this tool for retrieving figures from PMC publications is subject to NCBI's policies. Use at own risk.
1412

1513
## Requirements
1614

@@ -20,25 +18,41 @@ This code is maintained for educational and historical reference purposes only.
2018

2119
## Installation & Setup
2220

23-
If you would like to run or modify the publication figure web scraping tool locally, clone the repository with git by running the following command:
21+
If you would like to run or modify the publication figure retrieval tool locally, clone the repository with git by running the following command:
22+
23+
```bash
24+
git clone https://github.com/AlexJSully/Publication-Figure-Retrieval.git
25+
```
26+
27+
Then run
28+
29+
```bash
30+
npm install
31+
```
32+
33+
followed by
2434

25-
```git
26-
git clone https://github.com/AlexJSully/Publication-Figures-Web-Scraping.git
35+
```bash
36+
npm start
2737
```
2838

29-
Then run `npm install` then `npm start`. This tool runs within your node environment. On Windows, this script needs to run in an administrator mode.
39+
This tool runs within your Node.js environment. On Windows, this script may need to run in administrator mode.
40+
41+
### Usage
3042

31-
The images are downloaded then downloaded locally within this containing directory under [src/data/figures/{species}/{PMC ID}](./src/data/figures).
43+
The images are downloaded locally within the `build/processor/output` directory.
3244

33-
If you would like to run against commercial use publications, you will need to download [`oa_comm_use_file.list.txt`](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file.list.txt) from [ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/](https://ftp.ncbi.nlm.nih.gov/pub/pmc/) then run `npm run process`. Once that is done, set [index.js](./src/index.js) init function to true (`await init(true);`)
45+
### API Key
3446

35-
The publication figure scraper will resume where you last left off. If you would like to reset the scraper, empty [species-pmid-list.json](./src/data/species-pmid-list.json), [data-retrieved.json](./src/data/data-retrieved.json) and [data-empty-pubs.json](./src/data/data-empty-pubs.json) to contain only just an empty JSON object (`{}`).
47+
If you have an API key, create a `.env` file in the root directory and add your API key as follows:
3648

37-
If you would like to add more species support for publications to be scraped, add the species to [species.json](./src/data/species.json) and then run `npm start`. Currently, this JSON includes species' common aliases which are not currently being used but may be useful in the future. If you would like to scrape a single species, then change `speciesList` in [index.js](./src/index.js) to an array of species scientific name(s) to scrape. For example: `speciesList = ['Arabidopsis thaliana']; // Or whatever species name(s) you would like to scrape`. Currently, it is set to scrape all species within the [species.json](./src/data/species.json) file.
49+
```bash
50+
NCBI_API_KEY=your_api_key_here
51+
```
3852

39-
If in the instance that you do not have an internet connection/speed greater than 7mb/s, you will need to change all the Axios request timeouts in [data-retrieval.js](./src/scripts/data-retrieval.js) to a value of at least half of your speed (e.g. down speed of 10mb/s, set timeout to 5s).
53+
With an API key, the tool can retrieve up to 10 calls per second instead of 3. Details on obtaining an API key can be found [here](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/).
4054

41-
## Known issues
55+
## Known Issues
4256

4357
We aim to make this tool as perfect as possible but unfortunately, there may be some unforeseen bugs. If you manage to find one that is not here, feel free to create a bug report so we can fix it.
4458

@@ -48,6 +62,12 @@ We aim to make this tool as perfect as possible but unfortunately, there may be
4862

4963
Please read [CONTRIBUTING.md](CONTRIBUTING.md) for more details.
5064

65+
Before contributing, ensure that all tests pass by running:
66+
67+
```bash
68+
npm run validate
69+
```
70+
5171
## License
5272

5373
[GLP-2.0](LICENSE.md)
@@ -61,7 +81,7 @@ This project is currently in maintenance mode. This means that:
6181

6282
## Sponsorship
6383

64-
If you want to support my work, you can through the following methods:
84+
If you want to support my work, you can do so through the following methods:
6585

6686
- [BTC](3Lp4pwF5nXqwFA62BYx4DSvDswyYpskBog) - 3Lp4pwF5nXqwFA62BYx4DSvDswyYpskBog
6787
- [ETH](0xc6EB17BD7cbe5976Bfc4f845669cD66Ff340a1A2) - 0xc6EB17BD7cbe5976Bfc4f845669cD66Ff340a1A2

package.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,11 @@
4747
],
4848
"private": true,
4949
"bugs": {
50-
"url": "https://github.com/AlexJSully/Publication-Figures-Web-Scraping/issues"
50+
"url": "https://github.com/AlexJSully/Publication-Figure-Retrieval/issues"
5151
},
5252
"repository": {
5353
"type": "git",
54-
"url": "git+https://github.com/AlexJSully/Publication-Figures-Web-Scraping.git"
54+
"url": "git+https://github.com/AlexJSully/Publication-Figure-Retrieval.git"
5555
},
5656
"keywords": [
5757
"science",
@@ -60,5 +60,5 @@
6060
"publication-data"
6161
],
6262
"license": "GNU General Public License v2.0",
63-
"homepage": "https://github.com/AlexJSully/Publication-Figures-Web-Scraping#readme"
63+
"homepage": "https://github.com/AlexJSully/Publication-Figure-Retrieval#readme"
6464
}

src/index.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import { searchArticlesBySpecies } from "./processor/searchArticleBySpecies";
55

66
// Check if NCBI API key is present in environment variables
77
/** The API key for the NCBI E-utilities. */
8-
const ncbiApiKey = process.env.NCBI_API_KEY;
8+
const ncbiApiKey = process?.env?.NCBI_API_KEY;
99
/**
1010
* The number of API calls allowed per second.
1111
* If an API key is provided, we can make up to 10 calls per second.

src/processor/extractFigureUrls.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ export function extractFigureUrls(
5757
// Construct the correct absolute URL
5858
/** The absolute URL of the figure graphic. */
5959
const absoluteUrl = `https://www.ncbi.nlm.nih.gov/pmc/articles/PMC${pmcId}/bin/${figureUrl}`;
60+
6061
figureUrls.push(absoluteUrl);
6162
}
6263
});

src/processor/fetchArticleDetails.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,12 @@ export async function fetchArticleDetails(
3434
/** Comma-separated list of PMCIDs. */
3535
const ids = batch.join(",");
3636
/** The URL to fetch article details for the current batch. */
37-
const url = `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=${ids}&retmode=xml`;
37+
let url = `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=${ids}&retmode=xml`;
38+
// Check if there is a NCBI API key available and if so, add it to the URL
39+
if (process?.env?.NCBI_API_KEY) {
40+
url += `&api_key=${process.env.NCBI_API_KEY}`;
41+
}
42+
3843
console.log(`Fetching article details for batch ${i + 1}-${i + batch.length}...`);
3944

4045
try {

src/processor/searchArticleBySpecies.ts

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,13 @@ export async function searchArticlesBySpecies(
2222
): Promise<string[]> {
2323
// Construct query for species and open-access filter
2424
const query = `${species}[organism]`;
25-
const url = `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=${encodeURIComponent(
25+
let url = `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=${encodeURIComponent(
2626
query,
27-
)}&retmode=json&retmax=100000`;
27+
)}&retmode=json&retmax=1000000`;
28+
// Check if there is a NCBI API key available and if so, add it to the URL
29+
if (process?.env?.NCBI_API_KEY) {
30+
url += `&api_key=${process.env.NCBI_API_KEY}`;
31+
}
2832

2933
try {
3034
// Make HTTP request to NCBI E-utilities API to search for articles

0 commit comments

Comments
 (0)