Skip to content

Commit e73bb7f

Browse files
committed
feat: add PDF processing resumption via filename tracking, centralise DefaultPaths constants and add VLM paper citation
1 parent 3176597 commit e73bb7f

20 files changed

Lines changed: 342 additions & 55 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
# Unreleased
2+
3+
### Added
4+
5+
- Added `is_track_pdfs` and `track_pdfs_report_path` to `process_articles()` for local PDF workflows. When enabled (default), each processed PDF is recorded as a `filename<TAB>doi` entry in `logs/{keyword}_pdf_processed_dois.txt`, allowing re-runs to skip already-processed PDFs before any conversion or API calls. Falls back to scanning the output CSV when the tracking file does not yet exist.
6+
7+
- Centralised non-keyword default file paths (`results/failed_automated_articles.txt`, `agentic_evaluation_result.json`, `detailed_evaluation.json`) as class-level constants on `DefaultPaths` so they can be changed in one place.
8+
9+
---
10+
111
# 2026.05.19
212

313
### Added

CITATION.cff

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,28 @@ preferred-citation:
5050
title: "ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature"
5151
type: article
5252
url: "https://doi.org/10.1039/D5DD00521C"
53+
references:
54+
- authors:
55+
- family-names: Roy
56+
given-names: Aritra
57+
orcid: "https://orcid.org/0000-0003-0243-9124"
58+
- family-names: Grisan
59+
given-names: Enrico
60+
orcid: "https://orcid.org/0000-0002-7365-5652"
61+
- family-names: Gattinoni
62+
given-names: Chiara
63+
orcid: "https://orcid.org/0000-0002-3376-6374"
64+
- family-names: Buckeridge
65+
given-names: John
66+
orcid: "https://orcid.org/0000-0002-2537-5082"
67+
identifiers:
68+
- type: other
69+
value: "arXiv:2606.00065"
70+
description: "arXiv preprint"
71+
title: "Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy"
72+
type: article
73+
year: 2026
74+
url: "https://arxiv.org/abs/2606.00065"
5375
repository-code: "https://github.com/slimeslab/ComProScanner"
5476
license: MIT
5577
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"

README.md

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
<img src="https://raw.githubusercontent.com/aritraroy24/ComProScanner/refs/heads/main/assets/comproscanner_logo.png" alt="ComProScanner Logo" width="500"/>
33
</p>
44

5-
[![Python Version](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg?logo=python&logoColor=white)](https://www.python.org/downloads/) [![License: MIT](https://custom-icon-badges.demolab.com/badge/license-MIT-brown.svg?logo=law&logoColor=white)](https://opensource.org/licenses/MIT) [![PyPI](https://img.shields.io/pypi/v/comproscanner?logo=pypi&logoColor=white)](https://pypi.org/project/comproscanner/) [![Documentation](https://custom-icon-badges.demolab.com/badge/docs-latest-brightgreen.svg?logo=materialformkdocs&logoColor=white)](https://slimeslab.github.io/ComProScanner/) [![Coverage](https://img.shields.io/codecov/c/github/aritraroy24/ComProScanner?logo=codecov&logoColor=white&label=coverage&color=e62277)](https://codecov.io/gh/aritraroy24/ComProScanner) [![PyPI - Downloads](https://custom-icon-badges.demolab.com/pypi/dm/comproscanner?logo=download&logoColor=white&color=purple)](https://pypistats.org/packages/comproscanner) [![Ask DeepWiki](https://custom-icon-badges.demolab.com/badge/Ask%20DeepWiki-brightgreen.svg?logo=deepwikidevin&logoColor=white&labelColor=grey&color=5ab998)](https://deepwiki.com/slimeslab/ComProScanner) [![Digital Discovery](https://custom-icon-badges.demolab.com/badge/Digital_Discovery-10.1039/D5DD00521C-brightgreen.svg?logo=rsc&logoColor=white&color=c8c300)](https://doi.org/10.1039/D5DD00521C)
5+
[![Python Version](https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg?logo=python&logoColor=white)](https://www.python.org/downloads/) [![License: MIT](https://custom-icon-badges.demolab.com/badge/license-MIT-brown.svg?logo=law&logoColor=white)](https://opensource.org/licenses/MIT) [![PyPI](https://img.shields.io/pypi/v/comproscanner?logo=pypi&logoColor=white)](https://pypi.org/project/comproscanner/) [![Documentation](https://custom-icon-badges.demolab.com/badge/docs-latest-brightgreen.svg?logo=materialformkdocs&logoColor=white)](https://slimeslab.github.io/ComProScanner/) [![Coverage](https://img.shields.io/codecov/c/github/aritraroy24/ComProScanner?logo=codecov&logoColor=white&label=coverage&color=e62277)](https://codecov.io/gh/aritraroy24/ComProScanner) [![PyPI - Downloads](https://custom-icon-badges.demolab.com/pypi/dm/comproscanner?logo=download&logoColor=white&color=purple)](https://pypistats.org/packages/comproscanner) [![Ask DeepWiki](https://custom-icon-badges.demolab.com/badge/Ask%20DeepWiki-brightgreen.svg?logo=deepwikidevin&logoColor=white&labelColor=grey&color=5ab998)](https://deepwiki.com/slimeslab/ComProScanner) [![Digital Discovery](https://custom-icon-badges.demolab.com/badge/Digital_Discovery-10.1039/D5DD00521C-brightgreen.svg?logo=rsc&logoColor=white&color=c8c300)](https://doi.org/10.1039/D5DD00521C) [![arXiv Preprint](https://custom-icon-badges.demolab.com/badge/arXiv-2606.00065-brightgreen.svg?logo=arxiv&logoColor=white&color=b22929)](https://arxiv.org/abs/2606.00065)
66

77
# ComProScanner
88

@@ -129,20 +129,30 @@ The ComProScanner workflow consists of four main stages:
129129

130130
## Citation
131131

132-
If you use ComProScanner in your research, please cite:
132+
If you use ComProScanner in your research, please cite the following papers:
133133

134134
```bibtex
135-
@Article{roy2026comproscanner,
136-
title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
137-
author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
138-
journal={Digital Discovery},
139-
volume={5},
140-
number={4},
141-
pages={1794--1808},
142-
year={2026},
143-
publisher={Royal Society of Chemistry},
144-
doi ="10.1039/D5DD00521C",
145-
url ="https://doi.org/10.1039/D5DD00521C"
135+
@article{roy2026comproscanner,
136+
title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
137+
author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
138+
journal={Digital Discovery},
139+
volume={5},
140+
number={4},
141+
pages={1794--1808},
142+
year={2026},
143+
publisher={Royal Society of Chemistry},
144+
doi ="10.1039/D5DD00521C",
145+
url ="https://doi.org/10.1039/D5DD00521C"
146+
}
147+
@misc{roy2026comproscanner_vlm,
148+
title={Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy},
149+
author={Aritra Roy and Enrico Grisan and Chiara Gattinoni and John Buckeridge},
150+
year={2026},
151+
eprint={2606.00065},
152+
archivePrefix={arXiv},
153+
primaryClass={cs.IR},
154+
doi={10.48550/arXiv.2606.00065},
155+
url={https://arxiv.org/abs/2606.00065},
146156
}
147157
```
148158

docs/about/changelog.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
# Unreleased
2+
3+
### Added
4+
5+
- Added `is_track_pdfs` and `track_pdfs_report_path` to `process_articles()` for local PDF workflows. When enabled (default), each processed PDF is recorded as a `filename<TAB>doi` entry in `logs/{keyword}_pdf_processed_dois.txt`, allowing re-runs to skip already-processed PDFs before any conversion or API calls. Falls back to scanning the output CSV when the tracking file does not yet exist.
6+
7+
- Centralised non-keyword default file paths (`results/failed_automated_articles.txt`, `agentic_evaluation_result.json`, `detailed_evaluation.json`) as class-level constants on `DefaultPaths` so they can be changed in one place.
8+
9+
---
10+
111
# 2026.05.19
212

313
### Added
@@ -28,6 +38,10 @@
2838

2939
- Added `save_failed_automated_report` and `failed_automated_report_path` to `process_articles()` for automated publisher sources (Elsevier, Springer Nature, IOP, Wiley), mirroring the existing PDF failure report. Failed articles are written as tab-separated `doi`, `publisher`, `reason` entries to `results/failed_automated_articles.txt` by default.
3040

41+
- Added `is_track_pdfs` and `track_pdfs_report_path` to `process_articles()` for local PDF workflows. When enabled (default), each processed PDF is recorded as a `filename<TAB>doi` entry in `logs/{keyword}_pdf_processed_dois.txt`, allowing re-runs to skip already-processed PDFs before any conversion or API calls. Falls back to scanning the output CSV when the tracking file does not yet exist.
42+
43+
- Centralised default file paths (`results/failed_automated_articles.txt`, `agentic_evaluation_result.json`, `detailed_evaluation.json`) as class-level constants on `DefaultPaths` so they can be changed in one place.
44+
3145
- Added image-aware fallback in `DataExtractionFlow.identify_materials_data_presence()`:
3246

3347
- The Materials Data Identifier still runs text RAG first.

docs/about/citation.md

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,26 @@
33
If you use ComProScanner in your research, please cite our related paper:
44

55
```bibtex
6-
@Article{roy2026comproscanner,
7-
title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
8-
author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
9-
journal={Digital Discovery},
10-
volume={5},
11-
number={4},
12-
pages={1794--1808},
13-
year={2026},
14-
publisher={Royal Society of Chemistry},
15-
doi ="10.1039/D5DD00521C",
16-
url ="https://doi.org/10.1039/D5DD00521C"
6+
@article{roy2026comproscanner,
7+
title={ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature},
8+
author={Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara},
9+
journal={Digital Discovery},
10+
volume={5},
11+
number={4},
12+
pages={1794--1808},
13+
year={2026},
14+
publisher={Royal Society of Chemistry},
15+
doi ="10.1039/D5DD00521C",
16+
url ="https://doi.org/10.1039/D5DD00521C"
17+
}
18+
@misc{roy2026comproscanner_vlm,
19+
title={Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy},
20+
author={Aritra Roy and Enrico Grisan and Chiara Gattinoni and John Buckeridge},
21+
year={2026},
22+
eprint={2606.00065},
23+
archivePrefix={arXiv},
24+
primaryClass={cs.IR},
25+
doi={10.48550/arXiv.2606.00065},
26+
url={https://arxiv.org/abs/2606.00065},
1727
}
1828
```

docs/index.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
<a href="https://pypistats.org/packages/comproscanner"><img src="https://custom-icon-badges.demolab.com/pypi/dm/comproscanner?logo=download&logoColor=white&color=purple" alt="Downloads"></a>
1414
<a href="https://deepwiki.com/slimeslab/ComProScanner"><img src="https://custom-icon-badges.demolab.com/badge/Ask%20DeepWiki-brightgreen.svg?logo=deepwikidevin&logoColor=white&labelColor=grey&color=5ab998" alt="Ask DeepWiki"></a>
1515
<a href="https://doi.org/10.1039/D5DD00521C"><img src="https://custom-icon-badges.demolab.com/badge/Digital_Discovery-10.1039/D5DD00521C-brightgreen.svg?logo=rsc&logoColor=white&color=c8c300" alt="Digital Discovery"></a>
16+
<a href="https://arxiv.org/abs/2606.00065"><img src="https://custom-icon-badges.demolab.com/badge/arXiv-2606.00065-brightgreen.svg?logo=arxiv&logoColor=white&color=b22929" alt="arXiv Preprint"></a>
1617
</p>
1718

1819
## Welcome
@@ -155,15 +156,25 @@ Read the details of ComProScanner in the following Digital Discovery paper: [10.
155156
If you use ComProScanner in your research, please cite:
156157

157158
```bibtex
158-
@Article{roy2026comproscannermultiagentbasedframework,
159-
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
160-
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
161-
journal ="Digital Discovery",
162-
year ="2026",
163-
pages ="Accepted",
164-
publisher ="RSC",
165-
doi ="10.1039/D5DD00521C",
166-
url ="https://doi.org/10.1039/D5DD00521C"
159+
@article{roy2026comproscannermultiagentbasedframework,
160+
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
161+
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
162+
journal ="Digital Discovery",
163+
year ="2026",
164+
pages ="Accepted",
165+
publisher ="RSC",
166+
doi ="10.1039/D5DD00521C",
167+
url ="https://doi.org/10.1039/D5DD00521C"
168+
}
169+
@misc{roy2026comproscanner_vlm,
170+
title={Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy},
171+
author={Aritra Roy and Enrico Grisan and Chiara Gattinoni and John Buckeridge},
172+
year={2026},
173+
eprint={2606.00065},
174+
archivePrefix={arXiv},
175+
primaryClass={cs.IR},
176+
doi={10.48550/arXiv.2606.00065},
177+
url={https://arxiv.org/abs/2606.00065},
167178
}
168179
```
169180

docs/news/posts/graphtool-vlm-benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
date:
3-
created: 2026-05-08
3+
created: 2026-06-03
44
authors:
55
- aritraroy24
66
cover_image: https://i.ibb.co/jPJf86Qg/comproscanner-vlm-integration.png

docs/usage/article-processing.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,14 @@ For `source_list=["pdfs"]` processing only. If `True`, saves a text report for P
132132

133133
For `source_list=["pdfs"]` processing only. Custom output path for the failed PDF filename report. If not provided, defaults to `{folder_path}/failed_pdf_filenames.txt`.
134134

135+
#### :material-square-medium:`is_track_pdfs` _(bool)_
136+
137+
For `source_list=["pdfs"]` processing only. If `True`, each successfully processed PDF is recorded in a plain-text tracking file as a tab-separated `filename<TAB>doi` entry so that re-runs skip already-processed PDFs before any conversion or API calls. Falls back to scanning the output CSV when the tracking file does not yet exist. Defaults to `True`.
138+
139+
#### :material-square-medium:`track_pdfs_report_path` _(str)_
140+
141+
For `source_list=["pdfs"]` processing only. Custom path for the PDF tracking file. If not provided, defaults to `logs/{keyword}_pdf_processed_dois.txt`.
142+
135143
#### :material-square-medium:`save_failed_automated_report` _(bool)_
136144

137145
For automated publisher sources (`elsevier`, `springer`, `iop`, `wiley`). If `True`, appends a tab-separated record for every article that could not be downloaded or parsed to the failure report. Each line contains three fields: `doi`, `publisher`, and a short reason code:
@@ -150,7 +158,7 @@ Custom output path for the automated failure report. If not provided, defaults t
150158

151159
!!! info "Default Values"
152160

153-
:material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`main_figure_keywords`** = `property_keywords`<br>:material-square-small:**`additional_figure_keywords`** = None<br>:material-square-small:**`save_failed_pdf_report`** = True<br>:material-square-small:**`failed_pdf_report_path`** = None (auto: `{folder_path}/failed_pdf_filenames.txt`)<br>:material-square-small:**`save_failed_automated_report`** = True<br>:material-square-small:**`failed_automated_report_path`** = None (auto: `results/failed_automated_articles.txt`)
161+
:material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`main_figure_keywords`** = `property_keywords`<br>:material-square-small:**`additional_figure_keywords`** = None<br>:material-square-small:**`save_failed_pdf_report`** = True<br>:material-square-small:**`failed_pdf_report_path`** = None (auto: `{folder_path}/failed_pdf_filenames.txt`)<br>:material-square-small:**`is_track_pdfs`** = True<br>:material-square-small:**`track_pdfs_report_path`** = None (auto: `logs/{keyword}_pdf_processed_dois.txt`)<br>:material-square-small:**`save_failed_automated_report`** = True<br>:material-square-small:**`failed_automated_report_path`** = None (auto: `results/failed_automated_articles.txt`)
154162

155163
## Processing Workflow
156164

src/comproscanner/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
import sys
1818
import os
19+
from .utils.configs.paths_config import DefaultPaths
1920

2021
# Check if the script is running under pytest
2122
_is_testing = "pytest" in sys.modules or "PYTEST_CURRENT_TEST" in os.environ
@@ -221,7 +222,7 @@ def evaluate_semantic(
221222
def evaluate_agentic(
222223
ground_truth_file=None,
223224
test_data_file=None,
224-
output_file="agentic_evaluation_result.json",
225+
output_file=DefaultPaths.AGENTIC_EVALUATION_RESULT_FILENAME,
225226
extraction_agent_model_name="gpt-4o-mini",
226227
is_synthesis_evaluation=True,
227228
weights=None,

src/comproscanner/article_processors/elsevier_processor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ def __init__(
161161
)
162162
self.save_failed_automated_report = save_failed_automated_report
163163
self.failed_automated_report_path = (
164-
failed_automated_report_path or "results/failed_automated_articles.txt"
164+
failed_automated_report_path or self.all_paths.FAILED_AUTOMATED_ARTICLES_FILENAME
165165
)
166166
self.failed_automated_count = 0
167167

0 commit comments

Comments
 (0)