Skip to content

Commit d580d7d

Browse files
authored
Merge pull request #2 from aritraroy24/release/0.1.6
Release/0.1.6
2 parents f1cb4e3 + 2df5f0a commit d580d7d

File tree

24 files changed

+1491
-94
lines changed

24 files changed

+1491
-94
lines changed

.gitignore

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,4 +183,12 @@ examples/db/10.*
183183
tests example/
184184

185185
applications
186-
vlm_test
186+
vlm_test
187+
examples/vlm_piezo_test
188+
189+
# Test results
190+
db
191+
results
192+
elsevier_test.xml
193+
springer_test.xml
194+
wiley_test.pdf

CHANGELOG.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
1-
## [Unreleased]
1+
# Unreleased
2+
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
23

3-
### Added
4+
- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
5+
6+
- **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.
7+
8+
- **Agentic evaluation**: a new `GetValueErrorThresholdTool` (CrewAI `BaseTool`) is added to the composition evaluator agent when thresholds are configured. The agent calls this tool with the reference value to retrieve the tolerance before deciding on each numeric match. No tool is added and no prompt changes are made when no thresholds are provided.
9+
10+
- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.
411

512
- VLM-based graph data extraction added across all publishers and PDF processors:
613

@@ -12,7 +19,25 @@
1219

1320
- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
1421

15-
## [0.1.5] - 08-02-2026
22+
### Fixed
23+
24+
- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
25+
26+
---
27+
## [0.1.6] - 2026-04-02
28+
### Changed
29+
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
30+
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
31+
32+
### Added
33+
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.
34+
35+
### Fixed
36+
- Model prefix handling in `rag_tool.py` standardized to reflect the docs.
37+
- `HF_TOKEN` documentation clarified as optional — only required for gated or private Hugging Face models.
38+
39+
---
40+
## [0.1.5] - 2026-02-08
1641

1742
### Added
1843
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.
@@ -83,7 +108,8 @@
83108

84109
- README badges section converted from HTML to markdown format for better compatibility across platforms.
85110

86-
## [0.1.4] - 02-12-2025
111+
---
112+
## [0.1.4] - 2025-12-02
87113

88114
### Added
89115

@@ -118,30 +144,34 @@
118144
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
119145
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
120146

121-
## [0.1.3] - 04-11-2025
147+
---
148+
## [0.1.3] - 2025-11-04
122149

123150
### Fixed
124151

125152
- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
126153
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
127154
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`
128155

129-
## [0.1.2] - 24-10-2025
156+
---
157+
## [0.1.2] - 2025-10-24
130158

131159
### Added
132160

133161
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
134162
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
135163

136-
## [0.1.1] - 22-10-2025
164+
---
165+
## [0.1.1] - 2025-10-22
137166

138167
### Fixed
139168

140169
- README images updated with external image link to fix PyPI rendering issue.
141170
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
142171
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
143172

144-
## [0.1.0] - 22-10-2025
173+
---
174+
## [0.1.0] - 2025-10-22
145175

146176
### Added
147177

CITATION.cff

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ contact:
1616
- family-names: Roy
1717
given-names: Aritra
1818
orcid: "https://orcid.org/0000-0002-4928-2935"
19-
message: If you use this software, please cite our article on arXiv.
19+
message: If you use this software, please cite our article in Digital Discovery.
2020
preferred-citation:
2121
authors:
2222
- family-names: Roy
@@ -31,21 +31,28 @@ preferred-citation:
3131
- family-names: Gattinoni
3232
given-names: Chiara
3333
orcid: "https://orcid.org/0000-0002-3376-6374"
34-
date-published: 2025-10-23
34+
doi: "10.1039/D5DD00521C"
3535
identifiers:
36+
- type: doi
37+
value: "10.1039/D5DD00521C"
38+
description: "Peer-reviewed article"
3639
- type: other
3740
value: "arXiv:2510.20362"
3841
description: "arXiv preprint"
39-
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
42+
journal: "Digital Discovery"
43+
publisher:
44+
name: "RSC"
45+
status: advance-online
46+
title: "ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature"
4047
type: article
41-
url: "https://arxiv.org/abs/2510.20362"
48+
url: "https://doi.org/10.1039/D5DD00521C"
4249
repository-code: "https://github.com/slimeslab/ComProScanner"
4350
license: MIT
4451
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
4552
type: software
4653
url: "https://slimeslab.github.io/ComProScanner/"
47-
version: "0.1.4"
48-
date-released: 2025-12-03
54+
version: "0.1.6"
55+
date-released: 2026-04-02
4956
keywords:
5057
- materials science
5158
- data extraction

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -169,14 +169,15 @@ eval_visualizer.plot_multiple_radar_charts(
169169
If you use ComProScanner in your research, please cite:
170170

171171
```bibtex
172-
@misc{roy2025comproscannermultiagentbasedframework,
173-
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
174-
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
175-
year={2025},
176-
eprint={2510.20362},
177-
archivePrefix={arXiv},
178-
primaryClass={physics.comp-ph},
179-
url={https://arxiv.org/abs/2510.20362},
172+
@Article{roy2026comproscannermultiagentbasedframework,
173+
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
174+
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
175+
journal ="Digital Discovery",
176+
year ="2026",
177+
pages ="Accepted",
178+
publisher ="RSC",
179+
doi ="10.1039/D5DD00521C",
180+
url ="https://doi.org/10.1039/D5DD00521C"
180181
}
181182
```
182183

docs/about/changelog.md

Lines changed: 60 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,42 @@
1-
## Unreleased
1+
# Unreleased
2+
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
3+
4+
- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
5+
6+
- **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.
7+
8+
- **Agentic evaluation**: a new `GetValueErrorThresholdTool` (CrewAI `BaseTool`) is added to the composition evaluator agent when thresholds are configured. The agent calls this tool with the reference value to retrieve the tolerance before deciding on each numeric match. No tool is added and no prompt changes are made when no thresholds are provided.
9+
10+
- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.
11+
12+
- VLM-based graph data extraction added across all publishers and PDF processors:
13+
14+
- New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.
15+
16+
- New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
17+
18+
- New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
19+
20+
- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
21+
22+
### Fixed
23+
24+
- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
25+
26+
---
27+
## [0.1.6] - 2026-04-02
28+
### Changed
29+
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
30+
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
31+
32+
### Added
33+
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.
34+
35+
---
36+
## [0.1.5] - 2026-02-08
237

338
### Added
39+
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.
440

541
- New parameter `apply_advanced_cleaning` added to data cleaning methods in `data_cleaner.py`. When set to `True`, it triggers the advanced cleaning pipeline.
642

@@ -38,6 +74,11 @@
3874
- [CITATION.cff](https://github.com/slimeslab/ComProScanner/blob/main/CITATION.cff) added for standardized citation information based on the latest release and arXiv preprint.
3975

4076
### Fixed
77+
- OAWorks API is replaced with OpenAlex API as OAWorks is no longer available.
78+
79+
- Empty/corrupted PDF handled in `pdf_processor.py` and `wiley_processor.py` to avoid having GLYPH errors during text extraction.
80+
81+
- Data extraction failures fixed if composition-property text data is empty.
4182

4283
- CSV progress tracking in `elsevier_processor.py`:
4384

@@ -63,7 +104,8 @@
63104

64105
- README badges section converted from HTML to markdown format for better compatibility across platforms.
65106

66-
## [0.1.4] - 02-12-2025
107+
---
108+
## [0.1.4] - 2025-12-02
67109

68110
### Added
69111

@@ -94,29 +136,38 @@
94136

95137
### Changed
96138

97-
- README images updated with raw GitHub links for better reliability: [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png), [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
139+
- README images updated with raw GitHub links for better reliability:
140+
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
141+
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
98142

99-
## [0.1.3] - 04-11-2025
143+
---
144+
## [0.1.3] - 2025-11-04
100145

101146
### Fixed
102147

103148
- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
104149
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
105150
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`
106151

107-
## [0.1.2] - 24-10-2025
152+
---
153+
## [0.1.2] - 2025-10-24
108154

109155
### Added
110156

111-
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md: [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
157+
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
158+
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
112159

113-
## [0.1.1] - 22-10-2025
160+
---
161+
## [0.1.1] - 2025-10-22
114162

115163
### Fixed
116164

117-
- README images updated with external image link to fix PyPI rendering issue. [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png), [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
165+
- README images updated with external image link to fix PyPI rendering issue.
166+
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
167+
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
118168

119-
## [0.1.0] - 22-10-2025
169+
---
170+
## [0.1.0] - 2025-10-22
120171

121172
### Added
122173

docs/about/citation.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
If you use ComProScanner in your research, please cite our related paper:
44

55
```bibtex
6-
@misc{roy2025comproscannermultiagentbasedframework,
7-
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
8-
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
9-
year={2025},
10-
eprint={2510.20362},
11-
archivePrefix={arXiv},
12-
primaryClass={physics.comp-ph},
13-
url={https://arxiv.org/abs/2510.20362},
6+
@Article{roy2026comproscannermultiagentbasedframework,
7+
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
8+
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
9+
journal ="Digital Discovery",
10+
year ="2026",
11+
pages ="Accepted",
12+
publisher ="RSC",
13+
doi ="10.1039/D5DD00521C",
14+
url ="https://doi.org/10.1039/D5DD00521C"
1415
}
1516
```

0 commit comments

Comments
 (0)