Skip to content

Commit ec10352

Browse files
authored
Merge pull request aritraroy24#7 from aritraroy24/release/0.1.6
Release/0.1.6
2 parents c1dcbb4 + 2df5f0a commit ec10352

9 files changed

Lines changed: 449 additions & 56 deletions

File tree

CHANGELOG.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
## [Unreleased]
2-
3-
### Added
4-
1+
# Unreleased
52
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
63

74
- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
@@ -26,7 +23,21 @@
2623

2724
- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
2825

29-
## [0.1.5] - 08-02-2026
26+
---
27+
## [0.1.6] - 2026-04-02
28+
### Changed
29+
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
30+
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
31+
32+
### Added
33+
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.
34+
35+
### Fixed
36+
- Model prefix handling in `rag_tool.py` standardized to reflect the docs.
37+
- `HF_TOKEN` documentation clarified as optional — only required for gated or private Hugging Face models.
38+
39+
---
40+
## [0.1.5] - 2026-02-08
3041

3142
### Added
3243
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.
@@ -97,7 +108,8 @@
97108

98109
- README badges section converted from HTML to markdown format for better compatibility across platforms.
99110

100-
## [0.1.4] - 02-12-2025
111+
---
112+
## [0.1.4] - 2025-12-02
101113

102114
### Added
103115

@@ -132,30 +144,34 @@
132144
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
133145
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
134146

135-
## [0.1.3] - 04-11-2025
147+
---
148+
## [0.1.3] - 2025-11-04
136149

137150
### Fixed
138151

139152
- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
140153
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
141154
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`
142155

143-
## [0.1.2] - 24-10-2025
156+
---
157+
## [0.1.2] - 2025-10-24
144158

145159
### Added
146160

147161
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
148162
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
149163

150-
## [0.1.1] - 22-10-2025
164+
---
165+
## [0.1.1] - 2025-10-22
151166

152167
### Fixed
153168

154169
- README images updated with external image link to fix PyPI rendering issue.
155170
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
156171
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
157172

158-
## [0.1.0] - 22-10-2025
173+
---
174+
## [0.1.0] - 2025-10-22
159175

160176
### Added
161177

CITATION.cff

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ contact:
1616
- family-names: Roy
1717
given-names: Aritra
1818
orcid: "https://orcid.org/0000-0002-4928-2935"
19-
message: If you use this software, please cite our article on arXiv.
19+
message: If you use this software, please cite our article in Digital Discovery.
2020
preferred-citation:
2121
authors:
2222
- family-names: Roy
@@ -31,21 +31,28 @@ preferred-citation:
3131
- family-names: Gattinoni
3232
given-names: Chiara
3333
orcid: "https://orcid.org/0000-0002-3376-6374"
34-
date-published: 2025-10-23
34+
doi: "10.1039/D5DD00521C"
3535
identifiers:
36+
- type: doi
37+
value: "10.1039/D5DD00521C"
38+
description: "Peer-reviewed article"
3639
- type: other
3740
value: "arXiv:2510.20362"
3841
description: "arXiv preprint"
39-
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
42+
journal: "Digital Discovery"
43+
publisher:
44+
name: "RSC"
45+
status: advance-online
46+
title: "ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature"
4047
type: article
41-
url: "https://arxiv.org/abs/2510.20362"
48+
url: "https://doi.org/10.1039/D5DD00521C"
4249
repository-code: "https://github.com/slimeslab/ComProScanner"
4350
license: MIT
4451
title: "ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature"
4552
type: software
4653
url: "https://slimeslab.github.io/ComProScanner/"
47-
version: "0.1.4"
48-
date-released: 2025-12-03
54+
version: "0.1.6"
55+
date-released: 2026-04-02
4956
keywords:
5057
- materials science
5158
- data extraction

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -169,14 +169,15 @@ eval_visualizer.plot_multiple_radar_charts(
169169
If you use ComProScanner in your research, please cite:
170170

171171
```bibtex
172-
@misc{roy2025comproscannermultiagentbasedframework,
173-
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
174-
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
175-
year={2025},
176-
eprint={2510.20362},
177-
archivePrefix={arXiv},
178-
primaryClass={physics.comp-ph},
179-
url={https://arxiv.org/abs/2510.20362},
172+
@Article{roy2026comproscannermultiagentbasedframework,
173+
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
174+
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
175+
journal ="Digital Discovery",
176+
year ="2026",
177+
pages ="Accepted",
178+
publisher ="RSC",
179+
doi ="10.1039/D5DD00521C",
180+
url ="https://doi.org/10.1039/D5DD00521C"
180181
}
181182
```
182183

docs/about/changelog.md

Lines changed: 60 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,42 @@
1-
## Unreleased
1+
# Unreleased
2+
- New `value_error_thresholds` parameter added to both `evaluate_semantic()` and `evaluate_agentic()` for range-based absolute error tolerances on numeric property value comparisons:
3+
4+
- Accepts a dict mapping `(min, max)` tuples to absolute error thresholds. When a ground-truth value falls inside a range, the extracted value is accepted if `|extracted - ground_truth| ≤ threshold`. Values outside all configured ranges fall back to exact comparison.
5+
6+
- **Semantic evaluation**: handled inside `_is_value_in_range()` via the new `_get_error_threshold()` helper in `MaterialsDataSemanticEvaluator`.
7+
8+
- **Agentic evaluation**: a new `GetValueErrorThresholdTool` (CrewAI `BaseTool`) is added to the composition evaluator agent when thresholds are configured. The agent calls this tool with the reference value to retrieve the tolerance before deciding on each numeric match. No tool is added and no prompt changes are made when no thresholds are provided.
9+
10+
- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.
11+
12+
- VLM-based graph data extraction added across all publishers and PDF processors:
13+
14+
- New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.
15+
16+
- New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
17+
18+
- New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
19+
20+
- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
21+
22+
### Fixed
23+
24+
- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
25+
26+
---
27+
## [0.1.6] - 2026-04-02
28+
### Changed
29+
- Updated [README.md](README.md), [CITATION.cff](CITATION.cff) and docs with the published version (advance article) of the ComProScanner paper in _Digital Discovery_ as fully open access:
30+
- [ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature](https://doi.org/10.1039/D5DD00521C)
231

332
### Added
33+
- Guide for API key creation for various LLM providers and publisher APIs added to the documentation at `docs/getting-started/api-key-guide.md` with detailed instructions for each provider.
34+
35+
---
36+
## [0.1.5] - 2026-02-08
37+
38+
### Added
39+
- Data related to comparison with other agentic data extraction frameworks added for the ComProScanner paper in the `examples/piezo_test/comparing_existing_frameworks` folder.
440

541
- New parameter `apply_advanced_cleaning` added to data cleaning methods in `data_cleaner.py`. When set to `True`, it triggers the advanced cleaning pipeline.
642

@@ -37,9 +73,12 @@
3773

3874
- [CITATION.cff](https://github.com/slimeslab/ComProScanner/blob/main/CITATION.cff) added for standardized citation information based on the latest release and arXiv preprint.
3975

40-
- Exposed `value_error_thresholds` in public evaluation methods: `ComProScanner.evaluate_semantic()`, `ComProScanner.evaluate_agentic()`, `comproscanner.evaluate_semantic()`, and `comproscanner.evaluate_agentic()`.
41-
4276
### Fixed
77+
- OAWorks API is replaced with OpenAlex API as OAWorks is no longer available.
78+
79+
- Empty/corrupted PDF handled in `pdf_processor.py` and `wiley_processor.py` to avoid having GLYPH errors during text extraction.
80+
81+
- Data extraction failures fixed if composition-property text data is empty.
4382

4483
- CSV progress tracking in `elsevier_processor.py`:
4584

@@ -61,13 +100,12 @@
61100
- GitHub Actions CI disk space issue:
62101
- Added `--no-cache-dir` flag to pip install to reduce disk usage
63102

64-
- `process_articles()` now routes user-provided `doi_list` by `general_publisher` from metadata and sends each DOI only to its matching source processor.
65-
66103
### Changed
67104

68105
- README badges section converted from HTML to markdown format for better compatibility across platforms.
69106

70-
## [0.1.4] - 02-12-2025
107+
---
108+
## [0.1.4] - 2025-12-02
71109

72110
### Added
73111

@@ -98,32 +136,39 @@
98136

99137
### Changed
100138

101-
- README images updated with raw GitHub links for better reliability: [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png), [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
139+
- README images updated with raw GitHub links for better reliability:
140+
- [ComProScanner Logo](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/comproscanner_logo.png)
141+
- [ComProScanner Workflow](https://raw.githubusercontent.com/aritraroy24/ComProScanner/main/assets/overall_workflow.png)
102142

103-
## [0.1.3] - 04-11-2025
143+
---
144+
## [0.1.3] - 2025-11-04
104145

105146
### Fixed
106147

107148
- **RecursiveCharacterTextSplitter** importing updated for latest _langchain_ version to avoid import errors:
108149
- Changed from `from langchain.text_splitter import RecursiveCharacterTextSplitter`
109150
- To `from langchain.text_splitter.recursive_character import RecursiveCharacterTextSplitter`
110151

111-
## [0.1.2] - 24-10-2025
152+
---
153+
## [0.1.2] - 2025-10-24
112154

113155
### Added
114156

115-
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md: [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
157+
- Link to ComProScanner preprint on arXiv in the documentation index page and README.md:
158+
- [arXiv:2510.20362](https://arxiv.org/abs/2510.20362)
116159

117-
## [0.1.1] - 22-10-2025
160+
---
161+
## [0.1.1] - 2025-10-22
118162

119163
### Fixed
120164

121-
- README images updated with external image link to fix PyPI rendering issue. [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png), [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
165+
- README images updated with external image link to fix PyPI rendering issue.
166+
- [ComProScanner Logo](https://i.ibb.co/whHSbGvT/comproscanner-logo.png)
167+
- [ComProScanner Workflow](https://i.ibb.co/QWd2qd3/overall-workflow.png)
122168

123-
## [0.1.0] - 22-10-2025
169+
---
170+
## [0.1.0] - 2025-10-22
124171

125172
### Added
126173

127174
- Initial release of ComProScanner.
128-
129-

docs/about/citation.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
If you use ComProScanner in your research, please cite our related paper:
44

55
```bibtex
6-
@misc{roy2025comproscannermultiagentbasedframework,
7-
title={ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature},
8-
author={Aritra Roy and Enrico Grisan and John Buckeridge and Chiara Gattinoni},
9-
year={2025},
10-
eprint={2510.20362},
11-
archivePrefix={arXiv},
12-
primaryClass={physics.comp-ph},
13-
url={https://arxiv.org/abs/2510.20362},
6+
@Article{roy2026comproscannermultiagentbasedframework,
7+
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
8+
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
9+
journal ="Digital Discovery",
10+
year ="2026",
11+
pages ="Accepted",
12+
publisher ="RSC",
13+
doi ="10.1039/D5DD00521C",
14+
url ="https://doi.org/10.1039/D5DD00521C"
1415
}
1516
```

0 commit comments

Comments
 (0)