You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,17 @@
1
+
## [Unreleased]
2
+
3
+
### Added
4
+
5
+
- VLM-based graph data extraction added across all publishers and PDF processors:
6
+
7
+
- New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.
8
+
9
+
- New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
10
+
11
+
- New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
12
+
13
+
- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
Dictionary of keyword lists used to filter figures during article processing. Only figures whose captions match these keywords are saved for later VLM-based graph extraction. If not provided, defaults to `property_keywords`.
F --> |Yes| G[Save Article's<br>Full Text to CSV<br>and Vector DB]
131
+
F --> |Yes| I{Caption Keywords<br>Provided?}
132
+
I --> |Yes| J[Extract & Save<br>Matching Figures]
133
+
I --> |No| K[Skip Figure Extraction]
120
134
F --> |No| H[Skip Article]
121
135
```
122
136
@@ -204,6 +218,25 @@ scanner.process_articles(
204
218
)
205
219
```
206
220
221
+
### Figure Extraction for VLM-Based Graph Analysis
222
+
223
+
When `caption_keywords` are provided, figures whose captions match those keywords are automatically extracted and saved during article processing. These saved figures are later used by the `GraphExtractorTool` during data extraction to read composition-property values directly from graphs and charts using a vision LLM.
Saved figures are stored under `results/extracted_data/{main_property_keyword}/related_figures/{doi}/` alongside an `info.json` file that maps each figure to its caption text.
Copy file name to clipboardExpand all lines: docs/usage/data-extraction.md
+13-1Lines changed: 13 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -152,13 +152,21 @@ Number of top relevant documents to retrieve from the vector database for RAG.
152
152
153
153
Base URL for the RAG model service, used for custom or local model deployments.
154
154
155
+
#### :material-square-medium:`vlm_model`_(str)_
156
+
157
+
Name of the vision LLM model used by `GraphExtractorTool` to read composition-property values from saved figures. Supports any provider prefix supported by [LiteLLM](https://docs.litellm.ai/docs/providers) (e.g., `gemini/...`, `openai/...`, `anthropic/...`).
Path to the directory where figures were saved during article processing. Defaults to `results/extracted_data/{main_property_keyword}/related_figures`.
Optional arguments for the MaterialsFlow class to customize extraction behavior by giving additional notes, examples, and allowed methods/techniques.
158
166
159
167
!!! info "Default Values"
160
168
161
-
:material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`flow_optional_args`** = {}
169
+
:material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`vlm_model`** = "gemini/gemini-3-flash-preview"<br>:material-square-small:**`related_figures_base_path`** = "results/extracted_data/{main_property_keyword}/related_figures"<br>:material-square-small:**`flow_optional_args`** = {}
162
170
163
171
## Extraction Agents
164
172
@@ -207,6 +215,10 @@ Is there any material chemical composition and corresponding {main_property_keyw
207
215
208
216
MaterialParser Tool is used by the `Composition-Property Data Formatter` agent. Material-parser is a deep learning model, developed by [Foppiano et al.](https://doi.org/10.1080/27660400.2022.2153633), specifically designed for parsing chemical compositions with multiple fractions denoted as variables e.g., $Na_{(1-x)}Li_xTiO_3$ where x = 0.1, 0.3, and 0.4. This tool incorporates the material-parser model to accurately extract and standardize complex chemical compositions with variable fractions into the final compositions. For e.g., the previous example would be parsed into three distinct compositions: **Na(0.9)Li(0.1)TiO3**, **Na(0.7)Li(0.3)TiO3**, and **Na(0.6)Li(0.4)TiO3**.
209
217
218
+
!!! example "Graph Extractor Tool"
219
+
220
+
Graph Extractor Tool is used by the `Composition-Property Data Extractor` agent when figures have been saved during article processing. It scans the saved figure directory for the given DOI, sends each image to a configurable vision LLM (default: `gemini/gemini-3-flash-preview`), and extracts composition-property value pairs directly from graphs and charts. The extracted data is returned as structured JSON and used alongside the text-based extraction to improve coverage of graphical data. For graph extraction, the figures are saved during article processing (using `caption_keywords` in `process_articles`) and specify the VLM model at extraction time:
221
+
210
222
### 3. Synthesis Data Extractor (4️⃣) & Synthesis Data Formatter (5️⃣)
211
223
212
224
**Purpose**: `Synthesis Data Extractor` extracts synthesis related data including method, precursors, steps, and characterization techniques from the article text and finally `Synthesis Data Formatter` formats the extracted data into structured JSON similar to the following example.
Copy file name to clipboardExpand all lines: src/comproscanner/extract_flow/crews/composition_crew/composition_extraction_crew/composition_extraction_crew.py
0 commit comments