Skip to content

Commit f1cb4e3

Browse files
committed
test, doc added for figure extraction functionality
1 parent 433f17a commit f1cb4e3

File tree

13 files changed

+568
-23
lines changed

13 files changed

+568
-23
lines changed

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
## [Unreleased]
2+
3+
### Added
4+
5+
- VLM-based graph data extraction added across all publishers and PDF processors:
6+
7+
- New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.
8+
9+
- New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
10+
11+
- New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
12+
13+
- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
14+
115
## [0.1.5] - 08-02-2026
216

317
### Added

docs/usage/article-processing.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,20 @@ Overlap size between chunks for creating vector databases for RAG.
102102

103103
Name of the embedding model to use for creating vector databases for RAG.
104104

105+
#### :material-square-medium:`caption_keywords` _(dict)_
106+
107+
Dictionary of keyword lists used to filter figures during article processing. Only figures whose captions match these keywords are saved for later VLM-based graph extraction. If not provided, defaults to `property_keywords`.
108+
109+
```python
110+
caption_keywords = {
111+
"exact_keywords": ["d33"],
112+
"substring_keywords": [" d 33 "]
113+
}
114+
```
115+
105116
!!! info "Default Values"
106117

107-
:material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"
118+
:material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`caption_keywords`** = `property_keywords`
108119

109120
## Processing Workflow
110121

@@ -117,6 +128,9 @@ graph TB
117128
D --> E
118129
E --> F{Is Keyword Present?}
119130
F --> |Yes| G[Save Article's<br>Full Text to CSV<br>and Vector DB]
131+
F --> |Yes| I{Caption Keywords<br>Provided?}
132+
I --> |Yes| J[Extract & Save<br>Matching Figures]
133+
I --> |No| K[Skip Figure Extraction]
120134
F --> |No| H[Skip Article]
121135
```
122136

@@ -204,6 +218,25 @@ scanner.process_articles(
204218
)
205219
```
206220

221+
### Figure Extraction for VLM-Based Graph Analysis
222+
223+
When `caption_keywords` are provided, figures whose captions match those keywords are automatically extracted and saved during article processing. These saved figures are later used by the `GraphExtractorTool` during data extraction to read composition-property values directly from graphs and charts using a vision LLM.
224+
225+
```python
226+
caption_keywords = {
227+
"exact_keywords": ["d33"],
228+
"substring_keywords": [" d 33 "]
229+
}
230+
231+
scanner.process_articles(
232+
property_keywords=property_keywords,
233+
caption_keywords=caption_keywords,
234+
source_list=["elsevier", "springer", "wiley", "iop", "pdfs"]
235+
)
236+
```
237+
238+
Saved figures are stored under `results/extracted_data/{main_property_keyword}/related_figures/{doi}/` alongside an `info.json` file that maps each figure to its caption text.
239+
207240
### RAG Vector Database
208241

209242
```python

docs/usage/data-extraction.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,13 +152,21 @@ Number of top relevant documents to retrieve from the vector database for RAG.
152152

153153
Base URL for the RAG model service, used for custom or local model deployments.
154154

155+
#### :material-square-medium:`vlm_model` _(str)_
156+
157+
Name of the vision LLM model used by `GraphExtractorTool` to read composition-property values from saved figures. Supports any provider prefix supported by [LiteLLM](https://docs.litellm.ai/docs/providers) (e.g., `gemini/...`, `openai/...`, `anthropic/...`).
158+
159+
#### :material-square-medium:`related_figures_base_path` _(str)_
160+
161+
Path to the directory where figures were saved during article processing. Defaults to `results/extracted_data/{main_property_keyword}/related_figures`.
162+
155163
#### :material-square-medium:`**flow_optional_args` _(dict)_
156164

157165
Optional arguments for the MaterialsFlow class to customize extraction behavior by giving additional notes, examples, and allowed methods/techniques.
158166

159167
!!! info "Default Values"
160168

161-
:material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`flow_optional_args`** = {}
169+
:material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`vlm_model`** = "gemini/gemini-3-flash-preview"<br>:material-square-small:**`related_figures_base_path`** = "results/extracted_data/{main_property_keyword}/related_figures"<br>:material-square-small:**`flow_optional_args`** = {}
162170

163171
## Extraction Agents
164172

@@ -207,6 +215,10 @@ Is there any material chemical composition and corresponding {main_property_keyw
207215

208216
MaterialParser Tool is used by the `Composition-Property Data Formatter` agent. Material-parser is a deep learning model, developed by [Foppiano et al.](https://doi.org/10.1080/27660400.2022.2153633), specifically designed for parsing chemical compositions with multiple fractions denoted as variables e.g., $Na_{(1-x)}Li_xTiO_3$ where x = 0.1, 0.3, and 0.4. This tool incorporates the material-parser model to accurately extract and standardize complex chemical compositions with variable fractions into the final compositions. For e.g., the previous example would be parsed into three distinct compositions: **Na(0.9)Li(0.1)TiO3**, **Na(0.7)Li(0.3)TiO3**, and **Na(0.6)Li(0.4)TiO3**.
209217

218+
!!! example "Graph Extractor Tool"
219+
220+
Graph Extractor Tool is used by the `Composition-Property Data Extractor` agent when figures have been saved during article processing. It scans the saved figure directory for the given DOI, sends each image to a configurable vision LLM (default: `gemini/gemini-3-flash-preview`), and extracts composition-property value pairs directly from graphs and charts. The extracted data is returned as structured JSON and used alongside the text-based extraction to improve coverage of graphical data. For graph extraction, the figures are saved during article processing (using `caption_keywords` in `process_articles`) and specify the VLM model at extraction time:
221+
210222
### 3. Synthesis Data Extractor (4️⃣) & Synthesis Data Formatter (5️⃣)
211223

212224
**Purpose**: `Synthesis Data Extractor` extracts synthesis related data including method, precursors, steps, and characterization techniques from the article text and finally `Synthesis Data Formatter` formats the extracted data into structured JSON similar to the following example.

src/comproscanner/article_processors/pdfs_processor.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -342,7 +342,11 @@ def process_pdfs(self):
342342

343343
# Extract and save figures matching caption_keywords
344344
if self.caption_keywords:
345-
pdf_to_md.extract_and_save_figures(self.doi, self.caption_keywords)
345+
pdf_to_md.extract_and_save_figures(
346+
self.doi,
347+
self.caption_keywords,
348+
base_path=f"results/extracted_data/{self.keyword}/related_figures",
349+
)
346350

347351
# Process sections
348352
all_sections = pdf_to_md.clean_text(md_text)

src/comproscanner/article_processors/wiley_processor.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -563,7 +563,11 @@ def _process_articles(self):
563563

564564
# Extract and save figures matching caption_keywords
565565
if self.caption_keywords:
566-
pdf_to_md.extract_and_save_figures(row["doi"], self.caption_keywords)
566+
pdf_to_md.extract_and_save_figures(
567+
row["doi"],
568+
self.caption_keywords,
569+
base_path=f"results/extracted_data/{self.keyword}/related_figures",
570+
)
567571

568572
# Process the markdown text
569573
all_sections = pdf_to_md.clean_text(md_text)

src/comproscanner/comproscanner.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,6 @@ def extract_composition_property_data(
291291
rag_base_url: Optional[str] = None,
292292
vlm_model: str = "gemini/gemini-3-flash-preview",
293293
related_figures_base_path: Optional[str] = None,
294-
caption_keywords: Optional[Dict] = None,
295294
**flow_optional_args,
296295
):
297296
"""Extract the composition-property data and synthesis data if the property is present in the article.
@@ -334,7 +333,6 @@ def extract_composition_property_data(
334333
vlm_model (str, optional): Vision LLM model for graph data extraction from saved figures. Defaults to "gemini/gemini-3-flash-preview".
335334
related_figures_base_path (str, optional): Base path where saved figures are stored. Defaults to
336335
"results/extracted_data/{main_property_keyword}/related_figures".
337-
caption_keywords (dict, optional): Keywords used for caption matching (propagated to GraphExtractorTool). Defaults to None.
338336
**flow_optional_args (Any): Optional keyword arguments for the MaterialsFlow class.
339337
340338
Raises:
@@ -468,7 +466,6 @@ def _has_composition_data(comp_data):
468466
is_extract_synthesis_data=is_extract_synthesis_data,
469467
vlm_model=vlm_model,
470468
related_figures_base_path=related_figures_base_path,
471-
caption_keywords=caption_keywords,
472469
rag_config=rag_config,
473470
output_log_folder=output_log_folder,
474471
task_output_folder=task_output_folder,

src/comproscanner/extract_flow/crews/composition_crew/composition_extraction_crew/composition_extraction_crew.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def __init__(
5050
verbose: Optional[bool] = True,
5151
vlm_model: str = "gemini/gemini-3-flash-preview",
5252
related_figures_base_path: str = "results/related_figures",
53-
caption_keywords: Optional[Dict] = None,
53+
main_extraction_keyword: str = "",
5454
):
5555
"""
5656
Initialize the MaterialsDataIdentifierCrew.
@@ -64,7 +64,7 @@ def __init__(
6464
- verbose: Optional boolean for verbosity. Default is True.
6565
- vlm_model (str, optional): Vision LLM model for graph extraction. Defaults to "gemini/gemini-3-flash-preview".
6666
- related_figures_base_path (str, optional): Base path for saved figures. Defaults to "results/related_figures".
67-
- caption_keywords (dict, optional): Keywords used for caption matching (propagated to GraphExtractorTool). Defaults to None.
67+
- main_extraction_keyword (str, optional): Property keyword used to label axes in the VLM extraction prompt. Defaults to "".
6868
"""
6969
if doi is None:
7070
raise ValueError("DOI must be provided")
@@ -77,7 +77,7 @@ def __init__(
7777
self.verbose = verbose
7878
self.vlm_model = vlm_model
7979
self.related_figures_base_path = related_figures_base_path
80-
self.caption_keywords = caption_keywords or {}
80+
self.main_extraction_keyword = main_extraction_keyword
8181

8282
# Initialize output file paths as None
8383
self.output_log_file = None
@@ -112,7 +112,7 @@ def composition_property_extractor(self) -> Agent:
112112
graph_tool = GraphExtractorTool(
113113
vlm_model=self.vlm_model,
114114
related_figures_base_path=self.related_figures_base_path,
115-
caption_keywords=self.caption_keywords,
115+
vlm_property_name=self.main_extraction_keyword,
116116
)
117117
return Agent(
118118
config=self.agents_config["composition_property_extractor"],

src/comproscanner/extract_flow/main_extraction_flow.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,6 @@ class MaterialsState(BaseModel):
5858
is_extract_synthesis_data: bool = True
5959
vlm_model: str = "gemini/gemini-3-flash-preview"
6060
related_figures_base_path: str = "results/related_figures"
61-
caption_keywords: Dict = {}
6261
llm: Optional[LLM] = None
6362
rag_config: Optional[RAGConfig] = None
6463
output_log_folder: Optional[str] = None
@@ -124,7 +123,6 @@ def __init__(
124123
is_extract_synthesis_data: bool = True,
125124
vlm_model: str = "gemini/gemini-3-flash-preview",
126125
related_figures_base_path: str = "results/related_figures",
127-
caption_keywords: Optional[Dict] = None,
128126
rag_config: Optional[RAGConfig] = None,
129127
output_log_folder: Optional[str] = None,
130128
task_output_folder: Optional[str] = None,
@@ -156,7 +154,6 @@ def __init__(
156154
self.state.is_extract_synthesis_data = is_extract_synthesis_data
157155
self.state.vlm_model = vlm_model
158156
self.state.related_figures_base_path = related_figures_base_path
159-
self.state.caption_keywords = caption_keywords or {}
160157
self.state.rag_config = rag_config
161158
self.state.output_log_folder = output_log_folder
162159
self.state.task_output_folder = task_output_folder
@@ -591,7 +588,7 @@ def extract_composition_property_data(self):
591588
verbose=self.state.verbose,
592589
vlm_model=self.state.vlm_model,
593590
related_figures_base_path=self.state.related_figures_base_path,
594-
caption_keywords=self.state.caption_keywords,
591+
main_extraction_keyword=self.state.main_extraction_keyword,
595592
).crew()
596593
else:
597594
composition_property_crew = CompositionExtractionCrew(
@@ -602,7 +599,7 @@ def extract_composition_property_data(self):
602599
verbose=self.state.verbose,
603600
vlm_model=self.state.vlm_model,
604601
related_figures_base_path=self.state.related_figures_base_path,
605-
caption_keywords=self.state.caption_keywords,
602+
main_extraction_keyword=self.state.main_extraction_keyword,
606603
).crew()
607604

608605
result = composition_property_crew.kickoff(

src/comproscanner/extract_flow/tools/graph_extractor_tool.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import os
1212
import json
1313
import base64
14-
from typing import Type, Dict, Any
14+
from typing import Type, Dict
1515

1616
# Third-party imports
1717
from crewai.tools import BaseTool
@@ -56,7 +56,7 @@ class GraphExtractorTool(BaseTool):
5656

5757
vlm_model: str = "gemini/gemini-3-flash-preview"
5858
related_figures_base_path: str = "results/related_figures"
59-
caption_keywords: Dict[str, Any] = Field(default_factory=dict)
59+
vlm_property_name: str = "the target property"
6060

6161
def _run(self, doi: str) -> str:
6262
"""
@@ -99,11 +99,7 @@ def _run(self, doi: str) -> str:
9999
"Captions available: " + json.dumps(captions)
100100
)
101101

102-
# Determine property name from caption_keywords for the prompt
103-
property_name = "the target property"
104-
exact_kws = self.caption_keywords.get("exact_keywords", [])
105-
if exact_kws:
106-
property_name = exact_kws[0]
102+
property_name = self.vlm_property_name or "the target property"
107103

108104
results: Dict[str, Any] = {}
109105

0 commit comments

Comments
 (0)