test, doc added for figure extraction functionality

aritraroy24 · aritraroy24 · commit f1cb4e34ecc0 · 2026-02-27T12:06:03.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,17 @@
+## [Unreleased]
+
+### Added
+
+- VLM-based graph data extraction added across all publishers and PDF processors:
+
+  - New `GraphExtractorTool` — a CrewAI agent tool that reads saved figures for a given DOI and uses a vision LLM to extract composition-property value pairs from graphs and charts. Default VLM: `gemini/gemini-3-flash-preview`.
+
+  - New `FigureExtractor` utility — shared helper for caption keyword-based figure filtering and saving, used by all article processors.
+
+  - New `caption_keywords` parameter in `process_articles()` and `extract_composition_property_data()`, and new `vlm_model` and `related_figures_base_path` parameters in `extract_composition_property_data()`.
+
+- New unit tests added for all three agent tools in `tests/test_agent_tools/`.
+
 ## [0.1.5] - 08-02-2026
 
 ### Added
diff --git a/docs/usage/article-processing.md b/docs/usage/article-processing.md
@@ -102,9 +102,20 @@ Overlap size between chunks for creating vector databases for RAG.
 
 Name of the embedding model to use for creating vector databases for RAG.
 
+#### :material-square-medium:`caption_keywords` _(dict)_
+
+Dictionary of keyword lists used to filter figures during article processing. Only figures whose captions match these keywords are saved for later VLM-based graph extraction. If not provided, defaults to `property_keywords`.
+
+```python
+caption_keywords = {
+    "exact_keywords": ["d33"],
+    "substring_keywords": [" d 33 "]
+}
+```
+
 !!! info "Default Values"
 
-    :material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"
+    :material-square-small:**`source_list`** = ["elsevier", "wiley", "iop", "springer"]<br>:material-square-small:**`folder_path`** = None<br>:material-square-small:**`doi_list`** = None<br>:material-square-small:**`is_sql_db`** = False<br>:material-square-small:**`is_save_xml`** = False<br>:material-square-small:**`is_save_pdf`** = False<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`chunk_size`** = 1000<br>:material-square-small:**`chunk_overlap`** = 25<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`caption_keywords`** = `property_keywords`
 
 ## Processing Workflow
 
@@ -117,6 +128,9 @@ graph TB
     D --> E
     E --> F{Is Keyword Present?}
     F --> |Yes| G[Save Article's<br>Full Text to CSV<br>and Vector DB]
+    F --> |Yes| I{Caption Keywords<br>Provided?}
+    I --> |Yes| J[Extract & Save<br>Matching Figures]
+    I --> |No| K[Skip Figure Extraction]
     F --> |No| H[Skip Article]
 ```
 
@@ -204,6 +218,25 @@ scanner.process_articles(
 )
 ```
 
+### Figure Extraction for VLM-Based Graph Analysis
+
+When `caption_keywords` are provided, figures whose captions match those keywords are automatically extracted and saved during article processing. These saved figures are later used by the `GraphExtractorTool` during data extraction to read composition-property values directly from graphs and charts using a vision LLM.
+
+```python
+caption_keywords = {
+    "exact_keywords": ["d33"],
+    "substring_keywords": [" d 33 "]
+}
+
+scanner.process_articles(
+    property_keywords=property_keywords,
+    caption_keywords=caption_keywords,
+    source_list=["elsevier", "springer", "wiley", "iop", "pdfs"]
+)
+```
+
+Saved figures are stored under `results/extracted_data/{main_property_keyword}/related_figures/{doi}/` alongside an `info.json` file that maps each figure to its caption text.
+
 ### RAG Vector Database
 
 ```python
diff --git a/docs/usage/data-extraction.md b/docs/usage/data-extraction.md
@@ -152,13 +152,21 @@ Number of top relevant documents to retrieve from the vector database for RAG.
 
 Base URL for the RAG model service, used for custom or local model deployments.
 
+#### :material-square-medium:`vlm_model` _(str)_
+
+Name of the vision LLM model used by `GraphExtractorTool` to read composition-property values from saved figures. Supports any provider prefix supported by [LiteLLM](https://docs.litellm.ai/docs/providers) (e.g., `gemini/...`, `openai/...`, `anthropic/...`).
+
+#### :material-square-medium:`related_figures_base_path` _(str)_
+
+Path to the directory where figures were saved during article processing. Defaults to `results/extracted_data/{main_property_keyword}/related_figures`.
+
 #### :material-square-medium:`**flow_optional_args` _(dict)_
 
 Optional arguments for the MaterialsFlow class to customize extraction behavior by giving additional notes, examples, and allowed methods/techniques.
 
 !!! info "Default Values"
 
-    :material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`flow_optional_args`** = {}
+    :material-square-small:**`start_row`** = 0<br>:material-square-small:**`num_rows`** = All rows<br>:material-square-small:**`is_test_data_preparation`** = False<br>:material-square-small:**`test_doi_list_file`** = None<br>:material-square-small:**`total_test_data`** = 50<br>:material-square-small:**`is_only_consider_test_doi_list`** = False<br>:material-square-small:**`test_random_seed`** = 42<br>:material-square-small:**`checked_doi_list_file`** = "checked_dois.txt"<br>:material-square-small:**`json_results_file`** = "results.json"<br>:material-square-small:**`csv_results_file`** = "results.csv"<br>:material-square-small:**`is_extract_synthesis_data`** = True<br>:material-square-small:**`is_save_csv`** = False<br>:material-square-small:**`is_save_relevant`** = True<br>:material-square-small:**`materials_data_identifier_query`** = "Is there any material chemical composition and corresponding {main_property_keyword} value mentioned in the paper? Give one word answer. Either yes or no."<br>:material-square-small:**`model`** = "gpt-4o-mini"<br>:material-square-small:**`api_base`** = None<br>:material-square-small:**`base_url`** = None<br>:material-square-small:**`api_key`** = None<br>:material-square-small:**`output_log_folder`** = None<br>:material-square-small:**`is_log_json`** = False<br>:material-square-small:**`task_output_folder`** = None<br>:material-square-small:**`verbose`** = True<br>:material-square-small:**`temperature`** = 0.1<br>:material-square-small:**`top_p`** = 0.9<br>:material-square-small:**`timeout`** = 60<br>:material-square-small:**`frequency_penalty`** = None<br>:material-square-small:**`max_tokens`** = 2048<br>:material-square-small:**`rag_db_path`** = "db"<br>:material-square-small:**`embedding_model`** = "huggingface:thellert/physbert_cased"<br>:material-square-small:**`rag_chat_model`** = "gpt-4o-mini"<br>:material-square-small:**`rag_max_tokens`** = 512<br>:material-square-small:**`rag_top_k`** = 3<br>:material-square-small:**`rag_base_url`** = None<br>:material-square-small:**`vlm_model`** = "gemini/gemini-3-flash-preview"<br>:material-square-small:**`related_figures_base_path`** = "results/extracted_data/{main_property_keyword}/related_figures"<br>:material-square-small:**`flow_optional_args`** = {}
 
 ## Extraction Agents
 
@@ -207,6 +215,10 @@ Is there any material chemical composition and corresponding {main_property_keyw
 
     MaterialParser Tool is used by the `Composition-Property Data Formatter` agent. Material-parser is a deep learning model, developed by [Foppiano et al.](https://doi.org/10.1080/27660400.2022.2153633), specifically designed for parsing chemical compositions with multiple fractions denoted as variables e.g., $Na_{(1-x)}Li_xTiO_3$ where x = 0.1, 0.3, and 0.4. This tool incorporates the material-parser model to accurately extract and standardize complex chemical compositions with variable fractions into the final compositions. For e.g., the previous example would be parsed into three distinct compositions: **Na(0.9)Li(0.1)TiO3**, **Na(0.7)Li(0.3)TiO3**, and **Na(0.6)Li(0.4)TiO3**.
 
+!!! example "Graph Extractor Tool"
+
+    Graph Extractor Tool is used by the `Composition-Property Data Extractor` agent when figures have been saved during article processing. It scans the saved figure directory for the given DOI, sends each image to a configurable vision LLM (default: `gemini/gemini-3-flash-preview`), and extracts composition-property value pairs directly from graphs and charts. The extracted data is returned as structured JSON and used alongside the text-based extraction to improve coverage of graphical data. For graph extraction, the figures are saved during article processing (using `caption_keywords` in `process_articles`) and specify the VLM model at extraction time:
+
 ### 3. Synthesis Data Extractor (4️⃣) & Synthesis Data Formatter (5️⃣)
 
 **Purpose**: `Synthesis Data Extractor` extracts synthesis related data including method, precursors, steps, and characterization techniques from the article text and finally `Synthesis Data Formatter` formats the extracted data into structured JSON similar to the following example.
diff --git a/src/comproscanner/article_processors/pdfs_processor.py b/src/comproscanner/article_processors/pdfs_processor.py
@@ -342,7 +342,11 @@ def process_pdfs(self):
 
                 # Extract and save figures matching caption_keywords
                 if self.caption_keywords:
-                    pdf_to_md.extract_and_save_figures(self.doi, self.caption_keywords)
+                    pdf_to_md.extract_and_save_figures(
+                        self.doi,
+                        self.caption_keywords,
+                        base_path=f"results/extracted_data/{self.keyword}/related_figures",
+                    )
 
                 # Process sections
                 all_sections = pdf_to_md.clean_text(md_text)
diff --git a/src/comproscanner/article_processors/wiley_processor.py b/src/comproscanner/article_processors/wiley_processor.py
@@ -563,7 +563,11 @@ def _process_articles(self):
 
                 # Extract and save figures matching caption_keywords
                 if self.caption_keywords:
-                    pdf_to_md.extract_and_save_figures(row["doi"], self.caption_keywords)
+                    pdf_to_md.extract_and_save_figures(
+                        row["doi"],
+                        self.caption_keywords,
+                        base_path=f"results/extracted_data/{self.keyword}/related_figures",
+                    )
 
                 # Process the markdown text
                 all_sections = pdf_to_md.clean_text(md_text)
diff --git a/src/comproscanner/comproscanner.py b/src/comproscanner/comproscanner.py
@@ -291,7 +291,6 @@ def extract_composition_property_data(
         rag_base_url: Optional[str] = None,
         vlm_model: str = "gemini/gemini-3-flash-preview",
         related_figures_base_path: Optional[str] = None,
-        caption_keywords: Optional[Dict] = None,
         **flow_optional_args,
     ):
         """Extract the composition-property data and synthesis data if the property is present in the article.
@@ -334,7 +333,6 @@ def extract_composition_property_data(
             vlm_model (str, optional): Vision LLM model for graph data extraction from saved figures. Defaults to "gemini/gemini-3-flash-preview".
             related_figures_base_path (str, optional): Base path where saved figures are stored. Defaults to
                 "results/extracted_data/{main_property_keyword}/related_figures".
-            caption_keywords (dict, optional): Keywords used for caption matching (propagated to GraphExtractorTool). Defaults to None.
             **flow_optional_args (Any): Optional keyword arguments for the MaterialsFlow class.
 
         Raises:
@@ -468,7 +466,6 @@ def _has_composition_data(comp_data):
                         is_extract_synthesis_data=is_extract_synthesis_data,
                         vlm_model=vlm_model,
                         related_figures_base_path=related_figures_base_path,
-                        caption_keywords=caption_keywords,
                         rag_config=rag_config,
                         output_log_folder=output_log_folder,
                         task_output_folder=task_output_folder,
diff --git a/src/comproscanner/extract_flow/crews/composition_crew/composition_extraction_crew/composition_extraction_crew.py b/src/comproscanner/extract_flow/crews/composition_crew/composition_extraction_crew/composition_extraction_crew.py
@@ -50,7 +50,7 @@ def __init__(
         verbose: Optional[bool] = True,
         vlm_model: str = "gemini/gemini-3-flash-preview",
         related_figures_base_path: str = "results/related_figures",
-        caption_keywords: Optional[Dict] = None,
+        main_extraction_keyword: str = "",
     ):
         """
         Initialize the MaterialsDataIdentifierCrew.
@@ -64,7 +64,7 @@ def __init__(
         - verbose: Optional boolean for verbosity. Default is True.
         - vlm_model (str, optional): Vision LLM model for graph extraction. Defaults to "gemini/gemini-3-flash-preview".
         - related_figures_base_path (str, optional): Base path for saved figures. Defaults to "results/related_figures".
-        - caption_keywords (dict, optional): Keywords used for caption matching (propagated to GraphExtractorTool). Defaults to None.
+        - main_extraction_keyword (str, optional): Property keyword used to label axes in the VLM extraction prompt. Defaults to "".
         """
         if doi is None:
             raise ValueError("DOI must be provided")
@@ -77,7 +77,7 @@ def __init__(
         self.verbose = verbose
         self.vlm_model = vlm_model
         self.related_figures_base_path = related_figures_base_path
-        self.caption_keywords = caption_keywords or {}
+        self.main_extraction_keyword = main_extraction_keyword
 
         # Initialize output file paths as None
         self.output_log_file = None
@@ -112,7 +112,7 @@ def composition_property_extractor(self) -> Agent:
         graph_tool = GraphExtractorTool(
             vlm_model=self.vlm_model,
             related_figures_base_path=self.related_figures_base_path,
-            caption_keywords=self.caption_keywords,
+            vlm_property_name=self.main_extraction_keyword,
         )
         return Agent(
             config=self.agents_config["composition_property_extractor"],
diff --git a/src/comproscanner/extract_flow/main_extraction_flow.py b/src/comproscanner/extract_flow/main_extraction_flow.py
@@ -58,7 +58,6 @@ class MaterialsState(BaseModel):
     is_extract_synthesis_data: bool = True
     vlm_model: str = "gemini/gemini-3-flash-preview"
     related_figures_base_path: str = "results/related_figures"
-    caption_keywords: Dict = {}
     llm: Optional[LLM] = None
     rag_config: Optional[RAGConfig] = None
     output_log_folder: Optional[str] = None
@@ -124,7 +123,6 @@ def __init__(
         is_extract_synthesis_data: bool = True,
         vlm_model: str = "gemini/gemini-3-flash-preview",
         related_figures_base_path: str = "results/related_figures",
-        caption_keywords: Optional[Dict] = None,
         rag_config: Optional[RAGConfig] = None,
         output_log_folder: Optional[str] = None,
         task_output_folder: Optional[str] = None,
@@ -156,7 +154,6 @@ def __init__(
         self.state.is_extract_synthesis_data = is_extract_synthesis_data
         self.state.vlm_model = vlm_model
         self.state.related_figures_base_path = related_figures_base_path
-        self.state.caption_keywords = caption_keywords or {}
         self.state.rag_config = rag_config
         self.state.output_log_folder = output_log_folder
         self.state.task_output_folder = task_output_folder
@@ -591,7 +588,7 @@ def extract_composition_property_data(self):
                 verbose=self.state.verbose,
                 vlm_model=self.state.vlm_model,
                 related_figures_base_path=self.state.related_figures_base_path,
-                caption_keywords=self.state.caption_keywords,
+                main_extraction_keyword=self.state.main_extraction_keyword,
             ).crew()
         else:
             composition_property_crew = CompositionExtractionCrew(
@@ -602,7 +599,7 @@ def extract_composition_property_data(self):
                 verbose=self.state.verbose,
                 vlm_model=self.state.vlm_model,
                 related_figures_base_path=self.state.related_figures_base_path,
-                caption_keywords=self.state.caption_keywords,
+                main_extraction_keyword=self.state.main_extraction_keyword,
             ).crew()
 
         result = composition_property_crew.kickoff(
diff --git a/src/comproscanner/extract_flow/tools/graph_extractor_tool.py b/src/comproscanner/extract_flow/tools/graph_extractor_tool.py
@@ -11,7 +11,7 @@
 import os
 import json
 import base64
-from typing import Type, Dict, Any
+from typing import Type, Dict
 
 # Third-party imports
 from crewai.tools import BaseTool
@@ -56,7 +56,7 @@ class GraphExtractorTool(BaseTool):
 
     vlm_model: str = "gemini/gemini-3-flash-preview"
     related_figures_base_path: str = "results/related_figures"
-    caption_keywords: Dict[str, Any] = Field(default_factory=dict)
+    vlm_property_name: str = "the target property"
 
     def _run(self, doi: str) -> str:
         """
@@ -99,11 +99,7 @@ def _run(self, doi: str) -> str:
                 "Captions available: " + json.dumps(captions)
             )
 
-        # Determine property name from caption_keywords for the prompt
-        property_name = "the target property"
-        exact_kws = self.caption_keywords.get("exact_keywords", [])
-        if exact_kws:
-            property_name = exact_kws[0]
+        property_name = self.vlm_property_name or "the target property"
 
         results: Dict[str, Any] = {}
 
diff --git a/tests/test_agent_tools/test_graph_extractor_tool.py b/tests/test_agent_tools/test_graph_extractor_tool.py
diff --git a/tests/test_agent_tools/test_material_parser_tool.py b/tests/test_agent_tools/test_material_parser_tool.py
diff --git a/tests/test_agent_tools/test_rag_tool.py b/tests/test_agent_tools/test_rag_tool.py
diff --git a/tests/test_extract_flow.py b/tests/test_extract_flow.py