fix(llm): align regex extraction of json to json format of prompt (#211)

Thespica · imbajin · web-flow · commit aa83ec2a8596 · 2025-04-22T11:25:02.000+08:00
See #210 Main change of regex: matching `(\[.*])` -> matching `({.*})`. tested models: - qwen-max - qwen-plus - deepseek-v3 --------- Co-authored-by: imbajin <jin@apache.org>
diff --git a/.asf.yaml b/.asf.yaml
@@ -58,6 +58,7 @@ github:
     - HJ-Young
     - afterimagex
     - returnToInnocence
+    - Thespica
 
 # refer https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Notificationsettingsforrepositories
 notifications:
diff --git a/.github/workflows/hugegraph-python-client.yml b/.github/workflows/hugegraph-python-client.yml
@@ -20,7 +20,7 @@ jobs:
     - name: Prepare HugeGraph Server Environment
       run: |
         docker run -d --name=graph -p 8080:8080 -e PASSWORD=admin hugegraph/hugegraph:1.3.0
-        sleep 5
+        sleep 10
 
     - uses: actions/checkout@v4
 
diff --git a/hugegraph-llm/README.md b/hugegraph-llm/README.md
@@ -8,12 +8,12 @@ This project includes runnable demos, it can also be used as a third-party libra
 As we know, graph systems can help large models address challenges like timeliness and hallucination,
 while large models can help graph systems with cost-related issues.
 
-With this project, we aim to reduce the cost of using graph systems, and decrease the complexity of 
+With this project, we aim to reduce the cost of using graph systems and decrease the complexity of 
 building knowledge graphs. This project will offer more applications and integration solutions for 
 graph systems and large language models.
 1.  Construct knowledge graph by LLM + HugeGraph
 2.  Use natural language to operate graph databases (Gremlin/Cypher)
-3.  Knowledge graph supplements answer context (GraphRAG -> Graph Agent)
+3.  Knowledge graph supplements answer context (GraphRAG → Graph Agent)
 
 ## 2. Environment Requirements
 > [!IMPORTANT]
@@ -24,19 +24,19 @@ graph systems and large language models.
 ## 3. Preparation
 
 1. Start the HugeGraph database, you can run it via [Docker](https://hub.docker.com/r/hugegraph/hugegraph)/[Binary Package](https://hugegraph.apache.org/docs/download/download/).  
-    Refer to detailed [doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev) for more guidance
+    Refer to a detailed [doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev) for more guidance
 
 2. Configuring the poetry environment, Use the official installer to install Poetry, See the [poetry documentation](https://poetry.pythonlang.cn/docs/#installing-with-pipx) for other installation methods   
     ```bash
     # You could try pipx or pip to install poetry when meet network issues, refer the poetry doc for more details
     curl -sSL https://install.python-poetry.org | python3 - # install the latest version like 2.0+
     ```
 
-2. Clone this project
+3. Clone this project
     ```bash
     git clone https://github.com/apache/incubator-hugegraph-ai.git
     ```
-3. Install [hugegraph-python-client](../hugegraph-python-client) and [hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual environments
+4. Install [hugegraph-python-client](../hugegraph-python-client) and [hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual environments
     ```bash
     cd ./incubator-hugegraph-ai/hugegraph-llm
     poetry config --list # List/check the current configuration (Optional)
@@ -48,11 +48,11 @@ graph systems and large language models.
     poetry shell # use 'exit' to leave the shell
     ```  
     If `poetry install` fails or too slow due to network issues, it is recommended to modify `tool.poetry.source` of `hugegraph-llm/pyproject.toml`
-4. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
+5. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
     ```bash
     cd ./src
     ```
-5. Start the gradio interactive demo of **Graph RAG**, you can run with the following command, and open http://127.0.0.1:8001 after starting
+6. Start the gradio interactive demo of **Graph RAG**, you can run with the following command and open http://127.0.0.1:8001 after starting
     ```bash
     python -m hugegraph_llm.demo.rag_demo.app  # same as "poetry run xxx"
     ```
@@ -61,23 +61,23 @@ graph systems and large language models.
     python -m hugegraph_llm.demo.rag_demo.app --host 127.0.0.1 --port 18001
     ```
    
-6. After running the web demo, the config file `.env` will be automatically generated at the path `hugegraph-llm/.env`.    Additionally, a prompt-related configuration file `config_prompt.yaml` will also be generated at the path `hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
+7. After running the web demo, the config file `.env` will be automatically generated at the path `hugegraph-llm/.env`.    Additionally, a prompt-related configuration file `config_prompt.yaml` will also be generated at the path `hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
     You can modify the content on the web page, and it will be automatically saved to the configuration file after the corresponding feature is triggered.  You can also modify the file directly without restarting the web application; refresh the page to load your latest changes.  
     (Optional)To regenerate the config file, you can use `config.generate` with `-u` or `--update`.  
     ```bash
     python -m hugegraph_llm.config.generate --update
     ```
     Note: `Litellm` support multi-LLM provider, refer [litellm.ai](https://docs.litellm.ai/docs/providers) to config it
-7. (__Optional__) You could use 
+8. (__Optional__) You could use 
     [hugegraph-hubble](https://hugegraph.apache.org/docs/quickstart/hugegraph-hubble/#21-use-docker-convenient-for-testdev) 
     to visit the graph data, could run it via [Docker/Docker-Compose](https://hub.docker.com/r/hugegraph/hubble) 
-    for guidance. (Hubble is a graph-analysis dashboard include data loading/schema management/graph traverser/display).
-8. (__Optional__) offline download NLTK stopwords  
+    for guidance. (Hubble is a graph-analysis dashboard that includes data loading/schema management/graph traverser/display).
+9. (__Optional__) offline download NLTK stopwords  
     ```bash
     python ./hugegraph_llm/operators/common_op/nltk_helper.py
     ```
 > [!TIP]   
-> You can also refer our [quick-start](./quick_start.md) doc to understand how to use it & the basic query logic 🚧
+> You can also refer to our [quick-start](./quick_start.md) doc to understand how to use it & the basic query logic 🚧
 
 ## 4 Examples
 
@@ -124,7 +124,7 @@ This can be obtained from the `LLMs` class.
     )
     ```
     ![gradio-config](https://hugegraph.apache.org/docs/images/kg-uml.png)
-2. **Import Schema**: The `import_schema` method is used to import a schema from a source. The source can be a HugeGraph instance, a user-defined schema or an extraction result. The method `print_result` can be chained to print the result.
+2. **Import Schema**: The `import_schema` method is used to import a schema from a source. The source can be a HugeGraph instance, a user-defined schema, or an extraction result. The method `print_result` can be chained to print the result.
     ```python
     # Import schema from a HugeGraph instance
     builder.import_schema(from_hugegraph="xxx").print_result()
diff --git a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
@@ -26,7 +26,6 @@
 from hugegraph_llm.models.llms.base import BaseLLM
 from hugegraph_llm.utils.log import log
 
-
 """
 TODO: It is not clear whether there is any other dependence on the SCHEMA_EXAMPLE_PROMPT variable. 
 Because the SCHEMA_EXAMPLE_PROMPT variable will no longer change based on 
@@ -88,9 +87,9 @@ def filter_item(schema, items) -> List[Dict[str, Any]]:
 
 class PropertyGraphExtract:
     def __init__(
-            self,
-            llm: BaseLLM,
-            example_prompt: str = prompt.extract_graph_prompt
+        self,
+        llm: BaseLLM,
+        example_prompt: str = prompt.extract_graph_prompt
     ) -> None:
         self.llm = llm
         self.example_prompt = example_prompt
@@ -125,33 +124,41 @@ def extract_property_graph_by_llm(self, schema, chunk):
         return self.llm.generate(prompt=prompt)
 
     def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
-        # analyze llm generated text to JSON
-        json_strings = re.findall(r'(\[.*?])', text, re.DOTALL)
-        longest_json = max(json_strings, key=lambda x: len(''.join(x)), default=('', ''))
-
-        longest_json_str = ''.join(longest_json).strip()
+        # Use regex to extract a JSON object with curly braces
+        json_match = re.search(r'({.*})', text, re.DOTALL)
+        if not json_match:
+            log.critical("Invalid property graph! No JSON object found, "
+                         "please check the output format example in prompt.")
+            return []
+        json_str = json_match.group(1).strip()
 
         items = []
         try:
-            property_graph = json.loads(longest_json_str)
+            property_graph = json.loads(json_str)
+            # Expect property_graph to be a dict with keys "vertices" and "edges"
+            if not (isinstance(property_graph, dict) and "vertices" in property_graph and "edges" in property_graph):
+                log.critical("Invalid property graph format; expecting 'vertices' and 'edges'.")
+                return items
+
+            # Create sets for valid vertex and edge labels based on the schema
             vertex_label_set = {vertex["name"] for vertex in schema["vertexlabels"]}
             edge_label_set = {edge["name"] for edge in schema["edgelabels"]}
-            for item in property_graph:
-                if not isinstance(item, dict):
-                    log.warning("Invalid property graph item type '%s'.", type(item))
-                    continue
-                if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
-                    log.warning("Invalid item keys '%s'.", item.keys())
-                    continue
-                if item["type"] == "vertex" or item["type"] == "edge":
-                    if (item["label"] not in vertex_label_set
-                            and item["label"] not in edge_label_set):
-                        log.warning("Invalid '%s' label '%s' has been ignored.", item["type"], item["label"])
-                    else:
-                        items.append(item)
-                else:
-                    log.warning("Invalid item type '%s' has been ignored.", item["type"])
-        except json.JSONDecodeError:
-            log.critical("Invalid property graph! Please check the extracted JSON data carefully")
 
+            def process_items(item_list, valid_labels, item_type):
+                for item in item_list:
+                    if not isinstance(item, dict):
+                        log.warning("Invalid property graph item type '%s'.", type(item))
+                        continue
+                    if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
+                        log.warning("Invalid item keys '%s'.", item.keys())
+                        continue
+                    if item["label"] not in valid_labels:
+                        log.warning("Invalid %s label '%s' has been ignored.", item_type, item["label"])
+                        continue
+                    items.append(item)
+
+            process_items(property_graph["vertices"], vertex_label_set, "vertex")
+            process_items(property_graph["edges"], edge_label_set, "edge")
+        except json.JSONDecodeError:
+            log.critical("Invalid property graph JSON! Please check the extracted JSON data carefully")
         return items