Skip to content

Commit aa83ec2

Browse files
Thespicaimbajin
andauthored
fix(llm): align regex extraction of json to json format of prompt (#211)
See #210 Main change of regex: matching `(\[.*])` -> matching `({.*})`. tested models: - qwen-max - qwen-plus - deepseek-v3 --------- Co-authored-by: imbajin <jin@apache.org>
1 parent 36d5444 commit aa83ec2

4 files changed

Lines changed: 49 additions & 41 deletions

File tree

.asf.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ github:
5858
- HJ-Young
5959
- afterimagex
6060
- returnToInnocence
61+
- Thespica
6162

6263
# refer https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Notificationsettingsforrepositories
6364
notifications:

.github/workflows/hugegraph-python-client.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
- name: Prepare HugeGraph Server Environment
2121
run: |
2222
docker run -d --name=graph -p 8080:8080 -e PASSWORD=admin hugegraph/hugegraph:1.3.0
23-
sleep 5
23+
sleep 10
2424
2525
- uses: actions/checkout@v4
2626

hugegraph-llm/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ This project includes runnable demos, it can also be used as a third-party libra
88
As we know, graph systems can help large models address challenges like timeliness and hallucination,
99
while large models can help graph systems with cost-related issues.
1010

11-
With this project, we aim to reduce the cost of using graph systems, and decrease the complexity of
11+
With this project, we aim to reduce the cost of using graph systems and decrease the complexity of
1212
building knowledge graphs. This project will offer more applications and integration solutions for
1313
graph systems and large language models.
1414
1. Construct knowledge graph by LLM + HugeGraph
1515
2. Use natural language to operate graph databases (Gremlin/Cypher)
16-
3. Knowledge graph supplements answer context (GraphRAG -> Graph Agent)
16+
3. Knowledge graph supplements answer context (GraphRAG Graph Agent)
1717

1818
## 2. Environment Requirements
1919
> [!IMPORTANT]
@@ -24,19 +24,19 @@ graph systems and large language models.
2424
## 3. Preparation
2525

2626
1. Start the HugeGraph database, you can run it via [Docker](https://hub.docker.com/r/hugegraph/hugegraph)/[Binary Package](https://hugegraph.apache.org/docs/download/download/).
27-
Refer to detailed [doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev) for more guidance
27+
Refer to a detailed [doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev) for more guidance
2828

2929
2. Configuring the poetry environment, Use the official installer to install Poetry, See the [poetry documentation](https://poetry.pythonlang.cn/docs/#installing-with-pipx) for other installation methods
3030
```bash
3131
# You could try pipx or pip to install poetry when meet network issues, refer the poetry doc for more details
3232
curl -sSL https://install.python-poetry.org | python3 - # install the latest version like 2.0+
3333
```
3434

35-
2. Clone this project
35+
3. Clone this project
3636
```bash
3737
git clone https://github.com/apache/incubator-hugegraph-ai.git
3838
```
39-
3. Install [hugegraph-python-client](../hugegraph-python-client) and [hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual environments
39+
4. Install [hugegraph-python-client](../hugegraph-python-client) and [hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual environments
4040
```bash
4141
cd ./incubator-hugegraph-ai/hugegraph-llm
4242
poetry config --list # List/check the current configuration (Optional)
@@ -48,11 +48,11 @@ graph systems and large language models.
4848
poetry shell # use 'exit' to leave the shell
4949
```
5050
If `poetry install` fails or too slow due to network issues, it is recommended to modify `tool.poetry.source` of `hugegraph-llm/pyproject.toml`
51-
4. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
51+
5. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
5252
```bash
5353
cd ./src
5454
```
55-
5. Start the gradio interactive demo of **Graph RAG**, you can run with the following command, and open http://127.0.0.1:8001 after starting
55+
6. Start the gradio interactive demo of **Graph RAG**, you can run with the following command and open http://127.0.0.1:8001 after starting
5656
```bash
5757
python -m hugegraph_llm.demo.rag_demo.app # same as "poetry run xxx"
5858
```
@@ -61,23 +61,23 @@ graph systems and large language models.
6161
python -m hugegraph_llm.demo.rag_demo.app --host 127.0.0.1 --port 18001
6262
```
6363

64-
6. After running the web demo, the config file `.env` will be automatically generated at the path `hugegraph-llm/.env`. Additionally, a prompt-related configuration file `config_prompt.yaml` will also be generated at the path `hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
64+
7. After running the web demo, the config file `.env` will be automatically generated at the path `hugegraph-llm/.env`. Additionally, a prompt-related configuration file `config_prompt.yaml` will also be generated at the path `hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
6565
You can modify the content on the web page, and it will be automatically saved to the configuration file after the corresponding feature is triggered. You can also modify the file directly without restarting the web application; refresh the page to load your latest changes.
6666
(Optional)To regenerate the config file, you can use `config.generate` with `-u` or `--update`.
6767
```bash
6868
python -m hugegraph_llm.config.generate --update
6969
```
7070
Note: `Litellm` support multi-LLM provider, refer [litellm.ai](https://docs.litellm.ai/docs/providers) to config it
71-
7. (__Optional__) You could use
71+
8. (__Optional__) You could use
7272
[hugegraph-hubble](https://hugegraph.apache.org/docs/quickstart/hugegraph-hubble/#21-use-docker-convenient-for-testdev)
7373
to visit the graph data, could run it via [Docker/Docker-Compose](https://hub.docker.com/r/hugegraph/hubble)
74-
for guidance. (Hubble is a graph-analysis dashboard include data loading/schema management/graph traverser/display).
75-
8. (__Optional__) offline download NLTK stopwords
74+
for guidance. (Hubble is a graph-analysis dashboard that includes data loading/schema management/graph traverser/display).
75+
9. (__Optional__) offline download NLTK stopwords
7676
```bash
7777
python ./hugegraph_llm/operators/common_op/nltk_helper.py
7878
```
7979
> [!TIP]
80-
> You can also refer our [quick-start](./quick_start.md) doc to understand how to use it & the basic query logic 🚧
80+
> You can also refer to our [quick-start](./quick_start.md) doc to understand how to use it & the basic query logic 🚧
8181

8282
## 4 Examples
8383

@@ -124,7 +124,7 @@ This can be obtained from the `LLMs` class.
124124
)
125125
```
126126
![gradio-config](https://hugegraph.apache.org/docs/images/kg-uml.png)
127-
2. **Import Schema**: The `import_schema` method is used to import a schema from a source. The source can be a HugeGraph instance, a user-defined schema or an extraction result. The method `print_result` can be chained to print the result.
127+
2. **Import Schema**: The `import_schema` method is used to import a schema from a source. The source can be a HugeGraph instance, a user-defined schema, or an extraction result. The method `print_result` can be chained to print the result.
128128
```python
129129
# Import schema from a HugeGraph instance
130130
builder.import_schema(from_hugegraph="xxx").print_result()

hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@
2626
from hugegraph_llm.models.llms.base import BaseLLM
2727
from hugegraph_llm.utils.log import log
2828

29-
3029
"""
3130
TODO: It is not clear whether there is any other dependence on the SCHEMA_EXAMPLE_PROMPT variable.
3231
Because the SCHEMA_EXAMPLE_PROMPT variable will no longer change based on
@@ -88,9 +87,9 @@ def filter_item(schema, items) -> List[Dict[str, Any]]:
8887

8988
class PropertyGraphExtract:
9089
def __init__(
91-
self,
92-
llm: BaseLLM,
93-
example_prompt: str = prompt.extract_graph_prompt
90+
self,
91+
llm: BaseLLM,
92+
example_prompt: str = prompt.extract_graph_prompt
9493
) -> None:
9594
self.llm = llm
9695
self.example_prompt = example_prompt
@@ -125,33 +124,41 @@ def extract_property_graph_by_llm(self, schema, chunk):
125124
return self.llm.generate(prompt=prompt)
126125

127126
def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
128-
# analyze llm generated text to JSON
129-
json_strings = re.findall(r'(\[.*?])', text, re.DOTALL)
130-
longest_json = max(json_strings, key=lambda x: len(''.join(x)), default=('', ''))
131-
132-
longest_json_str = ''.join(longest_json).strip()
127+
# Use regex to extract a JSON object with curly braces
128+
json_match = re.search(r'({.*})', text, re.DOTALL)
129+
if not json_match:
130+
log.critical("Invalid property graph! No JSON object found, "
131+
"please check the output format example in prompt.")
132+
return []
133+
json_str = json_match.group(1).strip()
133134

134135
items = []
135136
try:
136-
property_graph = json.loads(longest_json_str)
137+
property_graph = json.loads(json_str)
138+
# Expect property_graph to be a dict with keys "vertices" and "edges"
139+
if not (isinstance(property_graph, dict) and "vertices" in property_graph and "edges" in property_graph):
140+
log.critical("Invalid property graph format; expecting 'vertices' and 'edges'.")
141+
return items
142+
143+
# Create sets for valid vertex and edge labels based on the schema
137144
vertex_label_set = {vertex["name"] for vertex in schema["vertexlabels"]}
138145
edge_label_set = {edge["name"] for edge in schema["edgelabels"]}
139-
for item in property_graph:
140-
if not isinstance(item, dict):
141-
log.warning("Invalid property graph item type '%s'.", type(item))
142-
continue
143-
if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
144-
log.warning("Invalid item keys '%s'.", item.keys())
145-
continue
146-
if item["type"] == "vertex" or item["type"] == "edge":
147-
if (item["label"] not in vertex_label_set
148-
and item["label"] not in edge_label_set):
149-
log.warning("Invalid '%s' label '%s' has been ignored.", item["type"], item["label"])
150-
else:
151-
items.append(item)
152-
else:
153-
log.warning("Invalid item type '%s' has been ignored.", item["type"])
154-
except json.JSONDecodeError:
155-
log.critical("Invalid property graph! Please check the extracted JSON data carefully")
156146

147+
def process_items(item_list, valid_labels, item_type):
148+
for item in item_list:
149+
if not isinstance(item, dict):
150+
log.warning("Invalid property graph item type '%s'.", type(item))
151+
continue
152+
if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
153+
log.warning("Invalid item keys '%s'.", item.keys())
154+
continue
155+
if item["label"] not in valid_labels:
156+
log.warning("Invalid %s label '%s' has been ignored.", item_type, item["label"])
157+
continue
158+
items.append(item)
159+
160+
process_items(property_graph["vertices"], vertex_label_set, "vertex")
161+
process_items(property_graph["edges"], edge_label_set, "edge")
162+
except json.JSONDecodeError:
163+
log.critical("Invalid property graph JSON! Please check the extracted JSON data carefully")
157164
return items

0 commit comments

Comments
 (0)