Skip to content

Commit 777cb25

Browse files
committed
refactor: enhance KG evaluator to use llm-as judge; remove evaluate_kg_config
1 parent 93abd00 commit 777cb25

10 files changed

Lines changed: 998 additions & 428 deletions

File tree

examples/evaluate_kg/evaluate_kg.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,4 @@ python3 -m graphgen.operators.evaluate_kg.evaluate_kg \
22
--working_dir cache \
33
--graph_backend kuzu \
44
--kv_backend rocksdb \
5-
--sample_size 100 \
65
--max_concurrent 10

examples/evaluate_kg/evaluate_kg_config.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

graphgen/models/evaluator/kg/README.md

Lines changed: 140 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,33 @@ This module provides comprehensive quality evaluation for knowledge graphs built
66

77
The evaluation functionality has been split into modular components:
88

9-
- **`accuracy_evaluator.py`**: Entity/relation/triple accuracy evaluation using LLM-as-judge
9+
- **`accuracy_evaluator.py`**: Entity/relation extraction quality evaluation using LLM-as-a-Judge
1010
- **`consistency_evaluator.py`**: Attribute value conflict detection
1111
- **`structure_evaluator.py`**: Graph structural robustness metrics
12-
- **`utils.py`**: Utility functions (NetworkX conversion, text retrieval, sampling)
1312
- **`kg_quality_evaluator.py`**: Main evaluator class that integrates all modules
1413

1514
## Features
1615

1716
### 1. Accuracy Assessment
18-
- **Entity Recognition Accuracy**: Samples entities and validates them using LLM
19-
- **Relation Extraction Accuracy**: Samples relations and validates them using LLM
20-
- **Triple Validation (RLC)**: Samples triples and validates them using LLM
21-
- Calculates Precision, Recall, and F1 scores for each metric
17+
- **Entity Extraction Quality**: Uses LLM-as-a-Judge to evaluate the quality of entity extraction from chunks
18+
- Evaluates accuracy (correctness of extracted entities)
19+
- Evaluates completeness (whether important entities are missed)
20+
- Evaluates precision (naming accuracy and specificity)
21+
- **Relation Extraction Quality**: Uses LLM-as-a-Judge to evaluate the quality of relation extraction from chunks
22+
- Evaluates accuracy (correctness of extracted relations)
23+
- Evaluates completeness (whether important relations are missed)
24+
- Evaluates precision (relation description accuracy)
25+
- Provides multi-dimensional quality scores (0-1 scale) with detailed reasoning for each chunk
2226

2327
### 2. Consistency Assessment
24-
- Detects attribute value conflicts (same entity, same attribute, different values)
28+
- **Semantic Conflict Detection**: Uses LLM-as-a-Judge to detect semantic conflicts in entity attributes
29+
- **Entity Type Conflicts**: Detects when the same entity is extracted with different types across chunks
30+
- **Entity Description Conflicts**: Detects when entity descriptions from different chunks are semantically inconsistent
31+
- **Relation Conflicts**: Detects when the same entity pair has conflicting relation descriptions
32+
- Only evaluates entities with multiple source chunks (entities appearing in multiple chunks)
33+
- Uses LLM to extract entity attributes from each chunk and compare them semantically
2534
- Calculates conflict rate: `conflict_entities_count / total_entities`
26-
- Returns detailed conflict information
35+
- Returns detailed conflict information including conflict severity and reasoning
2736

2837
### 3. Structural Robustness Assessment
2938
- **Noise Ratio**: Isolated nodes / total nodes (threshold: < 15%)
@@ -42,10 +51,9 @@ python -m graphgen.operators.evaluate_kg.evaluate_kg --working_dir cache
4251
# Run specific evaluation
4352
python -m graphgen.operators.evaluate_kg.evaluate_kg --working_dir cache --accuracy_only
4453

45-
# Custom configuration
54+
# Specify backends
4655
python -m graphgen.operators.evaluate_kg.evaluate_kg \
4756
--working_dir cache \
48-
--sample_size 200 \
4957
--graph_backend networkx \
5058
--kv_backend json_kv
5159
```
@@ -59,10 +67,22 @@ bash examples/evaluate_kg/evaluate_kg.sh
5967
# With custom options
6068
bash examples/evaluate_kg/evaluate_kg.sh \
6169
--working_dir cache \
62-
--sample_size 200 \
6370
--accuracy_only
6471
```
6572

73+
## Configuration
74+
75+
All evaluation thresholds use default values defined in the evaluator classes:
76+
77+
- **Structure thresholds**: Defined in `StructureEvaluator` with defaults:
78+
- `noise_ratio_threshold`: 0.15
79+
- `largest_cc_ratio_threshold`: 0.90
80+
- `avg_degree_min`: 2.0
81+
- `avg_degree_max`: 5.0
82+
- `powerlaw_r2_threshold`: 0.75
83+
84+
**Note**: Accuracy evaluation automatically loads chunks from the chunk storage and evaluates the quality of entity/relation extraction using LLM-as-a-Judge. No configuration file is needed.
85+
6686
## Requirements
6787

6888
- **NetworkX**: Required for structural evaluation
@@ -78,21 +98,117 @@ The evaluation returns a dictionary with the following structure:
7898
{
7999
"accuracy": {
80100
"entity_accuracy": {
81-
"precision": float,
82-
"recall": float,
83-
"f1": float,
84-
"true_positives": int,
85-
"false_positives": int,
86-
"sample_size": int
101+
"overall_score": {
102+
"mean": float,
103+
"median": float,
104+
"min": float,
105+
"max": float,
106+
"std": float
107+
},
108+
"accuracy": {
109+
"mean": float,
110+
"median": float,
111+
"min": float,
112+
"max": float,
113+
"std": float
114+
},
115+
"completeness": {
116+
"mean": float,
117+
"median": float,
118+
"min": float,
119+
"max": float,
120+
"std": float
121+
},
122+
"precision": {
123+
"mean": float,
124+
"median": float,
125+
"min": float,
126+
"max": float,
127+
"std": float
128+
},
129+
"total_chunks": int,
130+
"detailed_results": [
131+
{
132+
"chunk_id": str,
133+
"chunk_content": str,
134+
"extracted_entities_count": int,
135+
"accuracy": float,
136+
"completeness": float,
137+
"precision": float,
138+
"overall_score": float,
139+
"accuracy_reasoning": str,
140+
"completeness_reasoning": str,
141+
"precision_reasoning": str,
142+
"issues": [str]
143+
},
144+
...
145+
]
87146
},
88-
"relation_accuracy": { ... },
89-
"triple_accuracy": { ... }
147+
"relation_accuracy": {
148+
"overall_score": {
149+
"mean": float,
150+
"median": float,
151+
"min": float,
152+
"max": float,
153+
"std": float
154+
},
155+
"accuracy": {
156+
"mean": float,
157+
"median": float,
158+
"min": float,
159+
"max": float,
160+
"std": float
161+
},
162+
"completeness": {
163+
"mean": float,
164+
"median": float,
165+
"min": float,
166+
"max": float,
167+
"std": float
168+
},
169+
"precision": {
170+
"mean": float,
171+
"median": float,
172+
"min": float,
173+
"max": float,
174+
"std": float
175+
},
176+
"total_chunks": int,
177+
"detailed_results": [
178+
{
179+
"chunk_id": str,
180+
"chunk_content": str,
181+
"extracted_relations_count": int,
182+
"accuracy": float,
183+
"completeness": float,
184+
"precision": float,
185+
"overall_score": float,
186+
"accuracy_reasoning": str,
187+
"completeness_reasoning": str,
188+
"precision_reasoning": str,
189+
"issues": [str]
190+
},
191+
...
192+
]
193+
}
90194
},
91195
"consistency": {
92196
"conflict_rate": float,
93197
"conflict_entities_count": int,
94198
"total_entities": int,
95-
"conflicts": [ ... ]
199+
"entities_checked": int,
200+
"conflicts": [
201+
{
202+
"entity_id": str,
203+
"conflict_type": str, # "entity_type" or "description"
204+
"conflict_severity": float, # 0-1, severity of the conflict
205+
"conflict_reasoning": str,
206+
"conflicting_values": [str],
207+
"recommended_value": str, # for entity_type conflicts
208+
"conflict_details": str # for description conflicts
209+
},
210+
...
211+
]
96212
},
97213
"structure": {
98214
"total_nodes": int,
@@ -111,7 +227,10 @@ The evaluation returns a dictionary with the following structure:
111227

112228
## Notes
113229

114-
- Accuracy evaluation requires LLM API access and may be slow for large sample sizes
230+
- Accuracy evaluation uses LLM-as-a-Judge to evaluate extraction quality from chunks
231+
- Accuracy evaluation automatically loads chunks from chunk storage (no need for source_text_paths)
232+
- The evaluator associates extracted entities/relations with their source chunks using the `source_id` field
115233
- Structural evaluation automatically converts Kuzu storage to NetworkX for analysis
116234
- All evaluations include error handling and will return error messages if something fails
117235
- The evaluator automatically loads graph and chunk storage from the working directory
236+
- LLM evaluation may take time for large numbers of chunks (controlled by `max_concurrent` parameter)
Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,9 @@
11
from .accuracy_evaluator import AccuracyEvaluator
22
from .consistency_evaluator import ConsistencyEvaluator
33
from .structure_evaluator import StructureEvaluator
4-
from .utils import convert_to_networkx, get_relevant_text, get_source_text, sample_items
54

65
__all__ = [
76
"AccuracyEvaluator",
87
"ConsistencyEvaluator",
98
"StructureEvaluator",
10-
"convert_to_networkx",
11-
"get_relevant_text",
12-
"get_source_text",
13-
"sample_items",
149
]

0 commit comments

Comments
 (0)