You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All evaluation thresholds use default values defined in the evaluator classes:
76
+
77
+
-**Structure thresholds**: Defined in `StructureEvaluator` with defaults:
78
+
-`noise_ratio_threshold`: 0.15
79
+
-`largest_cc_ratio_threshold`: 0.90
80
+
-`avg_degree_min`: 2.0
81
+
-`avg_degree_max`: 5.0
82
+
-`powerlaw_r2_threshold`: 0.75
83
+
84
+
**Note**: Accuracy evaluation automatically loads chunks from the chunk storage and evaluates the quality of entity/relation extraction using LLM-as-a-Judge. No configuration file is needed.
85
+
66
86
## Requirements
67
87
68
88
-**NetworkX**: Required for structural evaluation
@@ -78,21 +98,117 @@ The evaluation returns a dictionary with the following structure:
78
98
{
79
99
"accuracy": {
80
100
"entity_accuracy": {
81
-
"precision": float,
82
-
"recall": float,
83
-
"f1": float,
84
-
"true_positives": int,
85
-
"false_positives": int,
86
-
"sample_size": int
101
+
"overall_score": {
102
+
"mean": float,
103
+
"median": float,
104
+
"min": float,
105
+
"max": float,
106
+
"std": float
107
+
},
108
+
"accuracy": {
109
+
"mean": float,
110
+
"median": float,
111
+
"min": float,
112
+
"max": float,
113
+
"std": float
114
+
},
115
+
"completeness": {
116
+
"mean": float,
117
+
"median": float,
118
+
"min": float,
119
+
"max": float,
120
+
"std": float
121
+
},
122
+
"precision": {
123
+
"mean": float,
124
+
"median": float,
125
+
"min": float,
126
+
"max": float,
127
+
"std": float
128
+
},
129
+
"total_chunks": int,
130
+
"detailed_results": [
131
+
{
132
+
"chunk_id": str,
133
+
"chunk_content": str,
134
+
"extracted_entities_count": int,
135
+
"accuracy": float,
136
+
"completeness": float,
137
+
"precision": float,
138
+
"overall_score": float,
139
+
"accuracy_reasoning": str,
140
+
"completeness_reasoning": str,
141
+
"precision_reasoning": str,
142
+
"issues": [str]
143
+
},
144
+
...
145
+
]
87
146
},
88
-
"relation_accuracy": { ... },
89
-
"triple_accuracy": { ... }
147
+
"relation_accuracy": {
148
+
"overall_score": {
149
+
"mean": float,
150
+
"median": float,
151
+
"min": float,
152
+
"max": float,
153
+
"std": float
154
+
},
155
+
"accuracy": {
156
+
"mean": float,
157
+
"median": float,
158
+
"min": float,
159
+
"max": float,
160
+
"std": float
161
+
},
162
+
"completeness": {
163
+
"mean": float,
164
+
"median": float,
165
+
"min": float,
166
+
"max": float,
167
+
"std": float
168
+
},
169
+
"precision": {
170
+
"mean": float,
171
+
"median": float,
172
+
"min": float,
173
+
"max": float,
174
+
"std": float
175
+
},
176
+
"total_chunks": int,
177
+
"detailed_results": [
178
+
{
179
+
"chunk_id": str,
180
+
"chunk_content": str,
181
+
"extracted_relations_count": int,
182
+
"accuracy": float,
183
+
"completeness": float,
184
+
"precision": float,
185
+
"overall_score": float,
186
+
"accuracy_reasoning": str,
187
+
"completeness_reasoning": str,
188
+
"precision_reasoning": str,
189
+
"issues": [str]
190
+
},
191
+
...
192
+
]
193
+
}
90
194
},
91
195
"consistency": {
92
196
"conflict_rate": float,
93
197
"conflict_entities_count": int,
94
198
"total_entities": int,
95
-
"conflicts": [ ... ]
199
+
"entities_checked": int,
200
+
"conflicts": [
201
+
{
202
+
"entity_id": str,
203
+
"conflict_type": str, # "entity_type" or "description"
204
+
"conflict_severity": float, # 0-1, severity of the conflict
205
+
"conflict_reasoning": str,
206
+
"conflicting_values": [str],
207
+
"recommended_value": str, # for entity_type conflicts
208
+
"conflict_details": str# for description conflicts
209
+
},
210
+
...
211
+
]
96
212
},
97
213
"structure": {
98
214
"total_nodes": int,
@@ -111,7 +227,10 @@ The evaluation returns a dictionary with the following structure:
111
227
112
228
## Notes
113
229
114
-
- Accuracy evaluation requires LLM API access and may be slow for large sample sizes
230
+
- Accuracy evaluation uses LLM-as-a-Judge to evaluate extraction quality from chunks
231
+
- Accuracy evaluation automatically loads chunks from chunk storage (no need for source_text_paths)
232
+
- The evaluator associates extracted entities/relations with their source chunks using the `source_id` field
115
233
- Structural evaluation automatically converts Kuzu storage to NetworkX for analysis
116
234
- All evaluations include error handling and will return error messages if something fails
117
235
- The evaluator automatically loads graph and chunk storage from the working directory
236
+
- LLM evaluation may take time for large numbers of chunks (controlled by `max_concurrent` parameter)
0 commit comments