InternScience
diff --git a/‎examples/evaluate_kg/evaluate_kg.sh‎
Lines changed: 0 additions & 1 deletion b/‎examples/evaluate_kg/evaluate_kg.sh‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/evaluate_kg/evaluate_kg_config.yaml‎
Lines changed: 0 additions & 12 deletions b/‎examples/evaluate_kg/evaluate_kg_config.yaml‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎graphgen/models/evaluator/kg/README.md‎
Lines changed: 140 additions & 21 deletions b/‎graphgen/models/evaluator/kg/README.md‎
Lines changed: 140 additions & 21 deletions
diff --git a/‎graphgen/models/evaluator/kg/__init__.py‎
Lines changed: 0 additions & 5 deletions b/‎graphgen/models/evaluator/kg/__init__.py‎
Lines changed: 0 additions & 5 deletions
@@ -2,5 +2,4 @@ python3 -m graphgen.operators.evaluate_kg.evaluate_kg \
     --working_dir cache \
     --graph_backend kuzu \
     --kv_backend rocksdb \
-    --sample_size 100 \
     --max_concurrent 10
@@ -6,24 +6,33 @@ This module provides comprehensive quality evaluation for knowledge graphs built
 
 The evaluation functionality has been split into modular components:
 
-- **`accuracy_evaluator.py`**: Entity/relation/triple accuracy evaluation using LLM-as-judge
+- **`accuracy_evaluator.py`**: Entity/relation extraction quality evaluation using LLM-as-a-Judge
 - **`consistency_evaluator.py`**: Attribute value conflict detection
 - **`structure_evaluator.py`**: Graph structural robustness metrics
-- **`utils.py`**: Utility functions (NetworkX conversion, text retrieval, sampling)
 - **`kg_quality_evaluator.py`**: Main evaluator class that integrates all modules
 
 ## Features
 
 ### 1. Accuracy Assessment
-- **Entity Recognition Accuracy**: Samples entities and validates them using LLM
-- **Relation Extraction Accuracy**: Samples relations and validates them using LLM
-- **Triple Validation (RLC)**: Samples triples and validates them using LLM
-- Calculates Precision, Recall, and F1 scores for each metric
+- **Entity Extraction Quality**: Uses LLM-as-a-Judge to evaluate the quality of entity extraction from chunks
+  - Evaluates accuracy (correctness of extracted entities)
+  - Evaluates completeness (whether important entities are missed)
+  - Evaluates precision (naming accuracy and specificity)
+- **Relation Extraction Quality**: Uses LLM-as-a-Judge to evaluate the quality of relation extraction from chunks
+  - Evaluates accuracy (correctness of extracted relations)
+  - Evaluates completeness (whether important relations are missed)
+  - Evaluates precision (relation description accuracy)
+- Provides multi-dimensional quality scores (0-1 scale) with detailed reasoning for each chunk
 
 ### 2. Consistency Assessment
-- Detects attribute value conflicts (same entity, same attribute, different values)
+- **Semantic Conflict Detection**: Uses LLM-as-a-Judge to detect semantic conflicts in entity attributes
+  - **Entity Type Conflicts**: Detects when the same entity is extracted with different types across chunks
+  - **Entity Description Conflicts**: Detects when entity descriptions from different chunks are semantically inconsistent
+  - **Relation Conflicts**: Detects when the same entity pair has conflicting relation descriptions
+- Only evaluates entities with multiple source chunks (entities appearing in multiple chunks)
+- Uses LLM to extract entity attributes from each chunk and compare them semantically
 - Calculates conflict rate: `conflict_entities_count / total_entities`
-- Returns detailed conflict information
+- Returns detailed conflict information including conflict severity and reasoning
 
 ### 3. Structural Robustness Assessment
 - **Noise Ratio**: Isolated nodes / total nodes (threshold: < 15%)
@@ -42,10 +51,9 @@ python -m graphgen.operators.evaluate_kg.evaluate_kg --working_dir cache
 # Run specific evaluation
 python -m graphgen.operators.evaluate_kg.evaluate_kg --working_dir cache --accuracy_only
 
-# Custom configuration
+# Specify backends
 python -m graphgen.operators.evaluate_kg.evaluate_kg \
     --working_dir cache \
-    --sample_size 200 \
     --graph_backend networkx \
     --kv_backend json_kv
 ```
@@ -59,10 +67,22 @@ bash examples/evaluate_kg/evaluate_kg.sh
 # With custom options
 bash examples/evaluate_kg/evaluate_kg.sh \
     --working_dir cache \
-    --sample_size 200 \
     --accuracy_only
 ```
 
+## Configuration
+
+All evaluation thresholds use default values defined in the evaluator classes:
+
+- **Structure thresholds**: Defined in `StructureEvaluator` with defaults:
+  - `noise_ratio_threshold`: 0.15
+  - `largest_cc_ratio_threshold`: 0.90
+  - `avg_degree_min`: 2.0
+  - `avg_degree_max`: 5.0
+  - `powerlaw_r2_threshold`: 0.75
+
+**Note**: Accuracy evaluation automatically loads chunks from the chunk storage and evaluates the quality of entity/relation extraction using LLM-as-a-Judge. No configuration file is needed.
+
 ## Requirements
 
 - **NetworkX**: Required for structural evaluation
@@ -78,21 +98,117 @@ The evaluation returns a dictionary with the following structure:
 {
     "accuracy": {
         "entity_accuracy": {
-            "precision": float,
-            "recall": float,
-            "f1": float,
-            "true_positives": int,
-            "false_positives": int,
-            "sample_size": int
+            "overall_score": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "accuracy": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "completeness": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "precision": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "total_chunks": int,
+            "detailed_results": [
+                {
+                    "chunk_id": str,
+                    "chunk_content": str,
+                    "extracted_entities_count": int,
+                    "accuracy": float,
+                    "completeness": float,
+                    "precision": float,
+                    "overall_score": float,
+                    "accuracy_reasoning": str,
+                    "completeness_reasoning": str,
+                    "precision_reasoning": str,
+                    "issues": [str]
+                },
+                ...
+            ]
         },
-        "relation_accuracy": { ... },
-        "triple_accuracy": { ... }
+        "relation_accuracy": {
+            "overall_score": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "accuracy": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "completeness": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "precision": {
+                "mean": float,
+                "median": float,
+                "min": float,
+                "max": float,
+                "std": float
+            },
+            "total_chunks": int,
+            "detailed_results": [
+                {
+                    "chunk_id": str,
+                    "chunk_content": str,
+                    "extracted_relations_count": int,
+                    "accuracy": float,
+                    "completeness": float,
+                    "precision": float,
+                    "overall_score": float,
+                    "accuracy_reasoning": str,
+                    "completeness_reasoning": str,
+                    "precision_reasoning": str,
+                    "issues": [str]
+                },
+                ...
+            ]
+        }
     },
     "consistency": {
         "conflict_rate": float,
         "conflict_entities_count": int,
         "total_entities": int,
-        "conflicts": [ ... ]
+        "entities_checked": int,
+        "conflicts": [
+            {
+                "entity_id": str,
+                "conflict_type": str,  # "entity_type" or "description"
+                "conflict_severity": float,  # 0-1, severity of the conflict
+                "conflict_reasoning": str,
+                "conflicting_values": [str],
+                "recommended_value": str,  # for entity_type conflicts
+                "conflict_details": str  # for description conflicts
+            },
+            ...
+        ]
     },
     "structure": {
         "total_nodes": int,
@@ -111,7 +227,10 @@ The evaluation returns a dictionary with the following structure:
 
 ## Notes
 
-- Accuracy evaluation requires LLM API access and may be slow for large sample sizes
+- Accuracy evaluation uses LLM-as-a-Judge to evaluate extraction quality from chunks
+- Accuracy evaluation automatically loads chunks from chunk storage (no need for source_text_paths)
+- The evaluator associates extracted entities/relations with their source chunks using the `source_id` field
 - Structural evaluation automatically converts Kuzu storage to NetworkX for analysis
 - All evaluations include error handling and will return error messages if something fails
 - The evaluator automatically loads graph and chunk storage from the working directory
+- LLM evaluation may take time for large numbers of chunks (controlled by `max_concurrent` parameter)
@@ -1,14 +1,9 @@
 from .accuracy_evaluator import AccuracyEvaluator
 from .consistency_evaluator import ConsistencyEvaluator
 from .structure_evaluator import StructureEvaluator
-from .utils import convert_to_networkx, get_relevant_text, get_source_text, sample_items
 
 __all__ = [
     "AccuracyEvaluator",
     "ConsistencyEvaluator",
     "StructureEvaluator",
-    "convert_to_networkx",
-    "get_relevant_text",
-    "get_source_text",
-    "sample_items",
 ]