| title | Entity Normalization Deep-Dive |
|---|
Deduplication and Merging for Clean Knowledge Graphs
Entity normalization is a critical step in building quality knowledge graphs. This guide explains how EdgeQuake transforms raw entity names into canonical forms and merges duplicate entities into unified nodes.
- The Problem
- The Solution
- Normalization Algorithm
- Merge Strategy
- Configuration
- Edge Cases
- Quality Metrics
- Best Practices
Without normalization, the same real-world entity appears as multiple disconnected nodes in the knowledge graph.
Consider a document mentioning "Sarah Chen" in different ways:
┌─────────────────────────────────────────────────────────────────┐
│ WITHOUT NORMALIZATION (Fragmented Graph) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ │
│ │ Sarah Chen │ ← From chunk 1 │
│ └───────┬────────┘ │
│ │ WORKS_AT │
│ ▼ │
│ ┌────────────────┐ │
│ │ MIT │ │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ sarah chen │ ← From chunk 2 (different node!) │
│ └───────┬────────┘ │
│ │ AUTHORED │
│ ▼ │
│ ┌────────────────┐ │
│ │ Climate Paper │ │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ │
│ │ Dr. S. Chen │ ← From chunk 3 (yet another node!) │
│ └───────┬────────┘ │
│ │ RESEARCHES │
│ ▼ │
│ ┌────────────────┐ │
│ │ Machine Learning │ │
│ └────────────────┘ │
│ │
│ PROBLEM: 3 nodes for the same person! │
│ Relationships are disconnected. │
│ Query "Sarah Chen at MIT" misses paper authorship. │
│ │
└─────────────────────────────────────────────────────────────────┘
| Issue | Without Normalization |
|---|---|
| Missing relationships | WORKS_AT and AUTHORED never connect |
| Incomplete answers | "What does Sarah Chen research?" misses ML |
| Inflated counts | 3 person nodes instead of 1 |
| Failed lookups | Search "Sarah Chen" doesn't find "sarah chen" |
EdgeQuake normalizes all entity names to a canonical format before storage.
┌─────────────────────────────────────────────────────────────────┐
│ WITH NORMALIZATION (Unified Graph) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ │
│ ┌─────── │ SARAH_CHEN │ ───────┐ │
│ │ └───────┬────────┘ │ │
│ │ WORKS_AT │ AUTHORED │ RESEARCHES │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ MIT │ │CLIMATE_PAPER │ │MACHINE_LEARNING │ │
│ └──────────┘ └──────────────┘ └─────────────────┘ │
│ │
│ RESULT: Single node with all relationships! │
│ "Sarah Chen at MIT" now finds paper AND ML research │
│ │
└─────────────────────────────────────────────────────────────────┘
| Raw Input | Normalized Output |
|---|---|
| "Sarah Chen" | SARAH_CHEN |
| "sarah chen" | SARAH_CHEN |
| "Dr. S. Chen" | DR._S._CHEN |
| "The Company" | COMPANY |
| "John's Research" | JOHN_RESEARCH |
The normalize_entity_name() function applies these transformations in order:
┌─────────────────────────────────────────────────────────────────┐
│ NORMALIZATION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: " The John Doe's Company " │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 1: TRIM WHITESPACE │ │
│ │ " The John Doe's Company " │ │
│ │ → "The John Doe's Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 2: REMOVE PREFIXES │ │
│ │ Removes: "The ", "A ", "An " │ │
│ │ "The John Doe's Company" │ │
│ │ → "John Doe's Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 3: SPLIT BY WHITESPACE │ │
│ │ "John Doe's Company" │ │
│ │ → ["John", "Doe's", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 4: REMOVE POSSESSIVES │ │
│ │ Each word: strip "'s" suffix │ │
│ │ ["John", "Doe's", "Company"] │ │
│ │ → ["John", "Doe", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 5: TITLE CASE EACH WORD │ │
│ │ First letter upper, rest lower │ │
│ │ ["John", "Doe", "Company"] │ │
│ │ → ["John", "Doe", "Company"] │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 6: JOIN WITH UNDERSCORES │ │
│ │ ["John", "Doe", "Company"] │ │
│ │ → "John_Doe_Company" │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Step 7: UPPERCASE │ │
│ │ "John_Doe_Company" │ │
│ │ → "JOHN_DOE_COMPANY" │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Output: "JOHN_DOE_COMPANY" │
│ │
└─────────────────────────────────────────────────────────────────┘
pub fn normalize_entity_name(raw_name: &str) -> String {
let trimmed = raw_name.trim();
// Remove common prefixes
let without_prefix = trimmed
.strip_prefix("The ")
.or_else(|| trimmed.strip_prefix("the "))
.or_else(|| trimmed.strip_prefix("A "))
.or_else(|| trimmed.strip_prefix("An "))
.unwrap_or(trimmed);
// Split, normalize each word, rejoin
without_prefix
.split_whitespace()
.filter(|w| !w.is_empty())
.map(|word| {
let without_possessive = word
.strip_suffix("'s")
.or_else(|| word.strip_suffix("'s"))
.unwrap_or(word);
to_title_case(without_possessive)
})
.collect::<Vec<_>>()
.join("_")
.to_uppercase()
}When the same entity appears in multiple documents, EdgeQuake merges them intelligently.
┌─────────────────────────────────────────────────────────────────┐
│ ENTITY MERGE FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ New Entity: "SARAH_CHEN" │
│ Description: "A climate scientist at MIT" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Query: Does SARAH_CHEN exist? │ │
│ └─────────────────┬───────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ NO YES │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────────────┐ │
│ │ CREATE │ │ MERGE │ │
│ │ new node │ │ │ │
│ │ │ │ 1. Combine descriptions │ │
│ │ properties: │ │ 2. Max(importance) │ │
│ │ - desc │ │ 3. Append source_ids │ │
│ │ - type │ │ 4. Update timestamp │ │
│ │ - source │ │ │ │
│ └─────────────┘ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
EdgeQuake supports two strategies for combining entity descriptions:
When the same entity is described differently in multiple chunks, the LLM synthesizes a unified description:
┌─────────────────────────────────────────────────────────────────┐
│ LLM DESCRIPTION MERGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Existing: "Sarah Chen is a professor at MIT" │
│ New: "Dr. Chen researches climate modeling" │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ LLM Summarizer │ │
│ │ │ │
│ │ Prompt: │ │
│ │ "Merge these │ │
│ │ descriptions │ │
│ │ for SARAH_CHEN│ │
│ │ into a single │ │
│ │ coherent │ │
│ │ description" │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ Merged: "Sarah Chen is a professor at MIT who researches │
│ climate modeling" │
│ │
└─────────────────────────────────────────────────────────────────┘
If LLM is unavailable or fails, descriptions are concatenated with deduplication:
fn merge_descriptions(old: &str, new: &str, max_len: usize) -> String {
if old.contains(new) {
return old.to_string(); // Avoid duplicates
}
let merged = format!("{} {}", old, new);
if merged.len() > max_len {
merged[..max_len].to_string()
} else {
merged
}
}EdgeQuake maintains provenance for all entity occurrences:
{
"id": "SARAH_CHEN",
"entity_type": "PERSON",
"description": "Professor at MIT researching climate modeling",
"source_ids": "chunk_001|chunk_042|chunk_089",
"source_document_ids": ["doc_1", "doc_2"],
"importance": 0.85,
"first_seen": "2024-01-15T10:30:00Z",
"last_updated": "2024-01-15T11:45:00Z"
}This enables:
- Citation tracking: Link answers back to source documents
- Cascade delete: Remove entity when source documents deleted
- Confidence scoring: More sources = higher confidence
The MergerConfig struct controls merging behavior:
pub struct MergerConfig {
pub max_description_length: usize, // Default: 4096
pub description_decay: f32, // Default: 0.9
pub min_importance: f32, // Default: 0.1
pub max_sources: usize, // Default: 10
pub use_llm_summarization: bool, // Default: true
}| Parameter | Default | Description | Tuning Recommendation |
|---|---|---|---|
max_description_length |
4096 | Max chars in merged description | Increase for detailed entities |
description_decay |
0.9 | Weight decay for older descriptions | Lower = newer descriptions preferred |
min_importance |
0.1 | Entities below this are pruned | Raise to reduce noise |
max_sources |
10 | Max source_ids tracked per entity | Increase for better lineage |
use_llm_summarization |
true | Use LLM for description merging | Disable for faster, cheaper merging |
Some characters are preserved to maintain meaning:
| Input | Output | Note |
|---|---|---|
| "New-York" | NEW-YORK | Hyphens preserved |
| "C++" | C++ | Programming language syntax |
| "O'Brien" | O'BRIEN | Irish names |
| "AT&T" | AT&T | Ampersand preserved |
Acronyms normalize to uppercase naturally:
| Input | Output |
|---|---|
| "MIT" | MIT |
| "N.A.S.A." | N.A.S.A. |
| "NATO" | NATO |
normalize_entity_name("") // → ""
normalize_entity_name(" ") // → ""
normalize_entity_name("The") // → "THE" (single word kept)
normalize_entity_name("A") // → "A"From production benchmarks:
| Scenario | Raw Entities | After Normalization | Dedup Rate |
|---|---|---|---|
| Scientific papers | 50 | 32 | 36% |
| News articles | 80 | 48 | 40% |
| Legal documents | 120 | 85 | 29% |
| Mixed corpus | 200 | 128 | 36% |
┌─────────────────────────────────────────────────────────────────┐
│ QUALITY IMPROVEMENT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Query: "What did Sarah Chen publish?" │
│ │
│ Without Normalization: │
│ ───────────────────── │
│ Found: 1 paper (from "Sarah Chen" node only) │
│ Missed: 2 papers (from "sarah chen" and "S. Chen" nodes) │
│ Recall: 33% │
│ │
│ With Normalization: │
│ ────────────────── │
│ Found: 3 papers (all linked to SARAH_CHEN) │
│ Missed: 0 │
│ Recall: 100% │
│ │
│ IMPROVEMENT: 3x better recall │
│ │
└─────────────────────────────────────────────────────────────────┘
Clean input text before extraction to improve entity quality:
// Before extraction
let text = text
.replace("Dr. ", "") // Remove titles
.replace("Prof. ", "") // Remove titles
.replace("Mr. ", "") // Remove honorifics
.replace("Mrs. ", "");Use consistent entity types across documents:
| Good | Bad |
|---|---|
| PERSON | Person, person, HUMAN |
| ORGANIZATION | Org, Company, COMPANY |
| LOCATION | Place, Location, GEO |
For known aliases, consider pre-normalization mapping:
fn apply_aliases(name: &str) -> String {
match name.to_uppercase().as_str() {
"USA" | "US" | "AMERICA" => "UNITED_STATES",
"NYC" | "NEW YORK CITY" => "NEW_YORK",
_ => normalize_entity_name(name)
}
}Track these metrics in production:
- Dedup rate: Should be 25-50% for typical corpora
- False merges: Manually review sample for incorrect merges
- Description quality: Check that merged descriptions are coherent
- LightRAG Algorithm - The full extraction pipeline
- Entity Extraction - How entities are identified
- Knowledge Graph - Graph structure
- Query Modes - How normalized entities are queried