- Source Repository: BUPT-GAMMA/PathRAG
- Analyzed Commit:
32567bfc(2025-12-17) - Initial Document Date: 2026-03-16
- Last Updated: 2026-03-19
PathRAG is a Graph-based Retrieval Augmented Generation (RAG) system that enhances traditional RAG by introducing relational path pruning through knowledge graphs. Unlike standard RAG which retrieves isolated text chunks, PathRAG discovers and leverages multi-hop semantic paths between entities to provide richer, more contextually connected answers.
This document provides a technical blueprint for understanding, modifying, or reimplementing PathRAG's core concepts.
- Architecture Overview
- Core Data Structures
- System Components
- Indexing Pipeline (Document Ingestion)
- Query Pipeline (The PathRAG Algorithm)
- Path Finding & Pruning Algorithm
- Prompt Engineering
- Configuration Parameters
- Adaptation Guide
┌─────────────────────────────────────────────────────────────────────────────┐
│ PATHRAG SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ INDEXING PIPELINE │ │
│ │ │ │
│ │ Documents → Chunking → Entity Extraction → Graph Construction │ │
│ │ ↓ ↓ │ │
│ │ Entity Vector DB Relationship Vector DB │ │
│ │ ↓ ↓ │ │
│ │ Knowledge Graph (NetworkX DiGraph) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ QUERY PIPELINE │ │
│ │ │ │
│ │ Query → Keyword Extraction → Dual Context Retrieval │ │
│ │ ↓ │ │
│ │ ┌──────────────────┴──────────────────┐ │ │
│ │ ↓ ↓ │ │
│ │ Low-Level Context High-Level Context │ │
│ │ (Entity-based retrieval) (Relationship-based) │ │
│ │ ↓ ↓ │ │
│ │ Direct relationships PATH FINDING ALGORITHM │ │
│ │ ↓ (1-hop, 2-hop, 3-hop) │ │
│ │ └──────────────────┬──────────────────┘ │ │
│ │ ↓ │ │
│ │ Context Combination │ │
│ │ ↓ │ │
│ │ LLM Response Generation │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
├──────────────────┬──────────────────────┬───────────────────────────────┤
│ KV Storage │ Vector Storage │ Graph Storage │
├──────────────────┼──────────────────────┼───────────────────────────────┤
│ • full_docs │ • entities_vdb │ • chunk_entity_relation_graph │
│ • text_chunks │ • relationships_vdb │ │
│ • llm_cache │ • chunks_vdb │ │
├──────────────────┼──────────────────────┼───────────────────────────────┤
│ JsonKVStorage │ NanoVectorDBStorage │ NetworkXStorage │
│ (file-based) │ (cosine similarity) │ (DiGraph + GraphML) │
└──────────────────┴──────────────────────┴───────────────────────────────┘
TextChunkSchema = TypedDict(
"TextChunkSchema",
{
"tokens": int, # Number of tokens in chunk
"content": str, # The text content
"full_doc_id": str, # Reference to parent document
"chunk_order_index": int # Position in original document
}
)EntityNode = {
"entity_type": str, # e.g., "person", "organization", "geo", "event", "category"
"description": str, # LLM-generated description of the entity
"source_id": str # Chunk IDs where this entity was found (separated by "<SEP>")
}RelationshipEdge = {
"weight": float, # Relationship strength (summed across occurrences)
"description": str, # LLM-generated relationship description
"keywords": str, # High-level keywords summarizing the relationship
"source_id": str # Chunk IDs where this relationship was found
}@dataclass
class QueryParam:
mode: Literal["hybrid"] = "hybrid" # Query mode
only_need_context: bool = False # Return only context, no LLM call
only_need_prompt: bool = False # Return the final prompt
response_type: str = "Multiple Paragraphs" # Expected response format
stream: bool = False # Stream response
top_k: int = 40 # Number of results to retrieve
max_token_for_text_unit: int = 4000 # Token budget for source texts
max_token_for_global_context: int = 3000 # Token budget for high-level context
max_token_for_local_context: int = 5000 # Token budget for low-level context| Component | File | Purpose |
|---|---|---|
| PathRAG | PathRAG.py |
Main orchestrator class, manages all storage and coordinates pipelines |
| Chunking | operate.py: chunking_by_token_size() |
Splits documents into overlapping token-sized chunks |
| Entity Extraction | operate.py: extract_entities() |
LLM-based extraction of entities and relationships |
| KG Query | operate.py: kg_query() |
Main query handler with keyword extraction and context building |
| Path Finding | operate.py: find_paths_and_edges_with_stats() |
Core PathRAG algorithm - finds multi-hop paths |
| Storage Classes | storage.py |
KV, Vector, and Graph storage implementations |
| LLM Integration | llm.py |
Provider-agnostic LLM/embedding support via LiteLLM (OpenAI, Gemini, Bedrock, Anthropic, Ollama, etc.) |
| Prompts | prompt.py |
All LLM prompt templates |
BaseKVStorage (Abstract)
├── JsonKVStorage # File-based JSON storage
├── MongoKVStorage # MongoDB backend
├── OracleKVStorage # Oracle DB backend
└── TiDBKVStorage # TiDB backend
BaseVectorStorage (Abstract)
├── NanoVectorDBStorage # Lightweight local vector DB
├── MilvusVectorDBStorge # Milvus backend
├── ChromaVectorDBStorage # ChromaDB backend
└── OracleVectorDBStorage # Oracle backend
BaseGraphStorage (Abstract)
├── NetworkXStorage # In-memory NetworkX DiGraph
├── Neo4JStorage # Neo4j graph database
├── OracleGraphStorage # Oracle graph backend
└── AGEStorage # Apache AGE (PostgreSQL)
Input Documents
│
▼
┌─────────────────────────────────────┐
│ Step 1: Document Deduplication │
│ - MD5 hash each document │
│ - Skip already indexed docs │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 2: Chunking │
│ - Token-based chunking (1200 tok) │
│ - Overlap (100 tokens) │
│ - Track chunk order │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 3: Entity Extraction │
│ - LLM extracts entities │
│ - LLM extracts relationships │
│ - Multi-gleaning for completeness │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 4: Knowledge Graph Update │
│ - Merge duplicate entities │
│ - Merge duplicate relationships │
│ - Summarize long descriptions │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Step 5: Vector DB Indexing │
│ - Embed entities (name + desc) │
│ - Embed relationships (keywords) │
│ - Embed chunks (content) │
└─────────────────────────────────────┘
def chunking_by_token_size(
content: str,
overlap_token_size: int = 128,
max_token_size: int = 1024,
tiktoken_model: str = "gpt-4o"
) -> list[dict]:
"""
Split document into overlapping chunks by token count.
Algorithm:
1. Tokenize entire content using tiktoken
2. Create sliding window with step = (max_token_size - overlap_token_size)
3. Decode each window back to text
4. Return list of {tokens, content, chunk_order_index}
"""
tokens = encode_string_by_tiktoken(content, model_name=tiktoken_model)
results = []
step = max_token_size - overlap_token_size
for index, start in enumerate(range(0, len(tokens), step)):
chunk_tokens = tokens[start : start + max_token_size]
chunk_content = decode_tokens_by_tiktoken(chunk_tokens, model_name=tiktoken_model)
results.append({
"tokens": len(chunk_tokens),
"content": chunk_content.strip(),
"chunk_order_index": index,
})
return resultsasync def extract_entities(chunks, knowledge_graph, entity_vdb, relationships_vdb):
"""
Multi-gleaning entity extraction from text chunks.
Process:
1. For each chunk, send to LLM with entity_extraction prompt
2. Parse structured output (entities and relationships)
3. Optionally "glean" again (ask LLM if more entities exist)
4. Merge duplicate entities across chunks
5. Merge duplicate relationships
6. Upsert to graph and vector databases
"""
for chunk in chunks:
# Initial extraction
result = await llm(entity_extraction_prompt.format(chunk.content))
# Multi-gleaning loop (default: 1 iteration)
for i in range(max_gleaning):
glean_result = await llm(continue_extraction_prompt)
result += glean_result
# Ask if more entities exist
if await llm(if_loop_prompt) != "yes":
break
# Parse entities: ("entity"<|>NAME<|>TYPE<|>DESCRIPTION)
# Parse relationships: ("relationship"<|>SRC<|>TGT<|>DESC<|>KEYWORDS<|>WEIGHT)
entities, relationships = parse_extraction_result(result)
# Merge and upsert
for entity_name, entity_occurrences in all_entities.items():
merged_entity = merge_entity_descriptions(entity_occurrences)
await knowledge_graph.upsert_node(entity_name, merged_entity)
for (src, tgt), edge_occurrences in all_relationships.items():
merged_edge = merge_edge_descriptions(edge_occurrences)
await knowledge_graph.upsert_edge(src, tgt, merged_edge)When the same entity or relationship is found in multiple chunks:
Entity Merging:
entity_type: Most frequent type across occurrencesdescription: Concatenate all descriptions, then summarize if too longsource_id: Union of all source chunk IDs
Relationship Merging:
weight: Sum of all weights (stronger evidence = higher weight)description: Concatenate and summarizekeywords: Union of all keywordssource_id: Union of all source chunk IDs
User Query: "How does A relate to B?"
│
▼
┌──────────────────────────────────────────┐
│ Step 1: Keyword Extraction │
│ LLM splits query into: │
│ - high_level_keywords: themes, concepts │
│ - low_level_keywords: specific entities │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Step 2: Dual Context Retrieval │
│ │
│ LOW-LEVEL (Local): │
│ - Search entity_vdb for low_keywords │
│ - Get entity nodes and their edges │
│ - Find connected text chunks │
│ - **Run PATH FINDING algorithm** │
│ │
│ HIGH-LEVEL (Global): │
│ - Search relationship_vdb for keywords │
│ - Get relationship edges │
│ - Find connected entities │
│ - Find source text chunks │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Step 3: Context Combination │
│ - Merge HL and LL entities │
│ - Merge HL and LL relationships │
│ - Merge source texts │
│ - Format as CSV tables │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Step 4: LLM Response Generation │
│ - System prompt with context tables │
│ - User query as input │
│ - Generate markdown response │
└──────────────────────────────────────────┘
The query is first decomposed into two types of keywords:
# Example Query: "What is the relationship between Apple and Steve Jobs?"
high_level_keywords = ["Corporate leadership", "Company founding", "Technology industry"]
low_level_keywords = ["Apple", "Steve Jobs", "CEO", "founder"]High-level keywords drive relationship-based (global) retrieval. Low-level keywords drive entity-based (local) retrieval.
The final context provided to the LLM is structured as CSV tables:
-----global-information-----
-----high-level entity information-----
id,entity,type,description,rank
0,"APPLE","organization","Apple Inc. is a technology company...",45
1,"STEVE JOBS","person","Steve Jobs was co-founder and CEO...",38
-----high-level relationship information-----
id,source,target,description,keywords,weight,rank
0,"STEVE JOBS","APPLE","Founded Apple Computer in 1976","founding, leadership",3.5,83
-----Sources-----
id,content
0,"Steve Jobs and Steve Wozniak founded Apple Computer..."
-----local-information-----
-----low-level entity information-----
id,entity,type,description,rank
...
-----low-level relationship information-----
id,context
0,"The entity STEVE JOBS is a person with description(...)through edge(founded) to connect to STEVE JOBS and APPLE."
This is what makes PathRAG different from standard graph-based RAG.
Instead of just returning direct entity relationships, PathRAG:
- Finds ALL paths (1-hop, 2-hop, 3-hop) between retrieved entities
- Weights paths by traversal frequency using BFS
- Prunes low-confidence paths using a threshold
- Generates natural language descriptions of multi-hop connections
async def find_paths_and_edges_with_stats(graph, target_nodes):
"""
Find all paths (up to 3 hops) between target nodes using DFS.
Args:
graph: NetworkX graph
target_nodes: List of entity names from vector search results
Returns:
result: Dict mapping (source, target) -> {paths: [...], edges: [...]}
path_stats: {"1-hop": count, "2-hop": count, "3-hop": count}
one_hop_paths, two_hop_paths, three_hop_paths: Separate path lists
"""
result = defaultdict(lambda: {"paths": [], "edges": set()})
path_stats = {"1-hop": 0, "2-hop": 0, "3-hop": 0}
async def dfs(current, target, path, depth):
# Stop at 3 hops maximum
if depth > 3:
return
# Found target - record the path
if current == target:
result[(path[0], target)]["paths"].append(list(path))
# Record all edges in path
for u, v in zip(path[:-1], path[1:]):
result[(path[0], target)]["edges"].add(tuple(sorted((u, v))))
# Categorize by hop count
if depth == 1:
path_stats["1-hop"] += 1
one_hop_paths.append(path)
elif depth == 2:
path_stats["2-hop"] += 1
two_hop_paths.append(path)
elif depth == 3:
path_stats["3-hop"] += 1
three_hop_paths.append(path)
return
# Continue DFS to neighbors
for neighbor in graph.neighbors(current):
if neighbor not in path: # Avoid cycles
await dfs(neighbor, target, path + [neighbor], depth + 1)
# Find paths between all pairs of target nodes
for node1 in target_nodes:
for node2 in target_nodes:
if node1 != node2:
await dfs(node1, node2, [node1], 0)
return result, path_stats, one_hop_paths, two_hop_paths, three_hop_pathsdef bfs_weighted_paths(G, paths, source, target, threshold=0.3, alpha=0.8):
"""
Weight paths based on traversal frequency using BFS.
Key insight: Paths that are traversed more frequently (appear in more
DFS results) are considered more important/confident.
Parameters:
threshold: Minimum weight to continue path exploration (0.3)
alpha: Decay factor for each hop (0.8)
Algorithm:
1. Build follow_dict: for each node, which nodes follow it in paths
2. Starting from source, assign weight 1/num_followers to each edge
3. Only continue to next hop if weight > threshold
4. Apply alpha decay at each hop
5. Return paths with their accumulated weights
"""
edge_weights = defaultdict(float)
follow_dict = {}
# Build follower relationships from all paths
for p in paths:
for i in range(len(p) - 1):
current, next_node = p[i], p[i + 1]
if current not in follow_dict:
follow_dict[current] = set()
follow_dict[current].add(next_node)
results = []
# First hop: weight = 1 / number_of_followers
for neighbor in follow_dict[source]:
edge_weights[(source, neighbor)] += 1 / len(follow_dict[source])
if neighbor == target:
results.append([source, neighbor])
continue
# Only continue if edge weight > threshold
if edge_weights[(source, neighbor)] > threshold:
# Second hop: apply alpha decay
for second_neighbor in follow_dict.get(neighbor, []):
weight = edge_weights[(source, neighbor)] * alpha / len(follow_dict[neighbor])
edge_weights[(neighbor, second_neighbor)] += weight
if second_neighbor == target:
results.append([source, neighbor, second_neighbor])
continue
# Third hop (if weight still above threshold)
if edge_weights[(neighbor, second_neighbor)] > threshold:
for third_neighbor in follow_dict.get(second_neighbor, []):
weight = edge_weights[(neighbor, second_neighbor)] * alpha / len(follow_dict[second_neighbor])
edge_weights[(second_neighbor, third_neighbor)] += weight
if third_neighbor == target:
results.append([source, neighbor, second_neighbor, third_neighbor])
# Calculate path weights (average edge weight)
path_weights = []
for p in paths:
total_weight = sum(edge_weights.get((p[i], p[i+1]), 0) for i in range(len(p)-1))
path_weights.append(total_weight / (len(p) - 1))
return list(zip(paths, path_weights))async def _find_most_related_edges_from_entities3(node_datas, query_param, knowledge_graph_inst):
"""
Convert graph paths to natural language relationship descriptions.
This is the key innovation: instead of returning raw graph data,
PathRAG generates human-readable path descriptions.
"""
# 1. Build NetworkX graph from storage
G = nx.Graph()
edges = await knowledge_graph_inst.edges()
nodes = await knowledge_graph_inst.nodes()
for u, v in edges:
G.add_edge(u, v)
G.add_nodes_from(nodes)
# 2. Find all paths between retrieved entities
source_nodes = [dp["entity_name"] for dp in node_datas]
result, path_stats, one_hop_paths, two_hop_paths, three_hop_paths = (
await find_paths_and_edges_with_stats(G, source_nodes)
)
# 3. Weight and prune paths
threshold = 0.3 # Pruning threshold
alpha = 0.8 # Hop decay factor
all_results = []
for node1 in source_nodes:
for node2 in source_nodes:
if node1 != node2 and (node1, node2) in result:
paths = result[(node1, node2)]["paths"]
weighted = bfs_weighted_paths(G, paths, node1, node2, threshold, alpha)
all_results += weighted
# 4. Sort by weight and deduplicate
all_results = sorted(all_results, key=lambda x: x[1], reverse=True)
seen = set()
result_edge = []
for edge, weight in all_results:
sorted_edge = tuple(sorted(edge))
if sorted_edge not in seen:
seen.add(sorted_edge)
result_edge.append((edge, weight))
# 5. Select top paths (max 15)
total_edges = min(15, len(results))
sort_result = result_edge[:total_edges] if result_edge else []
final_result = [edge for edge, weight in sort_result]
# 6. Convert to natural language using f-strings
relationship = []
for path in final_result:
if len(path) == 2:
# 1-hop: A --[relation]--> B
s_name, t_name = path[0], path[1]
edge0 = (
await knowledge_graph_inst.get_edge(path[0], path[1])
or await knowledge_graph_inst.get_edge(path[1], path[0])
)
if edge0 is None:
continue
s = await knowledge_graph_inst.get_node(s_name)
t = await knowledge_graph_inst.get_node(t_name)
desc = (
f"The entity {s_name} is a {s['entity_type']} "
f"with description({s['description']}) "
f"through edge({edge0['keywords']}) to connect to {s_name} and {t_name}. "
f"The entity {t_name} is a {t['entity_type']} "
f"with description({t['description']})"
)
relationship.append([desc])
elif len(path) == 3:
# 2-hop: A --[rel1]--> B --[rel2]--> C
s_name, b_name, t_name = path[0], path[1], path[2]
# ... similar construction with bridge entity
elif len(path) == 4:
# 3-hop: A --[rel1]--> B1 --[rel2]--> B2 --[rel3]--> C
s_name, b1_name, b2_name, t_name = path[0], path[1], path[2], path[3]
# ... similar construction with two bridge entities
return relationship[::-1]| Parameter | Value | Purpose |
|---|---|---|
threshold |
0.3 | Minimum edge weight to continue path exploration |
alpha |
0.8 | Decay factor applied at each hop |
max_hops |
3 | Maximum path length to consider |
max_paths |
15 | Maximum number of paths to return |
Effect of threshold (0.3):
- Paths with edge weight < 0.3 are pruned
- This eliminates weak/spurious connections
- Higher threshold = stricter pruning, fewer paths
Effect of alpha (0.8):
- Each hop reduces confidence by 20%
- 1-hop: weight * 1.0
- 2-hop: weight * 0.8
- 3-hop: weight * 0.64
- Prefers shorter paths over longer ones
-Goal-
Given a text document and entity types, identify all entities and relationships.
-Steps-
1. Identify entities with: name, type, description
Format: ("entity"<|>NAME<|>TYPE<|>DESCRIPTION)
2. Identify relationships with: source, target, description, keywords, strength
Format: ("relationship"<|>SRC<|>TGT<|>DESC<|>KEYWORDS<|>STRENGTH)
3. Identify content-level keywords
Format: ("content_keywords"<|>KEYWORDS)
Entity types: [organization, person, geo, event, category]
-Role-
Identify high-level and low-level keywords in user queries.
-Goal-
- high_level_keywords: overarching concepts, themes
- low_level_keywords: specific entities, details
Output JSON format:
{
"high_level_keywords": [...],
"low_level_keywords": [...]
}
-Role-
Answer questions using data from provided tables.
-Goal-
Generate response summarizing information from tables.
Don't make anything up. Say "I don't know" if unsure.
-Data tables-
{context_data} // CSV formatted context
Style response in markdown.
| Delimiter | Value | Purpose |
|---|---|---|
| Tuple delimiter | <|> |
Separates fields within a record |
| Record delimiter | ## |
Separates records in output |
| Completion delimiter | <|COMPLETE|> |
Signals end of extraction |
| Field separator | <SEP> |
Separates multiple source IDs |
chunk_token_size: int = 1200 # Maximum tokens per chunk
chunk_overlap_token_size: int = 100 # Overlap between consecutive chunks
tiktoken_model_name: str = "gpt-4o-mini" # Model for tokenization
# Note: tiktoken falls back to cl100k_base encoding for non-OpenAI models
# (e.g. Gemini, Bedrock), providing a reasonable token-count approximation.entity_extract_max_gleaning: int = 1 # Number of "continue extraction" iterations
entity_summary_to_max_tokens: int = 500 # Max tokens for entity descriptions# Embedding model is configured via embedding_model_name and embedding_dim.
# Uses LiteLLM, so any supported provider works.
# Examples:
# OpenAI : embedding_model_name="text-embedding-3-small", embedding_dim=1536
# Gemini : embedding_model_name="gemini/gemini-embedding-001", embedding_dim=3072
# Bedrock : embedding_model_name="bedrock/amazon.titan-embed-text-v2:0", embedding_dim=1024
embedding_model_name: str = "text-embedding-3-small" # LiteLLM model name
embedding_dim: int = 1536 # Must match the chosen model
embedding_func: EmbeddingFunc = None # Auto-created from above; or pass a custom one
embedding_batch_num: int = 32 # Batch size for embedding
embedding_func_max_async: int = 16 # Max concurrent embedding calls# LLM is configured via llm_model_func and llm_model_name.
# Default uses LiteLLM (litellm_complete), supporting any provider.
# Examples:
# OpenAI : llm_model_name="gpt-4o"
# Gemini : llm_model_name="gemini/gemini-2.5-flash"
# Anthropic : llm_model_name="anthropic/claude-sonnet-4-20250514"
# Bedrock : llm_model_name="bedrock/anthropic.claude-3-haiku-20240307-v1:0"
llm_model_func: callable = litellm_complete
llm_model_name: str = "gpt-4o"
llm_model_max_token_size: int = 32768
llm_model_max_async: int = 16top_k: int = 40 # Results from vector search
max_token_for_text_unit: int = 4000 # Token budget for sources
max_token_for_global_context: int = 3000 # Token budget for HL context
max_token_for_local_context: int = 5000 # Token budget for LL contextcosine_better_than_threshold: float = 0.2 # Minimum similarity for retrieval-
Dual Vector Databases
- One for entities (searchable by entity name + description)
- One for relationships (searchable by keywords + src + tgt + description)
-
Knowledge Graph with Provenance
- Every node and edge tracks
source_id(which chunks it came from) - This enables source attribution in answers
- Every node and edge tracks
-
Hybrid Query Strategy
- Low-level: Entity-centric retrieval → direct relationships
- High-level: Relationship-centric retrieval → connected entities
-
Multi-hop Path Finding
- Don't just return direct neighbors
- Find paths up to 3 hops connecting query-relevant entities
- Weight by frequency, prune by threshold
-
Natural Language Path Descriptions
- Convert graph paths to human-readable text
- Include entity types and descriptions for context
class MinimalPathRAG:
def __init__(self):
self.entity_vdb = VectorDB() # For entity search
self.relation_vdb = VectorDB() # For relationship search
self.graph = nx.DiGraph() # Knowledge graph
self.chunks = {} # Source text storage
def index(self, documents):
# 1. Chunk documents
chunks = chunk_documents(documents)
# 2. Extract entities and relationships (LLM)
for chunk in chunks:
entities, relations = extract_kg(chunk)
# 3. Update graph
for entity in entities:
self.graph.add_node(entity.name, **entity.data)
for relation in relations:
self.graph.add_edge(relation.src, relation.tgt, **relation.data)
# 4. Index in vector DBs
self.entity_vdb.add(entities)
self.relation_vdb.add(relations)
def query(self, question):
# 1. Extract keywords
hl_keywords, ll_keywords = extract_keywords(question)
# 2. Entity retrieval (low-level)
entities = self.entity_vdb.search(ll_keywords, top_k=40)
# 3. Relationship retrieval (high-level)
relations = self.relation_vdb.search(hl_keywords, top_k=40)
# 4. PATH FINDING (the key innovation)
entity_names = [e.name for e in entities]
paths = find_all_paths(self.graph, entity_names, max_hops=3)
weighted_paths = weight_paths(paths, threshold=0.3, alpha=0.8)
path_descriptions = paths_to_text(weighted_paths, self.graph)
# 5. Gather source texts
sources = get_source_texts(entities, relations, self.chunks)
# 6. Generate response
context = format_context(entities, relations, path_descriptions, sources)
return llm_generate(question, context)| Component | How to Customize |
|---|---|
| Entity Types | Modify PROMPTS["DEFAULT_ENTITY_TYPES"] |
| Extraction Prompt | Edit PROMPTS["entity_extraction"] |
| Path Parameters | Adjust threshold, alpha, max_hops |
| Storage Backend | Implement BaseKVStorage, BaseVectorStorage, BaseGraphStorage |
| LLM Provider | Set llm_model_name to any LiteLLM-supported model (e.g. gemini/gemini-2.5-flash), or add a custom function in llm.py |
| Embedding Model | Set embedding_model_name and embedding_dim (e.g. gemini/gemini-embedding-001, 3072), or provide a custom embedding_func |
-
Path finding is O(n^k) where n=entities and k=max_hops
- Limit
top_kretrieval to control n - Cap at 3 hops (exponential growth)
- Limit
-
Vector DB choice matters
- NanoVectorDB: Good for <100k entities
- Milvus/Chroma: Better for large scale
-
LLM caching
- PathRAG caches LLM responses by query hash
- Implement
embedding_cache_configfor similar query detection - Remote backend caveat: The cache stores all responses as a nested JSON dict keyed by
mode(seehandle_cache/save_to_cacheinutils.py). On every cache read or write the entire dict must be fetched from the KV store, updated in memory, and written back. This read-modify-write pattern is efficient for the localJsonKVStorage(in-memory dict backed by a single JSON file), but causes unnecessary network round-trips and growing payload sizes with remote backends such as Spanner. When using a remote KV backend, setenable_llm_cache=Falseto avoid this overhead.
-
Async everywhere
- All operations are async for parallelization
- Use
limit_async_func_call()to prevent rate limiting
PathRAG's key innovation is the relational path pruning algorithm that:
- Retrieves entities and relationships via dual vector search
- Finds multi-hop paths (1-3 hops) between relevant entities
- Weights paths by traversal frequency using BFS
- Prunes weak connections using threshold filtering
- Converts surviving paths to natural language descriptions
This provides richer context than standard RAG by surfacing indirect semantic connections that would otherwise be missed.