- Source Repository: HKUDS/LightRAG
- Analyzed Commit:
c884d7dd(2026-02-09) - Initial Document Date: 2026-02-11
- Last Updated: 2026-03-20
LightRAG is a specialized Retrieval-Augmented Generation (RAG) framework designed to be simple, fast, and cost-effective. Unlike traditional RAG systems that rely solely on vector similarity, LightRAG integrates Graph-based and Vector-based retrieval strategies. It builds a Knowledge Graph (KG) from documents to capture complex relationships and entities, enabling "Global" (high-level summary) and "Local" (specific detail) query capabilities. The system is built with an "Async-First" architecture to handle high-throughput ingestion and retrieval.
The system follows a modular architecture where the LightRAG orchestrator coordinates interactions between the user, the LLM service, and four distinct storage backends.
graph TD
User["User / Application"] -->|Query/Insert| LightRAG["LightRAG Orchestrator"]
subgraph Core [Core Modules]
LightRAG --> Operate["operate.py\n(Workflow Logic)"]
LightRAG --> LLM_Mod["llm module\n(Model Interface)"]
end
subgraph Storage [Storage Layer]
Operate --> KV["KV Storage\n(Docs/Chunks)"]
Operate --> VDB_E["Vector Storage\n(Entities)"]
Operate --> VDB_R["Vector Storage\n(Relations)"]
Operate --> VDB_C["Vector Storage\n(Chunks)"]
Operate --> Graph["Graph Storage\n(Nodes/Edges)"]
end
LLM_Mod -->|API Calls| ExtLLM["External LLM Providers\nOpenAI/Azure/Ollama etc."]
LightRAG(Orchestraator): The main entry point. It initializes storage, manages configuration (usinglightrag.py), and exposesinsertandquerymethods.operate.py(The Engine): Contains the business logic for:- Chunking: Splitting text by token size.
- Extraction: Prompting LLMs to identify Entities and Relationships.
- Retrieval: Implementing the Logic for Local, Global, and Hybrid search.
- Consolidation: Merging duplicate entities and updating the graph.
base.py(Interfaces): Defines abstract base classes (BaseGraphStorage,BaseKVStorage,BaseVectorStorage) to ensure backend agnosticism.- Storage Implementations:
- Graph: NetworkX (local), Neo4j, ArangoDB, PostgreSQL (Apache AGE).
- Vector: NanoVectorDB (local), Milvus, Chroma, Qdrant, PostgreSQL (pgvector).
- KV: JSON (local), Redis, MongoDB, PostgreSQL.
LightRAG uses a hybrid data model combining structured graph data with unstructured vector embeddings.
- Structure:
{"entity_name": "...", "entity_type": "...", "description": "...", "source_id": "..."} - Storage:
- Graph: Stored as Nodes with properties.
- Vector: Description + Name embedded for similarity search.
- Structure:
{"src_id": "A", "tgt_id": "B", "keywords": [...], "description": "...", "weight": float} - Storage:
- Graph: Stored as Edges with properties.
- Vector: Description + Keywords embedded for global (relationship-centric) search.
- Document: Raw input text.
- Chunk: Token-limited segment of a document.
- Storage: KV Store maps
id -> content. Chunks can also be embedded (Naive RAG).
The ingestion process transforms raw text into a queryable Knowledge Graph through a multi-step pipeline.
sequenceDiagram
participant U as User
participant L as LightRAG
participant O as Operate
participant LLM as LLM Service
participant DB as Storage
U->>L: insert(text)
L->>O: chunking_by_token_size(text)
O->>L: chunks
par Parallel Extraction
L->>LLM: extract_entity_relationship(chunk)
LLM-->>L: entities, relations
end
L->>O: merge_nodes_and_edges(new_data)
O->>DB: upsert_node(entities)
O->>DB: upsert_edge(relations)
L->>LLM: embed(entities_desc, relations_desc)
L->>DB: upsert_vectors(embeddings)
L->>DB: upsert_kv(chunks)
-
Document Deduplication:
- Calculates MD5 hash of input documents.
- Skips documents that have already been processed to avoid redundant computation.
-
Chunking (
chunking_by_token_size):- Splits documents into manageable segments based on token count (default: 1200 tokens).
- Maintains an overlap (default: 100 tokens) to preserve context across boundaries.
- Uses
tiktokenfor accurate tokenization.
-
Entity & Relationship Extraction:
- Prompting: Uses
entity_extraction_system_promptto instruct the LLM to identify entities (e.g., Person, Org, Geo) and their relationships. - Multi-Gleaning: Optionally performs multiple extraction passes to catch missed entities.
- Structured Parsing: Converts LLM output (often in specific delimiter-separated formats like
("entity"<|>NAME<|>TYPE<|>DESC)) into structured objects.
- Prompting: Uses
-
Graph Construction & Update:
- Entity Merging: Combines descriptions for the same entity from different chunks.
- Relationship Merging: Aggregates weights and combines descriptions for identical edges.
- Storage: Persists Nodes and Edges to the Graph Storage (e.g., NetworkX, Neo4j, AGE).
-
Vector Indexing:
- Entity Embedding: Embeds
Name + Descriptionfor entity-centric search. - Relationship Embedding: Embeds
Keywords + Descriptionfor relationship-centric (global) search. - Chunk Embedding: Embeds raw chunk text for fallback or hybrid retrieval.
- Entity Embedding: Embeds
LightRAG implements a sophisticated query pipeline that adapts to the nature of the user's question.
graph TD
Q[User Query] --> KW[Keyword Extraction]
KW -->|Low-Level KW| Local[Local Search]
KW -->|High-Level KW| Global[Global Search]
subgraph LocalSearch [Local Search]
Local --> E_Search[Vector Search Entities]
E_Search --> E_Graph[Get Entity Degrees & Neighbors]
E_Graph --> E_Context[Build Entity Context]
end
subgraph GlobalSearch [Global Search]
Global --> R_Search[Vector Search Relations]
R_Search --> R_Graph[Get Relation Data]
R_Graph --> R_Context[Build Relation Context]
end
E_Context --> Merge[Merge & Deduplicate]
R_Context --> Merge
Merge --> Trunc[Token Truncation]
Trunc --> FinalContext[Final Prompt Context]
FinalContext --> LLM[LLM Generation]
LLM --> Answer
-
Local Query (Entity-Centric):
- Goal: Answer specific questions about details (e.g., "What is the battery life of X?").
- Mechanism:
- Extracts specific entity names (keywords).
- Performs Vector Search on the Entity Index.
- Retrieves 1-hop neighbors from the Knowledge Graph.
- Constructs context from Entity Descriptions + Neighbor Relations.
-
Global Query (Theme-Centric):
- Goal: Answer broad, summary questions (e.g., "What are the main themes in this dataset?").
- Mechanism:
- Extracts high-level concepts/keywords.
- Performs Vector Search on the Relationship Index.
- Retrieves top-ranked edges (relationships) based on semantic similarity.
- Constructs context primarily from Relationship Descriptions.
-
Hybrid Query (Best of Both):
- Goal: Provide a comprehensive answer covering both specific details and broad context.
- Mechanism: Executes both Local and Global strategies, merges the distinct contexts (deduplicating overlapped chunks), and prompts the LLM to synthesize the information.
- Keyword Extraction: Uses LLM to separate query into
high_level_keywords(for global) andlow_level_keywords(for local). - Dual Context Retrieval:
- Queries
entities_vdbwith low-level keywords. - Queries
relationships_vdbwith high-level keywords.
- Queries
- Context Combination & Pruning:
- Merges results from both streams.
- Token Truncation: Strictly enforces token limits (e.g., 5000 tokens) by prioritizing the most relevant chunks/entities.
- Generation: Sends the finalized, token-bounded context to the LLM with a specific generation prompt.
The kg_query function (in operate.py) is the central entry point for processing a user's natural language query against the knowledge graph. It orchestrates an end-to-end pipeline that transforms raw text into a grounded, context-rich LLM response. Below is a detailed, step-by-step breakdown of this pipeline, including which storage backend is involved at each stage.
The function begins by checking for an empty query (returning a fail response immediately if so). It then resolves which LLM function to use: if query_param.model_func is provided it takes precedence, otherwise the global llm_model_func is used with an elevated priority (_priority=5) to ensure query-time calls are scheduled ahead of lower-priority background tasks.
Storage used: None.
Before any retrieval can happen, the query must be decomposed into two classes of keywords:
| Keyword Type | Purpose | Used By |
|---|---|---|
| High-level (hl_keywords) | Broad themes, abstract concepts | Global / Hybrid / Mix search |
| Low-level (ll_keywords) | Specific entities, concrete details | Local / Hybrid / Mix search |
If keywords are pre-supplied via query_param, they are used directly. Otherwise, the function calls extract_keywords_only, which prompts the LLM with a keyword-extraction template and parses the structured JSON response. A fallback mechanism exists: when both keyword lists are empty and the query is short (under 50 characters), the raw query text itself is used as a low-level keyword.
Storage used: hashing_kv (BaseKVStorage) — caches keyword extraction results to avoid redundant LLM calls on repeated or similar queries.
This is the core retrieval stage. It operates differently depending on the selected query mode.
Before dispatching mode-specific searches, the function batches all required embedding calls into a single request. Depending on the mode, it may embed the raw query, the low-level keywords, and/or the high-level keywords. Batching avoids 2–3 sequential API round-trips and significantly reduces latency.
Storage used: Embedding function (external API call).
| Mode | Search Path | Description |
|---|---|---|
| local | _get_node_data(ll_keywords) |
Finds entities via vector similarity, then traverses their graph neighborhood |
| global | _get_edge_data(hl_keywords) |
Finds relationships via vector similarity, then retrieves associated entities |
| hybrid | Both local + global in sequence | Executes both paths and merges results |
| mix | hybrid + _get_vector_context(query) |
Adds a direct vector search over document chunks on top of hybrid |
Local path (_get_node_data):
- Queries
entities_vdbwith the low-level keywords to find semantically similar entities. - Retrieves full node data and degree counts from
knowledge_graph_instviaget_nodes_batch()andnode_degrees_batch(). - Calls
_find_most_related_edges_from_entitiesto discover edges connected to the matched entities viaget_nodes_edges_batch(), then fetches edge details in batch.
Global path (_get_edge_data):
- Queries
relationships_vdbwith the high-level keywords to find semantically similar relationships. - Retrieves full edge data and associated source/target entities from
knowledge_graph_inst.
Mix path (additional):
- Queries
chunks_vdbwith the original query text to retrieve raw document chunks ranked by vector similarity.
Storage used:
entities_vdb(BaseVectorStorage) — cosine similarity search for entities.relationships_vdb(BaseVectorStorage) — cosine similarity search for relationships.knowledge_graph_inst(BaseGraphStorage) — node/edge data retrieval, degree computation, neighborhood traversal.chunks_vdb(BaseVectorStorage) — direct document chunk retrieval (mix mode only).
Results from the local and global paths are interleaved in round-robin fashion (one local, one global, alternating) and deduplicated by entity name or relationship key. This ensures balanced representation from both retrieval strategies.
The raw search results typically exceed the LLM's context budget. This stage enforces strict token limits (max_entity_tokens, max_relation_tokens, max_total_tokens) by scoring and ranking the results and discarding lower-priority items. When reranking is enabled (enable_rerank=True), a reranker model re-scores the candidates before truncation for improved relevance.
Storage used: None (in-memory processing).
After truncation, the system collects the original text chunks that are linked to the surviving entities and relationships. These are merged with any vector chunks retrieved in mix mode. The merging process deduplicates chunks by ID, tracks provenance (whether a chunk was found via KG traversal or direct vector search), and applies frequency-based scoring.
Storage used:
text_chunks_db(BaseKVStorage) — resolves chunk IDs to their original text content.knowledge_graph_inst(BaseGraphStorage) — traces entity/relationship source IDs back to chunks.chunks_vdb(BaseVectorStorage) — used during chunk reranking when applicable.
The filtered entities, relationships, and text chunks are formatted into a structured text block that will serve as the LLM's context window. This context is then injected into the rag_response prompt template alongside the user's query, response type preferences, and any custom user_prompt instructions.
Storage used: None (string formatting only).
Before calling the LLM, the function computes a deterministic hash from the query parameters (mode, query text, top_k, token limits, keywords, rerank flag, etc.) and checks the cache. On a cache hit, the stored response is returned immediately without an LLM call. On a cache miss, the LLM is invoked (supporting both streaming and non-streaming modes), and the response is written back to the cache for future reuse.
Storage used: hashing_kv (BaseKVStorage) — query response caching (read and write).
The final response is wrapped in a QueryResult object:
- Non-streaming:
QueryResult(content=response_text, raw_data=context_metadata) - Streaming:
QueryResult(response_iterator=async_iterator, raw_data=context_metadata, is_streaming=True)
The raw_data field carries structured metadata including the retrieved entities, relationships, chunks, keyword breakdowns, and processing statistics (counts before/after truncation, etc.).
Special return modes are also supported:
only_need_context=True: Returns just the assembled context string (no LLM call).only_need_prompt=True: Returns the full prompt that would be sent to the LLM (no LLM call).
| Storage Instance | Type | Steps Used | Role |
|---|---|---|---|
hashing_kv |
BaseKVStorage | 2, 7 | Caches keyword extraction and LLM query responses |
entities_vdb |
BaseVectorStorage | 3 | Cosine similarity search for entity discovery |
relationships_vdb |
BaseVectorStorage | 3 | Cosine similarity search for relationship discovery |
chunks_vdb |
BaseVectorStorage | 3, 5 | Direct document chunk retrieval (mix mode) and chunk reranking |
knowledge_graph_inst |
BaseGraphStorage | 3, 5 | Node/edge data, degree computation, neighborhood traversal, source chunk tracing |
text_chunks_db |
BaseKVStorage | 5 | Resolves chunk IDs to original text content |
LightRAG and PathRAG share a common ancestry but diverge significantly in how they process queries. This section provides a side-by-side comparison of their kg_query pipelines, highlighting the architectural differences and the unique path-based retrieval strategy that defines PathRAG.
graph TD
subgraph Common ["Shared Steps"]
Q["User Query"]
KW["Step 1: Keyword Extraction<br/>(LLM splits query into hl/ll keywords)"]
Q --> KW
end
subgraph LightRAG_Flow ["LightRAG Pipeline"]
direction TB
L_CACHE1["Check Cache<br/>(hashing_kv)"]
L_EMBED["Batch Embedding<br/>(query + ll + hl in one call)"]
L_SEARCH["Multi-Mode Search<br/>(local / global / hybrid / mix)"]
L_MERGE_RR["Round-Robin Merge<br/>(interleave local + global results)"]
L_TRUNC["Token Truncation<br/>(+ optional Reranking)"]
L_CHUNK["Chunk Merging<br/>(KG chunks + vector chunks)"]
L_CTX["Build LLM Context<br/>(structured text block)"]
L_LLM["LLM Call<br/>(streaming / non-streaming)"]
L_CACHE2["Save to Cache"]
L_RESULT["QueryResult<br/>(content + raw_data + metadata)"]
L_CACHE1 -->|miss| L_EMBED --> L_SEARCH --> L_MERGE_RR --> L_TRUNC --> L_CHUNK --> L_CTX --> L_LLM --> L_CACHE2 --> L_RESULT
L_CACHE1 -->|hit| L_RESULT
end
subgraph PathRAG_Flow ["PathRAG Pipeline"]
direction TB
P_CACHE1["Check Cache<br/>(hashing_kv)"]
P_LOCAL["Local: Entity Vector Search<br/>(entities_vdb)"]
P_GLOBAL["Global: Relationship Vector Search<br/>(relationships_vdb)"]
P_PATH["PATH FINDING<br/>(DFS up to 3 hops)"]
P_WEIGHT["BFS Path Weighting<br/>(alpha=0.8 decay)"]
P_PRUNE["Path Pruning<br/>(threshold=0.3)"]
P_NL["Path-to-Natural-Language<br/>(1-hop / 2-hop / 3-hop descriptions)"]
P_COMBINE["Combine Contexts<br/>(HL entities + relations + sources<br/> LL entities + path descriptions)"]
P_CTX["Build LLM Context<br/>(CSV-formatted tables)"]
P_LLM["LLM Call"]
P_CACHE2["Save to Cache"]
P_RESULT["Response String"]
P_CACHE1 -->|miss| P_LOCAL --> P_PATH --> P_WEIGHT --> P_PRUNE --> P_NL
P_CACHE1 -->|miss| P_GLOBAL
P_GLOBAL --> P_COMBINE
P_NL --> P_COMBINE --> P_CTX --> P_LLM --> P_CACHE2 --> P_RESULT
P_CACHE1 -->|hit| P_RESULT
end
KW --> L_CACHE1
KW --> P_CACHE1
graph LR
subgraph LightRAG_Retrieval ["LightRAG: Edge-Based Retrieval"]
LE1["Entity A"] --- LR1["Direct Edge"] --- LE2["Entity B"]
LE1 --- LR2["Direct Edge"] --- LE3["Entity C"]
end
subgraph PathRAG_Retrieval ["PathRAG: Path-Based Retrieval"]
PE1["Entity A"] ---|"hop 1"| PE2["Entity B"]
PE2 ---|"hop 2"| PE3["Entity C"]
PE3 ---|"hop 3"| PE4["Entity D"]
end
| Aspect | LightRAG | PathRAG |
|---|---|---|
| Query Modes | local, global, hybrid, mix, naive |
hybrid only |
| Cache Timing | After keyword extraction, before search | Before keyword extraction (early exit) |
| Embedding Strategy | Batch pre-computation (query + ll + hl in one API call) | Individual calls per search |
| Local Retrieval | Entity vector search → 1-hop neighbor edges | Entity vector search → multi-hop path finding (1-3 hops) |
| Path Finding | Not used (direct edge retrieval) | DFS to find all paths, BFS-weighted pruning (alpha=0.8, threshold=0.3) |
| Relation Context | Raw edge descriptions (CSV) | Natural language path descriptions connecting entities through intermediate nodes |
| Result Merging | Round-robin interleave with deduplication | CSV-based process_combine_contexts merging |
| Token Management | 4-stage pipeline with dedicated truncation step + reranker support | Per-context-type token budgets (max_token_for_text_unit, max_token_for_local_context, max_token_for_global_context) |
| Chunk Retrieval | mix mode adds direct vector chunk search (chunks_vdb) |
Source chunks traced from entity/edge source_id fields |
| Return Type | QueryResult object (content + raw_data + streaming support) |
Plain string |
| Streaming | Supported (returns AsyncIterator) |
Supported (pass-through from LLM) |
- LightRAG: Checks cache after keyword extraction (Step 2 completes first). The cache key includes mode, query, response_type, top_k, chunk_top_k, token limits, keywords, and rerank flag — a highly specific hash.
- PathRAG: Checks cache before keyword extraction (immediate early exit). The cache key is a simpler hash of
(mode, query)only.
Both systems use the same LLM-based approach to split the query into high_level_keywords and low_level_keywords. However:
- LightRAG: Supports pre-supplied keywords via
query_param.hl_keywords/query_param.ll_keywords, and caches keyword extraction results separately. Falls back to using the raw query as a keyword when both lists are empty and the query is short. - PathRAG: Always calls the LLM for extraction. Returns a fail response immediately if either keyword list is empty.
LightRAG performs mode-dependent retrieval through _perform_kg_search:
local → entities_vdb.query(ll) → get_nodes_batch → get_nodes_edges_batch → get_edges_batch
global → relationships_vdb.query(hl) → get_edges_batch → find related entities
hybrid → local + global in parallel
mix → hybrid + chunks_vdb.query(raw query)
PathRAG performs hybrid-only retrieval with an additional path-finding stage:
Local path:
entities_vdb.query(ll) → get_node → node_degree
→ find_paths_and_edges_with_stats (DFS, up to 3 hops)
→ bfs_weighted_paths (frequency-based weighting)
→ prune paths below threshold
→ convert surviving paths to natural language
Global path:
relationships_vdb.query(hl) → get_edge → edge_degree
→ find related entities → find related text chunks
- LightRAG: Uses a 4-stage architecture (search → truncate → merge chunks → build context). The final context is a structured text block with entities, relations, and chunks sections. Supports reranking at the truncation stage.
- PathRAG: Formats context as CSV tables organized into
global-information(HL entities, HL relations, sources) andlocal-information(LL entities, LL path-based relations) blocks. Token budgets are applied per context type during retrieval, not in a separate truncation stage.
Both systems inject the assembled context into a rag_response prompt template and call the LLM. Both apply a post-processing step to strip leaked system prompt fragments from the response. Both support caching the result.
- LightRAG additionally supports
conversation_historyfor multi-turn conversations,enable_cotfor chain-of-thought, and returns a structuredQueryResultwith metadata. - PathRAG returns a plain response string.
| Storage | LightRAG | PathRAG |
|---|---|---|
hashing_kv (BaseKVStorage) |
Keyword cache + query response cache | Query response cache only |
entities_vdb (BaseVectorStorage) |
Cosine search for entities | Cosine search for entities |
relationships_vdb (BaseVectorStorage) |
Cosine search for relationships | Cosine search for relationships |
chunks_vdb (BaseVectorStorage) |
Direct chunk retrieval (mix mode) | Not used in query pipeline |
knowledge_graph_inst (BaseGraphStorage) |
Batch node/edge retrieval, degree computation | Node/edge retrieval, degree computation, full graph export for path finding |
text_chunks_db (BaseKVStorage) |
Chunk ID → text resolution | Chunk ID → text resolution (traced from source_id) |
The core innovation of PathRAG is the _find_most_related_edges_from_entities3 function, which replaces LightRAG's direct edge lookup with a multi-hop path discovery process:
graph TD
A["Retrieved Entities<br/>(from vector search)"] --> B["Build NetworkX Graph<br/>(load ALL nodes + edges)"]
B --> C["DFS Path Finding<br/>(all pairs, max 3 hops)"]
C --> D["Categorize Paths<br/>(1-hop / 2-hop / 3-hop)"]
D --> E["BFS Weighted Pruning"]
E --> F{"Edge weight<br/>> threshold?"}
F -->|Yes| G["Keep path"]
F -->|No| H["Prune path"]
G --> I["Sort by weight, deduplicate"]
I --> J["Select top 15 paths"]
J --> K["Convert to Natural Language"]
K --> K1["1-hop: 'Entity A is a TYPE through edge(KEYWORD) to connect to A and B'"]
K --> K2["2-hop: A → bridge → B with intermediate descriptions"]
K --> K3["3-hop: A → bridge1 → bridge2 → B with full chain"]
subgraph Weighting ["BFS Weighting Parameters"]
W1["alpha = 0.8 (decay per hop)"]
W2["threshold = 0.3 (pruning cutoff)"]
W3["Weight = traversal frequency / num_followers"]
end
This approach enables PathRAG to surface indirect semantic connections — information that is multiple relationship hops away from the query entities — which LightRAG's direct edge retrieval would miss. The trade-off is computational cost: path finding is O(n^k) where n is the number of retrieved entities and k is the maximum hop count.
The _build_query_context function orchestrates the 4-stage retrieval:
- Search: Calls
_perform_kg_searchto get raw entities/relations/chunks. - Truncate: Calls
_apply_token_truncationto fit LLM context windows, prioritizing highly relevant/dense information. - Merge:
_merge_all_chunkscombines text chunks associated with the selected entities/relations. - Build:
_build_context_strassembles the final prompt.
LightRAG uses priority_limit_async_func_call (in utils.py) to manage LLM rate limits.
- It implements a Priority Queue to manage request ordering.
- It includes Health Checks to detect and recover from stuck tasks.
- It wraps execution in robust Timeout handling logic.
- LLM Cache: Responses are cached (hashed by prompt + args) to save costs and reduce latency on repeated queries.
- KV Storage: Used to store the cache, enabling persistence across runs if a persistent KV backend (like Redis/Disk) is used.
- Environment Variables: Heavily used for setup (
OPENAI_API_KEY,LIGHTRAG_LOG_DIR). - Storage Swapping: Users can mix and match storage backends (e.g., Neo4j for Graph + Milvus for Vector + Redis for KV) by passing different storage class instances to
LightRAG. - Custom Prompts: System prompts (in
prompt.py) can be overridden viaglobal_configto tailor extraction/generation styles (e.g., changing the language or detail level).
While LightRAG defaults to tiktoken (OpenAI's tokenizer), it supports custom tokenizers via the Tokenizer interface. This is useful when using models with different tokenization schemes, such as Google's Gemini.
To use a custom tokenizer (e.g., for Gemini):
- Implement a wrapper class adhering to the
Tokenizerprotocol (must implementencodeanddecode). - Pass an instance of this class to the
LightRAGconstructor via thetokenizerargument.
import os
from typing import List
from lightrag.utils import Tokenizer
from google import genai
class GeminiTokenizer(Tokenizer):
def __init__(self, model_name: str = "gemini-2.5-flash"):
self.model_name = model_name
self.token_map = {}
try:
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
assert project_id is not None
location = os.environ.get("GOOGLE_CLOUD_LOCATION", "global")
self.client = genai.Client(vertexai=True, project=project_id, location=location)
except Exception as e:
# Fallback or error handling
print(f"Error initializing Gemini tokenizer: {e}")
raise
def encode(self, content: str) -> List[int]:
# returns the list of token IDs
response = self.client.models.compute_tokens(
model=self.model_name,
contents=content,
)
# Store token_ids -> tokens(bytes) mapping for decoding
if response.tokens_info:
info = response.tokens_info[0]
for t_id, t_bytes in zip(info.token_ids, info.tokens):
self.token_map[t_id] = t_bytes
return info.token_ids
return []
def decode(self, tokens: List[int]) -> str:
# returns the original string using stored token map
decoded_bytes = b""
for t_id in tokens:
if t_id in self.token_map:
decoded_bytes += self.token_map[t_id]
return decoded_bytes.decode("utf-8", errors="replace")
# Usage
# rag = LightRAG(..., tokenizer=GeminiTokenizer())The Tokenizer in LightRAG is far more than a simple text splitter. It serves as the core token budget manager that enforces LLM context window limits while maximizing the density of retrieved knowledge flowing through the pipeline. Its responsibilities fall into four categories:
1. Token-precise document chunking
The primary role is splitting source documents into chunks sized by actual token count rather than character count. The chunking_by_token_size function in operate.py encodes text into a token array via tokenizer.encode(), slices the array according to chunk_token_size (default: 1200) and chunk_overlap_token_size, then decodes each slice back to a string with tokenizer.decode(). This guarantees that no chunk exceeds the configured token limit and prevents ChunkTokenLimitExceededError from propagating through the pipeline.
2. Map-reduce control for entity/relation description merging
During knowledge graph construction, the same entity or relation may be described across multiple document chunks. The _handle_entity_relation_summary function in operate.py uses len(tokenizer.encode(desc)) to sum the total tokens across all descriptions and decide the merging strategy: if the total fits within summary_context_size, the descriptions are sent to the LLM in a single summarization call; if not, they are partitioned into token-budget-compliant groups (Map), each group is summarized individually, and the intermediate summaries are then re-summarized (Reduce).
3. Prompt context window and cost optimization
During RAG retrieval, the tokenizer enforces strict limits on the context data (node descriptions, edge descriptions, text units) injected alongside the user query. Functions such as truncate_list_by_token_size in utils.py iterate over retrieved items, accumulate their token counts via tokenizer.encode(), and truncate the list as soon as the cumulative count exceeds max_token_size. This prevents unnecessarily long inputs that would increase both latency and API billing.
4. Abstraction across LLM providers
The TokenizerInterface protocol (utils.py) and its implementations — TiktokenTokenizer for OpenAI models and GeminiTokenizer (utils_gemini.py) for Gemini — allow LightRAG to adapt to different tokenization schemes without changing any pipeline logic.
Yes — it works, and in practice it is often the preferred choice for performance reasons. However, vocabulary differences between tiktoken (OpenAI's BPE, cl100k_base/o200k_base) and Gemini's tokenizer (SentencePiece-based) introduce token count discrepancies that require attention.
Why TiktokenTokenizer is commonly used even with Gemini
The main reason is speed. The GeminiTokenizer implementation in utils_gemini.py calls client.models.compute_tokens — a network API request — for every encode() invocation. When processing large document corpora (hundreds of thousands to millions of characters), these network round-trips dramatically slow down the ingestion pipeline and risk hitting API rate limits. In contrast, TiktokenTokenizer runs entirely in local memory as a C/Rust-backed library, making it orders of magnitude faster.
Risks to be aware of
The two tokenizers can diverge significantly in token counts for the same input — especially for non-Latin scripts (e.g., Korean, Japanese, Chinese) and code snippets, where discrepancies of 20–30% or more are common. This creates two concrete risks:
-
Embedding model input limit overflow (most critical): Gemini LLMs have large context windows (up to 2M tokens), so slightly oversized chunks are harmless for LLM calls. However, embedding models such as
gemini-embedding-001enforce strict input token limits (e.g., 2048 tokens). If tiktoken estimates a chunk at 2000 tokens but Gemini's tokenizer counts it as 2200, the embedding API call will either fail with a 400 error or silently truncate the text, causing information loss in the vector index. -
Billing estimation inaccuracy: Internal token counts computed by tiktoken will not match the actual token counts reported by Google Cloud billing, making cost monitoring unreliable.
Recommendation: use TiktokenTokenizer with conservative limits
To retain the speed advantage of tiktoken while avoiding overflow errors, apply a safety margin by reducing the relevant configuration parameters:
chunk_token_size: Lower from the default 1200 to 800–1000. This buffer ensures that even if Gemini counts more tokens than tiktoken, chunks remain safely within the embedding model's input limit (2048).summary_max_tokensandsummary_context_size: Reduce by approximately 10–20% from their defaults to prevent oversized summarization prompts.
In summary, using TiktokenTokenizer with Gemini is a sound strategy for fast indexing — provided that chunk_token_size is tuned conservatively (e.g., 800) to absorb the token count variance, particularly for non-English content. Switch to GeminiTokenizer only when exact token-level precision is required and the added network latency is acceptable.
LightRAG allows users to override the default text chunking logic (chunking_by_token_size) by providing a custom chunking_func. This is useful for implementing domain-specific splitting (e.g., paragraph-based, sentence-based, or Markdown-aware chunking).
To use a custom chunking function:
- Define a synchronous or asynchronous function that matches the required signature.
- Pass this function to the
LightRAGconstructor via thechunking_funcargument.
The chunking_func can be implemented as either a synchronous or an asynchronous function. The LightRAG framework automatically detects the type of the function and handles it appropriately.
- Synchronous: If the function is a standard
def, it is called directly. - Asynchronous: If the function is an
async def, it is awaited automatically.
This flexibility allows you to perform basic text manipulation synchronously or more complex, IO-bound operations (like calling an external API for splitting) asynchronously.
The custom function must accept the following arguments and return a list of dictionaries:
from typing import Any, List, Dict
from lightrag.utils import Tokenizer
def my_custom_chunking(
tokenizer: Tokenizer,
content: str,
split_by_character: str | None = None,
split_by_character_only: bool = False,
chunk_overlap_token_size: int = 100,
chunk_token_size: int = 1200,
) -> List[Dict[str, Any]]:
# Implementation here...
return [
{
"tokens": token_count, # int
"content": chunk_text, # str
"chunk_order_index": index # int
},
# ...
]def paragraph_chunking(
tokenizer: Tokenizer,
content: str,
split_by_character: str | None = None,
split_by_character_only: bool = False,
chunk_overlap_token_size: int = 100,
chunk_token_size: int = 1200,
) -> List[Dict[str, Any]]:
"""Simple paragraph-based chunker example."""
chunks = []
# Simple split by double newline
paragraphs = content.split("\n\n")
for i, p in enumerate(paragraphs):
p = p.strip()
if not p:
continue
# Calculate tokens using the provided tokenizer
token_ids = tokenizer.encode(p)
token_count = len(token_ids)
chunks.append({
"tokens": token_count,
"content": p,
"chunk_order_index": i
})
return chunks
# Usage
# rag = LightRAG(..., chunking_func=paragraph_chunking)