@@ -10,6 +10,7 @@ Complete environment variable reference for Context Engine.
1010- [ Core Settings] ( #core-settings )
1111- [ Embedding Models] ( #embedding-models )
1212- [ Indexing & Micro-Chunks] ( #indexing--micro-chunks )
13+ - [ Chunking Strategies] ( #chunking-strategies )
1314- [ Query Optimization] ( #query-optimization )
1415- [ Watcher Settings] ( #watcher-settings )
1516- [ Reranker] ( #reranker )
@@ -105,13 +106,45 @@ make reset-dev-dual # Recreates collection and reindexes
105106| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 1 (on) |
106107| INDEX_USE_ENHANCED_AST | Enable advanced AST-based semantic chunking | 1 (on) |
107108| INDEX_SEMANTIC_CHUNKS | Enable semantic chunking (preserve function/class boundaries) | 1 (on) |
109+ | INDEX_SOSC_CHUNKS | Enable SOSC chunking (concept-aware, search-optimized) | 0 (off) |
110+ | INDEX_CAST_CHUNKS | Enable CAST+ chunking (hybrid merging with density scoring) | 0 (off) |
111+ | INDEX_SDC_CHUNKS | Enable SDC chunking (semantic density chunking) | 0 (off) |
108112| INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
109113| INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
110114| INDEX_BATCH_SIZE | Upsert batch size | 64 |
111115| INDEX_PROGRESS_EVERY | Log progress every N files | 200 |
112116| SMART_SYMBOL_REINDEXING | Reuse embeddings when only symbols change | 1 (enabled) |
113117| MAX_CHANGED_SYMBOLS_RATIO | Threshold for full reindex vs smart update | 0.6 |
114118
119+ ### Chunking Strategies
120+
121+ Context Engine supports multiple chunking strategies. Only one can be active at a time. Priority order (first enabled wins):
122+
123+ 1 . ** MICRO** (` INDEX_MICRO_CHUNKS=1 ` ) - Token-based micro-chunking for ReFRAG. 16-token windows with 8-token stride.
124+
125+ 2 . ** SOSC** (` INDEX_SOSC_CHUNKS=1 ` ) - Search-Optimized Semantic Chunking. Uses tree-sitter + language mappings to extract concept-aware chunks (DEFINITION, BLOCK, COMMENT, IMPORT, STRUCTURE). Best for search quality - clean boundaries, respects symbol structure.
126+
127+ 3 . ** CAST+** (` INDEX_CAST_CHUNKS=1 ` ) - Hybrid chunking combining concept-aware grouping with density scoring. Merges compatible concepts aggressively (e.g., docstring + function). Best for token efficiency.
128+
129+ 4 . ** SDC** (` INDEX_SDC_CHUNKS=1 ` ) - Semantic Density Chunking. Token-aware chunking with density scoring.
130+
131+ 5 . ** SEMANTIC** (` INDEX_SEMANTIC_CHUNKS=1 ` , default) - AST-aware chunking that preserves function/class boundaries.
132+
133+ 6 . ** LINE-BASED** (fallback) - Simple line-based chunking with overlap.
134+
135+ ** Recommended for search quality:** ` INDEX_SOSC_CHUNKS=1 `
136+ ** Recommended for token efficiency:** ` INDEX_CAST_CHUNKS=1 `
137+
138+ | SOSC Config | Description | Default |
139+ | -------------| -------------| ---------|
140+ | SOSC_MAX_CHARS | Max non-whitespace chars per chunk | 1200 |
141+ | SOSC_MIN_CHARS | Min chars to avoid tiny fragments | 50 |
142+
143+ | CAST+ Config | Description | Default |
144+ | --------------| -------------| ---------|
145+ | CAST_MAX_SIZE | Max non-whitespace chars per chunk | 1200 |
146+ | CAST_MIN_SIZE | Min chars to avoid tiny fragments | 50 |
147+
115148## Query Optimization
116149
117150Dynamic HNSW_EF tuning and intelligent query routing for 2x faster simple queries.
0 commit comments