Skip to content

Commit cecb223

Browse files
authored
Merge pull request Context-Engine-AI#194 from Context-Engine-AI/chunk
Add CAST+ and SOSC chunking strategies
2 parents ad566d4 + 821d8cf commit cecb223

46 files changed

Lines changed: 23282 additions & 4 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,19 @@ USE_TREE_SITTER=1
183183
INDEX_USE_ENHANCED_AST=1
184184
INDEX_SEMANTIC_CHUNKS=1
185185

186+
# Search-Optimized Semantic Chunking (SOSC) - concept-aware chunking for search
187+
# Combines cAST concept-awareness with SDC density scoring, optimized for MCP search.
188+
# Key features:
189+
# - 5 concept types: DEFINITION, BLOCK, COMMENT, IMPORT, STRUCTURE
190+
# - Concept-aware merging (docstring+function OK, block+definition NO)
191+
# - Deduplication (no duplicate chunks indexed)
192+
# - Emergency splitting for minified code
193+
# - Parent tracking for metadata (no context padding in chunk text)
194+
# Set to 1 to enable SOSC instead of SDC (INDEX_SEMANTIC_CHUNKS)
195+
# INDEX_SOSC_CHUNKS=0
196+
# SOSC_MAX_CHARS=1200
197+
# SOSC_MIN_CHARS=50
198+
186199
# Pattern Search - structural code similarity across languages
187200
# Enables pattern_search MCP tool and indexes 64-dim pattern vectors
188201
# Uses WL graph kernel, CFG fingerprints, SimHash, spectral features

docs/ARCHITECTURE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,8 @@ Tree-sitter-based multi-language AST analysis for semantic code understanding:
171171
- **Call Graph Construction**: Maps caller → callee relationships with enclosing function context
172172
- **Dependency Tracking**: Extracts imports and module dependencies
173173
- **Semantic Chunking**: Splits code at function/class boundaries (not arbitrary line counts)
174+
- **SOSC**: Search-Optimized Semantic Chunking using 34 language mappings for concept-aware chunks
175+
- **CAST+**: Hybrid chunking with concept-aware merging and density scoring
174176

175177
**Supported Languages:**
176178
| Language | Package |

docs/CONFIGURATION.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Complete environment variable reference for Context Engine.
1010
- [Core Settings](#core-settings)
1111
- [Embedding Models](#embedding-models)
1212
- [Indexing & Micro-Chunks](#indexing--micro-chunks)
13+
- [Chunking Strategies](#chunking-strategies)
1314
- [Query Optimization](#query-optimization)
1415
- [Watcher Settings](#watcher-settings)
1516
- [Reranker](#reranker)
@@ -105,13 +106,45 @@ make reset-dev-dual # Recreates collection and reindexes
105106
| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 1 (on) |
106107
| INDEX_USE_ENHANCED_AST | Enable advanced AST-based semantic chunking | 1 (on) |
107108
| INDEX_SEMANTIC_CHUNKS | Enable semantic chunking (preserve function/class boundaries) | 1 (on) |
109+
| INDEX_SOSC_CHUNKS | Enable SOSC chunking (concept-aware, search-optimized) | 0 (off) |
110+
| INDEX_CAST_CHUNKS | Enable CAST+ chunking (hybrid merging with density scoring) | 0 (off) |
111+
| INDEX_SDC_CHUNKS | Enable SDC chunking (semantic density chunking) | 0 (off) |
108112
| INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
109113
| INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
110114
| INDEX_BATCH_SIZE | Upsert batch size | 64 |
111115
| INDEX_PROGRESS_EVERY | Log progress every N files | 200 |
112116
| SMART_SYMBOL_REINDEXING | Reuse embeddings when only symbols change | 1 (enabled) |
113117
| MAX_CHANGED_SYMBOLS_RATIO | Threshold for full reindex vs smart update | 0.6 |
114118

119+
### Chunking Strategies
120+
121+
Context Engine supports multiple chunking strategies. Only one can be active at a time. Priority order (first enabled wins):
122+
123+
1. **MICRO** (`INDEX_MICRO_CHUNKS=1`) - Token-based micro-chunking for ReFRAG. 16-token windows with 8-token stride.
124+
125+
2. **SOSC** (`INDEX_SOSC_CHUNKS=1`) - Search-Optimized Semantic Chunking. Uses tree-sitter + language mappings to extract concept-aware chunks (DEFINITION, BLOCK, COMMENT, IMPORT, STRUCTURE). Best for search quality - clean boundaries, respects symbol structure.
126+
127+
3. **CAST+** (`INDEX_CAST_CHUNKS=1`) - Hybrid chunking combining concept-aware grouping with density scoring. Merges compatible concepts aggressively (e.g., docstring + function). Best for token efficiency.
128+
129+
4. **SDC** (`INDEX_SDC_CHUNKS=1`) - Semantic Density Chunking. Token-aware chunking with density scoring.
130+
131+
5. **SEMANTIC** (`INDEX_SEMANTIC_CHUNKS=1`, default) - AST-aware chunking that preserves function/class boundaries.
132+
133+
6. **LINE-BASED** (fallback) - Simple line-based chunking with overlap.
134+
135+
**Recommended for search quality:** `INDEX_SOSC_CHUNKS=1`
136+
**Recommended for token efficiency:** `INDEX_CAST_CHUNKS=1`
137+
138+
| SOSC Config | Description | Default |
139+
|-------------|-------------|---------|
140+
| SOSC_MAX_CHARS | Max non-whitespace chars per chunk | 1200 |
141+
| SOSC_MIN_CHARS | Min chars to avoid tiny fragments | 50 |
142+
143+
| CAST+ Config | Description | Default |
144+
|--------------|-------------|---------|
145+
| CAST_MAX_SIZE | Max non-whitespace chars per chunk | 1200 |
146+
| CAST_MIN_SIZE | Min chars to avoid tiny fragments | 50 |
147+
115148
## Query Optimization
116149

117150
Dynamic HNSW_EF tuning and intelligent query routing for 2x faster simple queries.

0 commit comments

Comments
 (0)