Commit ad566d4
committed
Add Semantic Density Chunker (SDC) for improved chunking
Introduces a new token-aware, AST-driven chunking algorithm (Semantic Density Chunker) in scripts/ingest/semantic_chunker.py. Integrates SDC into chunking.py and pipeline.py, allowing selection via the INDEX_SDC_CHUNKS environment variable. The new chunker uses token budgets, respects semantic boundaries, merges small units, and scores chunks by information density for more effective code chunking.1 parent cfa629f commit ad566d4
3 files changed
Lines changed: 533 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
24 | 31 | | |
25 | 32 | | |
26 | 33 | | |
| |||
284 | 291 | | |
285 | 292 | | |
286 | 293 | | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
802 | 802 | | |
803 | 803 | | |
804 | 804 | | |
| 805 | + | |
| 806 | + | |
805 | 807 | | |
806 | 808 | | |
807 | 809 | | |
| |||
824 | 826 | | |
825 | 827 | | |
826 | 828 | | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
827 | 832 | | |
828 | 833 | | |
829 | 834 | | |
| |||
0 commit comments