fix: truncate embed text to prevent context length overflow (issue #53)#55
fix: truncate embed text to prevent context length overflow (issue #53)#55
Conversation
Add buildEmbedText() helper that caps embed payload at 30 000 chars (~7 700 tokens, safe for 8 192-token models like nomic-embed-text). Metadata (docstring + signature) is always preserved in full; only the raw Code body is truncated when the total exceeds the limit. A [WARN] log is emitted when truncation occurs. Fixes #53
There was a problem hiding this comment.
Code Review
This pull request introduces a truncation mechanism for code chunks to ensure they fit within the embedding model's context window, specifically adding the buildEmbedText function and associated unit tests. A review comment identifies a correctness issue where metadata exceeding the character limit would still cause an overflow and suggests a simplified, more robust truncation logic.
There was a problem hiding this comment.
Pull request overview
Fixes embedding failures during indexing by introducing an embed-text builder that enforces a maximum payload size and logs when truncation occurs.
Changes:
- Added
buildEmbedText()helper andmaxEmbedCharslimit to cap embedding payload size. - Updated
Indexer.IndexPaths()to use the capped embed text and emit a warning when truncation happens. - Added unit tests validating truncation behavior and boundary conditions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
internal/ragcode/indexer.go |
Builds embed text with a character limit and logs when truncation occurs before calling the embedder. |
internal/ragcode/indexer_embed_test.go |
Adds tests for buildEmbedText() to verify truncation and no-truncation scenarios. |
| func TestBuildEmbedText_WithDocstring_TruncatesCode(t *testing.T) { | ||
| bigCode := strings.Repeat("y", 40_000) | ||
| ch := codetypes.CodeChunk{ | ||
| Signature: "func Fn()", | ||
| Code: bigCode, | ||
| Docstring: "This is a docstring.", | ||
| FilePath: "fn.go", | ||
| } | ||
| limit := 30_000 | ||
| text, truncated := buildEmbedText(ch, limit) | ||
| if !truncated { | ||
| t.Fatal("expected truncation") | ||
| } | ||
| if !strings.Contains(text, "This is a docstring.") { | ||
| t.Errorf("docstring missing after truncation") | ||
| } | ||
| if len([]rune(text)) > limit { | ||
| t.Errorf("text exceeds limit after truncation") | ||
| } | ||
| } |
There was a problem hiding this comment.
Test coverage doesn’t currently include the case where metadata alone (very large Docstring or Signature) exceeds the limit. That’s the scenario where buildEmbedText should still guarantee the returned text is <= limit (or clearly define how it behaves). Adding a test for an oversized docstring/signature would prevent regressions and catch the current overflow behavior.
Address PR #55 review comments: - Simplify buildEmbedText() to use single runes[:maxChars] truncation (gemini-code-assist suggestion). This guarantees the limit is never exceeded, even when metadata alone is larger than maxChars. - Remove redundant []rune conversions in truncation path (Copilot memory concern). - Add tests for oversized metadata with and without code body (Copilot test coverage request).
Avoid massive allocations by not concatenating large code chunks with metadata before truncation. Instead, compute available space and only extract the necessary substring. Uses string iteration to safely truncate at rune boundaries without large []rune casting.
Description
Fixes the
"llm embedding error: the input length exceeds the context length"error that occurs when indexing large codebases with Ollama (reported in #53).Root cause:
IndexPaths()was concatenatingdocstring + signature + codeinto a single embed payload with no size limit. For large symbols (big classes, generated files, long functions), this easily exceeds the context window of embedding models likenomic-embed-text(8 192 tokens).Fix: Added a
buildEmbedText()helper that hard-caps the embed payload at 30 000 characters (~7 700 tokens, giving ~6% safety headroom). Metadata (docstring + signature) has priority and is prepended to the text, but the entire payload (including metadata, if extremely large) is strictly truncated to themaxCharslimit. A[WARN]log entry is emitted whenever truncation occurs. Added memory optimizations to avoid huge[]runecastings when computing chunks.Fixes #53
Type of change
Checklist:
go fmt ./...go test ./...and they pass