Skip to content

fix: truncate embed text to prevent context length overflow (issue #53)#55

Merged
doITmagic merged 3 commits intomainfrom
fix/issue-53-embed-truncation
Apr 20, 2026
Merged

fix: truncate embed text to prevent context length overflow (issue #53)#55
doITmagic merged 3 commits intomainfrom
fix/issue-53-embed-truncation

Conversation

@doITmagic
Copy link
Copy Markdown
Owner

@doITmagic doITmagic commented Apr 16, 2026

Description

Fixes the "llm embedding error: the input length exceeds the context length" error that occurs when indexing large codebases with Ollama (reported in #53).

Root cause: IndexPaths() was concatenating docstring + signature + code into a single embed payload with no size limit. For large symbols (big classes, generated files, long functions), this easily exceeds the context window of embedding models like nomic-embed-text (8 192 tokens).

Fix: Added a buildEmbedText() helper that hard-caps the embed payload at 30 000 characters (~7 700 tokens, giving ~6% safety headroom). Metadata (docstring + signature) has priority and is prepended to the text, but the entire payload (including metadata, if extremely large) is strictly truncated to the maxChars limit. A [WARN] log entry is emitted whenever truncation occurs. Added memory optimizations to avoid huge []rune castings when computing chunks.

Fixes #53

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist:

  • I have performed a self-review of my own code
  • I have formatted my code with go fmt ./...
  • I have run tests go test ./... and they pass
  • I have verified integration with Ollama/Qdrant (if applicable)
  • I have updated the documentation accordingly

Add buildEmbedText() helper that caps embed payload at 30 000 chars
(~7 700 tokens, safe for 8 192-token models like nomic-embed-text).

Metadata (docstring + signature) is always preserved in full; only
the raw Code body is truncated when the total exceeds the limit.
A [WARN] log is emitted when truncation occurs.

Fixes #53
Copilot AI review requested due to automatic review settings April 16, 2026 17:29
@doITmagic doITmagic self-assigned this Apr 16, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a truncation mechanism for code chunks to ensure they fit within the embedding model's context window, specifically adding the buildEmbedText function and associated unit tests. A review comment identifies a correctness issue where metadata exceeding the character limit would still cause an overflow and suggests a simplified, more robust truncation logic.

Comment thread internal/ragcode/indexer.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes embedding failures during indexing by introducing an embed-text builder that enforces a maximum payload size and logs when truncation occurs.

Changes:

  • Added buildEmbedText() helper and maxEmbedChars limit to cap embedding payload size.
  • Updated Indexer.IndexPaths() to use the capped embed text and emit a warning when truncation happens.
  • Added unit tests validating truncation behavior and boundary conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
internal/ragcode/indexer.go Builds embed text with a character limit and logs when truncation occurs before calling the embedder.
internal/ragcode/indexer_embed_test.go Adds tests for buildEmbedText() to verify truncation and no-truncation scenarios.

Comment thread internal/ragcode/indexer.go Outdated
Comment thread internal/ragcode/indexer.go Outdated
Comment on lines +57 to +76
func TestBuildEmbedText_WithDocstring_TruncatesCode(t *testing.T) {
bigCode := strings.Repeat("y", 40_000)
ch := codetypes.CodeChunk{
Signature: "func Fn()",
Code: bigCode,
Docstring: "This is a docstring.",
FilePath: "fn.go",
}
limit := 30_000
text, truncated := buildEmbedText(ch, limit)
if !truncated {
t.Fatal("expected truncation")
}
if !strings.Contains(text, "This is a docstring.") {
t.Errorf("docstring missing after truncation")
}
if len([]rune(text)) > limit {
t.Errorf("text exceeds limit after truncation")
}
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage doesn’t currently include the case where metadata alone (very large Docstring or Signature) exceeds the limit. That’s the scenario where buildEmbedText should still guarantee the returned text is <= limit (or clearly define how it behaves). Adding a test for an oversized docstring/signature would prevent regressions and catch the current overflow behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cc23127.

Address PR #55 review comments:
- Simplify buildEmbedText() to use single runes[:maxChars] truncation
  (gemini-code-assist suggestion). This guarantees the limit is never
  exceeded, even when metadata alone is larger than maxChars.
- Remove redundant []rune conversions in truncation path (Copilot
  memory concern).
- Add tests for oversized metadata with and without code body (Copilot
  test coverage request).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread internal/ragcode/indexer.go
Comment thread internal/ragcode/indexer.go Outdated
Avoid massive allocations by not concatenating large code chunks with
metadata before truncation. Instead, compute available space and only
extract the necessary substring. Uses string iteration to safely
truncate at rune boundaries without large []rune casting.
@doITmagic doITmagic merged commit 9f7c8ee into main Apr 20, 2026
6 checks passed
@doITmagic doITmagic deleted the fix/issue-53-embed-truncation branch April 20, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"llm embedding error: the input length exceeds the context length"

2 participants