Skip to content

fix: prevent segfault when input exceeds batch size#21

Open
vlasky wants to merge 1 commit intoasg017:mainfrom
vlasky:fix-batch-overflow-segfault
Open

fix: prevent segfault when input exceeds batch size#21
vlasky wants to merge 1 commit intoasg017:mainfrom
vlasky:fix-batch-overflow-segfault

Conversation

@vlasky
Copy link
Copy Markdown

@vlasky vlasky commented Dec 4, 2025

Summary

  • Fix buffer overflow segfault when input tokenizes to more than 512 tokens (the hardcoded batch capacity)
  • Size batch dynamically to actual token count instead of fixed 512
  • Add bounds check with actionable error message when token count exceeds model's context size
  • Fix memory leak: free tokens array after use

Details

The batch was initialized with a fixed capacity of 512 tokens, but the loop populating it had no bounds check. When an input tokenized to more than 512 tokens, this caused a buffer overflow and segmentation fault.

Resolves #20

Test plan

  • Verify extension builds successfully
  • Test with inputs that tokenize to more than 512 tokens - should return error instead of crashing
  • Existing tests continue to pass

The batch was initialized with a fixed capacity of 512 tokens, but the
loop populating it had no bounds check. When processing documents with
more than 512 tokens, this caused a buffer overflow and segmentation
fault.

Changes:
- Size batch to actual token_count instead of fixed 512
- Add bounds check: error if token_count > context size
- Return actionable error message with token count and limit
- Fix memory leak: free tokens array after use
- Add specific error messages for decode/embedding failures

Fixes asg017#20

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation fault core dumped computing lembed for certain values

1 participant