Implement Sequence ID pooling in BatchedExecutor to prevent native SeqMax overflow and crashes#1386
Merged
Merged
Conversation
…qMax overflow and crashes
martindevans
approved these changes
May 14, 2026
martindevans
left a comment
Member
There was a problem hiding this comment.
Looks good, thanks for fixing this.
Contributor
Author
|
Martin, when you merge this could you please also check the "llama.cpp @ 73c9eb8" link in the master branch. I think you forgot to update it to the version applied with the former main update. Thanks a lot! |
Member
Oops, I keep forgetting to do that. I fixed it in #1387 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The Problem
Under sustained traffic or continuous batching scenarios, the
BatchedExecutoreventually causes the nativellama.cppbackend to deadlock, reject tokens, or throw a segmentation fault.The Root Cause
Currently,
BatchedExecutorassigns sequence IDs using a strictly incrementing counter (_nextSequenceId++). When aConversationis disposed, its tokens are properly removed from the KV cache viaMemorySequenceRemove, but its Sequence ID is permanently abandoned.In
llama.cpp, theSeqMax(n_seq_max) parameter dictates the static allocation of arrays within the native KV cache. It expects sequence IDs to act as strictly bounded array indices (from0toSeqMax - 1). Once the C#_nextSequenceIdcounter exceedsSeqMax, passing that ID to the native backend results in an out-of-bounds memory access.The Solution
This PR replaces the strictly incrementing counter with a Sequence ID pool.
BatchedExecutor: Updated to track active IDs (e.g., using a thread-safeHashSetor similar structure) and assign the lowest available sequence ID. Added aReleaseSequenceIdmethod.Conversation: Updated theDispose()method to callExecutor.ReleaseSequenceId(ConversationId)after clearing the KV cache.Benefits
SeqMaxlimit as long as concurrent conversations stay within bounds.BatchedExecutorto run continuously under high-traffic workloads without requiring the host application to dangerously destroy and recreate the massive native context to reset the ID counter.