align prompt chunks for eagle3#3812
Open
xufang-lisa wants to merge 16 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves speculative decoding behavior (notably Eagle3 + dynamic_split_fuse) by keeping main/draft pipelines token-aligned during prompt chunking, adding scheduler-side controls for externally enforced prompt chunk sizes, and expanding Python test coverage to multi-prompt batching scenarios.
Changes:
- Propagates “num processed tokens” from the main pipeline into speculative candidates and uses it to synchronize prompt chunk sizes between main and draft.
- Adds a per-request “expected number of scheduled tokens” mechanism to the continuous batching
Schedulerto enforce alignment indynamic_split_fuse. - Extends continuous batching Python tests to cover both single- and multi-prompt cases with different
max_num_batched_tokensvalues.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python_tests/test_continuous_batching.py | Parametrizes dynamic-split-fuse tests for single vs multi-prompt batches. |
| src/cpp/src/speculative_decoding/continuous_batching/update_request_structs.hpp | Extends GeneratedSequence with num_processed_tokens for alignment logic. |
| src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp | Sets/uses expected prompt chunk sizes and adds additional draft pausing safety checks. |
| src/cpp/src/continuous_batching/scheduler.hpp | Introduces per-request expected scheduling size overrides during prompt scheduling. |
| src/cpp/src/continuous_batching/pipeline.cpp | Removes outdated comment claiming Eagle3 can’t use dynamic_split_fuse. |
| src/cpp/src/continuous_batching/model_runner.hpp | Adjusts Eagle3 hidden-state import checks/handling for chunked prompt scheduling. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
songbell
reviewed
May 15, 2026
4 tasks
songbell
reviewed
May 18, 2026
songbell
approved these changes
May 18, 2026
songbell
reviewed
May 20, 2026
Co-authored-by: yanlan song <bell.song@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces important improvements to the speculative decoding pipeline, specifically for the "dynamic_split_fuse" mode. The changes ensure better alignment between the main and draft models during prompt processing, provide more robust scheduling controls, and enhance test coverage for multi-prompt and batch scenarios. The main themes are improved synchronization between models, enhanced scheduler controls, and expanded testing.
Speculative decoding alignment and scheduler improvements:
Schedulerclass, with methods to set, get, and clear these expectations. This allows the pipeline to precisely control how many tokens each sequence should process, especially in dynamic speculative decoding scenarios. [1] [2] [3] [4]Example (concise):
With
max_num_batched_tokens = 8andnum_assistant_tokens = 4:At step t,
RA(request A) is ingenerate phaseandRB(request B) is inprompt phase.Before this PR:
RA = 5 generation tokensRB = 3 prompt tokens (8-5) -------> output hidden_states.shape[0]=3multistepstep1:RA = 2 generation tokens(after main's validation,inserted=1andremoved=4)RB = 6 prompt tokens (8-2) --------> need to import hidden_states.shape[0]=6In this case, the shape of
hidden_statesexpected by the draft model does not match the shape produced by the main model.What this PR fixes:
RA = 5 generation tokensRB = 3 prompt tokens (8-5)(Main explicitly syncs the expected prompt chunk size forRBto draft byset_expected_num_scheduled_tokens)multistepstep1:RA = 2 generation tokensRB =3 prompt tokens(Draft schedulesRBwith the same chunk size as main byget_expected_num_scheduled_tokens)Main and draft stay token-aligned during prefill, avoiding mismatch.
CVS-179148
Checklist: