Skip to content

align prompt chunks for eagle3#3812

Open
xufang-lisa wants to merge 16 commits into
openvinotoolkit:masterfrom
xufang-lisa:xufang/align_prompt_chunks_for_eagle3
Open

align prompt chunks for eagle3#3812
xufang-lisa wants to merge 16 commits into
openvinotoolkit:masterfrom
xufang-lisa:xufang/align_prompt_chunks_for_eagle3

Conversation

@xufang-lisa
Copy link
Copy Markdown
Contributor

@xufang-lisa xufang-lisa commented May 7, 2026

Description

This pull request introduces important improvements to the speculative decoding pipeline, specifically for the "dynamic_split_fuse" mode. The changes ensure better alignment between the main and draft models during prompt processing, provide more robust scheduling controls, and enhance test coverage for multi-prompt and batch scenarios. The main themes are improved synchronization between models, enhanced scheduler controls, and expanded testing.

Speculative decoding alignment and scheduler improvements:

  • Added tracking of the expected number of scheduled tokens per request in the Scheduler class, with methods to set, get, and clear these expectations. This allows the pipeline to precisely control how many tokens each sequence should process, especially in dynamic speculative decoding scenarios. [1] [2] [3] [4]
  • During speculative decoding, the pipeline now explicitly sets the expected number of tokens the draft model should process to match the main model, ensuring proper alignment and preventing the draft model from getting ahead. [1] [2]
  • Modified the logic for pausing draft model generation to include additional safety checks and to handle edge cases where the main model hasn't started processing or when prompt processing is incomplete. [1] [2]

Example (concise):
With max_num_batched_tokens = 8 and num_assistant_tokens = 4:
At step t, RA (request A) is in generate phase and RB (request B) is in prompt phase.

Before this PR:

  • Main schedules:
    RA = 5 generation tokens
    RB = 3 prompt tokens (8-5) -------> output hidden_states.shape[0]=3
  • In draft multistep step1:
    RA = 2 generation tokens (after main's validation, inserted=1 and removed=4)
    RB = 6 prompt tokens (8-2) --------> need to import hidden_states.shape[0]=6
    In this case, the shape of hidden_states expected by the draft model does not match the shape produced by the main model.

What this PR fixes:

  • Main schedules:
    RA = 5 generation tokens
    RB = 3 prompt tokens (8-5) (Main explicitly syncs the expected prompt chunk size for RB to draft by set_expected_num_scheduled_tokens)
  • In draft multistep step1:
    RA = 2 generation tokens
    RB =3 prompt tokens (Draft schedules RB with the same chunk size as main by get_expected_num_scheduled_tokens)
    Main and draft stay token-aligned during prefill, avoiding mismatch.

CVS-179148

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

@github-actions github-actions Bot added category: continuous batching Continuous batching category: speculative decoding Speculative decoding category: GGUF GGUF file reader labels May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 06:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves speculative decoding behavior (notably Eagle3 + dynamic_split_fuse) by keeping main/draft pipelines token-aligned during prompt chunking, adding scheduler-side controls for externally enforced prompt chunk sizes, and expanding Python test coverage to multi-prompt batching scenarios.

Changes:

  • Propagates “num processed tokens” from the main pipeline into speculative candidates and uses it to synchronize prompt chunk sizes between main and draft.
  • Adds a per-request “expected number of scheduled tokens” mechanism to the continuous batching Scheduler to enforce alignment in dynamic_split_fuse.
  • Extends continuous batching Python tests to cover both single- and multi-prompt cases with different max_num_batched_tokens values.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/python_tests/test_continuous_batching.py Parametrizes dynamic-split-fuse tests for single vs multi-prompt batches.
src/cpp/src/speculative_decoding/continuous_batching/update_request_structs.hpp Extends GeneratedSequence with num_processed_tokens for alignment logic.
src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Sets/uses expected prompt chunk sizes and adds additional draft pausing safety checks.
src/cpp/src/continuous_batching/scheduler.hpp Introduces per-request expected scheduling size overrides during prompt scheduling.
src/cpp/src/continuous_batching/pipeline.cpp Removes outdated comment claiming Eagle3 can’t use dynamic_split_fuse.
src/cpp/src/continuous_batching/model_runner.hpp Adjusts Eagle3 hidden-state import checks/handling for chunked prompt scheduling.

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/continuous_batching/scheduler.hpp Outdated
Comment thread tests/python_tests/test_continuous_batching.py
Copilot AI review requested due to automatic review settings May 7, 2026 08:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/continuous_batching/scheduler.hpp Outdated
Comment thread tests/python_tests/test_continuous_batching.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 01:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment thread src/cpp/src/continuous_batching/scheduler.hpp Outdated
Comment thread src/cpp/src/continuous_batching/scheduler.hpp Outdated
Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
@peterchen-intel peterchen-intel requested a review from songbell May 14, 2026 00:23
Copilot AI review requested due to automatic review settings May 14, 2026 05:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/continuous_batching/scheduler.hpp Outdated
Copilot AI review requested due to automatic review settings May 15, 2026 03:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Copy link
Copy Markdown
Contributor

@songbell songbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall the solution looks good to me, copilot comments are worth considered for clear condition checkings

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated
Co-authored-by: yanlan song <bell.song@intel.com>
Copilot AI review requested due to automatic review settings May 20, 2026 01:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread src/cpp/src/continuous_batching/scheduler.hpp
@peterchen-intel peterchen-intel marked this pull request as ready for review May 21, 2026 06:41
Copilot AI review requested due to automatic review settings May 21, 2026 06:41
@peterchen-intel peterchen-intel requested a review from popovaan as a code owner May 21, 2026 06:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread src/cpp/src/continuous_batching/model_runner.hpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants