support continous batching for eagle3 by xufang-lisa · Pull Request #3321 · openvinotoolkit/openvino.genai

xufang-lisa · 2026-02-11T06:38:22Z

Description

support continuous batching with multiple prompts for eagle3 pipeline.

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR adds support for continuous batching with multiple prompts to the EAGLE3 speculative decoding pipeline. The implementation ensures that the draft model and main model process the same chunks by coordinating prefill completion across all requests.

Changes:

Added synchronization logic to pause generation until all requests complete prefill in EAGLE3 mode
Modified hidden state handling to support partial tensor copying for mismatched sequence lengths
Extended test coverage to validate multiple prompt batching scenarios

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/python_tests/test_continuous_batching.py	Updated tests to support parameterized multi-prompt batching scenarios for both speculative decoding and EAGLE3
src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp	Added prefill synchronization logic and modified draft model execution control flow
src/cpp/src/sequence_group.hpp	Added `has_finished_prefill()` method to check if any sequence has begun generation
src/cpp/src/continuous_batching/model_runner.hpp	Enhanced hidden state handling with partial tensor copying for mismatched sequence lengths

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp:1

Switching from to_generate |= ... to to_generate &= ... changes semantics from “any request can generate” to “all requests can generate”. With this change, to_generate must be initialized to true immediately before the loop; otherwise &= can preserve a stale value from previous iterations (or remain false), preventing the draft model from running when it should.

// Copyright (C) 2023-2026 Intel Corporation

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (4)

src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp:1

to_generate was previously initialized for an OR-reduction; after switching to an AND-reduction (&=), it must be initialized to true before the loop. As written, to_generate may be uninitialized/stale (or remain false if previously set), which can incorrectly suppress draft-model execution or lead to inconsistent behavior. Initialize to_generate = true immediately before the loop (or refactor to a clearly-scoped reduction variable).

// Copyright (C) 2023-2026 Intel Corporation

src/cpp/src/sequence_group.hpp:466

has_finished_prefill() returns true only after at least one token has been generated (generated_len > 0), which is typically after prefill has already completed. Since pipeline synchronization relies on this to detect prefill completion, this can fail to pause the earliest-finished request at the right time. Consider basing this on prompt-consumption state (e.g., the same internal condition used to allow generation after prompt processing, without depending on generated_len).

    bool has_finished_prefill() const {
        for (auto& sequence : get_running_sequences()) {
            if (sequence->get_generated_len() > 0) {
                return true;
            }
        }

        return false;
    }

tests/python_tests/test_continuous_batching.py:697

With the new @pytest.mark.parametrize cases, this helper will re-download/re-convert the same models multiple times per test function, which can significantly lengthen CI/runtime. Consider moving model download/convert into a cached fixture (e.g., scope='session') or passing precomputed main_model_path/draft_model_path into this helper so parametrization doesn’t multiply conversion cost.

def compare_results_for_dynamic_split_fuse_config(main_model_id, draft_model_id, prompts):
    main_model_path = download_and_convert_model(main_model_id).models_path
    draft_model_path = download_and_convert_model(draft_model_id).models_path

tests/python_tests/test_continuous_batching.py:728

This index-based loop can be simplified and made clearer by iterating over both lists directly (e.g., for reference, generated in zip(...)). That also avoids manual indexing and makes failures easier to read if you add pytest assertion messages later.

    for i in range(0, result_len_ref):
        reference = result_ref.texts[i]
        generated = result_gen.texts[i]
        assert generated == reference

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

songbell · 2026-03-18T06:18:27Z

                    size_t stored_hidden_size = stored_shape[stored_shape.size() - 1];

                    OPENVINO_ASSERT(stored_hidden_size == hidden_size, "Target state hidden size does not match the expected size for Eagle3 draft model inference.");
-                    OPENVINO_ASSERT(stored_seq_len == total_num_tokens, "Target state sequence length does not match the expected length for Eagle3 draft model inference.");


stored_seq_len == num_scheduled_tokens

songbell · 2026-03-18T06:21:41Z

+                    if (stored_seq_len == total_num_tokens) {
+                        hidden_state_input = stored_hidden_state;
+                    } else {
+                        size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);


Suggested change

size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);

const size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);

songbell · 2026-03-18T06:21:51Z

+                        hidden_state_input = stored_hidden_state;
+                    } else {
+                        size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);
+                        size_t source_start_idx = stored_seq_len - copy_length;


Suggested change

size_t source_start_idx = stored_seq_len - copy_length;

const size_t source_start_idx = stored_seq_len - copy_length;

songbell · 2026-03-18T07:03:37Z

        return m_max_content_len + m_num_validation_tokens >= get_prompt_len() && !m_is_gen_paused;
    }

+    bool has_generated_tokens() const {


looks like we don't need a specific api only for limited useage?

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot · 2026-03-23T01:11:22Z

+                    if (stored_seq_len == total_num_tokens) {
+                        hidden_state_input = stored_hidden_state;
+                    } else {
+                        size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);


In the Eagle3 hidden-state import path, the fallback branch copies only min(stored_seq_len, num_scheduled_tokens) rows when stored_seq_len != total_num_tokens. If stored_seq_len < num_scheduled_tokens, the remaining scheduled tokens keep zeroed hidden states, which can silently corrupt draft-model inference. Add an explicit precondition (e.g., OPENVINO_ASSERT(stored_seq_len >= num_scheduled_tokens, ...) or ==) and fail fast if the stored hidden-state slice is shorter than the scheduled chunk.

Suggested change

size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);

OPENVINO_ASSERT(stored_seq_len >= num_scheduled_tokens,

"Stored hidden state length is shorter than the scheduled token chunk "

"in Eagle3 draft model inference.");

size_t copy_length = num_scheduled_tokens;

…atching_for_eagle3

songbell · 2026-04-02T02:08:15Z

                                                                                         bool is_update_logit_processor) {
    UpdateRequestResult result{0, 0};
+    // Check if all requests have completed pre-filling
+    bool all_prefill_finished = true;


Suggested change

bool all_prefill_finished = true;

bool any_other_prefill_unfinished = false;

looks more clear?

songbell · 2026-04-02T02:08:51Z

+    for (auto& request : m_requests) {
+        const bool finished = request->has_generated_tokens();
+        if (!finished) {
+            all_prefill_finished = false;


can break the loop earlier once set ？

github-actions · 2026-04-20T00:21:29Z

This PR will be closed in a week because of 2 weeks of no activity.

github-actions · 2026-04-27T00:23:48Z

Closing due to inactivity. Feel free to reopen if you plan to continue working on this.

peterchen-intel · 2026-05-15T07:57:25Z

Replaced by #3812

xufang-lisa added 2 commits February 11, 2026 13:51

support continous batching for eagle3

820717d

Merge branch 'master' into xufang/support_continous_batching_for_eagle3

ce82532

Copilot AI review requested due to automatic review settings February 11, 2026 06:38

github-actions Bot added category: continuous batching Continuous batching category: speculative decoding Speculative decoding no-match-files category: GGUF GGUF file reader labels Feb 11, 2026

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Comment thread tests/python_tests/test_continuous_batching.py Outdated

Comment thread src/cpp/src/sequence_group.hpp Outdated

Comment thread src/cpp/src/continuous_batching/model_runner.hpp

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp Outdated

xufang-lisa added 2 commits February 11, 2026 14:54

update comments

e70e3e4

fix format

7e47180

Copilot AI review requested due to automatic review settings February 11, 2026 07:06

Copilot AI reviewed Feb 11, 2026

View reviewed changes

xufang-lisa and others added 3 commits February 27, 2026 03:04

Merge branch 'master' into xufang/support_continous_batching_for_eagle3

d56301f

add an empty tensor check

7764ec0

Update comment

568765d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 28, 2026 08:02

rename function name

61414ca

Copilot AI reviewed Feb 28, 2026

View reviewed changes

Copilot started reviewing on behalf of xufang-lisa February 28, 2026 08:26 View session

xufang-lisa added 2 commits March 4, 2026 00:57

Merge branch 'master' into xufang/support_continous_batching_for_eagle3

04ad840

Merge branch 'master' into xufang/support_continous_batching_for_eagle3

162c6a6

Copilot AI review requested due to automatic review settings March 11, 2026 01:42

Copilot started reviewing on behalf of xufang-lisa March 11, 2026 01:43 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp

Comment thread src/cpp/src/speculative_decoding/continuous_batching/pipeline_impl.cpp

Comment thread src/cpp/src/sequence_group.hpp Outdated

Comment thread tests/python_tests/test_continuous_batching.py

update comments

14e7d35

songbell reviewed Mar 18, 2026

View reviewed changes

Merge branch 'master' into xufang/support_continous_batching_for_eagle3

8ece2e9

Copilot AI review requested due to automatic review settings March 23, 2026 01:06

Copilot started reviewing on behalf of xufang-lisa March 23, 2026 01:06 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Merge branch 'openvinotoolkit:master' into xufang/support_continous_b…

5fa3bea

…atching_for_eagle3

songbell reviewed Apr 2, 2026

View reviewed changes

github-actions Bot added the Stale label Apr 20, 2026

github-actions Bot closed this Apr 27, 2026

	size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);
	const size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);

	size_t source_start_idx = stored_seq_len - copy_length;
	const size_t source_start_idx = stored_seq_len - copy_length;

-                        size_t copy_length = std::min(stored_seq_len, num_scheduled_tokens);
+                        OPENVINO_ASSERT(stored_seq_len >= num_scheduled_tokens,
+                                        "Stored hidden state length is shorter than the scheduled token chunk "
+                                        "in Eagle3 draft model inference.");
+                        size_t copy_length = num_scheduled_tokens;

	bool all_prefill_finished = true;
	bool any_other_prefill_unfinished = false;

Conversation

xufang-lisa commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

songbell Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

peterchen-intel commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xufang-lisa commented Feb 11, 2026 •

edited

Loading