[Dataflow Streaming] [Multi Key] Prepare StreamingModeExecutionContext and StreamingWorkScheduler for multi-key execution. by arunpandianp · Pull Request #38814 · apache/beam

arunpandianp · 2026-06-04T11:08:08Z

Adds interfaces and methods for advancing work items in a multi key bundle. Nothing functional changes yet. Following PRs will update failure handling logic and implement StreamingModeExecutionContext::advance to poll more items from BoundedQueueExecutor.

…oring for multi-key execution.

arunpandianp · 2026-06-04T11:09:16Z

R: @scwhittle

github-actions · 2026-06-04T11:10:33Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

arunpandianp · 2026-06-04T11:10:52Z

@scwhittle FYI on this, i'll update after validating presubmits and when it is ready for review.

gemini-code-assist · 2026-06-04T11:11:56Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the foundational infrastructure required for multi-key execution in Dataflow streaming. By updating the execution context, scheduler, and reader iterators, the changes enable the system to process multiple keys within a single bundle. These modifications are primarily structural, setting the stage for future PRs that will implement the specific failure handling and queue polling logic required for full multi-key support.

Highlights

Refactored StreamingModeExecutionContext: Added support for multi-key execution by introducing methods like startForNewKey and advance, allowing the context to manage state transitions between keys in a single bundle.
Updated StreamingWorkScheduler: Refactored the scheduler to handle multi-key work batches, updating execution signatures and processing logic to support batch-level cleanup and statistics recording.
Enhanced Iterator Logic: Updated WindmillReaderIteratorBase and WindowingWindmillReader to support key chaining, enabling the iterator to transition seamlessly between keys during work advancement.
Failure Propagation: Added an onFailureListener to the Work class to improve failure propagation and cleanup across multi-key execution cycles.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the Dataflow streaming execution context and work processing to support batching and key chaining, allowing the worker to advance to the next key in a group when elements are exhausted. Key changes include updating StreamingModeExecutionContext to manage multiple executed works and output builders, refactoring StreamingWorkScheduler to handle work batches, and updating reader iterators to support key transitions. The review feedback highlights several critical improvement opportunities: a potential memory leak in StreamingModeExecutionContext.clear() from not nullifying key and work references, a deadlock risk in Work caused by executing listener callbacks inside synchronized blocks, a potential NullPointerException in exception logging when getWorkItem() is null, and unsafe mutation of a shared outputBuilder in StreamingWorkScheduler.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T11:13:43Z

+      } catch (IOException e) {
+        LOG.warn(
+            "Failed to close reader for {}-{}", computationId, getWorkItem().getShardingKey(), e);
      }


If getWorkItem() returns null (which is possible if the context is cleared or not yet initialized), calling getWorkItem().getShardingKey() inside the catch block will throw a NullPointerException. This will mask the original IOException thrown by activeReader.close(), making debugging very difficult.

Please add a null check for getWorkItem() before accessing the sharding key.

} catch (IOException e) { Windmill.WorkItem workItem = getWorkItem(); long shardingKey = workItem != null ? workItem.getShardingKey() : -1L; LOG.warn( "Failed to close reader for {}-{}", computationId, shardingKey, e); }

arunpandianp · 2026-06-05T04:44:26Z

R: @scwhittle this is ready for review. Thanks!

scwhittle

Some initial comments

scwhittle · 2026-06-05T07:58:03Z

 *
 * @implNote Once closed, it cannot be reused.
 */
 // TODO(m-trieu): See if this can be combined/cleaned up with StreamingModeExecutionContext as the


should we get rid of this and merge into StreamingModeExecutionContext

We should be able to merge things into StreamingModeExecutionContext. Today ComputationWorkExecutor owns both StreamingModeExecutionContext and MapTaskExecutor. The tricky bit seems to be that MapTaskExecutor's creation depends on StreamingModeExecutionContext. So we need to create StreamingModeExecutionContext before the MapTaskExecutor. We could store MapTaskExecutor in StreamingModeExecutionContext after constructing both.

I think we can do that separately, since it looks like it will pull in a bit of unrelated changes.

Sounds good, might be a nice simplification to follow up with.

scwhittle · 2026-06-05T08:56:17Z

+    Instant processingTime =
+        computeProcessingTime(newWork.getWorkItem().getTimers().getTimersList());
+    if (!getAllStepContexts().isEmpty()) {
+      // This must be only created once for a workItem as token validation will fail if the same


not sure what this means, I don't see any guard against recreating it. I'm wondering if there are cases with retries etc that we could get the same key and possibly same work token. Should we guard against that somehow?

It is an old comment that i moved here.

stateCache.forKey internally checks if the input workToken is > the last seen workToken, else will invalidate the in-memory cache and returns a new ForKey cache. I think that is why the comment says do this once per workitem.

I'm wondering if there are cases with retries etc that we could get the same key and possibly same work token. Should we guard against that somehow?

IIUC, the worktoken check in stateCache.forKey guards against this.

Separate from the logic here, I was wondering if we want/need to try to prevent processing a batch with the same key multiple times

ActiveWorkState manages the retries and multiple work items for same key. It make sures there is only one work item active at a time for a key. I think we can rely on that and don't need to add deduping logic in the context.

…nContext

# Conflicts: # runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingModeExecutionContext.java

scwhittle · 2026-06-08T10:47:24Z

 *
 * @implNote Once closed, it cannot be reused.
 */
 // TODO(m-trieu): See if this can be combined/cleaned up with StreamingModeExecutionContext as the


Sounds good, might be a nice simplification to follow up with.

scwhittle · 2026-06-08T11:02:50Z

+    Instant processingTime =
+        computeProcessingTime(newWork.getWorkItem().getTimers().getTimersList());
+    if (!getAllStepContexts().isEmpty()) {
+      // This must be only created once for a workItem as token validation will fail if the same


Separate from the logic here, I was wondering if we want/need to try to prevent processing a batch with the same key multiple times

Resolved conflicts in StreamingModeExecutionContext.java and StreamingModeExecutionContextTest.java. Fixed compilation error in Work.java by removing duplicate getComputationId() method. TAG=agy CONV=143daaa5-e902-4d26-820d-cf1af2babb84

scwhittle

some initial comments

arunpandianp · 2026-06-29T07:59:46Z

@scwhittle could you take another look?

scwhittle · 2026-06-29T08:41:33Z

    String computationId = computationState.getComputationId();
-    ByteString key = workItem.getKey();
    work.setProcessingThreadName(Thread.currentThread().getName());
    work.setState(Work.State.PROCESSING);


the first item is set to PROCESSING here and then others are set to PROCESSING as they are added to the batch. But the first remains PROCESSING until it is transferred to QUEUED. This might confuse user worker latency analysis, we attribute too much user processing time to that particular key and if we have N items taking a second each we would have O(N^2) total processing seconds instead of N. Should we add a new POST_PROCESSING_QUEUED or something?

I think that is a good idea. Will tackle it in a separate PR.

scwhittle · 2026-06-29T08:42:40Z

+  private KeyTransitionListener createKeyTransitionListener() {
+    return (oldWork, newWork) -> {
+      setLoggingContextWorkId(newWork.getLatencyTrackingId());
+      newWork.setProcessingThreadName(oldWork.getProcessingThreadName());


see above, maybe we should set oldWork state to somethign showing it is waiting for the batch

Yes, would like to tackle it in a separate PR.

scwhittle · 2026-06-29T08:46:05Z

+      List<Work> workBatch,
+      List<Windmill.WorkItemCommitRequest> workItemCommits) {
+    Preconditions.checkState(
+        workBatch.size() == 1, "Expected single-key work batch, got: " + workBatch.size());


check taht commits and batch are same size?

scwhittle · 2026-06-29T08:50:32Z

-      // either here or in DFE.
-      if (work.getWorkItem().hasTimers()) {
-        stageInfo.timerProcessingMsecs().addValue(processingTimeMsecs);
+      recordProcessingTime(stageInfo, workBatch, work, processingStartTimeNanos);


could we simplify here by just creating a single-element workBatch if workBatch is null? and then just using the batch for this method and below

good idea, done.

scwhittle · 2026-06-29T08:51:36Z

  private @Nullable WorkExecutor workExecutor;
  private boolean finishKeyCalled = false;

+  @SuppressWarnings("UnusedVariable")


any idea why these suppressions are needed?

It is because we are not reading from the variables workQueueExecutor yet, future PRs have logic to read from these variables. Will remove the suppressions in future PRs.

Private field 'workQueueExecutor' is assigned but never accessed is the warning that shows up in ide.

scwhittle · 2026-06-29T09:26:31Z

+      private @Nullable WindowedValue<KeyedWorkItem<K, T>> current = null;
+
+      @Override
+      public boolean start() throws IOException {


can we just implement start() and advance() with a helper method taking a bool on whether or not to advance initially? seems safer if there is more setup/checking added before processing an item

scwhittle · 2026-06-29T09:29:02Z

    // Ensure that the invalidated dofn had tearDown called on them.
-    assertEquals(1, TestExceptionInvalidatesCacheFn.tearDownCallCount.get());
-    assertEquals(2, TestExceptionInvalidatesCacheFn.setupCallCount.get());
+    assertEquals(2, TestExceptionInvalidatesCacheFn.tearDownCallCount.get());


why did this change?

I should have explained this :(

The test makes UnboundedReader::getCheckpointMark throw. In the old code, UnboundedReader::getCheckpointMark was called after calling finishBundle, so teardown was not called by pardo abort().

beam/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/SimpleParDoFnHelpers.java

Line 72 in 961ed7c

class SimpleParDoFnHelpers<InputT, OutputT, W extends BoundedWindow> {

In the new code finishKey() calls UnboundedReader::getCheckpointMark, it happens before finishBundle, so the DoFn's teardown got called one more time.

So it seems now that we are not caching this DoFn for more things that might happen in flushInternal which was previously separate from user code execution?

For the multi-key case this seems needed because we haven't called finishBundle yet and thus have incomplete dofn lifecycle. However if it is the final key (or only key) within a batch we may be not caching dofns as aggressively as previously when they are still valid to use.

We could defer the final finishKey if advance will return false? I'm worried that there might be more effects for single-key processing than we realize if there are cases where checkpoints do have errors since dofn construction is expensive.

scwhittle · 2026-06-29T09:31:22Z

+    // First call
+    executionContext.finishKey();
+    // Second call - should not throw any Exception
+    executionContext.finishKey();


when does this happen? add a comment here why we want this to be the case

The reader iterators call finishKey() in advance(). I initially was throwing if finishKey got called twice. Gemini said iterator advance() needs to be reentrant, so make finishKey to ignore future calls.

scwhittle · 2026-06-29T09:34:13Z

+                    .build())
+            .build();
+
+    // Set up context.advance() to mock transition


this seems a bit odd to test this way by just mocking out responses. It seems better suited for a test once support of multiple items via advance is actually within the context.

The tests helped validate the logic in the Iterator where it calls context.advance(). mocked context.advance() to test the iterator logic in isolation.

scwhittle · 2026-06-29T09:35:50Z

+    when(mockContext.getWork()).thenReturn(work1);
+
+    // Mock transition behaviour of context.advance()
+    when(mockContext.advance())


ditto seems a bit odd to mock this in here

Same as above. These tests help validate the logic in the Iterator where it calls context.advance(). mocked context.advance() to test the iterator logic in isolation. We could consider using the real context once it supports multi key bundles.

scwhittle · 2026-06-30T09:28:18Z

+      try {
+        return keyCoder.decode(work.getWorkItem().getKey().newInput(), Coder.Context.OUTER);
+      } catch (IOException e) {
+        throw new RuntimeException("Failed to decode key during processing", e);


should we wrap as CoderException instead of RuntimeException?

scwhittle · 2026-06-30T09:30:52Z

+
+  // Returns the windmill WorkItem proto for the current Work
+  public Windmill.WorkItem getWorkItem() {
+    return checkStateNotNull(


how about moving this checkSTateNotNull message to getWork() and then just have return getWork().getWorkItem();
here

scwhittle · 2026-06-30T09:51:05Z

    // Ensure that the invalidated dofn had tearDown called on them.
-    assertEquals(1, TestExceptionInvalidatesCacheFn.tearDownCallCount.get());
-    assertEquals(2, TestExceptionInvalidatesCacheFn.setupCallCount.get());
+    assertEquals(2, TestExceptionInvalidatesCacheFn.tearDownCallCount.get());


So it seems now that we are not caching this DoFn for more things that might happen in flushInternal which was previously separate from user code execution?

For the multi-key case this seems needed because we haven't called finishBundle yet and thus have incomplete dofn lifecycle. However if it is the final key (or only key) within a batch we may be not caching dofns as aggressively as previously when they are still valid to use.

We could defer the final finishKey if advance will return false? I'm worried that there might be more effects for single-key processing than we realize if there are cases where checkpoints do have errors since dofn construction is expensive.

arunpandianp · 2026-06-30T10:16:17Z

should we wrap as CoderException instead of RuntimeException?

I think the comment is on a stale diff, the code is now rethrowing CoderExceptions and wraping other IOExceptions.

arunpandianp · 2026-06-30T10:26:20Z

So it seems now that we are not caching this DoFn for more things that might happen in flushInternal which was previously separate from user code execution?

For the multi-key case this seems needed because we haven't called finishBundle yet and thus have incomplete dofn lifecycle. However if it is the final key (or only key) within a batch we may be not caching dofns as aggressively as previously when they are still valid to use.

We could defer the final finishKey if advance will return false? I'm worried that there might be more effects for single-key processing than we realize if there are cases where checkpoints do have errors since dofn construction is expensive.

Done. Moved last flushStateInternal outside the doFn lifecycle.

scwhittle · 2026-06-30T20:13:19Z

Internal tests passing, going to merge.

arunpandianp added 2 commits June 4, 2026 11:00

[Dataflow Streaming] [Multi Key] StreamingModeExecutionContext refact…

73faa68

…oring for multi-key execution.

trigger postsubmit tests

53bc9a6

github-actions Bot added runners dataflow labels Jun 4, 2026

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

arunpandianp added 5 commits June 4, 2026 18:53

fix tests

7720cf8

fix tests

72740d2

improve work synchronization

0f96e72

cleanup logic

51b9257

cleanup logic

9a4e7be

scwhittle requested changes Jun 5, 2026

View reviewed changes

arunpandianp added 4 commits June 8, 2026 05:13

address comments

3f36afd

improve WindowingWindmillReader

f3cc628

spotless fix

58e0ef9

[Dataflow Streaming] Fix nullness supression in StreamingModeExecutio…

3dceab0

…nContext

arunpandianp requested a review from scwhittle June 8, 2026 08:01

arunpandianp added 5 commits June 8, 2026 08:19

make windmillTagEncoding final

e199438

address comments

700dfbc

Move SideInputStateFetcherFactory from start to constructor

bc5bee2

Merge branch 'contextnullness' into multikey_context_review

6d7f28e

# Conflicts: # runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingModeExecutionContext.java

Merge remote-tracking branch 'beam/master' into multikey_context_review

c4a52c1

scwhittle requested changes Jun 8, 2026

View reviewed changes

arunpandianp added 3 commits June 8, 2026 20:31

Merge beam/master into multikey_context_review

1107a60

Resolved conflicts in StreamingModeExecutionContext.java and StreamingModeExecutionContextTest.java. Fixed compilation error in Work.java by removing duplicate getComputationId() method. TAG=agy CONV=143daaa5-e902-4d26-820d-cf1af2babb84

Address comment

47eb7d6

Address comment

24505dd

Fix UnderInitialization

4e0d174

arunpandianp mentioned this pull request Jun 11, 2026

[Dataflow Streaming] [Multi Key] MultiKey failure handling + Integration #38919

Open

arunpandianp requested a review from scwhittle June 15, 2026 19:23

arunpandianp mentioned this pull request Jun 15, 2026

[Dataflow Streaming] [Multi Key] Drop failed work in BoundedQueueExecutor::pollWork #38920

Open

Merge remote-tracking branch 'beam/master' into multikey_context_review

96abcca

scwhittle reviewed Jun 17, 2026

View reviewed changes

address comments

15e8d0a

scwhittle requested changes Jun 29, 2026

View reviewed changes

arunpandianp added 3 commits June 30, 2026 08:59

address comments

996f561

Merge remote-tracking branch 'beam/master' into multikey_context_review

6b3218b

fix test

a51c2f7

scwhittle requested changes Jun 30, 2026

View reviewed changes

address comments

3036b54

scwhittle approved these changes Jun 30, 2026

View reviewed changes

Merge remote-tracking branch 'beam/master' into multikey_context_review

5557cbb

scwhittle merged commit 78fdc95 into apache:master Jun 30, 2026
20 of 22 checks passed

Uh oh!

Conversation

arunpandianp commented Jun 4, 2026

Uh oh!

arunpandianp commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

arunpandianp commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arunpandianp commented Jun 5, 2026

Uh oh!

scwhittle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scwhittle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arunpandianp commented Jun 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!