Skip to content

Watermark propagation stalls indefinitely in v1.7.5 Rust runtime due to frequent micro-regressions #3424

@yuanzaiyang

Description

@yuanzaiyang

Describe the bug
Watermark propagation stalls indefinitely in Numaflow v1.7.5 (Rust runtime). Even with idleSource and maxDelay properly configured, the Source vertex frequently logs Watermark regression detected, skipping
publish, leading to a complete stop in watermark updates for downstream vertices.

This issue is particularly severe for Reduce (Aggregator) vertices, which become stuck in the STILL WAITING for window to close state indefinitely. The watermark on the Dashboard shows "Not Available" for
downstream edges of Aggregators, even when data is actively being processed. This behavior suggests a regression in how the new Rust data plane handles micro-jitters in data timestamps and gRPC signal
delivery.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a pipeline with a Kafka Source and a Reduce/Aggregator vertex using Numaflow v1.7.5.
  2. Configure a fixed window (e.g., 60s) and set maxDelay (e.g., 120s) and idleSource.
  3. Feed a Kafka topic with data across multiple partitions where slight timestamp jitters (sub-second regressions) occur naturally.
  4. Observe the Source vertex logs: it frequently skips watermark publishing due to "regression detected".
  5. Observe the Aggregator vertex: it logs STILL WAITING for window to close but never receives the None signal to close the window, despite the wall time being well past the window_end.
  6. The UI shows Watermark: Not Available on the edge following the Aggregator.

Expected behavior

  1. Watermarks should advance monotonically, utilizing maxDelay to absorb micro-jitters without skipping publishing entirely.
  2. idleSource should force watermark advancement even if some partitions are slow or have slight regressions.
  3. Reduce windows should close and propagate watermarks to downstream sinks once the watermark (adjusted by lateness) exceeds the window end time.

Screenshots

Image

Environment (please complete the following information):

  • Kubernetes: v1.34
  • Numaflow: v1.7.5
  • Data Plane Runtime: Rust
  • SDK: Rust SDK v0.5.0

Additional context

  1. Micro-Regression Sensitivity: The system seems extremely sensitive to even small regressions (e.g., < 500ms), which causes the Source to stop publishing watermarks.
  2. Resource Impact: Increasing CPU/Memory resources and adjusting NATS maxackpending did not resolve the stall, suggesting a logic error in the Rust-based watermark progressor or gRPC stream handling.
  3. Consistency: The issue is reproducible across multiple independent pipelines, including those with only Map vertices, where downstream watermarks also intermittently show "Not Available".
  4. Window Stall: In Aggregator UDF logs, we see thousands of STILL WAITING entries, confirming that the window closure signal is never sent by the Sidecar when the watermark is stuck at the source.

Metadata

Metadata

Labels

area/reduceReduce operations like GroupByKeybugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions