Skip to content

Add Jitter to Task Polling Mechanism #7115

@mwielansky

Description

@mwielansky

New feature

This proposal is to enhance Nextflow's task polling mechanism by introducing a jitter, which will improve the robustness and efficiency of workflows, especially when interacting with external services.

Use case

The main use case is to prevent the "thundering herd" problem when Nextflow tasks poll external services (e.g., cloud provider APIs like GCP, AWS, Azure, or other web services) for status updates. In scenarios where numerous tasks initiate polling at synchronized, fixed intervals, this can lead to sudden, high-volume bursts of requests. Such bursts can overwhelm external APIs, resulting in rate limiting, increased latency, or temporary service unavailability. Adding jitter will distribute these requests more evenly over time, making Nextflow a more resilient and "API-friendly" client.

Suggested implementation

The core polling logic resides in TaskPollingMonitor.groovy and ParallelPollingMonitor.groovy.

  1. Identify Polling Loop: The pollLoop() method in TaskPollingMonitor.groovy (around line 483) is the primary location where the polling interval is enforced via the await(time) call.
  2. Introduce Jitter Calculation:
    • Modify the await(long time) method (around line 578) or the pollLoop() to introduce a random delay (jitter) to the fixed pollIntervalMillis.
    • The jitter should be calculated such that there's an inverse relationship between the base polling interval and the jitter factor. For example:
      • For a shorter pollIntervalMillis, a larger percentage of that interval could be used for random jitter.
      • For a longer pollIntervalMillis, a smaller percentage of that interval would be used for jitter.
    • A possible formula could be jitter = random_factor * (max_jitter_percentage / pollIntervalMillis_normalized).
    • The random_factor would be a random number between 0 and 1.
    • The max_jitter_percentage would be a configurable value (e.g., 25% of the pollIntervalMillis).
    • pollIntervalMillis_normalized could be the pollIntervalMillis divided by a base unit (e.g., 1000ms for 1 second) to ensure the inverse relationship scales appropriately.
    • The final delay would be pollIntervalMillis + jitter.
  3. Configuration: Consider adding a new configuration parameter (e.g., executor.<name>.pollJitterFactor) to ExecutorConfig to allow users to control the maximum jitter percentage, or to enable/disable the jitter.
  4. Impact: This change would be inherited by all executors that use TaskPollingMonitor and ParallelPollingMonitor, including cloud-specific plugins like nf-google.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions