Skip to content

Forced PK_NewTasks index hint in _LockNextTask causes high CPU under concurrency #312

@berndverst

Description

@berndverst

Summary

The dt._LockNextTask stored procedure contains a hard-coded index hint WITH (INDEX (PK_NewTasks)) that forces SQL Server to use the clustered primary key index for the activity task dequeue UPDATE. Under high concurrency and sustained task backlogs, this hint forces a suboptimal execution plan that causes disproportionate CPU consumption and query latency on Azure SQL, even after scaling compute.

Problem Details

The forced hint

In logic.sql, the _LockNextTask procedure contains:

UPDATE TOP (1) NewTasks WITH (READPAST)
SET ...
FROM NewTasks WITH (INDEX (PK_NewTasks))
WHERE
    [TaskHub] = @TaskHub AND
    ([LockExpiration] IS NULL OR [LockExpiration] < @now) AND
    ([VisibleTime] IS NULL OR [VisibleTime] < @now)

The comment says: "The PK_NewTasks hint is specified to help ensure in-order selection."

Why the hint is problematic at scale

PK_NewTasks is defined as PRIMARY KEY (TaskHub, SequenceNumber) -- a clustered index. The WHERE clause filters on LockExpiration and VisibleTime, which are not in this index. The forced hint makes SQL Server evaluate these as residual predicates by scanning through clustered index rows until it finds one matching TOP (1) with READPAST.

With a backlog of locked or invisible tasks, this means scanning past many ineligible rows on every call. The existing IX_NewTasks_InstanceID index already INCLUDEs both LockExpiration and VisibleTime, but SQL Server cannot use it because of the forced hint.

Why the hint does not achieve its goal

The hint intends to ensure FIFO (in-order) activity task selection by SequenceNumber. However, UPDATE TOP (1) with READPAST is inherently non-deterministic -- SQL Server returns the first unlocked row it encounters, which may not be the lowest SequenceNumber. The hint provides a soft tendency toward ordering but not a guarantee, while forcing a costly plan.

Polling amplification

Each worker runs LockNextTaskActivityWorkItem in a tight loop with default settings:

  • MaxConcurrentActivities = Environment.ProcessorCount
  • MinActivityPollingInterval = 50ms
  • Backoff resets immediately when a task is found

With a sustained backlog, backoff never engages. A deployment with N workers x M cores creates NxM concurrent polling threads, each calling _LockNextTask at up to 20 times/sec. This can easily produce hundreds of executions per second, all hitting the suboptimal forced plan.

Comparison with _LockNextOrchestration

The companion procedure _LockNextOrchestration uses UPDATE TOP (1) Instances WITH (READPAST) with a join to NewEvents -- without any forced index hint. This proves the pattern works correctly without the hint.

Proposed Fix

Option 1: Remove the hint (minimal change)

-- Before:
FROM NewTasks WITH (INDEX (PK_NewTasks))

-- After:
FROM NewTasks

This lets SQL Server choose the optimal plan based on current table statistics, consistent with _LockNextOrchestration.

Option 2: Add a covering index (performance optimization)

In addition to removing the hint, add a new index that covers the WHERE clause predicates:

CREATE NONCLUSTERED INDEX IX_NewTasks_LockNext
ON dt.NewTasks (TaskHub, LockExpiration, VisibleTime)
INCLUDE (SequenceNumber, LockedBy, DequeueCount)

This gives the optimizer an efficient path to find unlocked, visible tasks without scanning the clustered index.

Customer Workaround

Until a fix is released, customers can safely apply the following workaround:

  1. Retrieve the current procedure definition:

    SELECT OBJECT_DEFINITION(OBJECT_ID('dt._LockNextTask'))
  2. Re-create the procedure with the hint removed -- copy the full definition from step 1, replace FROM NewTasks WITH (INDEX (PK_NewTasks)) with FROM NewTasks, and execute the modified CREATE OR ALTER PROCEDURE.

  3. Monitor Azure SQL CPU, _LockNextTask duration, and logical reads via Query Store for 24-48 hours.

  4. Re-apply after NuGet upgrades -- the durabletask-mssql package uses CREATE OR ALTER PROCEDURE during schema create/upgrade, which will restore the original procedure. Re-apply the patch after any package upgrade.

Related Issues

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions