Skip to content

Commit bed14fe

Browse files
Merge pull request redis#3360 from redis/DOC-6652-rdi-azure-workaroung
DOC-6652 document CDC latency workaround for SQL Server on Azure
2 parents 0c9835c + d608636 commit bed14fe

1 file changed

Lines changed: 171 additions & 0 deletions

File tree

  • content/integrate/redis-data-integration/data-pipelines/prepare-dbs

content/integrate/redis-data-integration/data-pipelines/prepare-dbs/sql-server.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,177 @@ For more information about CDC on Azure SQL Database, see Microsoft's
334334
[Change Data Capture with Azure SQL Database](https://learn.microsoft.com/en-us/azure/azure-sql/database/change-data-capture-overview?view=azuresql)
335335
guide.
336336

337+
#### Reducing end-to-end latency on Azure SQL Database
338+
339+
Because the capture cadence on Azure SQL Database is fixed at ~20 seconds and
340+
cannot be tuned, the CDC step alone can add up to that much latency to your
341+
end-to-end change-propagation time. If your workload needs lower latency, you
342+
can supplement the automatic Azure scheduler with an external worker that
343+
periodically calls the `sys.sp_cdc_scan` stored procedure. The automatic
344+
scheduler continues to run; each manual call adds an extra CDC log scan in
345+
between, lowering the effective capture cadence to roughly the worker's
346+
polling interval.
347+
348+
Each call runs one CDC log scan, bounded by the `maxtrans` and `maxscans`
349+
parameters covered in
350+
[SQL Server capture job agent configuration parameters](#sql-server-capture-job-agent-configuration-parameters).
351+
On Azure SQL Database, `pollinginterval` and `continuous` do not apply, but
352+
`maxtrans` and `maxscans` remain tunable via `sp_cdc_change_job`. For low and
353+
moderate change volumes the defaults are usually fine — each call drains the
354+
pending transactions. For high-volume workloads, raise `maxtrans` and
355+
`maxscans` if a single call cannot keep up with the change rate.
356+
357+
This is a customer-operated workaround for an Azure platform limitation, not a
358+
Redis-supplied component. It does not apply to Azure SQL Managed Instance,
359+
SQL Server on Azure VM, or on-premises SQL Server — those use SQL Server Agent
360+
and the tunable `pollinginterval` parameter described in
361+
[SQL Server capture job agent configuration parameters](#sql-server-capture-job-agent-configuration-parameters).
362+
363+
{{< warning >}}Run **only one** instance of the scan worker per source database.
364+
`sys.sp_cdc_scan` holds an exclusive log-reader lock for the duration of each
365+
call; concurrent callers fail rather than running in parallel, so additional
366+
replicas add no throughput and only generate error noise.{{< /warning >}}
367+
368+
##### Requirements
369+
370+
- A database identity with permission to execute `sys.sp_cdc_scan`. This
371+
requires `db_owner` and is **more privilege than the Debezium user needs**, so
372+
create a separate login dedicated to the scan worker rather than reusing the
373+
RDI source credentials.
374+
- A single-replica runtime (a Kubernetes Deployment with `replicas: 1`, a
375+
systemd unit, a serverless cron with `maxConcurrency: 1`, or equivalent).
376+
- Network access from the worker to the Azure SQL endpoint on TCP 1433.
377+
378+
##### Scan loop
379+
380+
The worker repeatedly opens a connection (or holds a long-lived one), runs
381+
`EXEC sys.sp_cdc_scan;` with a bounded command timeout, sleeps for the
382+
configured interval, and handles two expected error classes:
383+
384+
- **Scan already in progress** — `sys.sp_cdc_scan` cannot run while another
385+
CDC log scan is active, either the Azure-internal scheduler's scan or a
386+
previous call from this worker that has not yet returned. The procedure
387+
returns a SQL error in this state. The error message has been observed to
388+
contain `sp_replcmds` (the underlying log-reader procedure), but the exact
389+
wording is not contractual — match by whatever signature your client
390+
surfaces, then log the occurrence and continue. Do not back off.
391+
- **Connection or transport errors** — close and reopen the connection with
392+
exponential backoff before the next attempt.
393+
394+
The example below shows the loop in pseudocode:
395+
396+
```text
397+
loop until shutdown:
398+
start = now()
399+
try:
400+
EXEC sys.sp_cdc_scan # command timeout: 30s
401+
log("scan_ok", now() - start)
402+
catch SqlException identifying "scan already active":
403+
log("scan_already_running", now() - start)
404+
catch any other exception as e:
405+
log("scan_error", e)
406+
reconnect with exponential backoff
407+
sleep(max(0, interval - (now() - start)))
408+
```
409+
410+
##### Choosing the interval
411+
412+
The scan interval directly trades end-to-end latency against source-database
413+
load — each call reads the transaction log. Pick the largest interval that
414+
meets your latency target:
415+
416+
| Interval | Approximate CDC-step latency | Typical use |
417+
| --- | --- | --- |
418+
| No worker | Up to ~20s | Azure SQL Database default; the automatic scheduler runs every ~20s. |
419+
| 5s | Around 5s | Workload tolerates ~5s end-to-end. |
420+
| 2s | Around 2s under low to moderate load; can be higher under heavy write volume | Latency-sensitive workloads. Confirm the achieved latency under your own workload before relying on it. |
421+
422+
Intervals below 1s are not recommended — each call has a fixed cost on the
423+
source database and the marginal latency improvement is small.
424+
425+
{{< warning >}}CDC scans consume regular database resources. Every call reads
426+
the transaction log, competing with the workload for CPU, memory, and log I/O.
427+
An aggressive interval can degrade the source database, especially on lower
428+
service tiers or under high write volume. Microsoft provides no SLA on CDC
429+
freshness on Azure SQL Database; treat measured end-to-end latency under your
430+
own workload as the source of truth, not the configured interval. If scans
431+
start falling behind, raise the service tier, raise `maxtrans` and `maxscans`,
432+
or relax the interval.{{< /warning >}}
433+
434+
##### Example Kubernetes deployment
435+
436+
A minimal single-replica deployment skeleton — adapt the image, namespace, and
437+
secret reference to your environment:
438+
439+
```yaml
440+
apiVersion: apps/v1
441+
kind: Deployment
442+
metadata:
443+
name: azure-sql-cdc-scan-worker
444+
spec:
445+
replicas: 1
446+
selector:
447+
matchLabels:
448+
app: azure-sql-cdc-scan-worker
449+
template:
450+
metadata:
451+
labels:
452+
app: azure-sql-cdc-scan-worker
453+
spec:
454+
containers:
455+
- name: worker
456+
image: <your-registry>/<your-scan-worker-image>:<tag>
457+
env:
458+
- name: SQL_HOST
459+
value: <server-name>.database.windows.net
460+
- name: SQL_DATABASE
461+
value: <database-name>
462+
- name: SCAN_INTERVAL_MS
463+
value: "2000"
464+
envFrom:
465+
- secretRef:
466+
name: <scan-worker-db-secret>
467+
```
468+
469+
The secret referenced by `envFrom` must provide the credentials of the
470+
`db_owner` identity created for the scan worker — not the RDI source
471+
credentials.
472+
473+
##### Verifying the workaround
474+
475+
After the worker has been running for a few minutes, confirm that the
476+
effective scan cadence has dropped to the worker's interval by querying the
477+
`sys.dm_cdc_log_scan_sessions` dynamic management view. This DMV records both
478+
the automatic scheduler's scans and the worker's manual scans, so the gap
479+
between successive `start_time` values should now match the worker's interval:
480+
481+
```sql
482+
-- Recent CDC log-scan sessions (manual and automatic combined)
483+
SELECT TOP (10)
484+
session_id, start_time, end_time, duration, scan_phase,
485+
latency, tran_count, last_commit_cdc_time
486+
FROM sys.dm_cdc_log_scan_sessions
487+
WHERE session_id > 0
488+
ORDER BY session_id DESC
489+
```
490+
491+
To check the commit time of the latest change captured for a specific table,
492+
map the highest captured LSN back to a time using `sys.fn_cdc_map_lsn_to_time`:
493+
494+
```sql
495+
-- Replace <capture-instance> with the capture instance name shown by
496+
-- sys.sp_cdc_help_change_data_capture
497+
SELECT sys.fn_cdc_map_lsn_to_time(MAX(__$start_lsn)) AS latest_captured_commit_time
498+
FROM cdc.<capture-instance>_CT
499+
```
500+
501+
The difference between that value and the current time is an upper bound on
502+
how stale the captured stream is for that table.
503+
504+
You can also confirm that end-to-end change propagation through RDI now meets
505+
your latency target by measuring `<change committed in source> → <change visible in Redis>`
506+
on a representative table.
507+
337508
#### Azure SQL Managed Instance
338509

339510
Follow the on-premises instructions for

0 commit comments

Comments
 (0)