Skip to content

Sqlserver connector: Data loss with direct mode on update with fetch size > 0 #2012

@b-goyal

Description

@b-goyal

Bug report

What Debezium connector do you use and what version?

SQL Server connector. Reproduced on the latest main (3.6.0-SNAPSHOT), built and run locally. The behavior also affects earlier releases (the relevant code is unchanged).


What is the connector configuration?

Key settings that trigger it:

{
  "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
  "data.query.mode": "direct",
  "streaming.fetch.size": "5",
  "snapshot.mode": "no_data",
  "database.names": "bgoyal_test",
  "table.include.list": "dbo.users"
}

streaming.fetch.size is set to 5 here to reproduce minimally; in production it is the default 10000 (any non-zero value reproduces it once the transaction exceeds the fetch boundary).


What is the captured database version and mode of deployment?

Microsoft SQL Server 2019 (RTM-CU32) 15.0.4430.1, self-managed (VM/EC2). CDC enabled via sys.sp_cdc_enable_table.


What behavior do you expect?

Every change captured by SQL Server CDC for a committed transaction should be emitted to Kafka exactly once, independent of data.query.mode and streaming.fetch.size. An update to a unique-index column should produce all of its delete/insert events.


What behavior do you see?

Data loss of update events when using data.query.mode=direct with streaming.fetch.size > 0. The loss happens only when the update is to a column that is part of the table's unique index — SQL Server records this as a "deferred update", i.e. a delete followed by an insert with the same __$seqval. If the transaction is a large bulk update that does not fully return in one fetch (because of streaming.fetch.size), the attempt to fetch the rest of the transaction returns only a subset of the rows, so some events are silently dropped.

Example. Table:

CREATE TABLE dbo.users (
    id         INT          NOT NULL PRIMARY KEY CLUSTERED,
    email      VARCHAR(100) NOT NULL,
    name       VARCHAR(100) NOT NULL,
    updated_at DATETIME     NOT NULL DEFAULT getutcdate()
);
CREATE UNIQUE NONCLUSTERED INDEX UX_users_email ON dbo.users(email);
EXEC sys.sp_cdc_enable_table @source_schema='dbo', @source_name='users',
     @role_name=NULL, @supports_net_changes=0;

Update to the unique-index column:

UPDATE dbo.users SET email = name + '_v0@example.com' WHERE id BETWEEN 1 AND 5;

On the CDC table this produces 10 rows (5 deletes + 5 inserts). Ordered as the direct-mode query orders them (ORDER BY __$start_lsn, __$command_id, __$seqval, __$operation):

commit_lsn            |seqval                |__$command_id|__$operation|name |email
----------------------+----------------------+-------------+------------+-----+--------------------
0x00000a24000018a80015|0x00000a24000018a80003|            1|           1|alice|alice_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80004|            2|           1|bob  |bob_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80005|            3|           1|carol|carol_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80006|            4|           1|dave |dave_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80007|            5|           1|eve  |eve_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80003|            6|           2|alice|alice_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80004|            7|           2|bob  |bob_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80005|            8|           2|carol|carol_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80006|            9|           2|dave |dave_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80007|           10|           2|eve  |eve_v0@example.com

Ordered as the function-mode query orders them (ORDER BY __$start_lsn, __$seqval, __$operation — no __$command_id), the delete/insert of each row stay adjacent and __$seqval is monotonic:

commit_lsn            |seqval                |__$operation|name |email
----------------------+----------------------+------------+-----+--------------------
0x00000a24000018a80015|0x00000a24000018a80003|           1|alice|alice_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80003|           2|alice|alice_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80004|           1|bob  |bob_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80004|           2|bob  |bob_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80005|           1|carol|carol_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80005|           2|carol|carol_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80006|           1|dave |dave_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80006|           2|dave |dave_v0@example.com
0x00000a24000018a80015|0x00000a24000018a80007|           1|eve  |eve_v1@example.com
0x00000a24000018a80015|0x00000a24000018a80007|           2|eve  |eve_v0@example.com

With streaming.fetch.size=5, the Kafka topic receives 5 deletes and only 1 create (the eve / command_id 10 row). The 4 inserts for alice/bob/carol/dave are missing:

op=d id=1 change_lsn=00000a24:000018a8:0003
op=d id=2 change_lsn=00000a24:000018a8:0004
op=d id=3 change_lsn=00000a24:000018a8:0005
op=d id=4 change_lsn=00000a24:000018a8:0006
op=d id=5 change_lsn=00000a24:000018a8:0007
op=c id=5 change_lsn=00000a24:000018a8:0007 event_serial_no=2     <-- only surviving insert

All inserts with seqval < 00000a24:000018a8:0007 are dropped. Only command_id 10 survives because its seqval (...0007) is >= the last-seen seqval and its operation is > 1. The issue reproduces whenever the number of (deferred) updates exceeds the streaming.fetch.size boundary.

Likely cause: the first fetch returns the 5 deletes (command_id 1–5, ordered by command_id), so the resume position becomes the highest delete seqval (...0007). The next fetch resumes with getNextResultSet() using a WHERE __$seqval > ? predicate that does not account for __$command_id, so the inserts whose seqval is below that high-water mark are skipped. In other words, in direct mode the query ORDER BY uses __$command_id but the pagination resume key uses only __$seqval, and the two disagree.


Do you see the same behaviour using the latest released Debezium version?

Yes — reproduced on the latest main (3.6.0-SNAPSHOT) built locally. It does not occur with data.query.mode=function (whose ORDER BY has no __$command_id, so __$seqval stays monotonic and delete/insert pairs remain adjacent).


Do you have the connector logs, ideally from start till finish?

TRACE  Getting top 5 changes for table Capture instance "dbo_orders" [...changeTableId=...cdc.dbo_orders_CT...] in range [00000a23:0001b318:0010-00000000:00000000:0000-0, 00000a23:0001b838:0015]   (io.debezium...SqlServerConnection:421)
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:0003,1)}   (...SqlServerStreamingChangeEventSource:314)
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:0008,1)}
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:000b,1)}
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:000e,1)}
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:0011,1)}
TRACE  Getting top 5 changes for table Capture instance "dbo_orders" [...] in range [00000a23:0001b838:0015-00000a23:0001b838:0011-1, 00000a23:0001b838:0015]   (io.debezium...SqlServerConnection:421)
INFO   11 records sent during previous ... last recorded offset ... is {event_serial_no=1, commit_lsn=00000a23:0001b838:0015, change_lsn=00000a23:0001b838:0011}   (io.debezium...BaseSourceTask:387)
TRACE  Processing change ChangeTableResultSet{... currentChangePosition=00000a23:0001b838:0015(00000a23:0001b838:0011,2)}   <-- only the op2 insert at the high-water seqval survives

How to reproduce the issue using our tutorial deployment?

  1. Create the dbo.users table and unique nonclustered index, enable CDC (DDL above).
  2. Insert 5 rows; start the SQL Server connector with data.query.mode=direct, streaming.fetch.size=5, snapshot.mode=no_data, capturing dbo.users.
  3. In a single transaction, update the unique-index column on all 5 rows:
    UPDATE dbo.users SET email = name + '_v0@example.com' WHERE id BETWEEN 1 AND 5;
  4. Consume the topic: expected 5 deletes + 5 creates; actual 5 deletes + 1 create (4 rows lost). Increase the row count / fetch.size to confirm it scales with the fetch boundary.

The potential cause seems to be the getNextResultSet() call not taking __$command_id into account in direct mode.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions