feat: add optimized chunking strategies#933
Conversation
| // MSSQLPhysLocSampleBoundaryQuery returns a query that uses TABLESAMPLE SYSTEM to | ||
| // sample a percentage of data pages and return sorted %%physloc%% binary values. | ||
| func MSSQLPhysLocSampleBoundaryQuery(stream types.StreamInterface, samplePercent float64) string { | ||
| quotedTable := QuoteTable(stream.Namespace(), stream.Name(), constants.MSSQL) | ||
| return fmt.Sprintf(` | ||
| SELECT %%%%physloc%%%% | ||
| FROM %s TABLESAMPLE SYSTEM (%.6f PERCENT) WITH (NOLOCK) | ||
| ORDER BY %%%%physloc%%%% | ||
| `, quotedTable, samplePercent) | ||
| } |
There was a problem hiding this comment.
Do we need with NO LOCK ? What if we include a row which got rollbacked ?
| // hex literals (the shape produced by IAM walk and the iterative physloc | ||
| // planner). Either Min or Max is sufficient to identify the format because | ||
| // both planners always set at least one boundary. | ||
| func isPhysLocChunk(chunk types.Chunk) bool { |
There was a problem hiding this comment.
Can we move this util to the bottom ?
| } | ||
| defer rows.Close() | ||
|
|
||
| var sampledLocs [][]byte |
| return nil | ||
| } | ||
|
|
||
| // probeIAMWalkCapability checks if IAM walk |
| logger.Debugf("IAM walk probe: failed to read server properties: %s", err) | ||
| return false | ||
| } | ||
| if majorVersion < 11 { |
|
|
||
| // Permission probe: TOP 0 evaluates the DMF without returning any rows. | ||
| // Failure here means the current login lacks VIEW DATABASE STATE. | ||
| rows, err := m.client.QueryContext(ctx, jdbc.MSSQLIAMWalkPermissionQuery()) |
There was a problem hiding this comment.
We have this query to check the permission
// MSSQLViewDatabaseStatePermissionQuery checks for VIEW DATABASE STATE (all versions) or
// VIEW DATABASE PERFORMANCE STATE (SQL Server 2022+). Either permission grants access to
// sys.dm_cdc_log_scan_sessions.
func MSSQLViewDatabaseStatePermissionQuery() string {
return `
SELECT CAST(CASE
WHEN HAS_PERMS_BY_NAME(NULL, 'DATABASE', 'VIEW DATABASE STATE') = 1 THEN 1
WHEN HAS_PERMS_BY_NAME(NULL, 'DATABASE', 'VIEW DATABASE PERFORMANCE STATE') = 1 THEN 1
ELSE 0
END AS BIT)
`
}| } | ||
| defer rows.Close() | ||
|
|
||
| pages := make([]uint64, 0, 1024) |
| // allocating an 8-byte slice per page and sorting with bytes.Compare), | ||
| // halving memory on large tables and skipping the per-page encode step | ||
| // entirely; we encode only the few sampled boundaries. | ||
| func packPhysLoc(fileID, pageID int32) uint64 { |
There was a problem hiding this comment.
func physlocPageBoundarySortKey(fileID, pageID int32) uint64 {
var b [8]byte
binary.LittleEndian.PutUint32(b[0:4], uint32(pageID))
binary.LittleEndian.PutUint16(b[4:6], uint16(fileID))
binary.LittleEndian.PutUint16(b[6:8], 0xFFFF)
return binary.BigEndian.Uint64(b[:])
}Can you check this, claude suggested. If it's better we can use it
| // splitViaIAMWalk plans chunks for any heap or clustered table by reading | ||
| // only the table's Index Allocation Map pages via | ||
| // sys.dm_db_database_page_allocations. | ||
| func (m *MSSQL) splitViaIAMWalk(ctx context.Context, stream types.StreamInterface, chunks *types.Set[types.Chunk]) error { |
There was a problem hiding this comment.
- The first chunk {nil, B1} is unbounded on the low side.
Scan: WHERE %%physloc%% <= B1. This includes rows on any page below B1, including pages not in the IAM walk list (e.g. pages deallocated since stats were gathered, pages from forwarded-row pointers).
Can you verify this
| case chunkColumn != "": | ||
| stmt = jdbc.MSSQLChunkScanQuery(stream, []string{chunkColumn}, chunk, filter) | ||
| } else if len(pkColumns) > 0 { | ||
| case isPhysLocChunk(chunk): |
There was a problem hiding this comment.
A table with a BINARY(8) primary key produces PK boundaries that are "0x" + 16 hex chars = 18 characters — the exact same shape isPhysLocChunk uses to detect physloc boundaries. ChunkIterator would scan with WHERE %%physloc%% > X instead of WHERE pk > X. Wrong rows returned. Data loss / incorrect results. This is not theoretical — BINARY(8) PKs exist in real workloads (hash keys, external IDs).

Description
IAM walk chunk planning: Adds a fast physloc-boundary planner using sys.dm_db_database_page_allocations (LIMITED) to stream (file_id, page_id) and generate page-aligned %%physloc%% chunks without scanning/sorting the table; used when supported (SQL Server 2012+ + VIEW DATABASE STATE, not Azure SQL DB/MI).
Sampling fallback + shared abstraction: When IAM walk is unavailable/fails, adds TABLESAMPLE SYSTEM + %%physloc%% sampling to estimate evenly-spaced boundaries using only SELECT permission; the sampling % calculation and clamp constants are shared across MSSQL + Oracle.
Dependencies / requirements:
Fixes # (issue)
Type of change
How Has This Been Tested?
VIEW DATABASE STATE(or run on Azure SQL DB/MI) and re-run the same sync.Screenshots or Recordings
Documentation
Related PR's (If Any):