Skip to content

feat: add optimized chunking strategies#933

Open
vaibhav-datazip wants to merge 4 commits into
stagingfrom
feat/optimized-mssql-chunking
Open

feat: add optimized chunking strategies#933
vaibhav-datazip wants to merge 4 commits into
stagingfrom
feat/optimized-mssql-chunking

Conversation

@vaibhav-datazip
Copy link
Copy Markdown
Collaborator

@vaibhav-datazip vaibhav-datazip commented Apr 30, 2026

Description

  • IAM walk chunk planning: Adds a fast physloc-boundary planner using sys.dm_db_database_page_allocations (LIMITED) to stream (file_id, page_id) and generate page-aligned %%physloc%% chunks without scanning/sorting the table; used when supported (SQL Server 2012+ + VIEW DATABASE STATE, not Azure SQL DB/MI).

  • Sampling fallback + shared abstraction: When IAM walk is unavailable/fails, adds TABLESAMPLE SYSTEM + %%physloc%% sampling to estimate evenly-spaced boundaries using only SELECT permission; the sampling % calculation and clamp constants are shared across MSSQL + Oracle.

Dependencies / requirements:

  • SQL Server 2012+, VIEW DATABASE STATE permission, and not Azure SQL DB/MI for IAM-walk planning.
  • Existing strategies remain as fallback when IAM-walk is unavailable.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Scenario A: MSSQL sync with IAM-walk enabled
  • Scenario B: IAM-walk unsupported (capability/permission/platform) falls back cleanly
    • Remove/deny VIEW DATABASE STATE (or run on Azure SQL DB/MI) and re-run the same sync.
    • Verify logs show fallback to existing strategy and sync proceeds.

Screenshots or Recordings

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

@vishalm0509 vishalm0509 requested a deployment to integration_tests May 6, 2026 11:33 — with GitHub Actions Waiting
Copy link
Copy Markdown
Collaborator

@vishalm0509 vishalm0509 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Image
  2. Please change PR title to that of MSSQL

Comment thread pkg/jdbc/jdbc.go
Comment on lines +1032 to +1041
// MSSQLPhysLocSampleBoundaryQuery returns a query that uses TABLESAMPLE SYSTEM to
// sample a percentage of data pages and return sorted %%physloc%% binary values.
func MSSQLPhysLocSampleBoundaryQuery(stream types.StreamInterface, samplePercent float64) string {
quotedTable := QuoteTable(stream.Namespace(), stream.Name(), constants.MSSQL)
return fmt.Sprintf(`
SELECT %%%%physloc%%%%
FROM %s TABLESAMPLE SYSTEM (%.6f PERCENT) WITH (NOLOCK)
ORDER BY %%%%physloc%%%%
`, quotedTable, samplePercent)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need with NO LOCK ? What if we include a row which got rollbacked ?

// hex literals (the shape produced by IAM walk and the iterative physloc
// planner). Either Min or Max is sufficient to identify the format because
// both planners always set at least one boundary.
func isPhysLocChunk(chunk types.Chunk) bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this util to the bottom ?

}
defer rows.Close()

var sampledLocs [][]byte
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampledLocs ?

return nil
}

// probeIAMWalkCapability checks if IAM walk
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment incomplete ?

logger.Debugf("IAM walk probe: failed to read server properties: %s", err)
return false
}
if majorVersion < 11 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants ?


// Permission probe: TOP 0 evaluates the DMF without returning any rows.
// Failure here means the current login lacks VIEW DATABASE STATE.
rows, err := m.client.QueryContext(ctx, jdbc.MSSQLIAMWalkPermissionQuery())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this query to check the permission

// MSSQLViewDatabaseStatePermissionQuery checks for VIEW DATABASE STATE (all versions) or
// VIEW DATABASE PERFORMANCE STATE (SQL Server 2022+). Either permission grants access to
// sys.dm_cdc_log_scan_sessions.
func MSSQLViewDatabaseStatePermissionQuery() string {
	return `
		SELECT CAST(CASE
			WHEN HAS_PERMS_BY_NAME(NULL, 'DATABASE', 'VIEW DATABASE STATE') = 1 THEN 1
			WHEN HAS_PERMS_BY_NAME(NULL, 'DATABASE', 'VIEW DATABASE PERFORMANCE STATE') = 1 THEN 1
			ELSE 0
		END AS BIT)
	`
}

}
defer rows.Close()

pages := make([]uint64, 0, 1024)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1024 ?

// allocating an 8-byte slice per page and sorting with bytes.Compare),
// halving memory on large tables and skipping the per-page encode step
// entirely; we encode only the few sampled boundaries.
func packPhysLoc(fileID, pageID int32) uint64 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func physlocPageBoundarySortKey(fileID, pageID int32) uint64 {
	var b [8]byte
	binary.LittleEndian.PutUint32(b[0:4], uint32(pageID))
	binary.LittleEndian.PutUint16(b[4:6], uint16(fileID))
	binary.LittleEndian.PutUint16(b[6:8], 0xFFFF)
	return binary.BigEndian.Uint64(b[:])
}

Can you check this, claude suggested. If it's better we can use it

// splitViaIAMWalk plans chunks for any heap or clustered table by reading
// only the table's Index Allocation Map pages via
// sys.dm_db_database_page_allocations.
func (m *MSSQL) splitViaIAMWalk(ctx context.Context, stream types.StreamInterface, chunks *types.Set[types.Chunk]) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The first chunk {nil, B1} is unbounded on the low side.
    Scan: WHERE %%physloc%% <= B1. This includes rows on any page below B1, including pages not in the IAM walk list (e.g. pages deallocated since stats were gathered, pages from forwarded-row pointers).

Can you verify this

case chunkColumn != "":
stmt = jdbc.MSSQLChunkScanQuery(stream, []string{chunkColumn}, chunk, filter)
} else if len(pkColumns) > 0 {
case isPhysLocChunk(chunk):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table with a BINARY(8) primary key produces PK boundaries that are "0x" + 16 hex chars = 18 characters — the exact same shape isPhysLocChunk uses to detect physloc boundaries. ChunkIterator would scan with WHERE %%physloc%% > X instead of WHERE pk > X. Wrong rows returned. Data loss / incorrect results. This is not theoretical — BINARY(8) PKs exist in real workloads (hash keys, external IDs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants