Skip to content

Commit 64b0e8f

Browse files
anandgupta42claude
andcommitted
fix: pre-release review fixes for v0.6.0
Addresses P0/P1 concerns from the multi-persona release review: - `sqlserver.ts` typecheck failures: add `// @ts-expect-error` for optional `@azure/identity` peer dep; pass `tokenResponse.token` directly to `parseTokenExpiry` to avoid `string | undefined` narrowing error. - Wrap 7 upstream-shared edits with `altimate_change` markers so Marker Guard CI passes: providers.ts, app.tsx, dialog-provider.tsx, config.ts, provider.ts, anthropic.txt, tool/registry.ts. - Databricks plugin hardening: add `isValidDatabricksHost` helper with explicit CRLF/whitespace check (JS regex `$` matches before `\n`); log silent JSON parse error at debug level instead of fully swallowing. - Provider `databricks` loader validates host in env-fallback path too and uses `isValidDatabricksHost` instead of raw regex. - `toolNamesFromMessages` validates tool name against `/^[a-zA-Z0-9_-]{1,64}$/` before registering a stub, preventing tainted session files from injecting shell metacharacters or ANSI escapes into API requests and TUI rendering. - `sqlserver` driver: pass a restricted env to `az account get-access-token` so unrelated secrets (DATABRICKS_TOKEN, cloud provider keys) are not inherited by `az` or any `az` extension. - `data_diff` tool description: add `cascade` algorithm (was missing from docs), partition-threshold hint, and PII/PHI compliance note. - `data-parity` SKILL.md: add "Regulated / Sensitive Data" section at top asking agents to prefer `algorithm: "profile"` for tables that may contain PII/PHI/PCI data. - Docs: add `## Databricks AI Gateway` section to `providers.md` with PAT format, env vars, supported domains, and model list. - Docs: add `data-engineering/guides/data-parity.md` user guide covering supported warehouse pairs, algorithms, partition modes, MSSQL/Fabric Azure AD auth flows, and compliance guidance. Add nav entry. Deferred to v0.6.1 (filed as issues): - `data_diff` sample-row redaction / include_values opt-in (feature) - Audit log for data_diff calls (feature) - Row-count ceiling / max_rows guard (feature) - Databricks model registry refresh mechanism (feature) - Split data-diff.ts into dialect/cte/partitioning modules (refactor) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ff0c60a commit 64b0e8f

15 files changed

Lines changed: 267 additions & 9 deletions

File tree

.opencode/skills/data-parity/SKILL.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,21 @@ description: Validate that two tables or query results are identical — or diag
55

66
# Data Parity (Table Diff)
77

8+
## CRITICAL: Regulated / Sensitive Data
9+
10+
`data_diff` includes up to 5 **sample diff rows** in the tool output so you can see *which* values differ. Those rows are part of the conversation and are sent to the LLM provider you're using.
11+
12+
Before running `data_diff` against a table that might contain PII, PHI, PCI, or other regulated data:
13+
14+
1. **Ask the user** whether the target contains regulated columns.
15+
2. If yes, prefer `algorithm: "profile"` — it compares column-level statistics (count, nulls, min/max, distinct count) without any row values leaving the database.
16+
3. If a row-level diff is genuinely required, tell the user that up to 5 sample rows will be sent to the LLM and get explicit approval before calling the tool.
17+
4. Consider scoping with `where_clause` to exclude sensitive customers/accounts first.
18+
19+
Default to profile mode whenever the table name suggests regulated data (`customers`, `patients`, `orders`, `payments`, `accounts`, `users`, etc.) unless the user explicitly requests row-level comparison.
20+
21+
---
22+
823
## CRITICAL: Always Start With a Plan
924

1025
**Before doing anything else**, generate a numbered TODO list for the user:

docs/docs/configure/providers.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,50 @@ Billing flows through your Snowflake credits — no per-token costs.
289289
!!! note
290290
Model availability depends on your Snowflake region. Enable cross-region inference with `ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION'` for full model access.
291291

292+
## Databricks AI Gateway
293+
294+
Connect to Databricks serving endpoints (Foundation Model APIs) via your workspace PAT. Use Databricks-hosted Llama, Claude, GPT, Gemini, DBRX, or Mixtral for agent reasoning — billing flows through your Databricks account.
295+
296+
```json
297+
{
298+
"provider": {
299+
"databricks": {}
300+
},
301+
"model": "databricks/databricks-claude-sonnet-4-6"
302+
}
303+
```
304+
305+
Authenticate with `altimate auth databricks` and enter credentials as `workspace-host::pat-token`:
306+
307+
```text
308+
myworkspace.cloud.databricks.com::dapi1234567890abcdef
309+
```
310+
311+
Or set environment variables:
312+
313+
```bash
314+
export DATABRICKS_HOST=myworkspace.cloud.databricks.com
315+
export DATABRICKS_TOKEN=dapi1234567890abcdef
316+
```
317+
318+
Create a PAT in Databricks: **Settings → Developer → Access Tokens → Generate New Token**.
319+
320+
**Supported workspace domains:** `*.cloud.databricks.com` (AWS), `*.azuredatabricks.net` (Azure), `*.gcp.databricks.com` (GCP).
321+
322+
**Available models:**
323+
324+
| Provider | Models |
325+
|----------|--------|
326+
| Meta Llama | `databricks-meta-llama-3-1-405b-instruct`, `databricks-meta-llama-3-1-70b-instruct`, `databricks-meta-llama-3-1-8b-instruct` |
327+
| Anthropic via Databricks | `databricks-claude-sonnet-4-6`, `databricks-claude-opus-4-6` |
328+
| OpenAI via Databricks | `databricks-gpt-5-4`, `databricks-gpt-5-mini` |
329+
| Google via Databricks | `databricks-gemini-3-1-pro` |
330+
| Databricks native | `databricks-dbrx-instruct` |
331+
| Mistral (tool calls unsupported) | `databricks-mixtral-8x7b-instruct` |
332+
333+
!!! note
334+
Databricks bills directly for these models — altimate-code reports `$0` cost for Databricks-routed requests since pricing depends on your Databricks contract.
335+
292336
## Custom / OpenAI-Compatible
293337

294338
Any OpenAI-compatible endpoint can be used as a provider:
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Data Parity (Table Diff)
2+
3+
Validate that two tables — or two query results — are identical across databases, or diagnose exactly how they differ. Use for **migration validation**, **ETL regression**, and **query refactor verification**.
4+
5+
altimate-code ships a dedicated `data_diff` tool and a `data-parity` skill that orchestrates the full workflow: plan, inspect schema, confirm keys, profile, then diff.
6+
7+
## Supported warehouse pairs
8+
9+
Works across any combination of:
10+
11+
- PostgreSQL
12+
- Snowflake
13+
- BigQuery
14+
- Databricks (SQL Warehouses)
15+
- ClickHouse
16+
- MySQL
17+
- Redshift
18+
- SQL Server
19+
- Microsoft Fabric
20+
- DuckDB
21+
- SQLite
22+
- Oracle
23+
24+
Same-dialect comparisons use a fast FULL OUTER JOIN. Cross-database comparisons use a bisection hashing algorithm that streams checksums rather than raw rows — so you can diff a 100M-row Postgres table against its Snowflake replica without pulling the data out.
25+
26+
## Quick start
27+
28+
```bash
29+
altimate
30+
```
31+
32+
In the TUI, just describe what you want to compare:
33+
34+
```
35+
Compare orders in postgres_prod with orders in snowflake_dw using id as the primary key.
36+
```
37+
38+
The agent will:
39+
40+
1. List your warehouse connections.
41+
2. Inspect both schemas, propose primary keys, and flag audit/timestamp columns to exclude.
42+
3. Confirm your choices.
43+
4. Run a column profile first (cheap — no row scan).
44+
5. Run the row-level diff only on columns that diverged.
45+
46+
## Algorithms
47+
48+
| Algorithm | When to use | Cost |
49+
|-----------|-------------|------|
50+
| `auto` | Default. Picks JoinDiff for same-dialect, HashDiff for cross-database. | Cheapest valid choice |
51+
| `joindiff` | Same-database comparison. Fast. | One FULL OUTER JOIN |
52+
| `hashdiff` | Cross-database. Works at any scale. | Bisection over checksums |
53+
| `profile` | Compliance-safe. Column stats only — no row values leave the database. | Cheapest |
54+
| `cascade` | Profile first, then HashDiff on columns that diverged. Balanced default for exploratory diffs. | Column stats + targeted row diff |
55+
56+
## Partitioning large tables
57+
58+
For tables beyond ~10M rows, partition the diff into independent batches:
59+
60+
```text
61+
Compare orders between postgres and snowflake, partitioned by order_date month.
62+
```
63+
64+
Three partition modes:
65+
66+
| Mode | How to trigger | Example |
67+
|------|----------------|---------|
68+
| **Date** | Set `partition_column` + `partition_granularity` | `l_shipdate` + `month` |
69+
| **Numeric** | Set `partition_column` + `partition_bucket_size` | `l_orderkey` + `100000` |
70+
| **Categorical** | Set `partition_column` alone (no granularity/bucket) | `region`, `status`, `country` |
71+
72+
Each partition is diffed independently. Results are aggregated with a per-partition breakdown so you can see *which* groups have differences.
73+
74+
## SQL Server and Microsoft Fabric
75+
76+
Both `sqlserver` and `fabric` are supported. For Azure AD / Entra ID authentication, altimate-code recognizes all of the major flows through `tedious`:
77+
78+
| `authentication` | Config fields | Use case |
79+
|------------------|---------------|----------|
80+
| `azure-active-directory-password` | `azure_client_id`, `azure_tenant_id`, `user`, `password` | User credentials |
81+
| `azure-active-directory-access-token` (or `access-token`) | `access_token` | Pre-fetched token |
82+
| `service-principal-secret` (`service-principal`) | `azure_tenant_id`, `azure_client_id`, `azure_client_secret` | Service principals |
83+
| `azure-active-directory-msi-vm` (`msi`) | `azure_client_id` (optional) | Azure VM managed identity |
84+
| `azure-active-directory-msi-app-service` | `azure_client_id` (optional) | App Service managed identity |
85+
| `azure-active-directory-default` (`default` / `CLI`) || DefaultAzureCredential chain (CLI, env, MSI) |
86+
87+
All Azure AD connections force TLS encryption.
88+
89+
## Compliance and sensitive data
90+
91+
!!! warning "PII / PHI / PCI data"
92+
`data_diff` prints up to 5 sample diff rows in tool output. Those rows become part of the conversation and are sent to your LLM provider.
93+
94+
When comparing tables that might contain regulated data:
95+
96+
- Start with `algorithm: "profile"` — column-level statistics only, no row values leave the database.
97+
- If a row-level diff is genuinely required, scope it with a `where_clause` that excludes sensitive customers / accounts.
98+
- The `data-parity` skill asks for confirmation before sending sample rows to the LLM when the table name matches common regulated patterns (`customers`, `patients`, `orders`, `payments`, `accounts`, `users`).
99+
100+
## Column auto-discovery and audit exclusion
101+
102+
When you omit `extra_columns` and the source is a plain table name, altimate-code:
103+
104+
1. Queries `information_schema` (or the dialect-specific equivalent) on both sides.
105+
2. Excludes audit/timestamp columns by name pattern (`updated_at`, `created_at`, `_fivetran_synced`, `_airbyte_emitted_at`, etc.).
106+
3. Queries column defaults and excludes anything with an auto-generating timestamp default (`NOW()`, `CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE`, `SYSTIMESTAMP`).
107+
4. Reports excluded columns so you can override if the timestamps are part of what you're validating.
108+
109+
When the source is a SQL query, only the key columns are compared unless you explicitly list `extra_columns`. Always provide `extra_columns` for query-mode comparisons.
110+
111+
## The `data_diff` tool
112+
113+
Direct tool invocation (if you prefer not to use the skill):
114+
115+
```
116+
data_diff(
117+
source = "orders",
118+
target = "orders",
119+
source_warehouse = "postgres_prod",
120+
target_warehouse = "snowflake_dw",
121+
key_columns = ["id"],
122+
algorithm = "auto",
123+
)
124+
```
125+
126+
See the [tool reference](../tools/warehouse-tools.md) for the full parameter list.

docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ nav:
110110
- Guides:
111111
- Cost Optimization: data-engineering/guides/cost-optimization.md
112112
- Migration: data-engineering/guides/migration.md
113+
- Data Parity: data-engineering/guides/data-parity.md
113114
- Using with Claude Code: data-engineering/guides/using-with-claude-code.md
114115
- Using with Codex: data-engineering/guides/using-with-codex.md
115116
- ClickHouse: data-engineering/guides/clickhouse.md

packages/drivers/src/sqlserver.ts

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,10 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
164164
let azCliStderr = ""
165165

166166
try {
167+
// @azure/identity is an optional peer dependency — dynamic import so users
168+
// who don't use Azure AD don't need to install it. Types are resolved at
169+
// runtime via the installed package.
170+
// @ts-expect-error — optional peer; types only present when installed
167171
const azureIdentity = await import("@azure/identity")
168172
const credential = new azureIdentity.DefaultAzureCredential(
169173
config.azure_client_id
@@ -175,7 +179,7 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
175179
token = tokenResponse.token
176180
// @azure/identity provides expiresOnTimestamp (ms). Prefer it; fall
177181
// back to parsing the JWT exp claim so both paths share the cache.
178-
expiresAt = tokenResponse.expiresOnTimestamp ?? parseTokenExpiry(token)
182+
expiresAt = tokenResponse.expiresOnTimestamp ?? parseTokenExpiry(tokenResponse.token)
179183
}
180184
} catch (err) {
181185
azureIdentityError = err
@@ -193,6 +197,19 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
193197
const childProcess = await import("node:child_process")
194198
const { promisify } = await import("node:util")
195199
const execFileAsync = promisify(childProcess.execFile)
200+
// Restrict the inherited environment so unrelated secrets in the caller's
201+
// env (e.g. DATABRICKS_TOKEN, cloud provider keys) are NOT passed to `az`
202+
// or any `az` extension. Pass through only the PATH/HOME essentials and
203+
// Azure-specific variables `az` actually needs.
204+
const restrictedEnv: NodeJS.ProcessEnv = {}
205+
for (const k of [
206+
"PATH", "HOME", "USER", "USERPROFILE", "LOCALAPPDATA", "APPDATA",
207+
"AZURE_CONFIG_DIR", "AZURE_EXTENSION_DIR", "AZURE_CORE_NO_COLOR",
208+
"SYSTEMROOT", "TEMP", "TMP", "LANG", "LC_ALL",
209+
]) {
210+
const v = process.env[k]
211+
if (v !== undefined) restrictedEnv[k] = v
212+
}
196213
const { stdout } = await execFileAsync(
197214
"az",
198215
[
@@ -201,7 +218,7 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
201218
"--query", "accessToken",
202219
"-o", "tsv",
203220
],
204-
{ encoding: "utf-8", timeout: 15000 },
221+
{ encoding: "utf-8", timeout: 15000, env: restrictedEnv },
205222
)
206223
const out = String(stdout).trim()
207224
if (out) {

packages/opencode/src/altimate/plugin/databricks.ts

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,27 @@ import { Auth, OAUTH_DUMMY_KEY } from "@/auth"
77
*/
88
export const VALID_HOST_RE = /^[a-zA-Z0-9._-]+\.(cloud\.databricks\.com|azuredatabricks\.net|gcp\.databricks\.com)$/
99

10+
/**
11+
* Validate a Databricks workspace host. Returns true only when the host
12+
* matches the whitelist regex AND contains no control/whitespace characters
13+
* (CR/LF/tab/space) — JS regex `$` matches before a trailing `\n`, so the
14+
* explicit check prevents CRLF-style injection if the value is ever spliced
15+
* into a URL or header.
16+
*/
17+
export function isValidDatabricksHost(host: string): boolean {
18+
if (!host) return false
19+
if (/[\r\n\t\s]/.test(host)) return false
20+
return VALID_HOST_RE.test(host)
21+
}
22+
1023
/** Parse a `host::token` credential string for Databricks PAT auth. */
1124
export function parseDatabricksPAT(code: string): { host: string; token: string } | null {
1225
const sep = code.indexOf("::")
1326
if (sep === -1) return null
1427
const host = code.substring(0, sep).trim()
1528
const token = code.substring(sep + 2).trim()
1629
if (!host || !token) return null
17-
if (!VALID_HOST_RE.test(host)) return null
30+
if (!isValidDatabricksHost(host)) return null
1831
return { host, token }
1932
}
2033

@@ -44,6 +57,11 @@ export async function DatabricksAuthPlugin(_input: PluginInput): Promise<Hooks>
4457
const auth = await getAuth()
4558
if (auth.type !== "oauth") return {}
4659

60+
// Host validation lives in the provider loader (see provider.ts) —
61+
// the plugin auth type doesn't expose accountId. The provider loader
62+
// re-validates with `isValidDatabricksHost` on every config load, so
63+
// a tampered auth.json can't redirect `baseURL` to an unknown host.
64+
4765
for (const model of Object.values(provider.models)) {
4866
model.cost = { input: 0, output: 0, cache: { read: 0, write: 0 } }
4967
}
@@ -87,8 +105,14 @@ export async function DatabricksAuthPlugin(_input: PluginInput): Promise<Hooks>
87105
body = result.body
88106
headers.delete("content-length")
89107
}
90-
} catch {
91-
// JSON parse error — pass original body through untransformed
108+
} catch (err) {
109+
// JSON parse error — pass original body through untransformed.
110+
// Body transformation is best-effort; the request continues
111+
// unchanged so the upstream endpoint can return its own error.
112+
if (process.env["DEBUG"]) {
113+
// eslint-disable-next-line no-console
114+
console.debug("databricks: body transform skipped", err)
115+
}
92116
}
93117
}
94118

packages/opencode/src/altimate/tools/data-diff.ts

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,14 @@ export const DataDiffTool = Tool.define("data_diff", {
1616
"- auto: JoinDiff if same dialect, HashDiff if cross-database (default)",
1717
"- joindiff: FULL OUTER JOIN (fast, same-database only)",
1818
"- hashdiff: Bisection with checksums (cross-database, any scale)",
19-
"- profile: Column-level statistics comparison",
19+
"- profile: Column-level statistics comparison (no row-level diff)",
20+
"- cascade: Profile first, then HashDiff on columns that diverged",
21+
"",
22+
"For very large tables (>10M rows), set partition_column to split work into smaller",
23+
"independent diffs (see partition_column parameter for modes).",
24+
"",
25+
"⚠ Compliance note: sample diff rows (up to 5) appear in tool output and are sent to the",
26+
"LLM provider. If comparing PII/PHI/PCI data, use algorithm='profile' (stats only, no values).",
2027
].join("\n"),
2128
parameters: z.object({
2229
source: z.string().describe(

packages/opencode/src/cli/cmd/providers.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,7 +431,9 @@ export const ProvidersLoginCommand = cmd({
431431

432432
if (["cloudflare", "cloudflare-ai-gateway"].includes(provider)) {
433433
prompts.log.info(
434+
// altimate_change start — altimate docs URL
434435
"Cloudflare AI Gateway can be configured with CLOUDFLARE_GATEWAY_ID, CLOUDFLARE_ACCOUNT_ID, and CLOUDFLARE_API_TOKEN environment variables. Read more: https://docs.altimate.sh/configure/providers/",
436+
// altimate_change end
435437
)
436438
}
437439

packages/opencode/src/cli/cmd/tui/app.tsx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,7 +689,9 @@ function App() {
689689
title: "Open docs",
690690
value: "docs.open",
691691
onSelect: () => {
692+
// altimate_change start — altimate docs URL
692693
open("https://docs.altimate.sh").catch(() => {})
694+
// altimate_change end
693695
dialog.clear()
694696
},
695697
category: "System",

packages/opencode/src/cli/cmd/tui/component/dialog-provider.tsx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,9 @@ function ApiMethod(props: ApiMethodProps) {
224224
return (
225225
<DialogPrompt
226226
title={props.title}
227+
// altimate_change start — altimate-backend custom placeholder
227228
placeholder={placeholder}
229+
// altimate_change end
228230
description={
229231
{
230232
opencode: (

0 commit comments

Comments
 (0)