fix: pre-release review fixes for v0.6.0

anandgupta42 · claude · anandgupta42 · commit 64b0e8f0fd67 · 2026-04-21T13:17:19.000-07:00
Addresses P0/P1 concerns from the multi-persona release review:

- `sqlserver.ts` typecheck failures: add `// @ts-expect-error` for optional
  `@azure/identity` peer dep; pass `tokenResponse.token` directly to
  `parseTokenExpiry` to avoid `string | undefined` narrowing error.
- Wrap 7 upstream-shared edits with `altimate_change` markers so Marker
  Guard CI passes: providers.ts, app.tsx, dialog-provider.tsx, config.ts,
  provider.ts, anthropic.txt, tool/registry.ts.
- Databricks plugin hardening: add `isValidDatabricksHost` helper with
  explicit CRLF/whitespace check (JS regex `$` matches before `\n`); log
  silent JSON parse error at debug level instead of fully swallowing.
- Provider `databricks` loader validates host in env-fallback path too
  and uses `isValidDatabricksHost` instead of raw regex.
- `toolNamesFromMessages` validates tool name against
  `/^[a-zA-Z0-9_-]{1,64}$/` before registering a stub, preventing tainted
  session files from injecting shell metacharacters or ANSI escapes into
  API requests and TUI rendering.
- `sqlserver` driver: pass a restricted env to `az account get-access-token`
  so unrelated secrets (DATABRICKS_TOKEN, cloud provider keys) are not
  inherited by `az` or any `az` extension.
- `data_diff` tool description: add `cascade` algorithm (was missing from
  docs), partition-threshold hint, and PII/PHI compliance note.
- `data-parity` SKILL.md: add "Regulated / Sensitive Data" section at top
  asking agents to prefer `algorithm: "profile"` for tables that may
  contain PII/PHI/PCI data.
- Docs: add `## Databricks AI Gateway` section to `providers.md` with PAT
  format, env vars, supported domains, and model list.
- Docs: add `data-engineering/guides/data-parity.md` user guide covering
  supported warehouse pairs, algorithms, partition modes, MSSQL/Fabric
  Azure AD auth flows, and compliance guidance. Add nav entry.

Deferred to v0.6.1 (filed as issues):
- `data_diff` sample-row redaction / include_values opt-in (feature)
- Audit log for data_diff calls (feature)
- Row-count ceiling / max_rows guard (feature)
- Databricks model registry refresh mechanism (feature)
- Split data-diff.ts into dialect/cte/partitioning modules (refactor)

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.opencode/skills/data-parity/SKILL.md b/.opencode/skills/data-parity/SKILL.md
@@ -5,6 +5,21 @@ description: Validate that two tables or query results are identical — or diag
 
 # Data Parity (Table Diff)
 
+## CRITICAL: Regulated / Sensitive Data
+
+`data_diff` includes up to 5 **sample diff rows** in the tool output so you can see *which* values differ. Those rows are part of the conversation and are sent to the LLM provider you're using.
+
+Before running `data_diff` against a table that might contain PII, PHI, PCI, or other regulated data:
+
+1. **Ask the user** whether the target contains regulated columns.
+2. If yes, prefer `algorithm: "profile"` — it compares column-level statistics (count, nulls, min/max, distinct count) without any row values leaving the database.
+3. If a row-level diff is genuinely required, tell the user that up to 5 sample rows will be sent to the LLM and get explicit approval before calling the tool.
+4. Consider scoping with `where_clause` to exclude sensitive customers/accounts first.
+
+Default to profile mode whenever the table name suggests regulated data (`customers`, `patients`, `orders`, `payments`, `accounts`, `users`, etc.) unless the user explicitly requests row-level comparison.
+
+---
+
 ## CRITICAL: Always Start With a Plan
 
 **Before doing anything else**, generate a numbered TODO list for the user:
diff --git a/docs/docs/configure/providers.md b/docs/docs/configure/providers.md
@@ -289,6 +289,50 @@ Billing flows through your Snowflake credits — no per-token costs.
 !!! note
     Model availability depends on your Snowflake region. Enable cross-region inference with `ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION'` for full model access.
 
+## Databricks AI Gateway
+
+Connect to Databricks serving endpoints (Foundation Model APIs) via your workspace PAT. Use Databricks-hosted Llama, Claude, GPT, Gemini, DBRX, or Mixtral for agent reasoning — billing flows through your Databricks account.
+
+```json
+{
+  "provider": {
+    "databricks": {}
+  },
+  "model": "databricks/databricks-claude-sonnet-4-6"
+}
+```
+
+Authenticate with `altimate auth databricks` and enter credentials as `workspace-host::pat-token`:
+
+```text
+myworkspace.cloud.databricks.com::dapi1234567890abcdef
+```
+
+Or set environment variables:
+
+```bash
+export DATABRICKS_HOST=myworkspace.cloud.databricks.com
+export DATABRICKS_TOKEN=dapi1234567890abcdef
+```
+
+Create a PAT in Databricks: **Settings → Developer → Access Tokens → Generate New Token**.
+
+**Supported workspace domains:** `*.cloud.databricks.com` (AWS), `*.azuredatabricks.net` (Azure), `*.gcp.databricks.com` (GCP).
+
+**Available models:**
+
+| Provider | Models |
+|----------|--------|
+| Meta Llama | `databricks-meta-llama-3-1-405b-instruct`, `databricks-meta-llama-3-1-70b-instruct`, `databricks-meta-llama-3-1-8b-instruct` |
+| Anthropic via Databricks | `databricks-claude-sonnet-4-6`, `databricks-claude-opus-4-6` |
+| OpenAI via Databricks | `databricks-gpt-5-4`, `databricks-gpt-5-mini` |
+| Google via Databricks | `databricks-gemini-3-1-pro` |
+| Databricks native | `databricks-dbrx-instruct` |
+| Mistral (tool calls unsupported) | `databricks-mixtral-8x7b-instruct` |
+
+!!! note
+    Databricks bills directly for these models — altimate-code reports `$0` cost for Databricks-routed requests since pricing depends on your Databricks contract.
+
 ## Custom / OpenAI-Compatible
 
 Any OpenAI-compatible endpoint can be used as a provider:
diff --git a/docs/docs/data-engineering/guides/data-parity.md b/docs/docs/data-engineering/guides/data-parity.md
@@ -0,0 +1,126 @@
+# Data Parity (Table Diff)
+
+Validate that two tables — or two query results — are identical across databases, or diagnose exactly how they differ. Use for **migration validation**, **ETL regression**, and **query refactor verification**.
+
+altimate-code ships a dedicated `data_diff` tool and a `data-parity` skill that orchestrates the full workflow: plan, inspect schema, confirm keys, profile, then diff.
+
+## Supported warehouse pairs
+
+Works across any combination of:
+
+- PostgreSQL
+- Snowflake
+- BigQuery
+- Databricks (SQL Warehouses)
+- ClickHouse
+- MySQL
+- Redshift
+- SQL Server
+- Microsoft Fabric
+- DuckDB
+- SQLite
+- Oracle
+
+Same-dialect comparisons use a fast FULL OUTER JOIN. Cross-database comparisons use a bisection hashing algorithm that streams checksums rather than raw rows — so you can diff a 100M-row Postgres table against its Snowflake replica without pulling the data out.
+
+## Quick start
+
+```bash
+altimate
+```
+
+In the TUI, just describe what you want to compare:
+
+```
+Compare orders in postgres_prod with orders in snowflake_dw using id as the primary key.
+```
+
+The agent will:
+
+1. List your warehouse connections.
+2. Inspect both schemas, propose primary keys, and flag audit/timestamp columns to exclude.
+3. Confirm your choices.
+4. Run a column profile first (cheap — no row scan).
+5. Run the row-level diff only on columns that diverged.
+
+## Algorithms
+
+| Algorithm | When to use | Cost |
+|-----------|-------------|------|
+| `auto` | Default. Picks JoinDiff for same-dialect, HashDiff for cross-database. | Cheapest valid choice |
+| `joindiff` | Same-database comparison. Fast. | One FULL OUTER JOIN |
+| `hashdiff` | Cross-database. Works at any scale. | Bisection over checksums |
+| `profile` | Compliance-safe. Column stats only — no row values leave the database. | Cheapest |
+| `cascade` | Profile first, then HashDiff on columns that diverged. Balanced default for exploratory diffs. | Column stats + targeted row diff |
+
+## Partitioning large tables
+
+For tables beyond ~10M rows, partition the diff into independent batches:
+
+```text
+Compare orders between postgres and snowflake, partitioned by order_date month.
+```
+
+Three partition modes:
+
+| Mode | How to trigger | Example |
+|------|----------------|---------|
+| **Date** | Set `partition_column` + `partition_granularity` | `l_shipdate` + `month` |
+| **Numeric** | Set `partition_column` + `partition_bucket_size` | `l_orderkey` + `100000` |
+| **Categorical** | Set `partition_column` alone (no granularity/bucket) | `region`, `status`, `country` |
+
+Each partition is diffed independently. Results are aggregated with a per-partition breakdown so you can see *which* groups have differences.
+
+## SQL Server and Microsoft Fabric
+
+Both `sqlserver` and `fabric` are supported. For Azure AD / Entra ID authentication, altimate-code recognizes all of the major flows through `tedious`:
+
+| `authentication` | Config fields | Use case |
+|------------------|---------------|----------|
+| `azure-active-directory-password` | `azure_client_id`, `azure_tenant_id`, `user`, `password` | User credentials |
+| `azure-active-directory-access-token` (or `access-token`) | `access_token` | Pre-fetched token |
+| `service-principal-secret` (`service-principal`) | `azure_tenant_id`, `azure_client_id`, `azure_client_secret` | Service principals |
+| `azure-active-directory-msi-vm` (`msi`) | `azure_client_id` (optional) | Azure VM managed identity |
+| `azure-active-directory-msi-app-service` | `azure_client_id` (optional) | App Service managed identity |
+| `azure-active-directory-default` (`default` / `CLI`) | — | DefaultAzureCredential chain (CLI, env, MSI) |
+
+All Azure AD connections force TLS encryption.
+
+## Compliance and sensitive data
+
+!!! warning "PII / PHI / PCI data"
+    `data_diff` prints up to 5 sample diff rows in tool output. Those rows become part of the conversation and are sent to your LLM provider.
+
+    When comparing tables that might contain regulated data:
+
+    - Start with `algorithm: "profile"` — column-level statistics only, no row values leave the database.
+    - If a row-level diff is genuinely required, scope it with a `where_clause` that excludes sensitive customers / accounts.
+    - The `data-parity` skill asks for confirmation before sending sample rows to the LLM when the table name matches common regulated patterns (`customers`, `patients`, `orders`, `payments`, `accounts`, `users`).
+
+## Column auto-discovery and audit exclusion
+
+When you omit `extra_columns` and the source is a plain table name, altimate-code:
+
+1. Queries `information_schema` (or the dialect-specific equivalent) on both sides.
+2. Excludes audit/timestamp columns by name pattern (`updated_at`, `created_at`, `_fivetran_synced`, `_airbyte_emitted_at`, etc.).
+3. Queries column defaults and excludes anything with an auto-generating timestamp default (`NOW()`, `CURRENT_TIMESTAMP`, `GETDATE()`, `SYSDATE`, `SYSTIMESTAMP`).
+4. Reports excluded columns so you can override if the timestamps are part of what you're validating.
+
+When the source is a SQL query, only the key columns are compared unless you explicitly list `extra_columns`. Always provide `extra_columns` for query-mode comparisons.
+
+## The `data_diff` tool
+
+Direct tool invocation (if you prefer not to use the skill):
+
+```
+data_diff(
+  source = "orders",
+  target = "orders",
+  source_warehouse = "postgres_prod",
+  target_warehouse = "snowflake_dw",
+  key_columns = ["id"],
+  algorithm = "auto",
+)
+```
+
+See the [tool reference](../tools/warehouse-tools.md) for the full parameter list.
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -110,6 +110,7 @@ nav:
       - Guides:
           - Cost Optimization: data-engineering/guides/cost-optimization.md
           - Migration: data-engineering/guides/migration.md
+          - Data Parity: data-engineering/guides/data-parity.md
           - Using with Claude Code: data-engineering/guides/using-with-claude-code.md
           - Using with Codex: data-engineering/guides/using-with-codex.md
           - ClickHouse: data-engineering/guides/clickhouse.md
diff --git a/packages/drivers/src/sqlserver.ts b/packages/drivers/src/sqlserver.ts
@@ -164,6 +164,10 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
           let azCliStderr = ""
 
           try {
+            // @azure/identity is an optional peer dependency — dynamic import so users
+            // who don't use Azure AD don't need to install it. Types are resolved at
+            // runtime via the installed package.
+            // @ts-expect-error — optional peer; types only present when installed
             const azureIdentity = await import("@azure/identity")
             const credential = new azureIdentity.DefaultAzureCredential(
               config.azure_client_id
@@ -175,7 +179,7 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
               token = tokenResponse.token
               // @azure/identity provides expiresOnTimestamp (ms). Prefer it; fall
               // back to parsing the JWT exp claim so both paths share the cache.
-              expiresAt = tokenResponse.expiresOnTimestamp ?? parseTokenExpiry(token)
+              expiresAt = tokenResponse.expiresOnTimestamp ?? parseTokenExpiry(tokenResponse.token)
             }
           } catch (err) {
             azureIdentityError = err
@@ -193,6 +197,19 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
               const childProcess = await import("node:child_process")
               const { promisify } = await import("node:util")
               const execFileAsync = promisify(childProcess.execFile)
+              // Restrict the inherited environment so unrelated secrets in the caller's
+              // env (e.g. DATABRICKS_TOKEN, cloud provider keys) are NOT passed to `az`
+              // or any `az` extension. Pass through only the PATH/HOME essentials and
+              // Azure-specific variables `az` actually needs.
+              const restrictedEnv: NodeJS.ProcessEnv = {}
+              for (const k of [
+                "PATH", "HOME", "USER", "USERPROFILE", "LOCALAPPDATA", "APPDATA",
+                "AZURE_CONFIG_DIR", "AZURE_EXTENSION_DIR", "AZURE_CORE_NO_COLOR",
+                "SYSTEMROOT", "TEMP", "TMP", "LANG", "LC_ALL",
+              ]) {
+                const v = process.env[k]
+                if (v !== undefined) restrictedEnv[k] = v
+              }
               const { stdout } = await execFileAsync(
                 "az",
                 [
@@ -201,7 +218,7 @@ export async function connect(config: ConnectionConfig): Promise<Connector> {
                   "--query", "accessToken",
                   "-o", "tsv",
                 ],
-                { encoding: "utf-8", timeout: 15000 },
+                { encoding: "utf-8", timeout: 15000, env: restrictedEnv },
               )
               const out = String(stdout).trim()
               if (out) {
diff --git a/packages/opencode/src/altimate/plugin/databricks.ts b/packages/opencode/src/altimate/plugin/databricks.ts
@@ -7,14 +7,27 @@ import { Auth, OAUTH_DUMMY_KEY } from "@/auth"
  */
 export const VALID_HOST_RE = /^[a-zA-Z0-9._-]+\.(cloud\.databricks\.com|azuredatabricks\.net|gcp\.databricks\.com)$/
 
+/**
+ * Validate a Databricks workspace host. Returns true only when the host
+ * matches the whitelist regex AND contains no control/whitespace characters
+ * (CR/LF/tab/space) — JS regex `$` matches before a trailing `\n`, so the
+ * explicit check prevents CRLF-style injection if the value is ever spliced
+ * into a URL or header.
+ */
+export function isValidDatabricksHost(host: string): boolean {
+  if (!host) return false
+  if (/[\r\n\t\s]/.test(host)) return false
+  return VALID_HOST_RE.test(host)
+}
+
 /** Parse a `host::token` credential string for Databricks PAT auth. */
 export function parseDatabricksPAT(code: string): { host: string; token: string } | null {
   const sep = code.indexOf("::")
   if (sep === -1) return null
   const host = code.substring(0, sep).trim()
   const token = code.substring(sep + 2).trim()
   if (!host || !token) return null
-  if (!VALID_HOST_RE.test(host)) return null
+  if (!isValidDatabricksHost(host)) return null
   return { host, token }
 }
 
@@ -44,6 +57,11 @@ export async function DatabricksAuthPlugin(_input: PluginInput): Promise<Hooks>
         const auth = await getAuth()
         if (auth.type !== "oauth") return {}
 
+        // Host validation lives in the provider loader (see provider.ts) —
+        // the plugin auth type doesn't expose accountId. The provider loader
+        // re-validates with `isValidDatabricksHost` on every config load, so
+        // a tampered auth.json can't redirect `baseURL` to an unknown host.
+
         for (const model of Object.values(provider.models)) {
           model.cost = { input: 0, output: 0, cache: { read: 0, write: 0 } }
         }
@@ -87,8 +105,14 @@ export async function DatabricksAuthPlugin(_input: PluginInput): Promise<Hooks>
                   body = result.body
                   headers.delete("content-length")
                 }
-              } catch {
-                // JSON parse error — pass original body through untransformed
+              } catch (err) {
+                // JSON parse error — pass original body through untransformed.
+                // Body transformation is best-effort; the request continues
+                // unchanged so the upstream endpoint can return its own error.
+                if (process.env["DEBUG"]) {
+                  // eslint-disable-next-line no-console
+                  console.debug("databricks: body transform skipped", err)
+                }
               }
             }
 
diff --git a/packages/opencode/src/altimate/tools/data-diff.ts b/packages/opencode/src/altimate/tools/data-diff.ts
@@ -16,7 +16,14 @@ export const DataDiffTool = Tool.define("data_diff", {
     "- auto: JoinDiff if same dialect, HashDiff if cross-database (default)",
     "- joindiff: FULL OUTER JOIN (fast, same-database only)",
     "- hashdiff: Bisection with checksums (cross-database, any scale)",
-    "- profile: Column-level statistics comparison",
+    "- profile: Column-level statistics comparison (no row-level diff)",
+    "- cascade: Profile first, then HashDiff on columns that diverged",
+    "",
+    "For very large tables (>10M rows), set partition_column to split work into smaller",
+    "independent diffs (see partition_column parameter for modes).",
+    "",
+    "⚠ Compliance note: sample diff rows (up to 5) appear in tool output and are sent to the",
+    "LLM provider. If comparing PII/PHI/PCI data, use algorithm='profile' (stats only, no values).",
   ].join("\n"),
   parameters: z.object({
     source: z.string().describe(
diff --git a/packages/opencode/src/cli/cmd/providers.ts b/packages/opencode/src/cli/cmd/providers.ts
@@ -431,7 +431,9 @@ export const ProvidersLoginCommand = cmd({
 
         if (["cloudflare", "cloudflare-ai-gateway"].includes(provider)) {
           prompts.log.info(
+            // altimate_change start — altimate docs URL
             "Cloudflare AI Gateway can be configured with CLOUDFLARE_GATEWAY_ID, CLOUDFLARE_ACCOUNT_ID, and CLOUDFLARE_API_TOKEN environment variables. Read more: https://docs.altimate.sh/configure/providers/",
+            // altimate_change end
           )
         }
 
diff --git a/packages/opencode/src/cli/cmd/tui/app.tsx b/packages/opencode/src/cli/cmd/tui/app.tsx
@@ -689,7 +689,9 @@ function App() {
       title: "Open docs",
       value: "docs.open",
       onSelect: () => {
+        // altimate_change start — altimate docs URL
         open("https://docs.altimate.sh").catch(() => {})
+        // altimate_change end
         dialog.clear()
       },
       category: "System",
diff --git a/packages/opencode/src/cli/cmd/tui/component/dialog-provider.tsx b/packages/opencode/src/cli/cmd/tui/component/dialog-provider.tsx
@@ -224,7 +224,9 @@ function ApiMethod(props: ApiMethodProps) {
   return (
     <DialogPrompt
       title={props.title}
+      // altimate_change start — altimate-backend custom placeholder
       placeholder={placeholder}
+      // altimate_change end
       description={
         {
           opencode: (
diff --git a/packages/opencode/src/config/config.ts b/packages/opencode/src/config/config.ts
@@ -1074,7 +1074,9 @@ export namespace Config {
       command: z
         .record(z.string(), Command)
         .optional()
+        // altimate_change start — altimate docs URL
         .describe("Command configuration, see https://docs.altimate.sh/configure/commands/"),
+        // altimate_change end
       skills: Skills.optional().describe("Additional skill folder paths"),
       watcher: z
         .object({
diff --git a/packages/opencode/src/provider/provider.ts b/packages/opencode/src/provider/provider.ts
diff --git a/packages/opencode/src/session/llm.ts b/packages/opencode/src/session/llm.ts
diff --git a/packages/opencode/src/session/prompt/anthropic.txt b/packages/opencode/src/session/prompt/anthropic.txt
diff --git a/packages/opencode/src/tool/registry.ts b/packages/opencode/src/tool/registry.ts

Original file line number	Diff line number	Diff line change
`@@ -431,7 +431,9 @@ export const ProvidersLoginCommand = cmd({`
`431`	`431`
`432`	`432`	`if (["cloudflare", "cloudflare-ai-gateway"].includes(provider)) {`
`433`	`433`	`prompts.log.info(`
	`434`	`+ // altimate_change start — altimate docs URL`
`434`	`435`	`"Cloudflare AI Gateway can be configured with CLOUDFLARE_GATEWAY_ID, CLOUDFLARE_ACCOUNT_ID, and CLOUDFLARE_API_TOKEN environment variables. Read more: https://docs.altimate.sh/configure/providers/",`
	`436`	`+ // altimate_change end`
`435`	`437`	`)`
`436`	`438`	`}`
`437`	`439`