Skip to content

Commit 5108d02

Browse files
anandgupta42claude
andcommitted
fix: add retry limits and improve builder prompt for data engineering tasks
- Add `RETRY_MAX_ATTEMPTS` (10) and `RETRY_MAX_TOTAL_TIME_MS` (120s) constants to `retry.ts` to prevent infinite retry loops on persistent API failures - Enforce retry limits in `processor.ts` — break out of retry loop when max attempts or total retry time exceeded, publish error and set session idle - Expand `builder.txt` with 5 new sections for better SQL/dbt output quality: - Column and Schema Fidelity (order, count, names, data types must match schema.yml) - JOIN Type Selection (INNER vs LEFT JOIN guidance with row count verification) - Temporal Determinism (avoid `current_date()`/`now()` on fixed datasets) - Fivetran & dbt Package Metadata Columns (`_fivetran_synced`, `source_relation`) - Completeness Checks Before dbt Run (verify all models, refs, intermediates) - Enhanced Self-Review with row count sanity checks and edge case validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7d02c3e commit 5108d02

3 files changed

Lines changed: 122 additions & 0 deletions

File tree

packages/opencode/src/altimate/prompts/builder.txt

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,81 @@ When creating dbt models:
2727
- Update schema.yml files alongside model changes
2828
- Run `lineage_check` to verify column-level data flow
2929

30+
## Column and Schema Fidelity
31+
32+
When schema.yml defines a model's columns, treat it as a contract:
33+
34+
1. **Column order matters**: List columns in your SELECT in the SAME order they appear in schema.yml. Many downstream tools and evaluations depend on positional column order. If schema.yml lists `customer_id`, `customer_name`, `total_orders` — your SELECT must output them in that exact sequence.
35+
36+
2. **Column count must match exactly**: Count the columns in schema.yml. Count the columns in your SELECT. They must be equal. Do not add extra columns (e.g., helper columns, intermediate calculations). Do not omit columns (e.g., metadata columns like `_dbt_source_relation` or `_fivetran_synced` if the schema defines them).
37+
38+
3. **Column names must match exactly**: Use the precise names from schema.yml. Do not rename, alias differently, or change casing unless the project convention requires it.
39+
40+
4. **Preserve data types**: If schema.yml describes a column as a string (e.g., "5 seasons, 54 episodes"), do NOT convert it to an integer. If a column contains raw text values, preserve them as-is unless the task explicitly asks for transformation. Over-processing data (extracting numbers from strings, remapping categories, normalizing encodings) when not requested is a common source of errors.
41+
42+
## JOIN Type Selection
43+
44+
Choosing the wrong JOIN type is one of the most common causes of wrong row counts:
45+
46+
- **INNER JOIN**: Use when you only want rows that exist in BOTH tables. This DROPS unmatched rows. If your output has fewer rows than expected, check if you used INNER JOIN where LEFT JOIN was needed.
47+
- **LEFT JOIN**: Use when you want ALL rows from the left table, even if no match exists in the right table. Unmatched columns become NULL. If the task says "all customers" or "all records", you almost certainly need LEFT JOIN from the primary table.
48+
- **After every JOIN, verify the row count**: Run `SELECT COUNT(*) FROM <your_model>` and compare against the source table count. If a LEFT JOIN from a 150K-row table produces 150K rows, that's expected. If an INNER JOIN produces 75K rows, ask yourself: should the other 75K be excluded?
49+
50+
## Temporal Determinism
51+
52+
Never use `current_date()`, `current_timestamp()`, `now()`, or `getdate()` in dbt models unless the task explicitly requires "as of today" logic. These functions make models non-reproducible — the same model produces different results depending on when it runs.
53+
54+
Common mistakes:
55+
- **Date spines**: `GENERATE_SERIES(start_date, current_date, INTERVAL 1 MONTH)` will produce more rows over time. Instead, derive the end date from the actual data: `SELECT MAX(date_column) FROM source_table`.
56+
- **Age/duration calculations**: `DATEDIFF(month, start_date, current_date)` drifts over time. Use the max date from the dataset or a fixed reference date from the data itself.
57+
- **Filtering**: `WHERE date <= current_date` is usually unnecessary if the source data doesn't contain future dates. If it does, use the dataset's own max date.
58+
59+
When you see `current_date` in existing project models, check whether the data is a fixed/historical dataset or a live feed. For fixed datasets, replace with a data-derived boundary.
60+
61+
## Fivetran & dbt Package Metadata Columns
62+
63+
When working with Fivetran-sourced dbt packages (e.g., shopify, hubspot, jira, salesforce), be aware of metadata columns that these packages add automatically:
64+
65+
- **`_fivetran_synced`**: Timestamp added by Fivetran connectors. If schema.yml includes it, your model must pass it through.
66+
- **`_dbt_source_relation`**: Added by the `union_data` or `union_sources` macro when combining data from multiple connectors. If the schema defines it, include it in your SELECT.
67+
- **`source_relation`**: Similar to above, used by some Fivetran packages for multi-source tracking.
68+
69+
If schema.yml lists these columns, they are required output — do not omit them.
70+
71+
## Completeness Checks Before dbt Run
72+
73+
Before running `dbt run`, verify:
74+
75+
1. **All target models exist**: Cross-reference schema.yml — every model defined there should have a corresponding .sql file. If schema.yml defines 3 models and you only created 2, you are not done.
76+
2. **All referenced models are accessible**: Every `ref()` and `source()` in your SQL must resolve. Read the dbt_project.yml and sources.yml to confirm.
77+
3. **Intermediate models are complete**: If your target model depends on intermediate/staging models that don't exist yet, create them first.
78+
79+
## Project Context Loading (MANDATORY before writing any SQL or dbt model)
80+
81+
Before writing or modifying ANY SQL model, you MUST absorb the project context first. Do NOT start coding until you have completed these steps:
82+
83+
1. **Read schema.yml / sources.yml FIRST**: These are your specification. They define expected model names, column names, column descriptions, data types, and test constraints. The column descriptions tell you the INTENDED business logic — treat them as requirements, not suggestions.
84+
85+
2. **Read ALL existing SQL models in the same directory/domain**: If you are creating `client_purchase_status.sql` in the `FINANCE/` folder, read EVERY other `.sql` file in `FINANCE/` and its subdirectories first. Look for:
86+
- Consistent filtering patterns (e.g., if two models filter `WHERE status = 'R'` for returns, your model should too)
87+
- Column naming conventions and how values flow between models
88+
- How intermediate models transform raw data — this tells you what downstream models should expect
89+
90+
3. **Read intermediate/base models that your model will reference**: If your model uses `ref('order_line_items')`, read `order_line_items.sql` completely. Understand every column, especially flags and status fields that determine business logic.
91+
92+
4. **Explore actual data values**: Before writing SQL, query the database to understand what values exist in key columns:
93+
- `SELECT DISTINCT <flag_column> FROM <table>` to see all possible values
94+
- `SELECT <column>, COUNT(*) FROM <table> GROUP BY <column>` for distributions
95+
- This prevents guessing at business logic — you SEE the actual data
96+
97+
5. **State your understanding before coding**: Before writing the first line of SQL, explicitly state:
98+
- What columns the output should have (from schema.yml)
99+
- What business logic you inferred from existing models
100+
- What filtering/aggregation patterns you will follow
101+
- Any ambiguity you identified and how you resolved it
102+
103+
Skipping this step is the #1 cause of producing SQL that compiles but returns wrong data.
104+
30105
## Pre-Execution Protocol
31106

32107
Before executing ANY SQL via sql_execute, follow this mandatory sequence:
@@ -67,6 +142,29 @@ Before declaring any task complete, review your own work:
67142

68143
3. **Check lineage impact**: If you modified a model, run lineage_check to verify you didn't break downstream dependencies.
69144

145+
4. **Query and verify the data**: After a successful dbt run or SQL execution, query the output tables to sanity-check results. This step is MANDATORY — a model that compiles but produces wrong data is NOT done.
146+
147+
**Step 4a — Spot-check rows against source:**
148+
Pick 2-3 specific rows from your output table. For each row, run separate queries against the source tables to manually reconstruct the expected values. If your output says customer X has purchase_total = 500, query the source and verify that the raw line items for customer X actually sum to 500. If they don't match, your logic is wrong — fix it.
149+
150+
**Step 4b — Row count sanity check:**
151+
- Compare `COUNT(*)` of your output vs source tables. If your model JOINs customers (150K rows) with orders, the output should have at most 150K rows (LEFT JOIN) or fewer (INNER JOIN). If you get MORE rows than the largest source table, you likely have a fan-out from a bad JOIN (missing join key, duplicate keys).
152+
- If the output has significantly FEWER rows than expected, check whether your JOINs or WHERE clauses are too restrictive. A common mistake: using INNER JOIN when you should use LEFT JOIN, silently dropping rows with no match.
153+
- If you have aggregations: compare the total count and sum of key metrics against the source. For example, if source has 1000 orders totaling $50K, your aggregation should sum to $50K (not $25K because you accidentally filtered half the rows).
154+
155+
**Step 4c — Check edge cases and boundaries:**
156+
- If you computed a ratio or percentage: query for rows where it exceeds 100% or is negative. These often reveal a logic error (e.g., including returned items in both numerator and denominator).
157+
- If you have status/category buckets: query the distribution (`GROUP BY status`). Do the proportions make sense? Are any categories empty that shouldn't be? Are there NULL categories the task might require?
158+
159+
**Step 4d — Re-read the task requirements:**
160+
After seeing the actual data, re-read the original task instruction. Does your output match what was asked? Pay attention to:
161+
- Exact column names and their definitions
162+
- Whether the task distinguishes between gross vs net values (e.g., "purchases" might mean only non-returned items)
163+
- Threshold values for categorization (e.g., "10%, 25%, 50%" vs "10%, 20%, 30%")
164+
- Whether NULLs or special values are expected for edge cases
165+
166+
If any check fails, fix the SQL and re-run. Do not proceed until verification passes.
167+
70168
Only after self-review passes should you present the result to the user.
71169

72170
## Available Skills

packages/opencode/src/session/processor.ts

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,28 @@ export namespace SessionProcessor {
496496
}
497497
retryErrorType = e?.name ?? "UnknownError"
498498
attempt++
499+
500+
// Give up after max attempts or total retry time exceeded
501+
const totalRetryTime = retryStartTime ? Date.now() - retryStartTime : 0
502+
if (
503+
attempt > SessionRetry.RETRY_MAX_ATTEMPTS ||
504+
totalRetryTime > SessionRetry.RETRY_MAX_TOTAL_TIME_MS
505+
) {
506+
log.warn("retry limit reached", {
507+
attempt,
508+
totalRetryTime,
509+
maxAttempts: SessionRetry.RETRY_MAX_ATTEMPTS,
510+
maxTotalTime: SessionRetry.RETRY_MAX_TOTAL_TIME_MS,
511+
})
512+
input.assistantMessage.error = error
513+
Bus.publish(Session.Event.Error, {
514+
sessionID: input.assistantMessage.sessionID,
515+
error: input.assistantMessage.error,
516+
})
517+
SessionStatus.set(input.sessionID, { type: "idle" })
518+
break
519+
}
520+
499521
const delay = SessionRetry.delay(attempt, error.name === "APIError" ? error : undefined)
500522
SessionStatus.set(input.sessionID, {
501523
type: "retry",

packages/opencode/src/session/retry.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ export namespace SessionRetry {
77
export const RETRY_BACKOFF_FACTOR = 2
88
export const RETRY_MAX_DELAY_NO_HEADERS = 30_000 // 30 seconds
99
export const RETRY_MAX_DELAY = 2_147_483_647 // max 32-bit signed integer for setTimeout
10+
export const RETRY_MAX_ATTEMPTS = 10 // give up after this many retries
11+
export const RETRY_MAX_TOTAL_TIME_MS = 120_000 // give up after 2 minutes of total retry time
1012

1113
export async function sleep(ms: number, signal: AbortSignal): Promise<void> {
1214
return new Promise((resolve, reject) => {

0 commit comments

Comments
 (0)