[revisit later] feat: improve builder agent prompt for resilience and correctness#90
Closed
anandgupta42 wants to merge 9 commits intomainfrom
Closed
[revisit later] feat: improve builder agent prompt for resilience and correctness#90anandgupta42 wants to merge 9 commits intomainfrom
anandgupta42 wants to merge 9 commits intomainfrom
Conversation
3fcde96 to
3e321f8
Compare
…lysis Key improvements to builder agent system prompt: - Add graceful degradation: skip `sql_analyze`, `sql_validate`, `lineage_check`, `schema_inspect` when unavailable instead of getting stuck retrying - Add temporal determinism: avoid `current_date`/`now()`/`current_timestamp` on fixed/historical datasets - Add output validation: query output database directly after dbt run to verify correctness, not just compilation success - Add read-before-write: always read existing models before creating new ones - Add reserved word quoting guidance for SQL column names - Add self-review checks for JOIN correctness, aggregation completeness, non-deterministic functions - Update agent-modes docs to reflect graceful degradation and output validation Validated against Spider2-DBT benchmark: 42.65% pass rate (29/68 tasks), up from 39.71% baseline (27/68), 0 regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Scope "Explore first" to relevant models in same layer/domain instead of reading ALL models (performance concern for large projects) - Make reserved word quoting dialect-aware: double quotes for ANSI SQL, backticks for BigQuery/MySQL, brackets for SQL Server - Add incremental model exception for temporal function guidance - Add warehouse-agnostic fallback for output validation (not just DuckDB) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 68-task benchmark for evaluating agent on dbt+DuckDB workflows - Resumable runner with parallel execution (`--parallel N`) - Official Spider2 evaluation bridge (`eval_utils`) - Interactive single-file HTML report with leaderboard chart - One-time setup script for Spider2 repo + DuckDB databases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three generic context efficiency improvements: 1. Surface `observation_mask` for pruned tool outputs in `toModelMessages` instead of opaque "[Old tool result content cleared]". The mask was already computed by `SessionCompaction.prune()` but never used — gives the model post-compaction awareness of what it previously read. 2. Remove dead `<directories>` block from `SystemPrompt.environment()`. The tree was permanently disabled via `&& false`, leaving an empty XML tag wasting ~30 tokens per API call. 3. Compact skill descriptions in tool schema from 4-line XML per skill to single-line `<skill name="...">description</skill>`. Drops unused `<location>` URLs. Cuts skill listing size by ~60%. Includes 8 e2e tests validating all three changes without mocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing `id`, `sessionID`, `messageID` properties to `createObservationMask` test fixtures to satisfy `MessageV2.ToolPart` type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d05da0a to
a2fd017
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR doesn't fully meet our contributing guidelines and PR template. What needs to be fixed:
Please edit this PR description to address the above within 2 hours, or it will be automatically closed. If you believe this was flagged incorrectly, please let a maintainer know. |
|
Hey! Your PR title Please update it to start with one of:
Where See CONTRIBUTING.md for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two sets of improvements based on Spider2-DBT benchmark analysis (68 real-world dbt+DuckDB tasks). All changes are generic improvements that benefit all workflows.
Builder Prompt Improvements
sql_analyze,sql_validate,lineage_check,schema_inspectskip gracefully when unavailable instead of the agent getting stuck retryingcurrent_date/now()/current_timestampon fixed/historical datasetsdbt runto verify correctnessContext Efficiency Improvements
[Tool output cleared — read(file_path: "...") returned 47 lines, 3.2 KB — "SELECT..."]instead of opaque[Old tool result content cleared]. The mask was already computed bySessionCompaction.prune()but never surfaced intoModelMessages.<directories>block:SystemPrompt.environment()emitted empty<directories>\n \n</directories>(permanently disabled via&& false). Removed ~30 wasted tokens per API call.<skill name="...">description</skill>instead of 4-line XML per skill. Drops unused<location>URLs. ~60% size reduction in skill listings.Benchmark Pipeline
--parallel N)Benchmark Results
Test plan
tsgo --noEmitin opencode package: 0 errors)🤖 Generated with Claude Code