Skip to content

[revisit later] feat: improve builder agent prompt for resilience and correctness#90

Closed
anandgupta42 wants to merge 9 commits intomainfrom
feat/builder-prompt-improvements
Closed

[revisit later] feat: improve builder agent prompt for resilience and correctness#90
anandgupta42 wants to merge 9 commits intomainfrom
feat/builder-prompt-improvements

Conversation

@anandgupta42
Copy link
Copy Markdown
Contributor

@anandgupta42 anandgupta42 commented Mar 7, 2026

Summary

Two sets of improvements based on Spider2-DBT benchmark analysis (68 real-world dbt+DuckDB tasks). All changes are generic improvements that benefit all workflows.

Builder Prompt Improvements

  • Graceful degradation: sql_analyze, sql_validate, lineage_check, schema_inspect skip gracefully when unavailable instead of the agent getting stuck retrying
  • Temporal determinism: Avoid current_date/now()/current_timestamp on fixed/historical datasets
  • Output validation: Query output database directly after dbt run to verify correctness
  • Read before writing: Always read existing models before creating new ones
  • Reserved word quoting: Dialect-aware quoting for SQL reserved words
  • Self-review enhancements: JOIN correctness, aggregation completeness, non-deterministic function checks

Context Efficiency Improvements

  • Surface observation masks: Pruned tool outputs now show [Tool output cleared — read(file_path: "...") returned 47 lines, 3.2 KB — "SELECT..."] instead of opaque [Old tool result content cleared]. The mask was already computed by SessionCompaction.prune() but never surfaced in toModelMessages.
  • Remove dead <directories> block: SystemPrompt.environment() emitted empty <directories>\n \n</directories> (permanently disabled via && false). Removed ~30 wasted tokens per API call.
  • Compact skill descriptions: Tool schema uses single-line <skill name="...">description</skill> instead of 4-line XML per skill. Drops unused <location> URLs. ~60% size reduction in skill listings.

Benchmark Pipeline

  • 68-task benchmark runner with parallel execution (--parallel N)
  • Official Spider2 evaluation bridge
  • Interactive single-file HTML report with leaderboard chart

Benchmark Results

Metric Before After Delta
Pass rate 39.71% (27/68) 42.65% (29/68) +2.94%
Regressions 0
New passes divvy001, hive001 +2 tasks

Test plan

  • TypeScript typecheck passes (tsgo --noEmit in opencode package: 0 errors)
  • 8 new e2e tests for context efficiency (no mocking — real session state, prune, skill loading)
  • Existing skill test updated for compact format
  • Full test suite: 1505 pass
  • Full 68-task benchmark run validated with 0 regressions
  • Manual smoke test: run builder agent on a dbt project without warehouse connection

🤖 Generated with Claude Code

@anandgupta42 anandgupta42 force-pushed the feat/builder-prompt-improvements branch from 3fcde96 to 3e321f8 Compare March 7, 2026 05:58
anandgupta42 and others added 6 commits March 6, 2026 22:02
…lysis

Key improvements to builder agent system prompt:
- Add graceful degradation: skip `sql_analyze`, `sql_validate`, `lineage_check`,
  `schema_inspect` when unavailable instead of getting stuck retrying
- Add temporal determinism: avoid `current_date`/`now()`/`current_timestamp` on
  fixed/historical datasets
- Add output validation: query output database directly after dbt run to verify
  correctness, not just compilation success
- Add read-before-write: always read existing models before creating new ones
- Add reserved word quoting guidance for SQL column names
- Add self-review checks for JOIN correctness, aggregation completeness,
  non-deterministic functions
- Update agent-modes docs to reflect graceful degradation and output validation

Validated against Spider2-DBT benchmark: 42.65% pass rate (29/68 tasks),
up from 39.71% baseline (27/68), 0 regressions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Scope "Explore first" to relevant models in same layer/domain instead
  of reading ALL models (performance concern for large projects)
- Make reserved word quoting dialect-aware: double quotes for ANSI SQL,
  backticks for BigQuery/MySQL, brackets for SQL Server
- Add incremental model exception for temporal function guidance
- Add warehouse-agnostic fallback for output validation (not just DuckDB)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- 68-task benchmark for evaluating agent on dbt+DuckDB workflows
- Resumable runner with parallel execution (`--parallel N`)
- Official Spider2 evaluation bridge (`eval_utils`)
- Interactive single-file HTML report with leaderboard chart
- One-time setup script for Spider2 repo + DuckDB databases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three generic context efficiency improvements:

1. Surface `observation_mask` for pruned tool outputs in `toModelMessages`
   instead of opaque "[Old tool result content cleared]". The mask was
   already computed by `SessionCompaction.prune()` but never used —
   gives the model post-compaction awareness of what it previously read.

2. Remove dead `<directories>` block from `SystemPrompt.environment()`.
   The tree was permanently disabled via `&& false`, leaving an empty
   XML tag wasting ~30 tokens per API call.

3. Compact skill descriptions in tool schema from 4-line XML per skill
   to single-line `<skill name="...">description</skill>`. Drops unused
   `<location>` URLs. Cuts skill listing size by ~60%.

Includes 8 e2e tests validating all three changes without mocking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing `id`, `sessionID`, `messageID` properties to
`createObservationMask` test fixtures to satisfy `MessageV2.ToolPart` type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anandgupta42 anandgupta42 force-pushed the feat/builder-prompt-improvements branch from d05da0a to a2fd017 Compare March 7, 2026 06:03
anandgupta42 and others added 3 commits March 6, 2026 22:24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anandgupta42 anandgupta42 changed the title feat: improve builder agent prompt for resilience and correctness [revisit later] feat: improve builder agent prompt for resilience and correctness Mar 14, 2026
@github-actions
Copy link
Copy Markdown

This PR doesn't fully meet our contributing guidelines and PR template.

What needs to be fixed:

  • PR description is missing required template sections. Please use the PR template.

Please edit this PR description to address the above within 2 hours, or it will be automatically closed.

If you believe this was flagged incorrectly, please let a maintainer know.

@github-actions
Copy link
Copy Markdown

Hey! Your PR title [revisit later] feat: improve builder agent prompt for resilience and correctness doesn't follow conventional commit format.

Please update it to start with one of:

  • feat: or feat(scope): new feature
  • fix: or fix(scope): bug fix
  • docs: or docs(scope): documentation changes
  • chore: or chore(scope): maintenance tasks
  • refactor: or refactor(scope): code refactoring
  • test: or test(scope): adding or updating tests

Where scope is the package name (e.g., app, desktop, opencode).

See CONTRIBUTING.md for details.

@anandgupta42 anandgupta42 deleted the feat/builder-prompt-improvements branch March 17, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants