Commit c5e9cfa
feat: plan agent refinement, feature discovery (#556)
* feat: plan agent refinement, feature discovery, and telemetry instrumentation
## Plan Agent Improvements
- Two-step plan approach: outline first, confirm, then expand
- Plan refinement loop: edit instead of restart (capped at 5 revisions)
- Approval detection for common phrases ("looks good", "proceed", "lgtm")
- Bug fix: `agent_outcome` telemetry for plan sessions emitting null for
`duration_ms`, `tool_calls`, `generations` — `sessionAgentName` set too late
## Feature Discovery & Progressive Disclosure
- Post-warehouse-connect contextual suggestions (schema, SQL, lineage, PII)
- Progressive: `sql_execute` → `sql_analyze` → `schema_inspect` → `lineage_check`
- dbt auto-detection → recommends `/dbt-develop`, `/dbt-troubleshoot`
- All suggestions try/catch wrapped, never block core flow
## Telemetry
- New `plan_revision` event: revision count + action (refine/approve/reject)
- New `feature_suggestion` event: suggestions shown, type, warehouse
- `skill_used.trigger`: user_command | llm_selected | auto_suggested
- `classifySkillTrigger()` helper
## Tests
- 140 new tests across 4 files, all passing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address CodeRabbit review comments
- Validate `sessionId` before telemetry track in post-connect-suggestions
- Rename shadowed `ctx` to `suggestionCtx` in warehouse-add.ts
- Add rejection detection ("no", "stop", "reject", etc.) to plan revision
action — telemetry now emits "reject" in addition to "approve"/"refine"
- Add `warehouseType` to `trackSuggestions` in sql-analyze and schema-inspect
for telemetry payload consistency
- Use hyphenated skill names in telemetry (`dbt-develop` not `dbt_develop`)
to match user-facing `/dbt-develop` command names
- Derive `suggestionsShown` from `suggestionCtx` to prevent drift
- DRY up event types list in telemetry.test.ts (`ALL_EVENT_TYPES` constant)
- Remove `@ts-nocheck` from plan-skill-telemetry.test.ts
- Add runtime `Telemetry.track()` verification to type-only tests
- Use semantic regex in plan-refinement tests instead of brittle string matches
- Fix trackBlock extraction with generous region window instead of fixed slice
- Remove duplicate regression tests (covered in telemetry.test.ts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: e2e tests, performance benchmarks, and UX gap fixes
## E2E Integration Tests (41 tests)
- `feature-discovery-e2e.test.ts` (29 tests): Full warehouse-add → suggestions
flow, progressive disclosure chain, plan refinement session, telemetry
event validation with mocked Dispatcher
- `performance-regression.test.ts` (12 tests): 1000x suggestion generation
< 50ms, 10000x progressive lookup < 50ms, 100k phrase detection < 200ms,
output determinism verification
## UX Gap Fixes
- **Suggestion deduplication**: Progressive hints shown at most once per
session per tool via `shownProgressiveSuggestions` Set. Running
`sql_execute` 10 times no longer repeats the same tip.
- **Approval false positives**: "yes, but change X" now correctly classified
as "refine" not "approve". Added `refinementQualifiers` (" but ",
" however ", " except ", " change ", etc.) that override approval detection.
"no" uses `\bno\b` word boundary to avoid matching "know", "notion", etc.
- **Revision cap communication**: When 5 revision cap is hit, a synthetic
message informs the LLM to tell the user. Telemetry tracks "cap_reached".
- **Warehouse add latency**: Suggestion gathering now uses `Promise.all` +
`Promise.race` with 1500ms timeout. Slow schema/dbt checks silently
skipped instead of blocking the response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: plan layer safety e2e tests (68 tests)
Comprehensive user-perspective tests for the plan refinement changes:
## Phrase Classification (42 tests)
- 14 approval cases ("looks good", "proceed", "lgtm", etc.)
- 12 rejection cases ("no", "stop", "reject", etc.)
- 8 pure refinement cases (feedback without approval/rejection signals)
- 8 tricky edge cases:
- "yes, but change X" → refine (qualifier overrides approval)
- "no, I mean yes" → reject (rejection takes priority)
- "I know this looks good" → approve ("know" ≠ "no" via word boundary)
- Unicode input, multiline, special characters, empty string
## Non-Plan Agent Safety (3 tests)
- Plan variables initialized before loop
- Refinement block guarded by `agent.name === "plan"`
- Plan file detection guarded by agent check
## Session Agent Name Fix (2 tests)
- `sessionAgentName` set before early break conditions
- `agent_outcome` telemetry uses `sessionAgentName`
## Revision Cap (4 tests)
- Cap enforced at >= 5
- Synthetic message communicates limit to LLM
- Telemetry emits "cap_reached"
- Synthetic message doesn't persist to DB
## Adversarial (4 tests)
- 70k char input doesn't crash
- Unicode, special chars, multiline handled
## Import Safety (2 tests)
- post-connect-suggestions is self-contained (no heavy imports)
- Progressive suggestion dedup works correctly
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add mongodb to devDependencies for typecheck resolution
The mongodb driver (#482) was added to optionalDependencies only,
so `tsgo --noEmit` failed with TS2307 when checking the drivers
package. Adding it to devDependencies ensures types are available
during CI typecheck without making it a runtime requirement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: track suggestion failures in warehouse-add telemetry
Per Surya's review — the catch block in warehouse-add now emits a
`core_failure` event with `error_class: "internal"` and
`input_signature: "post_connect_suggestions"` so we can monitor
how often post-connect suggestions fail and why.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: 125 simulated user scenarios for plan + suggestions
Exercises the real code paths with realistic inputs:
- 84 phrase classification scenarios: approval (20), rejection (24),
refinement (16), qualifier overrides (12), word boundary (12)
- 13 warehouse suggestion configs: 8 warehouse types x indexed/dbt/multi
- 8 progressive disclosure chains: full progression, dedup, interleaved
- 6 revision cap simulations: mixed actions, all rejections, cap behavior
- 5 performance stress: 10k classifications < 500ms, determinism checks
- 10 adversarial: unicode lookalikes, 50KB, SQL injection, null bytes
All 125 pass in 255ms.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: 40 real tool execution simulations with mocked Dispatcher
Spawns actual tool execute() functions (not mocked) with registered
Dispatcher handlers to simulate real user tool invocations:
Warehouse Add (18 scenarios):
- 8 warehouse types with post-connect suggestions
- Schema indexed/not-indexed variations
- Multi-warehouse data_diff suggestion
- Failure modes: add fails, throws, missing type
- Resilience: schema.cache_status fails, warehouse.list fails
- Timeout: slow dispatcher (3s) races against 1.5s timeout
SQL Execute (6 scenarios):
- First call gets suggestion, subsequent calls deduped
- 10 consecutive calls — only first has hint
- Failure and empty result handling
- Blocked query (DROP DATABASE) throws
SQL Analyze (4 scenarios):
- First call suggests schema_inspect, second deduped
- Parse error and analyzer failure handling
Schema Inspect (3 scenarios):
- First call suggests lineage_check, second deduped
- Failure handling
Schema Index (3 scenarios):
- Lists all capabilities on first call
- Dedup on second, failure handling
Full User Journeys (4 scenarios):
- Complete 5-tool chain: warehouse_add → schema_index → sql_execute → sql_analyze → schema_inspect
- 20 repeated queries with dedup verification
- Interleaved tool calls with independent dedup
- All dispatchers failing — warehouse add still succeeds
Performance (2 scenarios):
- Warehouse add < 500ms with fast dispatchers
- 50 consecutive sql_execute < 2s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: document plan refinement, feature discovery, and new telemetry events
- agent-modes.md: Expanded Plan section with two-step workflow (outline
then expand), refinement loop (approve/refine/reject), 5-revision cap,
and example conversation
- telemetry.md: Added plan_revision and feature_suggestion events,
updated skill_used with trigger field
- warehouses.md: Added "Post-Connection Suggestions" section covering
progressive disclosure chain and once-per-session dedup
- getting-started.md: Added feature suggestions mention after /discover
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace mock.module() with spyOn to prevent cross-file test pollution
Root cause of 112 CI failures: three test files used Bun's
mock.module() to replace the Telemetry module with a partial mock.
mock.module() is process-global in Bun — it persisted across ALL
test files, causing Telemetry functions (categorizeToolName,
classifyError, bucketCount, computeInputSignature, maskArgs, etc.)
to be undefined for any test file loaded after the mock.
Fix: replaced mock.module() with spyOn() + afterEach(mock.restore())
in post-connect-suggestions.test.ts, performance-regression.test.ts,
and feature-discovery-e2e.test.ts. spyOn is per-function and properly
cleaned up between tests.
Result: 5740 pass, 0 fail (was 112 fail) across 285 files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: reset dedup state in tests + replace passwords with fake values
- post-connect-suggestions.test.ts: add resetShownSuggestions() in
beforeEach to prevent cross-file dedup state from causing the
"after sql_execute suggests sql_analyze" test to fail in CI
- connections.test.ts: replace "secret", "pw123", "ssh-pw",
"access-token-123" with obviously fake test values to resolve
GitGuardian false positives
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: replace all credential-like test values for GitGuardian
Replace remaining flagged values: "passphrase", "oauth-token",
"client-secret", "my_secret", "my-passphrase", dapi_secret with
obviously fake test-prefixed placeholders.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent ffe03b4 commit c5e9cfa
File tree
27 files changed
+3673
-115
lines changed- docs/docs
- configure
- data-engineering
- reference
- packages
- drivers
- opencode
- src
- altimate
- telemetry
- tools
- session
- prompt
- tool
- test
- altimate
- session
- telemetry
27 files changed
+3673
-115
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
365 | 365 | | |
366 | 366 | | |
367 | 367 | | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
145 | 190 | | |
146 | 191 | | |
147 | 192 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
| 63 | + | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
11 | 14 | | |
12 | 15 | | |
13 | 16 | | |
| |||
0 commit comments