feat: /discover command — data stack setup with project_scan tool#30
feat: /discover command — data stack setup with project_scan tool#30anandgupta42 merged 7 commits intomainfrom
Conversation
Replace the AGENTS.md-generating /init command with a comprehensive data stack scanner that detects dbt projects, warehouse connections, Docker databases, installed tools, and config files. The AI agent then walks the user through adding connections, testing them, and indexing schemas. New project_scan tool with 5 exported detection functions: - detectGit: branch, remote URL - detectDbtProject: dbt_project.yml, manifest, packages - detectEnvVars: Snowflake, BigQuery, Databricks, Postgres, MySQL, Redshift - detectDataTools: dbt, sqlfluff, airflow, dagster, prefect, soda, sqlmesh, great_expectations, sqlfmt - detectConfigFiles: .altimate-code/, .sqlfluff, .pre-commit-config.yaml Tests: 71 TypeScript (bun:test) + 24 Python (pytest) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restore the original /init command (creates AGENTS.md) and move the data stack setup functionality to /discover instead. Updates all docs to reference /discover as the recommended first-run command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub Actions checks out in detached HEAD state, so git branch --show-current returns empty. The test now accepts undefined branch in that case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In CI detached HEAD, git branch --show-current returns an empty string. Convert empty string to undefined so callers get a clean undefined instead of "". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "detects a git repository" test should only assert isRepo. Branch validation is handled by the dedicated branch test which accounts for CI detached HEAD returning undefined. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| stderr: "pipe", | ||
| }) | ||
| if (isRepoResult.exitCode !== 0) { | ||
| return { isRepo: false } |
There was a problem hiding this comment.
thought: detectGit (and detectDataTools) use Bun.spawnSync directly — this couples the tool to the Bun runtime. If there's ever a need to run under Node.js (or test with a different runner), these would need to be swapped for child_process.spawnSync. Fine for now since the project is Bun-only, but worth noting.
| type: wh.type, | ||
| source: "env-var", | ||
| signal: matchedSignal, | ||
| config, |
There was a problem hiding this comment.
suggestion: The config object captures raw env var values including password, access_token, etc. These get returned to the LLM in the tool output. Consider either redacting sensitive fields (replace with "***") or only including non-secret fields in the scan output. The connection setup step (warehouse_add) can read the env vars directly when actually configuring.
| database: "REDSHIFT_DATABASE", | ||
| user: "REDSHIFT_USER", | ||
| password: "REDSHIFT_PASSWORD", | ||
| }, |
There was a problem hiding this comment.
thought: DATABASE_URL is used as a Postgres signal, but many frameworks (Rails, Django, etc.) set DATABASE_URL for any database type including MySQL, SQLite, etc. This could produce false positives. Consider parsing the URL scheme (postgresql:// vs mysql://) before categorizing, or at least noting the assumption in the output.
| for (const tool of DATA_TOOL_NAMES) { | ||
| try { | ||
| const result = Bun.spawnSync([tool, "--version"], { | ||
| stdout: "pipe", |
There was a problem hiding this comment.
nit: The 9 tool version checks run sequentially. Since each has a 5s timeout, worst case is 45s. Could use Promise.all to run them in parallel and cut the scan time significantly.
jontsai
left a comment
There was a problem hiding this comment.
LGTM — solid feature addition with excellent test coverage (71 TS + 24 Python tests).
What's good:
- Clean separation of detection functions, each independently testable
- Thoughtful connection deduplication logic across sources
- Resilient CI handling (detached HEAD edge case)
- Well-structured
/discoverflow that guides users step by step - Docs are comprehensive and match the implementation
Minor items (posted inline):
- Env var detection captures secrets (passwords, tokens) in scan output — consider redacting
DATABASE_URLassumed to be Postgres but could be any DB type- Tool version checks are sequential; parallelizing would speed up scans
Bun.spawnSynccoupling is fine for now but noted for awareness
None are blockers. Ship it! 🚀
8c8d769 to
64d311a
Compare
…arallelize tool checks - Redact sensitive env var values (password, access_token, connection_string) in scan output with "***" so secrets are never sent to the LLM - Parse DATABASE_URL scheme (postgresql://, mysql://, etc.) to detect the correct database type instead of assuming Postgres - Parallelize tool version checks with Promise.all instead of sequential loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| "schema": "SNOWFLAKE_SCHEMA", | ||
| "role": "SNOWFLAKE_ROLE", | ||
| }, | ||
| }, |
There was a problem hiding this comment.
issue: This Python reference implementation is now out of sync with the TypeScript version in two ways:
-
DATABASE_URLis still listed as a postgres signal (line 28), but the TS version removed it from postgres signals and handles it separately with scheme-based type detection (postgresql://→ postgres,mysql://→ mysql, etc.). This means the Python version will always classifyDATABASE_URLas postgres regardless of the actual scheme. -
No secret redaction — the TS version now masks sensitive keys (
password,access_token,connection_string) with"***", but the Pythondetect_env_connections()returns raw values. If this Python implementation is used later, it would leak secrets.
Since the docstring says "mirrors TypeScript detectEnvVars", these should be kept in sync.
- Remove DATABASE_URL from postgres signals, add scheme-based detection - Add secret redaction (password, access_token, connection_string) - Add tests for DATABASE_URL scheme parsing and deduplication Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jontsai
left a comment
There was a problem hiding this comment.
All previous comments addressed ✅
- Secret redaction with
SENSITIVE_KEYSset — passwords, tokens, connection strings all masked as*** DATABASE_URLscheme parsing maps to correct DB types (postgresql, mysql, redshift, sqlite)- Python test file now in sync with TS: includes
SENSITIVE_KEYS,DATABASE_URL_SCHEME_MAP, and redaction tests - Tool version checks parallelized with
Promise.all
Clean PR, great test coverage (71 TS + 24 Python). LGTM 🚀
There was a problem hiding this comment.
LGTM ✅
Final review complete. The previously flagged issue (Python test_env_detect.py out of sync with TS) is resolved:
- ✅ DATABASE_URL now uses scheme-based detection matching TS (postgresql, postgres, mysql, mysql2, redshift, sqlite)
- ✅ Sensitive values (password, access_token, connection_string, private_key_path) properly redacted with "***"
- ✅ Python reference implementation mirrors TS detectEnvVars faithfully
- ✅ Comprehensive test coverage across all warehouse types
- ✅ project-scan.ts is well-structured with proper deduplication logic
- ✅ Docs updated consistently
Ship it! 🚀
* feat: replace /init with data stack setup command Replace the AGENTS.md-generating /init command with a comprehensive data stack scanner that detects dbt projects, warehouse connections, Docker databases, installed tools, and config files. The AI agent then walks the user through adding connections, testing them, and indexing schemas. New project_scan tool with 5 exported detection functions: - detectGit: branch, remote URL - detectDbtProject: dbt_project.yml, manifest, packages - detectEnvVars: Snowflake, BigQuery, Databricks, Postgres, MySQL, Redshift - detectDataTools: dbt, sqlfluff, airflow, dagster, prefect, soda, sqlmesh, great_expectations, sqlfmt - detectConfigFiles: .altimate-code/, .sqlfluff, .pre-commit-config.yaml Tests: 71 TypeScript (bun:test) + 24 Python (pytest) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: restore /init, rename data stack setup to /discover Restore the original /init command (creates AGENTS.md) and move the data stack setup functionality to /discover instead. Updates all docs to reference /discover as the recommended first-run command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: make detectGit test resilient to CI detached HEAD GitHub Actions checks out in detached HEAD state, so git branch --show-current returns empty. The test now accepts undefined branch in that case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: treat empty git branch as undefined in detectGit In CI detached HEAD, git branch --show-current returns an empty string. Convert empty string to undefined so callers get a clean undefined instead of "". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove branch type assertion from detectGit repo test The "detects a git repository" test should only assert isRepo. Branch validation is handled by the dedicated branch test which accounts for CI detached HEAD returning undefined. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review — redact secrets, parse DATABASE_URL scheme, parallelize tool checks - Redact sensitive env var values (password, access_token, connection_string) in scan output with "***" so secrets are never sent to the LLM - Parse DATABASE_URL scheme (postgresql://, mysql://, etc.) to detect the correct database type instead of assuming Postgres - Parallelize tool version checks with Promise.all instead of sequential loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: sync Python reference implementation with TypeScript changes - Remove DATABASE_URL from postgres signals, add scheme-based detection - Add secret redaction (password, access_token, connection_string) - Add tests for DATABASE_URL scheme parsing and deduplication Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
/initcommand with a comprehensive data stack scannerproject_scantool detects dbt projects, warehouse connections (from dbt profiles, Docker, env vars), installed tools, and config files/inittemplate guides the AI agent through a 5-step setup flow: scan → review → add connections → index schemas → show next stepsNew Files
src/tool/project-scan.tsproject_scantool with 5 exported detection functions + connection deduplicationtest/tool/project-scan.test.tsaltimate-engine/tests/test_env_detect.pyModified Files
src/tool/registry.tsProjectScanToolsrc/command/template/initialize.txtsrc/command/index.tsdocs/docs/getting-started.md/initas first-run commanddocs/docs/data-engineering/tools/warehouse-tools.mddocs/docs/data-engineering/tools/index.mddocs/docs/configure/commands.md/initand/reviewcommandsdocs/docs/usage/tui.md/initto slash command examplesWhat
project_scandetectsgitcommandsdbt_project.yml, parses manifest~/.dbt/profiles.ymlTest plan
bun test— 1220 pass, 0 fail (full suite including 71 new tests)pytest tests/test_env_detect.py— 24 passmkdocs build --strict— docs build cleanly/initin TUI from a dbt project directory/initin TUI from a bare directory (no dbt)🤖 Generated with Claude Code