Skip to content

Commit f409f20

Browse files
prosdevclaude
andcommitted
docs: write Core Phase 4 plan — Python language support
4-part plan for adding Python to dev-agent: - 4.1: Bundle tree-sitter-python WASM (476KB) + define extraction queries - 4.2: Implement PythonScanner (functions, classes, methods, imports, decorators, type hints, docstrings, __all__ exports, callees) - 4.3: Add Python-specific pattern rules for dev_patterns (try/except, raise, imports, type coverage) - 4.4: Test fixtures (FastAPI, pytest, utils), integration tests, docs Key decisions: tree-sitter WASM (not Python subprocess), no cross-file import resolution (name-based callees only), no framework-specific logic (decorators extracted generically), __all__ overrides _ convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 12d4845 commit f409f20

7 files changed

Lines changed: 973 additions & 2 deletions

File tree

.claude/da-plans/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Implementation deviations are logged at the bottom of each plan file.
99

1010
| Track | Description | Status |
1111
|-------|-------------|--------|
12-
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Merged, Phase 3: Draft (graph cache) |
12+
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1-2: Merged, Phase 3: Draft (graph cache), Phase 4: Draft (Python) |
1313
| [CLI](cli/) | Command-line interface | Not started |
1414
| [MCP Server](mcp/) | Model Context Protocol server + adapters | Phase 1: Merged (tools improvement) |
1515
| [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started |
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Part 4.1: Bundle Python WASM + Define Queries
2+
3+
See [overview.md](overview.md) for architecture context.
4+
5+
## Goal
6+
7+
Bundle `tree-sitter-python.wasm`, register the `'python'` language, and define
8+
the S-expression queries that the PythonScanner will use.
9+
10+
## What changes
11+
12+
### `packages/dev-agent/scripts/copy-wasm.js`
13+
14+
Add `'python'` to `SUPPORTED_LANGUAGES`:
15+
16+
```javascript
17+
const SUPPORTED_LANGUAGES = ['go', 'typescript', 'tsx', 'javascript', 'python'];
18+
```
19+
20+
### `packages/core/src/scanner/tree-sitter.ts`
21+
22+
Add `'python'` to `TreeSitterLanguage`:
23+
24+
```typescript
25+
export type TreeSitterLanguage = 'go' | 'typescript' | 'tsx' | 'javascript' | 'python';
26+
```
27+
28+
### New file: `packages/core/src/scanner/python-queries.ts`
29+
30+
All queries validated against `tree-sitter-python` grammar via AST inspection.
31+
32+
```typescript
33+
/**
34+
* Tree-sitter queries for Python code extraction.
35+
* Modeled after GO_QUERIES in go.ts.
36+
*/
37+
export const PYTHON_QUERIES = {
38+
// Top-level function definitions (not inside a class)
39+
functions: `
40+
(module
41+
(function_definition
42+
name: (identifier) @name) @definition)
43+
`,
44+
45+
// Top-level decorated functions (e.g., @app.route, @pytest.fixture)
46+
decoratedFunctions: `
47+
(module
48+
(decorated_definition
49+
definition: (function_definition
50+
name: (identifier) @name)) @definition)
51+
`,
52+
53+
// Class definitions
54+
classes: `
55+
(class_definition
56+
name: (identifier) @name) @definition
57+
`,
58+
59+
// Method definitions (inside class body)
60+
methods: `
61+
(class_definition
62+
body: (block
63+
(function_definition
64+
name: (identifier) @name) @definition))
65+
`,
66+
67+
// Decorated methods (inside class body)
68+
decoratedMethods: `
69+
(class_definition
70+
body: (block
71+
(decorated_definition
72+
definition: (function_definition
73+
name: (identifier) @name)) @definition))
74+
`,
75+
76+
// Import statements
77+
imports: `
78+
(import_statement) @definition
79+
`,
80+
81+
// From...import statements
82+
fromImports: `
83+
(import_from_statement) @definition
84+
`,
85+
86+
// Module-level variable assignments (constants, config)
87+
moduleVariables: `
88+
(module
89+
(expression_statement
90+
(assignment
91+
left: (identifier) @name)) @definition)
92+
`,
93+
94+
// Module-level type-annotated assignments (x: int = 3)
95+
annotatedVariables: `
96+
(module
97+
(expression_statement
98+
(assignment
99+
left: (identifier) @name
100+
type: (type) @type)) @definition)
101+
`,
102+
};
103+
```
104+
105+
### Step 1: Validate ALL queries against tree-sitter-python grammar
106+
107+
Before implementation, run each query against a real Python snippet (same approach
108+
as the JS/TS query validation in Part 1.5). Specifically verify:
109+
110+
- `bare-except` negation syntax: parse `except:` and `except ValueError:`, confirm
111+
the field name used in the negation pattern `!name` is correct for tree-sitter-python
112+
- `annotatedVariables`: parse `x: int = 3`, confirm the field name is `type` and
113+
the node structure matches the query
114+
- All other queries: confirm node types match grammar (function_definition, class_definition, etc.)
115+
116+
Write a validation script at `/tmp/python-query-test.js`, run it, fix any broken queries.
117+
118+
### Tests
119+
120+
| Test | What it verifies |
121+
|------|-----------------|
122+
| `parseCode('def foo(): pass', 'python')` works | WASM loads |
123+
| Each query matches expected Python source | Query correctness |
124+
| Decorated function query matches `@app.route` pattern | Decorator handling |
125+
| Method query matches method inside class | Class method detection |
126+
127+
### Commit
128+
129+
```
130+
feat(core): bundle tree-sitter-python WASM and define extraction queries
131+
```
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Part 4.2: Implement PythonScanner
2+
3+
See [overview.md](overview.md) for architecture context.
4+
5+
## Goal
6+
7+
Implement `PythonScanner` that extracts functions, classes, methods, imports,
8+
and module variables from `.py` files. Outputs `Document[]` matching the existing
9+
scanner interface. Register in the scanner registry.
10+
11+
## What changes
12+
13+
### New file: `packages/core/src/scanner/python.ts`
14+
15+
Implements `Scanner` interface (same pattern as `go.ts`):
16+
17+
```typescript
18+
export class PythonScanner implements Scanner {
19+
readonly language = 'python';
20+
readonly capabilities: ScannerCapabilities = {
21+
syntax: true,
22+
types: true, // type hints
23+
documentation: true, // docstrings
24+
};
25+
26+
canHandle(filePath: string): boolean {
27+
return path.extname(filePath).toLowerCase() === '.py';
28+
}
29+
30+
async scan(files, repoRoot, logger, onProgress): Promise<Document[]> {
31+
// For each .py file:
32+
// 1. Read file content
33+
// 2. Parse with tree-sitter (language: 'python')
34+
// 3. Run PYTHON_QUERIES
35+
// 4. For each match, create a Document with:
36+
// - id: `${relativePath}:${name}:${startLine}`
37+
// - type: 'function' | 'class' | 'method' | 'variable'
38+
// - text: signature + docstring (for search quality)
39+
// - metadata: name, signature, exported, docstring, callees, isAsync
40+
}
41+
}
42+
```
43+
44+
### Extraction logic per query type
45+
46+
**Functions:**
47+
- Name from `@name` capture
48+
- Signature: first line of node text (up to `:`)
49+
- Return type: check for `return_type` field on `function_definition`
50+
- isAsync: check if source text starts with `async`
51+
- Docstring: first `expression_statement > string` child of body block
52+
- Exported: name doesn't start with `_`
53+
- Callees: scan body for `call` nodes, extract function names
54+
55+
**Classes:**
56+
- Name from `@name` capture
57+
- Signature: `class Name(bases):` from first line
58+
- Superclasses: from `superclasses` field (argument_list)
59+
- Docstring: first string in body block
60+
- Exported: name doesn't start with `_`
61+
62+
**Methods:**
63+
- Same as functions but type is `'method'`
64+
- Parent class name prepended to signature: `ClassName.method_name`
65+
66+
**Imports:**
67+
- `import_statement`: extract module name
68+
- `import_from_statement`: extract module + imported names
69+
- Stored in file-level `metadata.imports` array
70+
71+
**Module variables:**
72+
- `UPPER_CASE` assignments at module level → type `'variable'`
73+
- Name from left-hand identifier
74+
- Exported: name doesn't start with `_`
75+
76+
**Parameters (`*args`, `**kwargs`):**
77+
- Extract `*args` via tree-sitter `list_splat_pattern` node
78+
- Extract `**kwargs` via `dictionary_splat_pattern` node
79+
- Include in signature: `def foo(x: int, *args, **kwargs) -> str`
80+
- These are extremely common in Python — validated by stack-graphs' parameter handling
81+
82+
**Async function detection:**
83+
- `async def` is NOT a separate node type in tree-sitter-python
84+
- It's a regular `function_definition` with an `async` keyword token as a child
85+
- Detect by checking if source text of the node starts with `async`
86+
- Confirmed by both AST inspection and stack-graphs (which also lacks `async_function_definition`)
87+
88+
**Callees — extraction depth:**
89+
- Walk ALL `call` nodes within the function body subtree (any depth)
90+
- Matches TypeScript behavior: `getDescendantsOfKind(CallExpression)` walks recursively
91+
- This means calls inside nested lambdas, comprehensions, and conditionals ARE included
92+
- A function that uses `result = list(map(lambda x: db.query(x), items))` DOES
93+
list `db.query` as a callee — correct for dependency analysis
94+
- Deduplicate by name+line (same pattern as TypeScript scanner)
95+
96+
### `__all__` handling
97+
98+
If module contains `__all__ = [...]`:
99+
1. Parse the list literal to extract names
100+
2. Override exported flag: only names in `__all__` are `exported: true`
101+
3. If `__all__` is computed (not a simple list), fall back to `_` convention
102+
103+
### Snippet extraction
104+
105+
Every Document must include `metadata.snippet` — truncated source text for search
106+
result previews. Use the same pattern as GoScanner: extract node text, truncate at
107+
50 lines. Without this, Python search results would lack code previews that Go and
108+
TypeScript results have.
109+
110+
### Generated file detection
111+
112+
Skip files matching common Python generated patterns:
113+
- `_pb2.py`, `_pb2_grpc.py` (protobuf stubs)
114+
- Files with `# Generated by` or `# DO NOT EDIT` in the first 3 lines
115+
- Migration files: `*/migrations/*.py` (Django), `*/versions/*.py` (Alembic)
116+
117+
### `packages/core/src/utils/test-utils.ts` — refactor to language-aware
118+
119+
Refactor both `isTestFile()` and `findTestFile()` from hardcoded JS/TS patterns
120+
to a language-aware pattern map. This prevents if/else chain growth as we add
121+
Rust, Java, C# etc.
122+
123+
```typescript
124+
const TEST_PATTERNS: Record<string, (filePath: string) => boolean> = {
125+
ts: (f) => f.includes('.test.') || f.includes('.spec.'),
126+
tsx: (f) => f.includes('.test.') || f.includes('.spec.'),
127+
js: (f) => f.includes('.test.') || f.includes('.spec.'),
128+
jsx: (f) => f.includes('.test.') || f.includes('.spec.'),
129+
go: (f) => f.endsWith('_test.go'),
130+
py: (f) => {
131+
const name = path.basename(f);
132+
return name.startsWith('test_') || name.endsWith('_test.py') || name === 'conftest.py';
133+
},
134+
};
135+
136+
export function isTestFile(filePath: string): boolean {
137+
const ext = path.extname(filePath).slice(1);
138+
const check = TEST_PATTERNS[ext];
139+
// Fall back to legacy JS/TS check for unknown extensions
140+
return check ? check(filePath) : filePath.includes('.test.') || filePath.includes('.spec.');
141+
}
142+
```
143+
144+
Similarly update `findTestFile()` to generate Python test path patterns
145+
(`test_{name}.py`, `{name}_test.py`) alongside the existing `.test.`/`.spec.` patterns.
146+
147+
### `packages/core/src/scanner/index.ts`
148+
149+
Register PythonScanner:
150+
151+
```typescript
152+
import { PythonScanner } from './python';
153+
154+
export function createDefaultRegistry(): ScannerRegistry {
155+
const registry = new ScannerRegistry();
156+
registry.register(new TypeScriptScanner());
157+
registry.register(new MarkdownScanner());
158+
registry.register(new GoScanner());
159+
registry.register(new PythonScanner()); // NEW
160+
return registry;
161+
}
162+
```
163+
164+
### Tests
165+
166+
| Test | What it verifies |
167+
|------|-----------------|
168+
| Extract function with type hints | Signature includes types |
169+
| Extract async function | isAsync = true |
170+
| Extract class with methods | Class doc + method separate |
171+
| Extract decorated function | Decorator preserved in context |
172+
| Extract imports | Both `import` and `from...import` |
173+
| Extract module-level constants | UPPER_CASE assignments |
174+
| Docstring extraction | First string in function/class body |
175+
| Public/private via `_` convention | exported flag correct |
176+
| `__all__` overrides convention | Only listed names exported |
177+
| Callees from function body | Call nodes extracted |
178+
| Snippet field populated | Truncated source text on every Document |
179+
| isTestFile recognizes test_*.py | Python test convention |
180+
| isTestFile recognizes conftest.py | pytest fixture files |
181+
| Skip _pb2.py generated files | Generated file detection |
182+
| Callees inside nested lambda | Recursive depth extraction |
183+
| isTestFile refactored to pattern map | Language-aware, extensible |
184+
| findTestFile generates Python patterns | test_{name}.py, {name}_test.py |
185+
| Scan multiple files | Progress callback, error handling |
186+
| Empty file | No crash, empty results |
187+
| Syntax error in file | Graceful handling, partial results |
188+
189+
### Commit
190+
191+
```
192+
feat(core): implement PythonScanner with full extraction
193+
194+
Extracts functions, classes, methods, imports, decorators, and module
195+
variables from Python files using tree-sitter. Handles type hints,
196+
docstrings, async functions, and __all__ for export detection.
197+
```

0 commit comments

Comments
 (0)