|
| 1 | +# Part 4.2: Implement PythonScanner |
| 2 | + |
| 3 | +See [overview.md](overview.md) for architecture context. |
| 4 | + |
| 5 | +## Goal |
| 6 | + |
| 7 | +Implement `PythonScanner` that extracts functions, classes, methods, imports, |
| 8 | +and module variables from `.py` files. Outputs `Document[]` matching the existing |
| 9 | +scanner interface. Register in the scanner registry. |
| 10 | + |
| 11 | +## What changes |
| 12 | + |
| 13 | +### New file: `packages/core/src/scanner/python.ts` |
| 14 | + |
| 15 | +Implements `Scanner` interface (same pattern as `go.ts`): |
| 16 | + |
| 17 | +```typescript |
| 18 | +export class PythonScanner implements Scanner { |
| 19 | + readonly language = 'python'; |
| 20 | + readonly capabilities: ScannerCapabilities = { |
| 21 | + syntax: true, |
| 22 | + types: true, // type hints |
| 23 | + documentation: true, // docstrings |
| 24 | + }; |
| 25 | + |
| 26 | + canHandle(filePath: string): boolean { |
| 27 | + return path.extname(filePath).toLowerCase() === '.py'; |
| 28 | + } |
| 29 | + |
| 30 | + async scan(files, repoRoot, logger, onProgress): Promise<Document[]> { |
| 31 | + // For each .py file: |
| 32 | + // 1. Read file content |
| 33 | + // 2. Parse with tree-sitter (language: 'python') |
| 34 | + // 3. Run PYTHON_QUERIES |
| 35 | + // 4. For each match, create a Document with: |
| 36 | + // - id: `${relativePath}:${name}:${startLine}` |
| 37 | + // - type: 'function' | 'class' | 'method' | 'variable' |
| 38 | + // - text: signature + docstring (for search quality) |
| 39 | + // - metadata: name, signature, exported, docstring, callees, isAsync |
| 40 | + } |
| 41 | +} |
| 42 | +``` |
| 43 | + |
| 44 | +### Extraction logic per query type |
| 45 | + |
| 46 | +**Functions:** |
| 47 | +- Name from `@name` capture |
| 48 | +- Signature: first line of node text (up to `:`) |
| 49 | +- Return type: check for `return_type` field on `function_definition` |
| 50 | +- isAsync: check if source text starts with `async` |
| 51 | +- Docstring: first `expression_statement > string` child of body block |
| 52 | +- Exported: name doesn't start with `_` |
| 53 | +- Callees: scan body for `call` nodes, extract function names |
| 54 | + |
| 55 | +**Classes:** |
| 56 | +- Name from `@name` capture |
| 57 | +- Signature: `class Name(bases):` from first line |
| 58 | +- Superclasses: from `superclasses` field (argument_list) |
| 59 | +- Docstring: first string in body block |
| 60 | +- Exported: name doesn't start with `_` |
| 61 | + |
| 62 | +**Methods:** |
| 63 | +- Same as functions but type is `'method'` |
| 64 | +- Parent class name prepended to signature: `ClassName.method_name` |
| 65 | + |
| 66 | +**Imports:** |
| 67 | +- `import_statement`: extract module name |
| 68 | +- `import_from_statement`: extract module + imported names |
| 69 | +- Stored in file-level `metadata.imports` array |
| 70 | + |
| 71 | +**Module variables:** |
| 72 | +- `UPPER_CASE` assignments at module level → type `'variable'` |
| 73 | +- Name from left-hand identifier |
| 74 | +- Exported: name doesn't start with `_` |
| 75 | + |
| 76 | +**Parameters (`*args`, `**kwargs`):** |
| 77 | +- Extract `*args` via tree-sitter `list_splat_pattern` node |
| 78 | +- Extract `**kwargs` via `dictionary_splat_pattern` node |
| 79 | +- Include in signature: `def foo(x: int, *args, **kwargs) -> str` |
| 80 | +- These are extremely common in Python — validated by stack-graphs' parameter handling |
| 81 | + |
| 82 | +**Async function detection:** |
| 83 | +- `async def` is NOT a separate node type in tree-sitter-python |
| 84 | +- It's a regular `function_definition` with an `async` keyword token as a child |
| 85 | +- Detect by checking if source text of the node starts with `async` |
| 86 | +- Confirmed by both AST inspection and stack-graphs (which also lacks `async_function_definition`) |
| 87 | + |
| 88 | +**Callees — extraction depth:** |
| 89 | +- Walk ALL `call` nodes within the function body subtree (any depth) |
| 90 | +- Matches TypeScript behavior: `getDescendantsOfKind(CallExpression)` walks recursively |
| 91 | +- This means calls inside nested lambdas, comprehensions, and conditionals ARE included |
| 92 | +- A function that uses `result = list(map(lambda x: db.query(x), items))` DOES |
| 93 | + list `db.query` as a callee — correct for dependency analysis |
| 94 | +- Deduplicate by name+line (same pattern as TypeScript scanner) |
| 95 | + |
| 96 | +### `__all__` handling |
| 97 | + |
| 98 | +If module contains `__all__ = [...]`: |
| 99 | +1. Parse the list literal to extract names |
| 100 | +2. Override exported flag: only names in `__all__` are `exported: true` |
| 101 | +3. If `__all__` is computed (not a simple list), fall back to `_` convention |
| 102 | + |
| 103 | +### Snippet extraction |
| 104 | + |
| 105 | +Every Document must include `metadata.snippet` — truncated source text for search |
| 106 | +result previews. Use the same pattern as GoScanner: extract node text, truncate at |
| 107 | +50 lines. Without this, Python search results would lack code previews that Go and |
| 108 | +TypeScript results have. |
| 109 | + |
| 110 | +### Generated file detection |
| 111 | + |
| 112 | +Skip files matching common Python generated patterns: |
| 113 | +- `_pb2.py`, `_pb2_grpc.py` (protobuf stubs) |
| 114 | +- Files with `# Generated by` or `# DO NOT EDIT` in the first 3 lines |
| 115 | +- Migration files: `*/migrations/*.py` (Django), `*/versions/*.py` (Alembic) |
| 116 | + |
| 117 | +### `packages/core/src/utils/test-utils.ts` — refactor to language-aware |
| 118 | + |
| 119 | +Refactor both `isTestFile()` and `findTestFile()` from hardcoded JS/TS patterns |
| 120 | +to a language-aware pattern map. This prevents if/else chain growth as we add |
| 121 | +Rust, Java, C# etc. |
| 122 | + |
| 123 | +```typescript |
| 124 | +const TEST_PATTERNS: Record<string, (filePath: string) => boolean> = { |
| 125 | + ts: (f) => f.includes('.test.') || f.includes('.spec.'), |
| 126 | + tsx: (f) => f.includes('.test.') || f.includes('.spec.'), |
| 127 | + js: (f) => f.includes('.test.') || f.includes('.spec.'), |
| 128 | + jsx: (f) => f.includes('.test.') || f.includes('.spec.'), |
| 129 | + go: (f) => f.endsWith('_test.go'), |
| 130 | + py: (f) => { |
| 131 | + const name = path.basename(f); |
| 132 | + return name.startsWith('test_') || name.endsWith('_test.py') || name === 'conftest.py'; |
| 133 | + }, |
| 134 | +}; |
| 135 | + |
| 136 | +export function isTestFile(filePath: string): boolean { |
| 137 | + const ext = path.extname(filePath).slice(1); |
| 138 | + const check = TEST_PATTERNS[ext]; |
| 139 | + // Fall back to legacy JS/TS check for unknown extensions |
| 140 | + return check ? check(filePath) : filePath.includes('.test.') || filePath.includes('.spec.'); |
| 141 | +} |
| 142 | +``` |
| 143 | + |
| 144 | +Similarly update `findTestFile()` to generate Python test path patterns |
| 145 | +(`test_{name}.py`, `{name}_test.py`) alongside the existing `.test.`/`.spec.` patterns. |
| 146 | + |
| 147 | +### `packages/core/src/scanner/index.ts` |
| 148 | + |
| 149 | +Register PythonScanner: |
| 150 | + |
| 151 | +```typescript |
| 152 | +import { PythonScanner } from './python'; |
| 153 | + |
| 154 | +export function createDefaultRegistry(): ScannerRegistry { |
| 155 | + const registry = new ScannerRegistry(); |
| 156 | + registry.register(new TypeScriptScanner()); |
| 157 | + registry.register(new MarkdownScanner()); |
| 158 | + registry.register(new GoScanner()); |
| 159 | + registry.register(new PythonScanner()); // NEW |
| 160 | + return registry; |
| 161 | +} |
| 162 | +``` |
| 163 | + |
| 164 | +### Tests |
| 165 | + |
| 166 | +| Test | What it verifies | |
| 167 | +|------|-----------------| |
| 168 | +| Extract function with type hints | Signature includes types | |
| 169 | +| Extract async function | isAsync = true | |
| 170 | +| Extract class with methods | Class doc + method separate | |
| 171 | +| Extract decorated function | Decorator preserved in context | |
| 172 | +| Extract imports | Both `import` and `from...import` | |
| 173 | +| Extract module-level constants | UPPER_CASE assignments | |
| 174 | +| Docstring extraction | First string in function/class body | |
| 175 | +| Public/private via `_` convention | exported flag correct | |
| 176 | +| `__all__` overrides convention | Only listed names exported | |
| 177 | +| Callees from function body | Call nodes extracted | |
| 178 | +| Snippet field populated | Truncated source text on every Document | |
| 179 | +| isTestFile recognizes test_*.py | Python test convention | |
| 180 | +| isTestFile recognizes conftest.py | pytest fixture files | |
| 181 | +| Skip _pb2.py generated files | Generated file detection | |
| 182 | +| Callees inside nested lambda | Recursive depth extraction | |
| 183 | +| isTestFile refactored to pattern map | Language-aware, extensible | |
| 184 | +| findTestFile generates Python patterns | test_{name}.py, {name}_test.py | |
| 185 | +| Scan multiple files | Progress callback, error handling | |
| 186 | +| Empty file | No crash, empty results | |
| 187 | +| Syntax error in file | Graceful handling, partial results | |
| 188 | + |
| 189 | +### Commit |
| 190 | + |
| 191 | +``` |
| 192 | +feat(core): implement PythonScanner with full extraction |
| 193 | +
|
| 194 | +Extracts functions, classes, methods, imports, decorators, and module |
| 195 | +variables from Python files using tree-sitter. Handles type hints, |
| 196 | +docstrings, async functions, and __all__ for export detection. |
| 197 | +``` |
0 commit comments