|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Development Commands |
| 6 | + |
| 7 | +**Testing:** |
| 8 | +```bash |
| 9 | +# Run all tests using pytest |
| 10 | +uv run pytest |
| 11 | + |
| 12 | +# Run individual test files |
| 13 | +uv run pytest tests/test_extractor.py |
| 14 | +uv run pytest tests/test_languages.py |
| 15 | +uv run pytest tests/test_models.py |
| 16 | + |
| 17 | +# Quick functional test of core MCP server |
| 18 | +python test_new_mcp.py |
| 19 | +``` |
| 20 | + |
| 21 | +**Running the MCP Server:** |
| 22 | +```bash |
| 23 | +# Run as package with UV (recommended) |
| 24 | +uv run mcp-server-code-extractor |
| 25 | + |
| 26 | +# Run as Python module |
| 27 | +uv run python -m code_extractor |
| 28 | + |
| 29 | +# Test with MCP Inspector |
| 30 | +npx @modelcontextprotocol/inspector uv run mcp-server-code-extractor |
| 31 | + |
| 32 | +# For uvx usage (after publishing) |
| 33 | +uvx mcp-server-code-extractor |
| 34 | +``` |
| 35 | + |
| 36 | +**Development Dependencies:** |
| 37 | +```bash |
| 38 | +# Install development dependencies (testing, formatting, linting) |
| 39 | +uv add --dev pytest black flake8 mypy |
| 40 | +``` |
| 41 | + |
| 42 | +**Code Quality:** |
| 43 | +```bash |
| 44 | +# Format code with Black |
| 45 | +uv run black . |
| 46 | + |
| 47 | +# Lint with flake8 |
| 48 | +uv run flake8 . |
| 49 | + |
| 50 | +# Type checking with mypy |
| 51 | +uv run mypy . |
| 52 | +``` |
| 53 | + |
| 54 | +## Architecture Overview |
| 55 | + |
| 56 | +This is an MCP (Model Context Protocol) server that provides precise code extraction using tree-sitter parsing. The codebase has a **clean package structure**: |
| 57 | + |
| 58 | +### Package Structure (`code_extractor/`) |
| 59 | +- **MCP Server** (`server.py`) - FastMCP server with 5 extraction tools |
| 60 | +- **Core Library** (`extractor.py`) - Query-driven extraction engine using tree-sitter |
| 61 | +- **Data Models** (`models.py`) - Rich symbol representations with hierarchical relationships |
| 62 | +- **Language Support** (`languages.py`) - Detection and mapping for 30+ programming languages |
| 63 | +- **Tree-sitter Queries** (`queries/`) - Language-specific syntax parsing patterns |
| 64 | +- **Entry Points** (`__main__.py`) - Module execution support |
| 65 | + |
| 66 | +### Entry Points |
| 67 | +- **Console Script**: `mcp-server-code-extractor` - Direct execution via uvx/pip |
| 68 | +- **Module Execution**: `python -m code_extractor` - Run as Python module |
| 69 | +- **Package Import**: `from code_extractor import CodeExtractor` - Library usage |
| 70 | + |
| 71 | +### Key Architectural Decisions |
| 72 | + |
| 73 | +**Method vs Function Classification:** |
| 74 | +The core innovation is distinguishing methods (functions inside classes) from top-level functions using tree-sitter query patterns. This solves the context problem where traditional parsers can't determine if a function is a class method without understanding the containment hierarchy. |
| 75 | + |
| 76 | +**Two-Layer Symbol Processing:** |
| 77 | +1. **Query capture phase**: Tree-sitter queries extract syntax nodes with semantic labels |
| 78 | +2. **Symbol building phase**: Raw captures are processed into rich `CodeSymbol` objects with hierarchical relationships |
| 79 | + |
| 80 | +**Clean MCP Interface:** |
| 81 | +The server uses FastMCP for simple tool registration and exposes 5 core extraction tools with consistent function signatures and error handling. |
| 82 | + |
| 83 | +## Working with Tree-Sitter Queries |
| 84 | + |
| 85 | +Tree-sitter queries are stored in `code_extractor/queries/` and use the S-expression format: |
| 86 | + |
| 87 | +```scheme |
| 88 | +; Extract methods inside classes |
| 89 | +(class_definition |
| 90 | + body: (block |
| 91 | + (function_definition |
| 92 | + name: (identifier) @method.name |
| 93 | + parameters: (parameters) @method.parameters) @method.definition)) |
| 94 | +``` |
| 95 | + |
| 96 | +**Query Structure:** |
| 97 | +- Capture names use `category.type` format (e.g., `method.name`, `function.definition`) |
| 98 | +- The extractor groups captures by their definition nodes to build complete symbols |
| 99 | +- Parent-child relationships are determined by byte range containment |
| 100 | + |
| 101 | +## Language Support |
| 102 | + |
| 103 | +New languages require: |
| 104 | +1. Adding language mapping in `code_extractor/languages.py` |
| 105 | +2. Creating tree-sitter query file in `code_extractor/queries/` |
| 106 | +3. Testing with language-specific syntax patterns |
| 107 | + |
| 108 | +The system automatically detects language from file extensions and falls back gracefully for unsupported languages. |
| 109 | + |
| 110 | +## MCP Tools Interface |
| 111 | + |
| 112 | +The server exposes 5 tools to AI assistants: |
| 113 | + |
| 114 | +1. **`get_symbols`** - Primary entry point for code discovery (uses modern core library) |
| 115 | +2. **`get_function`** - Extract specific functions (legacy tree traversal) |
| 116 | +3. **`get_class`** - Extract specific classes (legacy tree traversal) |
| 117 | +4. **`get_lines`** - Extract line ranges by number |
| 118 | +5. **`get_signature`** - Get function signatures only |
| 119 | + |
| 120 | +**Best Practice**: Always use `get_symbols` first for code exploration, then use specific extraction tools for detailed analysis. |
0 commit comments