| title | PythonCodeSplitter |
|---|---|
| id | pythoncodesplitter |
| slug | /pythoncodesplitter |
| description | Split Python source documents into syntax-aware chunks using Python's AST, with metadata for line ranges, classes, decorators, and docstrings. |
PythonCodeSplitter splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.
| Most common position in a pipeline | In indexing pipelines after Converters, before Embedders or DocumentWriter |
| Mandatory run variables | documents: A list of Python source code documents |
| Output variables | documents: A list of Python source code documents split into syntax-aware chunks |
| API reference | PreProcessors |
| GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py |
| Package name | haystack-ai |
PythonCodeSplitter expects each input document's content to be valid Python source code. It parses the source with Python's ast module and creates ordered split units for:
- Module docstrings
- Consecutive import blocks
- Top-level functions
- Class headers
- Methods and nested classes
- Remaining top-level statements
The splitter merges these units in source order toward max_effective_lines. Effective lines are calculated from character length with ceil(len(source) / expected_chars_per_line), so long lines count as more than one line.
Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than oversized_factor * max_effective_lines, the splitter falls back to a line-based secondary split using DocumentSplitter. This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.
By default, preserve_class_definition=True. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.
If strip_docstrings=True, function, method, and class docstrings are removed from chunk content and stored in meta["docstrings"]. Module docstrings stay in the chunk content because they are their own top-level unit.
Each output document includes the original document's metadata plus:
source_id: ID of the original documentsplit_id: Index of the chunk within the original documentstart_lineandend_line: Source line range for the AST units in the chunk. Oversized secondary chunks keep the originating unit's range.unit_kinds: Split units included in the chunk, such asimports,function,class_header, ormethodinclude_classes: Class names included in the chunk, when applicabledecorators: Decorators found on included functions, methods, or classes, when applicabledocstrings: Stripped docstrings, whenstrip_docstrings=Truesecondary_split,secondary_split_index, andsecondary_split_total: Metadata for oversized fallback chunks
Documents with None content raise ValueError, documents with non-string content raise TypeError, and invalid Python source raises SyntaxError. Empty documents are skipped.
| Parameter | Default | Description |
|---|---|---|
min_effective_lines |
20 |
Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit. |
max_effective_lines |
100 |
Target effective lines per chunk. Units are merged greedily toward this value. |
expected_chars_per_line |
45 |
Character count used to estimate effective lines. |
oversized_factor |
3 |
Multiplier that triggers secondary line-based splitting for oversized syntactic units. |
strip_docstrings |
False |
Moves function, method, and class docstrings from content into metadata. |
preserve_class_definition |
True |
Prefixes class signatures on chunks that contain class members without the class header. |
secondary_split_overlap |
5 |
Line overlap used only by the oversized secondary split. |
secondary_split_length |
None |
Line length for the oversized secondary split. Defaults to max_effective_lines. |
import textwrap
from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter
source = textwrap.dedent(
'''
"""Math utilities."""
from math import pi
class Circle:
"""A circle."""
def __init__(self, radius: float) -> None:
self.radius = radius
def area(self) -> float:
return pi * self.radius * self.radius
'''
).lstrip()
splitter = PythonCodeSplitter(
min_effective_lines=4,
max_effective_lines=12,
strip_docstrings=True,
)
result = splitter.run(
documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)
for chunk in result["documents"]:
print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))This pipeline converts Python files to documents, splits them with PythonCodeSplitter, and writes the chunks to an in-memory document store.
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})