Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs-website/docs/pipeline-components/preprocessors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,6 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header
| [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. |
| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. |
| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. |
| [PythonCodeSplitter](preprocessors/pythoncodesplitter.mdx) | Splits Python source documents into syntax-aware chunks using AST units such as imports, functions, class headers, methods, and statements. |
| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: "PythonCodeSplitter"
id: pythoncodesplitter
slug: "/pythoncodesplitter"
description: "Split Python source documents into syntax-aware chunks using Python's AST, with metadata for line ranges, classes, decorators, and docstrings."
---

# PythonCodeSplitter

`PythonCodeSplitter` splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) or [`DocumentWriter`](../writers/documentwriter.mdx) |
| **Mandatory run variables** | `documents`: A list of Python source code documents |
| **Output variables** | `documents`: A list of Python source code documents split into syntax-aware chunks |
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py |
| **Package name** | `haystack-ai` |

</div>

## Overview

`PythonCodeSplitter` expects each input document's `content` to be valid Python source code. It parses the source with Python's `ast` module and creates ordered split units for:

- Module docstrings
- Consecutive import blocks
- Top-level functions
- Class headers
- Methods and nested classes
- Remaining top-level statements

The splitter merges these units in source order toward `max_effective_lines`. Effective lines are calculated from character length with `ceil(len(source) / expected_chars_per_line)`, so long lines count as more than one line.

Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than `oversized_factor * max_effective_lines`, the splitter falls back to a line-based secondary split using [`DocumentSplitter`](documentsplitter.mdx). This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.

By default, `preserve_class_definition=True`. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.

If `strip_docstrings=True`, function, method, and class docstrings are removed from chunk content and stored in `meta["docstrings"]`. Module docstrings stay in the chunk content because they are their own top-level unit.

Each output document includes the original document's metadata plus:

- `source_id`: ID of the original document
- `split_id`: Index of the chunk within the original document
- `start_line` and `end_line`: Source line range for the AST units in the chunk. Oversized secondary chunks keep the originating unit's range.
- `unit_kinds`: Split units included in the chunk, such as `imports`, `function`, `class_header`, or `method`
- `include_classes`: Class names included in the chunk, when applicable
- `decorators`: Decorators found on included functions, methods, or classes, when applicable
- `docstrings`: Stripped docstrings, when `strip_docstrings=True`
- `secondary_split`, `secondary_split_index`, and `secondary_split_total`: Metadata for oversized fallback chunks

Documents with `None` content raise `ValueError`, documents with non-string content raise `TypeError`, and invalid Python source raises `SyntaxError`. Empty documents are skipped.

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `min_effective_lines` | `20` | Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit. |
| `max_effective_lines` | `100` | Target effective lines per chunk. Units are merged greedily toward this value. |
| `expected_chars_per_line` | `45` | Character count used to estimate effective lines. |
| `oversized_factor` | `3` | Multiplier that triggers secondary line-based splitting for oversized syntactic units. |
| `strip_docstrings` | `False` | Moves function, method, and class docstrings from content into metadata. |
| `preserve_class_definition` | `True` | Prefixes class signatures on chunks that contain class members without the class header. |
| `secondary_split_overlap` | `5` | Line overlap used only by the oversized secondary split. |
| `secondary_split_length` | `None` | Line length for the oversized secondary split. Defaults to `max_effective_lines`. |

## Usage

### On its own

```python
import textwrap

from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = textwrap.dedent(
'''
"""Math utilities."""
from math import pi


class Circle:
"""A circle."""

def __init__(self, radius: float) -> None:
self.radius = radius

def area(self) -> float:
return pi * self.radius * self.radius
'''
).lstrip()

splitter = PythonCodeSplitter(
min_effective_lines=4,
max_effective_lines=12,
strip_docstrings=True,
)

result = splitter.run(
documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)

for chunk in result["documents"]:
print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))
```

### In a pipeline

This pipeline converts Python files to documents, splits them with `PythonCodeSplitter`, and writes the chunks to an in-memory document store.

```python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})
```
1 change: 1 addition & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ export default {
'pipeline-components/preprocessors/documentsplitter',
'pipeline-components/preprocessors/embeddingbaseddocumentsplitter',
'pipeline-components/preprocessors/hierarchicaldocumentsplitter',
'pipeline-components/preprocessors/pythoncodesplitter',
'pipeline-components/preprocessors/recursivesplitter',
'pipeline-components/preprocessors/textcleaner',
'pipeline-components/preprocessors/presidiodocumentcleaner',
Expand Down