Skip to content

Latest commit

 

History

History
136 lines (99 loc) · 6.19 KB

File metadata and controls

136 lines (99 loc) · 6.19 KB
title PythonCodeSplitter
id pythoncodesplitter
slug /pythoncodesplitter
description Split Python source documents into syntax-aware chunks using Python's AST, with metadata for line ranges, classes, decorators, and docstrings.

PythonCodeSplitter

PythonCodeSplitter splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible.

Most common position in a pipeline In indexing pipelines after Converters, before Embedders or DocumentWriter
Mandatory run variables documents: A list of Python source code documents
Output variables documents: A list of Python source code documents split into syntax-aware chunks
API reference PreProcessors
GitHub link https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py
Package name haystack-ai

Overview

PythonCodeSplitter expects each input document's content to be valid Python source code. It parses the source with Python's ast module and creates ordered split units for:

  • Module docstrings
  • Consecutive import blocks
  • Top-level functions
  • Class headers
  • Methods and nested classes
  • Remaining top-level statements

The splitter merges these units in source order toward max_effective_lines. Effective lines are calculated from character length with ceil(len(source) / expected_chars_per_line), so long lines count as more than one line.

Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than oversized_factor * max_effective_lines, the splitter falls back to a line-based secondary split using DocumentSplitter. This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap.

By default, preserve_class_definition=True. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context.

If strip_docstrings=True, function, method, and class docstrings are removed from chunk content and stored in meta["docstrings"]. Module docstrings stay in the chunk content because they are their own top-level unit.

Each output document includes the original document's metadata plus:

  • source_id: ID of the original document
  • split_id: Index of the chunk within the original document
  • start_line and end_line: Source line range for the AST units in the chunk. Oversized secondary chunks keep the originating unit's range.
  • unit_kinds: Split units included in the chunk, such as imports, function, class_header, or method
  • include_classes: Class names included in the chunk, when applicable
  • decorators: Decorators found on included functions, methods, or classes, when applicable
  • docstrings: Stripped docstrings, when strip_docstrings=True
  • secondary_split, secondary_split_index, and secondary_split_total: Metadata for oversized fallback chunks

Documents with None content raise ValueError, documents with non-string content raise TypeError, and invalid Python source raises SyntaxError. Empty documents are skipped.

Configuration

Parameter Default Description
min_effective_lines 20 Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit.
max_effective_lines 100 Target effective lines per chunk. Units are merged greedily toward this value.
expected_chars_per_line 45 Character count used to estimate effective lines.
oversized_factor 3 Multiplier that triggers secondary line-based splitting for oversized syntactic units.
strip_docstrings False Moves function, method, and class docstrings from content into metadata.
preserve_class_definition True Prefixes class signatures on chunks that contain class members without the class header.
secondary_split_overlap 5 Line overlap used only by the oversized secondary split.
secondary_split_length None Line length for the oversized secondary split. Defaults to max_effective_lines.

Usage

On its own

import textwrap

from haystack import Document
from haystack.components.preprocessors import PythonCodeSplitter

source = textwrap.dedent(
    '''
    """Math utilities."""
    from math import pi


    class Circle:
        """A circle."""

        def __init__(self, radius: float) -> None:
            self.radius = radius

        def area(self) -> float:
            return pi * self.radius * self.radius
    '''
).lstrip()

splitter = PythonCodeSplitter(
    min_effective_lines=4,
    max_effective_lines=12,
    strip_docstrings=True,
)

result = splitter.run(
    documents=[Document(content=source, meta={"file_name": "geometry.py"})],
)

for chunk in result["documents"]:
    print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes"))

In a pipeline

This pipeline converts Python files to documents, splits them with PythonCodeSplitter, and writes the chunks to an in-memory document store.

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import PythonCodeSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/project").glob("**/*.py"))
p.run({"converter": {"sources": files}})