From ea314808d417861e85977992134a8f66def02d33 Mon Sep 17 00:00:00 2001 From: kota-wilson Date: Fri, 29 May 2026 20:55:36 -0700 Subject: [PATCH] docs: add PythonCodeSplitter docs --- .../pipeline-components/preprocessors.mdx | 1 + .../preprocessors/pythoncodesplitter.mdx | 136 ++++++++++++++++++ docs-website/sidebars.js | 1 + 3 files changed, 138 insertions(+) create mode 100644 docs-website/docs/pipeline-components/preprocessors/pythoncodesplitter.mdx diff --git a/docs-website/docs/pipeline-components/preprocessors.mdx b/docs-website/docs/pipeline-components/preprocessors.mdx index 4da2019dff..89664210b5 100644 --- a/docs-website/docs/pipeline-components/preprocessors.mdx +++ b/docs-website/docs/pipeline-components/preprocessors.mdx @@ -25,5 +25,6 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | | [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | | [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | +| [PythonCodeSplitter](preprocessors/pythoncodesplitter.mdx) | Splits Python source documents into syntax-aware chunks using AST units such as imports, functions, class headers, methods, and statements. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | diff --git a/docs-website/docs/pipeline-components/preprocessors/pythoncodesplitter.mdx b/docs-website/docs/pipeline-components/preprocessors/pythoncodesplitter.mdx new file mode 100644 index 0000000000..f575da661d --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/pythoncodesplitter.mdx @@ -0,0 +1,136 @@ +--- +title: "PythonCodeSplitter" +id: pythoncodesplitter +slug: "/pythoncodesplitter" +description: "Split Python source documents into syntax-aware chunks using Python's AST, with metadata for line ranges, classes, decorators, and docstrings." +--- + +# PythonCodeSplitter + +`PythonCodeSplitter` splits Python source code documents into syntax-aware chunks. It is designed for Python files and keeps code units such as imports, functions, classes, and methods together where possible. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx), before [Embedders](../embedders.mdx) or [`DocumentWriter`](../writers/documentwriter.mdx) | +| **Mandatory run variables** | `documents`: A list of Python source code documents | +| **Output variables** | `documents`: A list of Python source code documents split into syntax-aware chunks | +| **API reference** | [PreProcessors](/reference/preprocessors-api) | +| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/python_code_splitter.py | +| **Package name** | `haystack-ai` | + +
+ +## Overview + +`PythonCodeSplitter` expects each input document's `content` to be valid Python source code. It parses the source with Python's `ast` module and creates ordered split units for: + +- Module docstrings +- Consecutive import blocks +- Top-level functions +- Class headers +- Methods and nested classes +- Remaining top-level statements + +The splitter merges these units in source order toward `max_effective_lines`. Effective lines are calculated from character length with `ceil(len(source) / expected_chars_per_line)`, so long lines count as more than one line. + +Functions and methods are kept whole by the primary AST split. If one syntactic unit is larger than `oversized_factor * max_effective_lines`, the splitter falls back to a line-based secondary split using [`DocumentSplitter`](documentsplitter.mdx). This oversized fallback is the only case where chunks can overlap; the primary AST split does not add overlap. + +By default, `preserve_class_definition=True`. When a chunk contains class members without the original class header, the splitter prefixes the bare class signature so the chunk still carries the class context. + +If `strip_docstrings=True`, function, method, and class docstrings are removed from chunk content and stored in `meta["docstrings"]`. Module docstrings stay in the chunk content because they are their own top-level unit. + +Each output document includes the original document's metadata plus: + +- `source_id`: ID of the original document +- `split_id`: Index of the chunk within the original document +- `start_line` and `end_line`: Source line range for the AST units in the chunk. Oversized secondary chunks keep the originating unit's range. +- `unit_kinds`: Split units included in the chunk, such as `imports`, `function`, `class_header`, or `method` +- `include_classes`: Class names included in the chunk, when applicable +- `decorators`: Decorators found on included functions, methods, or classes, when applicable +- `docstrings`: Stripped docstrings, when `strip_docstrings=True` +- `secondary_split`, `secondary_split_index`, and `secondary_split_total`: Metadata for oversized fallback chunks + +Documents with `None` content raise `ValueError`, documents with non-string content raise `TypeError`, and invalid Python source raises `SyntaxError`. Empty documents are skipped. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `min_effective_lines` | `20` | Minimum effective lines per chunk. While a chunk is below this value, the splitter keeps merging in the next unit. | +| `max_effective_lines` | `100` | Target effective lines per chunk. Units are merged greedily toward this value. | +| `expected_chars_per_line` | `45` | Character count used to estimate effective lines. | +| `oversized_factor` | `3` | Multiplier that triggers secondary line-based splitting for oversized syntactic units. | +| `strip_docstrings` | `False` | Moves function, method, and class docstrings from content into metadata. | +| `preserve_class_definition` | `True` | Prefixes class signatures on chunks that contain class members without the class header. | +| `secondary_split_overlap` | `5` | Line overlap used only by the oversized secondary split. | +| `secondary_split_length` | `None` | Line length for the oversized secondary split. Defaults to `max_effective_lines`. | + +## Usage + +### On its own + +```python +import textwrap + +from haystack import Document +from haystack.components.preprocessors import PythonCodeSplitter + +source = textwrap.dedent( + ''' + """Math utilities.""" + from math import pi + + + class Circle: + """A circle.""" + + def __init__(self, radius: float) -> None: + self.radius = radius + + def area(self) -> float: + return pi * self.radius * self.radius + ''' +).lstrip() + +splitter = PythonCodeSplitter( + min_effective_lines=4, + max_effective_lines=12, + strip_docstrings=True, +) + +result = splitter.run( + documents=[Document(content=source, meta={"file_name": "geometry.py"})], +) + +for chunk in result["documents"]: + print(chunk.meta["start_line"], chunk.meta["end_line"], chunk.meta.get("include_classes")) +``` + +### In a pipeline + +This pipeline converts Python files to documents, splits them with `PythonCodeSplitter`, and writes the chunks to an in-memory document store. + +```python +from pathlib import Path + +from haystack import Pipeline +from haystack.components.converters.txt import TextFileToDocument +from haystack.components.preprocessors import PythonCodeSplitter +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore + +document_store = InMemoryDocumentStore() + +p = Pipeline() +p.add_component("converter", TextFileToDocument()) +p.add_component("splitter", PythonCodeSplitter(max_effective_lines=80)) +p.add_component("writer", DocumentWriter(document_store=document_store)) + +p.connect("converter.documents", "splitter.documents") +p.connect("splitter.documents", "writer.documents") + +files = list(Path("path/to/your/project").glob("**/*.py")) +p.run({"converter": {"sources": files}}) +``` diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 85841717cd..ca80d62356 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -481,6 +481,7 @@ export default { 'pipeline-components/preprocessors/documentsplitter', 'pipeline-components/preprocessors/embeddingbaseddocumentsplitter', 'pipeline-components/preprocessors/hierarchicaldocumentsplitter', + 'pipeline-components/preprocessors/pythoncodesplitter', 'pipeline-components/preprocessors/recursivesplitter', 'pipeline-components/preprocessors/textcleaner', 'pipeline-components/preprocessors/presidiodocumentcleaner',