Skip to content

Latest commit

 

History

History
29 lines (26 loc) · 3.66 KB

File metadata and controls

29 lines (26 loc) · 3.66 KB
title PreProcessors
id preprocessors
slug /preprocessors
description Use the PreProcessors to prepare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.

PreProcessors

Use the PreProcessors to prepare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.

PreProcessor Description
ChineseDocumentSplitter Divides Chinese text documents into smaller chunks using advanced Chinese language processing capabilities, using HanLP for accurate Chinese word segmentation and sentence tokenization.
ChonkieRecursiveDocumentSplitter Splits documents recursively using a hierarchy of rules via Chonkie's RecursiveChunker, applying progressively finer splits until all chunks satisfy the size constraints.
ChonkieSemanticDocumentSplitter Splits documents at semantic topic boundaries using embedding similarity via Chonkie's SemanticChunker, keeping related sentences together.
ChonkieSentenceDocumentSplitter Splits documents into chunks that respect sentence boundaries via Chonkie's SentenceChunker, avoiding mid-sentence cuts.
ChonkieTokenDocumentSplitter Splits documents into fixed-size token-based chunks via Chonkie's TokenChunker, supporting multiple tokenizers.
CSVDocumentCleaner Cleans CSV documents by removing empty rows and columns while preserving specific ignored rows and columns.
CSVDocumentSplitter Divides CSV documents into smaller sub-tables based on empty rows and columns.
DocumentCleaner Removes extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents.
DocumentPreprocessor Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.
DocumentSplitter Splits a list of text documents into a list of text documents with shorter texts.
HierarchicalDocumentSplitter Creates a multi-level document structure based on parent-children relationships between text segments.
MarkdownHeaderSplitter Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata.
PresidioDocumentCleaner Replaces PII in Document text with entity type placeholders using Microsoft Presidio.
PresidioTextCleaner Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM.
RecursiveSplitter Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided.
TextCleaner Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation.