Skip to content

Latest commit

 

History

History
158 lines (132 loc) · 7.16 KB

File metadata and controls

158 lines (132 loc) · 7.16 KB
title DocumentSplitter
id documentsplitter
slug /documentsplitter
description `DocumentSplitter` divides a list of text documents into a list of shorter text documents. This is useful for long texts that otherwise wouldn't fit into the maximum text length of language models and can also speed up question answering.

DocumentSplitter

DocumentSplitter divides a list of text documents into a list of shorter text documents. This is useful for long texts that otherwise wouldn't fit into the maximum text length of language models and can also speed up question answering.

Most common position in a pipeline In indexing pipelines after Converters and DocumentCleaner , before Classifiers
Mandatory run variables documents: A list of documents
Output variables documents: A list of documents
API reference PreProcessors
GitHub link https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_splitter.py

Overview

DocumentSplitter expects a list of documents as input and returns a list of documents with split texts. It splits each input document by split_by after split_length units with an overlap of split_overlap units. These additional parameters can be set when the component is initialized:

  • split_by can be "word", "sentence", "passage" (paragraph), "page", "line" or "function".
  • split_length is an integer indicating the chunk size, which is the number of words, sentences, or passages.
  • split_overlap is an integer indicating the number of overlapping words, sentences, or passages between chunks.
  • split_threshold is an integer indicating the minimum number of words, sentences, or passages that the document fragment should have. If the fragment is below the threshold, it will be attached to the previous one.

A field "source_id" is added to each document's meta data to keep track of the original document that was split. Another meta field "page_number" is added to each document to keep track of the page it belonged to in the original document. Other metadata are copied from the original document.

The DocumentSplitter is compatible with the following DocumentStores:

Usage

On its own

You can use this component outside of a pipeline to shorten your documents like this:

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(
    content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.",
)

splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])

In a pipeline

Here's how you can use DocumentSplitter in an indexing pipeline:

from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(
    instance=DocumentSplitter(split_by="sentence", split_length=1),
    name="splitter",
)
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})

In YAML

This is the YAML representation of the indexing pipeline shown above. It reads text files, cleans the text, splits it into individual sentences, and writes them to an in-memory document store.

components:
  cleaner:
    init_parameters:
      ascii_only: false
      keep_id: false
      remove_empty_lines: true
      remove_extra_whitespaces: true
      remove_regex: null
      remove_repeated_substrings: false
      remove_substrings: null
      replace_regexes: null
      strip_whitespaces: false
      unicode_normalization: null
    type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
  splitter:
    init_parameters:
      extend_abbreviations: true
      language: en
      respect_sentence_boundary: false
      skip_empty_documents: true
      split_by: sentence
      split_length: 1
      split_overlap: 0
      split_threshold: 0
      use_split_rules: true
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
  text_file_converter:
    init_parameters:
      encoding: utf-8
      store_full_path: false
    type: haystack.components.converters.txt.TextFileToDocument
  writer:
    init_parameters:
      document_store:
        init_parameters:
          bm25_algorithm: BM25L
          bm25_parameters: {}
          bm25_tokenization_regex: (?u)\\b\\w+\\b
          embedding_similarity_function: dot_product
          index: 64e4f9ab-87fb-47fd-b390-dabcfda61447
          return_embedding: true
        type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
      policy: NONE
    type: haystack.components.writers.document_writer.DocumentWriter
connection_type_validation: true
connections:
- receiver: cleaner.documents
  sender: text_file_converter.documents
- receiver: splitter.documents
  sender: cleaner.documents
- receiver: writer.documents
  sender: splitter.documents
max_runs_per_component: 100
metadata: {}