Skip to content

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes #13713

Description

@onyxmaster

Hi,

The TextChunker occasionally overcounts the last paragraph size because ProcessParagraphs orphan chunk gluing logic is using number of words instead of number of tokens which can lead to last chunk exceeding the target length. This leads to the results of TextChunker.SplitPlainTextParagraphs sometimes have too large chunks, causing (frequently silent) loss of information when generating embeddings or using a reranker.

Platform

  • Language: C#
  • Source:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssue is stale because it has been open for a while and has no activity

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions