fix(list): stop capturing nested list markdown in list item text#385
Merged
Conversation
Track text_end_pos separately and only advance it past non-list children. Fixes content duplication in both Markdown output and the structure collector when list items contain nested ul/ol elements.
b3e2bdb to
3d7f669
Compare
kh3rld
added a commit
to kreuzberg-dev/kreuzberg
that referenced
this pull request
May 23, 2026
…cation Patch html-to-markdown-rs to the fix branch (3.5.0) which stops item.rs from capturing nested-list markdown in list item text. All 8 regression tests now pass including html_nested_list_no_content_duplication. Rename test file to issue_1004_nested_list_regression.rs to match the repo convention (issue_NNN_<description>.rs). Update module doc to accurately state both bugs are tracked here, not yet both fixed in a single atomic commit. Patch will be replaced by a version bump once kreuzberg-dev/html-to-markdown#385 merges and 3.5.0 is published to crates.io.
Goldziher
approved these changes
May 25, 2026
…-list-duplication # Conflicts: # CHANGELOG.md
Goldziher
added a commit
that referenced
this pull request
May 25, 2026
- Bump workspace version 3.5.0 -> 3.5.1 via `alef sync-versions --set 3.5.1`. - Full `alef all` regen against alef 0.19.8 (unreleased — pre-commit pin held at v0.19.7). - Visitor exposure end-to-end for Ruby and Elixir (closes #388). - Nested-list duplication fix from PR #385. - See CHANGELOG.md [3.5.1] for the full set of fixes, additions, and CI changes.
kh3rld
added a commit
to kreuzberg-dev/kreuzberg
that referenced
this pull request
May 25, 2026
…cation Patch html-to-markdown-rs to the fix branch (3.5.0) which stops item.rs from capturing nested-list markdown in list item text. All 8 regression tests now pass including html_nested_list_no_content_duplication. Rename test file to issue_1004_nested_list_regression.rs to match the repo convention (issue_NNN_<description>.rs). Update module doc to accurately state both bugs are tracked here, not yet both fixed in a single atomic commit. Patch will be replaced by a version bump once kreuzberg-dev/html-to-markdown#385 merges and 3.5.0 is published to crates.io.
HuachunSi
pushed a commit
to HuachunSi/kreuzberg
that referenced
this pull request
May 28, 2026
…sion tests Adopts the upstream fix (kreuzberg-dev/html-to-markdown#385) for nested ul > li > ul > li > ol content duplication, plus the chunker no-panic regression coverage from kreuzberg-dev#1048. The 3.5.0 release dropped the need for the [patch.crates-io] git-rev pin proposed in the original PR; bumping the workspace dep to 3.5.2 gives us the fix straight from crates.io. Includes a new integration test crate (nested_list_duplication.rs) that asserts: - Markdown chunker MUST NOT PANIC on any input (regression for the integer underflow in text_splitter/splitter.rs that hit when the pre-3.5.0 malformed output was passed to the chunker). - HTML extraction of deeply nested mixed lists MUST NOT duplicate content (regression for issue kreuzberg-dev#1004). Closes kreuzberg-dev#1004 Co-authored-by: kh3rld <kherld.hussein@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1004 nested list content duplication in the structure collector and Markdown output.
Root cause:
push_list_iteminitem.rswas called withoutput[item_start_pos..], which included the full rendered Markdown of nestedul/olchildren. Each inner list item's text was therefore triple-counted: once as its own item, once in the parent item, and once as free text from the walker.Fix: Track
text_end_posseparately and only advance it past non-list children.ListItem.textnow contains only the immediate item text.Tests:
nested_list_duplication.rscovers both the Markdown output and the structure collector path (include_document_structure: true).