Skip to content

fix(list): stop capturing nested list markdown in list item text#385

Merged
Goldziher merged 5 commits into
mainfrom
fix/issue-1004-nested-list-duplication
May 25, 2026
Merged

fix(list): stop capturing nested list markdown in list item text#385
Goldziher merged 5 commits into
mainfrom
fix/issue-1004-nested-list-duplication

Conversation

@kh3rld
Copy link
Copy Markdown
Contributor

@kh3rld kh3rld commented May 22, 2026

Fixes #1004 nested list content duplication in the structure collector and Markdown output.

Root cause: push_list_item in item.rs was called with output[item_start_pos..], which included the full rendered Markdown of nested ul/ol children. Each inner list item's text was therefore triple-counted: once as its own item, once in the parent item, and once as free text from the walker.

Fix: Track text_end_pos separately and only advance it past non-list children. ListItem.text now contains only the immediate item text.

Tests: nested_list_duplication.rs covers both the Markdown output and the structure collector path (include_document_structure: true).

Track text_end_pos separately and only advance it past non-list children.
Fixes content duplication in both Markdown output and the structure
collector when list items contain nested ul/ol elements.
@kh3rld kh3rld force-pushed the fix/issue-1004-nested-list-duplication branch from b3e2bdb to 3d7f669 Compare May 22, 2026 20:57
kh3rld added a commit to kreuzberg-dev/kreuzberg that referenced this pull request May 23, 2026
…cation

Patch html-to-markdown-rs to the fix branch (3.5.0) which stops item.rs
from capturing nested-list markdown in list item text. All 8 regression
tests now pass including html_nested_list_no_content_duplication.

Rename test file to issue_1004_nested_list_regression.rs to match the
repo convention (issue_NNN_<description>.rs). Update module doc to
accurately state both bugs are tracked here, not yet both fixed in a
single atomic commit.

Patch will be replaced by a version bump once kreuzberg-dev/html-to-markdown#385
merges and 3.5.0 is published to crates.io.
@kh3rld kh3rld marked this pull request as ready for review May 23, 2026 06:53
@kh3rld kh3rld requested a review from Goldziher as a code owner May 23, 2026 06:53
@kh3rld kh3rld requested a review from v-tan as a code owner May 23, 2026 07:07
@kh3rld kh3rld self-assigned this May 23, 2026
@kh3rld kh3rld added the enhancement New feature or request label May 23, 2026
@kh3rld kh3rld moved this from Todo to In Review in Kreuzberg.dev Kanban May 23, 2026
…-list-duplication

# Conflicts:
#	CHANGELOG.md
@Goldziher Goldziher merged commit ea29cb9 into main May 25, 2026
8 of 9 checks passed
@Goldziher Goldziher deleted the fix/issue-1004-nested-list-duplication branch May 25, 2026 15:02
@github-project-automation github-project-automation Bot moved this from In Review to Done in Kreuzberg.dev Kanban May 25, 2026
Goldziher added a commit that referenced this pull request May 25, 2026
- Bump workspace version 3.5.0 -> 3.5.1 via `alef sync-versions --set 3.5.1`.
- Full `alef all` regen against alef 0.19.8 (unreleased — pre-commit pin held at v0.19.7).
- Visitor exposure end-to-end for Ruby and Elixir (closes #388).
- Nested-list duplication fix from PR #385.
- See CHANGELOG.md [3.5.1] for the full set of fixes, additions, and CI changes.
kh3rld added a commit to kreuzberg-dev/kreuzberg that referenced this pull request May 25, 2026
…cation

Patch html-to-markdown-rs to the fix branch (3.5.0) which stops item.rs
from capturing nested-list markdown in list item text. All 8 regression
tests now pass including html_nested_list_no_content_duplication.

Rename test file to issue_1004_nested_list_regression.rs to match the
repo convention (issue_NNN_<description>.rs). Update module doc to
accurately state both bugs are tracked here, not yet both fixed in a
single atomic commit.

Patch will be replaced by a version bump once kreuzberg-dev/html-to-markdown#385
merges and 3.5.0 is published to crates.io.
HuachunSi pushed a commit to HuachunSi/kreuzberg that referenced this pull request May 28, 2026
…sion tests

Adopts the upstream fix (kreuzberg-dev/html-to-markdown#385) for nested
ul > li > ul > li > ol content duplication, plus the chunker no-panic
regression coverage from kreuzberg-dev#1048.

The 3.5.0 release dropped the need for the [patch.crates-io] git-rev
pin proposed in the original PR; bumping the workspace dep to 3.5.2
gives us the fix straight from crates.io.

Includes a new integration test crate (nested_list_duplication.rs) that
asserts:
- Markdown chunker MUST NOT PANIC on any input (regression for the
  integer underflow in text_splitter/splitter.rs that hit when the
  pre-3.5.0 malformed output was passed to the chunker).
- HTML extraction of deeply nested mixed lists MUST NOT duplicate
  content (regression for issue kreuzberg-dev#1004).

Closes kreuzberg-dev#1004

Co-authored-by: kh3rld <kherld.hussein@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

bug: HTML-to-Markdown creates malformed nested list Markdown and Markdown chunker panics on it

2 participants