Skip to content

Commit ea29cb9

Browse files
kh3rldGoldziher
andauthored
fix(list): stop capturing nested list markdown in list item text (#385)
* fix(list): stop capturing nested list markdown in list item text Track text_end_pos separately and only advance it past non-list children. Fixes content duplication in both Markdown output and the structure collector when list items contain nested ul/ol elements. * docs(changelog): add entry for nested list duplication fix (#385) --------- Co-authored-by: Na'aman Hirschfeld <nhirschfeld@gmail.com>
1 parent 114d7ce commit ea29cb9

3 files changed

Lines changed: 75 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4545

4646
### Fixed
4747

48+
- **list**: Fix content duplication in Markdown output and the document structure collector when list items contain nested `ul` or `ol` children. `item.rs` previously captured the full rendered output of an `<li>` — including the rendered nested-list Markdown — as the item's text. A `text_end_pos` cursor now advances only past non-list children, so the structure collector records only the item's own text and the Markdown output for outer and mid items is not repeated. Affected: any `ul > li > ul` or `ol > li > ol` (arbitrarily nested) HTML structure. (#385)
49+
4850
- **ci(publish): skip `publish-hex` and `homebrew-bottles` on dry-run.** Both jobs need real GitHub Release assets (`generate-elixir-checksums` downloads NIF tarballs; `brew install --build-bottle` downloads CLI/FFI source tarballs), but `upload-release-assets@v1` only logs on dry-run — the release doesn't exist. Both jobs failed every dry-run after surviving every other stage. Gated their `if:` on `dry_run != 'true'`; real-release runs continue to exercise them.
4951

5052
- **ci(publish): replace inline Elixir Hex packaging with `kreuzberg-dev/actions/build-elixir-hex@v1`.** The `elixir-package` job in `.github/workflows/publish.yaml` previously ran `mix deps.get` + `mix hex.build` inline with no path-dep rewrite and no `Cargo.lock` generation. `Cargo.lock` is gitignored, so on a fresh CI checkout `mix hex.build` failed at the `files` list check with `Missing files: native/html_to_markdown_nif/Cargo.lock` — the proximate blocker on v3.5.0-rc.2 dry-run after the Go FFI fix landed. The new shared action wraps `rewrite-native-deps@v1` (default-on, dry-run-guarded) and falls back to `cargo generate-lockfile` on dry-run so the lockfile exists for `mix hex.build` even when the rewrite is skipped. Hex source-package builds now match the python-sdist / ruby-gem pattern (rewrite baked into the shared action; cannot be omitted).
@@ -58,7 +60,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5860
- **ci(publish): build the Go FFI matrix on dry-runs too.** `.github/workflows/publish.yaml` job `go-ffi-libraries` gated only on `release_go == 'true'` and the registry existence check, so on `workflow_dispatch` dry-runs it skipped entirely. Its downstream sibling `upload-go-release` already gates on `is_tag || dry_run` but waits for `go-ffi-libraries.result == 'success'`, so no Go assets were ever uploaded to the dry-run release, and the terminal `verify-assets` gate failed with `✗ pattern NOT matched: html-to-markdown-rs-go-*.tar.gz` on every recent retry (the immediate blocker on v3.5.0-rc.1 publish). Added the `(is_tag == 'true' || dry_run == 'true')` clause to `go-ffi-libraries`'s `if:`, mirroring the `kotlin-android-natives` pattern (precedent: commit `e00a56e1` "ci(publish): build Kotlin Android natives on dry_run too").
5961

6062
- **ci(e2e/ruby): drop the explicit `python3 scripts/ci/ruby/vendor-core-crate.py` step from both ruby build jobs in `.github/workflows/ci-e2e.yaml` (committed earlier today as `f23e6d458`); the dead script itself is removed in `f440af4fa`.** The shared `kreuzberg-dev/actions/build-ruby-gem@v1` action now invokes `rewrite-native-deps@v1` internally and vendors the core crate into `packages/ruby/vendor/html-to-markdown/` (no `-rs` suffix). Running the local script first wrote `packages/ruby/vendor/Cargo.toml` with `members = ["html-to-markdown-rs"]` and copied the crate into `vendor/html-to-markdown-rs/`; the subsequent action then created a sibling `vendor/html-to-markdown/` outside that members list, and cargo refused to build it with `error inheriting 'lints' from workspace root manifest's 'workspace.lints'` / `'workspace.lints' was not defined`. The action is now the single source of truth for ruby vendoring.
61-
6263
## [3.5.0-rc.1] - 2026-05-24
6364

6465
### Changed

crates/html-to-markdown/src/converter/list/item.rs

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -209,20 +209,34 @@ pub fn handle_li(
209209
}
210210

211211
let item_start_pos = output.len();
212+
let mut text_end_pos = output.len();
212213

213214
let children = tag.children();
214215
{
215216
for child_handle in children.top().iter() {
217+
let is_nested_list = if let Some(tl::Node::Tag(child_tag)) = child_handle.get(parser) {
218+
let n = normalized_tag_name(child_tag.name().as_utf8_str());
219+
matches!(n.as_ref(), "ul" | "ol")
220+
} else {
221+
false
222+
};
216223
walk_node(child_handle, parser, output, options, &li_ctx, depth + 1, dom_ctx);
224+
if !is_nested_list {
225+
text_end_pos = output.len();
226+
}
217227
}
218228
}
219229

220230
trim_trailing_whitespace(output);
221231

222232
if !ctx.in_table_cell {
223233
if let Some(ref sc) = ctx.structure_collector {
224-
if item_start_pos <= output.len() && output.is_char_boundary(item_start_pos) {
225-
let rendered = &output[item_start_pos..];
234+
let safe_end = text_end_pos.min(output.len());
235+
if item_start_pos <= safe_end
236+
&& output.is_char_boundary(item_start_pos)
237+
&& output.is_char_boundary(safe_end)
238+
{
239+
let rendered = &output[item_start_pos..safe_end];
226240
let content = rendered.trim();
227241
if !content.is_empty() {
228242
sc.borrow_mut().push_list_item(content);
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#![allow(missing_docs)]
2+
3+
//! Nested `ul > li > ul > li > ol` HTML must not produce duplicated content
4+
//! in the Markdown output or in the document structure collector.
5+
6+
use html_to_markdown_rs::{ConversionOptions, NodeContent};
7+
8+
const NESTED_LIST_HTML: &str = "<ul><li>outer<ul><li>mid<ol><li>inner1</li><li>inner2</li></ol></li></ul></li></ul>";
9+
10+
#[test]
11+
fn markdown_output_no_content_duplication() {
12+
let result = html_to_markdown_rs::convert(NESTED_LIST_HTML, None).unwrap();
13+
let content = result.content.unwrap_or_default();
14+
assert_eq!(
15+
content.matches("inner1").count(),
16+
1,
17+
"inner1 must appear exactly once:\n{content}"
18+
);
19+
assert_eq!(
20+
content.matches("inner2").count(),
21+
1,
22+
"inner2 must appear exactly once:\n{content}"
23+
);
24+
}
25+
26+
#[test]
27+
fn structure_collector_list_items_no_duplication() {
28+
let options = ConversionOptions {
29+
include_document_structure: true,
30+
..Default::default()
31+
};
32+
let result = html_to_markdown_rs::convert(NESTED_LIST_HTML, Some(options)).unwrap();
33+
let doc = result.document.expect("document structure must be populated");
34+
35+
let list_item_texts: Vec<&str> = doc
36+
.nodes
37+
.iter()
38+
.filter_map(|node| {
39+
if let NodeContent::ListItem { text } = &node.content {
40+
Some(text.as_str())
41+
} else {
42+
None
43+
}
44+
})
45+
.collect();
46+
47+
let inner1_count = list_item_texts.iter().filter(|t| t.contains("inner1")).count();
48+
let inner2_count = list_item_texts.iter().filter(|t| t.contains("inner2")).count();
49+
assert_eq!(
50+
inner1_count, 1,
51+
"inner1 must appear in exactly one ListItem text; got items: {list_item_texts:?}"
52+
);
53+
assert_eq!(
54+
inner2_count, 1,
55+
"inner2 must appear in exactly one ListItem text; got items: {list_item_texts:?}"
56+
);
57+
}

0 commit comments

Comments
 (0)