Add strikethrough text to HTML generator by LonelyMidoriya · Pull Request #379 · opendataloader-project/opendataloader-pdf

LonelyMidoriya · 2026-04-01T15:32:14Z

Update strikethrough text usage in Markdown

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Summary by CodeRabbit

Refactor
- Improved text extraction and formatting for lists, captions, paragraphs, and headings by assembling content from structured lines/columns for accurate spacing, multi-line and multi-column support.
- Strikethrough handling moved to a detection flag: HTML renders deletion markup (…) and Markdown uses inline markers (…) based on detected formatting rather than pre-wrapped text.
Tests
- Updated tests to assert strikethrough detection via the formatting flag and removed obsolete marker-wrapping checks.

coderabbitai · 2026-04-01T15:32:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Replaces ad-hoc string use with structured-text extraction: adds GeneratorUtils to build text from columns/lines/chunks and apply strikethrough markers at render time; generators (HTML/Markdown) use it to render text/strikethrough; StrikethroughProcessor now sets a boolean flag on chunks; tests updated accordingly.

Changes

Cohort / File(s)	Summary
Generator utilities `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`	New utility class that builds plain text from `SemanticTextNode`/`TextLine`/`TextChunk` structures, inserting configurable opening/closing strikethrough markers per chunk.
HTML generator `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`	Replaced direct `getValue()`/`toString()` usage with `GeneratorUtils` text extraction for lists, captions, paragraphs, and headings; renders strikethrough chunks as `<del>...</del>`; added strikethrough tag constants and broadened import.
Markdown generator `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`	Introduced `strikethroughTextSyntax` constant and switched list and semantic-text rendering to use `GeneratorUtils` to produce text with `~~` markers before Markdown sanitization.
Strikethrough processing `java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java`	Removed mutation that wrapped chunk text with `~~`; processor now marks chunks via `setIsStrikethroughText(true)` only, leaving chunk `value` unchanged.
Tests `java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`	Updated assertions to check `getIsStrikethroughText()` boolean flags instead of expecting `~~...~~`-wrapped `getValue()`; removed test for double-wrapping prevention.

Sequence Diagram(s)

sequenceDiagram
    participant P as StrikethroughProcessor
    participant M as TextModel\n(TextChunk/TextLine/TextColumn)
    participant G as Generator\n(HtmlGenerator/MarkdownGenerator)
    participant U as GeneratorUtils
    participant O as Output\n(HTML/Markdown)

    P->>M: detect strikethroughs\nsetIsStrikethroughText(true)
    Note right of M: chunk.value unchanged\nboolean flag set
    G->>M: request semantic nodes (columns/blocks/lines)
    G->>U: for each node, call getTextFromTextNode / getTextFromLines
    U->>M: traverse lines -> chunks, wrap chunk text when isStrikethroughText
    U-->>G: return assembled string with strikethrough markers
    G->>O: emit formatted output\n(<del>...</del> for HTML, ~~...~~ for MD)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions only HTML generator strikethrough text, but the PR substantially modifies HTML, Markdown, and utility classes with broader refactoring of how strikethrough text is processed and rendered.	Revise the title to reflect the full scope, such as 'Refactor strikethrough text handling across HTML and Markdown generators' or 'Add strikethrough text support to HTML and Markdown generators with GeneratorUtils'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 320-333: getTextFromLines writes raw chunk.getValue() into HTML
and can cause XSS; add a helper method (e.g., escapeHtmlContent) on
HtmlGenerator that returns "" for null and replaces &, <, > (and optionally "
and ') with their HTML entities, then call escapeHtmlContent(chunk.getValue())
wherever chunk.getValue() is appended (both inside the del branch and the plain
branch) so all TextChunk values are safely escaped before being appended to the
StringBuilder; reference methods: getTextFromLines(List<TextLine>,
StringBuilder) and create/use escapeHtmlContent(String).

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 251-258: Extract the column-joining code in MarkdownGenerator into
a reusable helper like HtmlGenerator's getTextFromColumns: create a private
method (e.g., getTextFromColumns(TextNode textNode, StringBuilder sb) or
getTextFromColumns(List<TextColumn> columns, StringBuilder sb)) that iterates
TextColumn objects, calls existing getTextFromLines(column.getLines(), sb), and
appends a space between non-final columns; then replace the current inline loop
in the method with a call to this new helper to keep behavior identical and
avoid duplication.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e4f34aae-0bff-452b-8ff9-1dc4fb1d0d7f

📥 Commits

Reviewing files that changed from the base of the PR and between b31557d and 25c8872.

📒 Files selected for processing (3)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java

💤 Files with no reviewable changes (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java (1)
30-37: 🧹 Nitpick | 🔵 Trivial

Note: Related javadoc in production code appears outdated.

The javadoc at StrikethroughProcessor.java:29-35 still states the processor "wraps affected TextChunk values with ~~ markdown syntax," but per the refactoring, it now only sets a boolean flag. Consider updating that documentation in a follow-up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`
around lines 30 - 37, The javadoc for StrikethroughProcessor is outdated: update
the class/method Javadoc in StrikethroughProcessor (the comment around the class
header formerly saying it "wraps affected TextChunk values with ~~ markdown
syntax") to reflect the current behavior — that it no longer mutates text but
instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify
the description to mention the flag-setting behavior and any effects on
downstream rendering and remove references to markdown wrapping so the docs
match the implementation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`:
- Around line 30-37: The javadoc for StrikethroughProcessor is outdated: update
the class/method Javadoc in StrikethroughProcessor (the comment around the class
header formerly saying it "wraps affected TextChunk values with ~~ markdown
syntax") to reflect the current behavior — that it no longer mutates text but
instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify
the description to mention the flag-setting behavior and any effects on
downstream rendering and remove references to markdown wrapping so the docs
match the implementation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 03a53da9-6e03-4bd7-bc73-b820dbaa7b26

📥 Commits

Reviewing files that changed from the base of the PR and between 25c8872 and 435af40.

📒 Files selected for processing (1)

java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java

codecov · 2026-04-01T15:48:53Z

Codecov Report

❌ Patch coverage is 71.69811% with 15 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...a/org/opendataloader/pdf/utils/GeneratorUtils.java	70.83%	4 Missing and 3 partials ⚠️
...ava/org/opendataloader/pdf/html/HtmlGenerator.java	69.23%	2 Missing and 2 partials ⚠️
...opendataloader/pdf/markdown/MarkdownGenerator.java	60.00%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java (1)
327-335: ⚠️ Potential issue | 🟠 Major

Missing HTML entity escaping — potential XSS vulnerability.

chunk.getValue() is written directly into HTML output without escaping special characters (<, >, &). This was flagged in a previous review and remains unaddressed. PDF content containing characters like <script> or x < y will be rendered as HTML rather than displayed as text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`
around lines 327 - 335, The getTextFromLine method appends raw chunk.getValue()
into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix
by HTML-escaping the text before appending: replace direct uses of
chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility
like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the
escaped value to stringBuilder (still wrapping with "<del>...</del>" for
strikethroughs) so all TextChunk.getValue() output is safely encoded.
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java (1)
252-258: 🧹 Nitpick | 🔵 Trivial

Consider extracting getTextFromColumns helper for consistency.

This inline traversal pattern (columns → blocks → lines) duplicates the logic in HtmlGenerator.getTextFromColumns(). A previous review suggested extracting this to a helper method, which remains applicable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`
around lines 252 - 258, Extract the inline traversal (columns → blocks → lines)
in MarkdownGenerator into a helper named getTextFromColumns (or reuse
HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit
nested loops over textNode.getColumns() and block.getLines() with a single call
to that helper; the helper should accept the same input used here (e.g., the
TextNode or its columns and a StringBuilder) and call getTextFromLines for each
block's lines to produce the same String value.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 273-280: The method getTextFromLines in MarkdownGenerator risks
IndexOutOfBoundsException when textLines is empty; add a guard at the start
(e.g., if (textLines == null || textLines.isEmpty()) return;) so you exit early
for empty input, then keep the existing loop that processes all but the last
line (calling getTextFromLine and TextChunkUtils.formatLineEnd) and handle the
final line only if present; this ensures safe access to
textLines.get(textLines.size() - 1) and preserves current formatting logic.

---

Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 327-335: The getTextFromLine method appends raw chunk.getValue()
into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix
by HTML-escaping the text before appending: replace direct uses of
chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility
like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the
escaped value to stringBuilder (still wrapping with "<del>...</del>" for
strikethroughs) so all TextChunk.getValue() output is safely encoded.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 252-258: Extract the inline traversal (columns → blocks → lines)
in MarkdownGenerator into a helper named getTextFromColumns (or reuse
HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit
nested loops over textNode.getColumns() and block.getLines() with a single call
to that helper; the helper should accept the same input used here (e.g., the
TextNode or its columns and a StringBuilder) and call getTextFromLines for each
block's lines to produce the same String value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 526d1d45-c268-49cf-a2e1-207d1c810b95

📥 Commits

Reviewing files that changed from the base of the PR and between e06f0fd and 07a41fb.

📒 Files selected for processing (2)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`:
- Around line 23-32: The getTextFromLines method can throw
IndexOutOfBoundsException when textLines is empty; add a guard at the start of
getTextFromLines to return an empty string (or handle null) if textLines is null
or textLines.isEmpty(), otherwise proceed with the existing loop that calls
getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling
textLines.get(textLines.size() - 1) on an empty list.
- Around line 34-44: The getTextFromLine method currently appends
chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped
HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature
to accept an optional escaping callback (e.g., Function<String,String> escapeFn)
or a boolean escapeHtml flag and apply it to each chunk value before appending
(use escapeFn.apply(value) if provided, or escape when flag is true), while
still appending the strikethrough wrappers as-is; update callers (HtmlGenerator
should pass an HTML-escaping function, MarkdownGenerator can pass null/identity)
so that HTML-special characters in TextChunk.getValue() are properly escaped
without escaping the intentional wrapper tags.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9622c5c4-e1e0-4aca-9a38-1a7bebe696b5

📥 Commits

Reviewing files that changed from the base of the PR and between 07a41fb and c35bed4.

📒 Files selected for processing (3)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java

coderabbitai · 2026-04-03T09:27:47Z

+    public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
+        StringBuilder stringBuilder = new StringBuilder();
+        for (int i = 0; i < textLines.size() - 1; i++) {
+            TextLine line = textLines.get(i);
+            getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
+            TextChunkUtils.formatLineEnd(stringBuilder);
+        }
+        getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
+        return stringBuilder.toString();
+    }


⚠️ Potential issue | 🔴 Critical

IndexOutOfBoundsException when textLines is empty.

Line 30 accesses textLines.get(textLines.size() - 1) without checking if the list is empty. If an empty list is passed, textLines.size() - 1 equals -1, causing an exception.

🐛 Proposed fix to guard against empty list

public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) { + if (textLines == null || textLines.isEmpty()) { + return ""; + } StringBuilder stringBuilder = new StringBuilder(); for (int i = 0; i < textLines.size() - 1; i++) { TextLine line = textLines.get(i); getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing); TextChunkUtils.formatLineEnd(stringBuilder); } getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing); return stringBuilder.toString(); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {

StringBuilder stringBuilder = new StringBuilder();

for (int i = 0; i < textLines.size() - 1; i++) {

TextLine line = textLines.get(i);

getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);

TextChunkUtils.formatLineEnd(stringBuilder);

}

getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);

return stringBuilder.toString();

}

public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {

if (textLines == null || textLines.isEmpty()) {

return "";

}

StringBuilder stringBuilder = new StringBuilder();

for (int i = 0; i < textLines.size() - 1; i++) {

TextLine line = textLines.get(i);

getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);

TextChunkUtils.formatLineEnd(stringBuilder);

}

getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);

return stringBuilder.toString();

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java` around lines 23 - 32, The getTextFromLines method can throw IndexOutOfBoundsException when textLines is empty; add a guard at the start of getTextFromLines to return an empty string (or handle null) if textLines is null or textLines.isEmpty(), otherwise proceed with the existing loop that calls getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling textLines.get(textLines.size() - 1) on an empty list.

coderabbitai · 2026-04-03T09:27:47Z

+    public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) {
+        for (TextChunk chunk : line.getTextChunks()) {
+            if (chunk.getIsStrikethroughText()) {
+                stringBuilder.append(strikethroughTextOpening);
+            }
+            stringBuilder.append(chunk.getValue());
+            if (chunk.getIsStrikethroughText()) {
+                stringBuilder.append(strikethroughTextClosing);
+            }
+        }
+    }


⚠️ Potential issue | 🟠 Major

Chunk values are not escaped — XSS risk when used with HTML.

chunk.getValue() is appended directly without escaping. When HtmlGenerator passes "<del>"/"</del>" as wrappers, any special HTML characters in the PDF content (e.g., <script> or x < y) will be rendered as HTML rather than displayed as text.

Consider either:

Having HtmlGenerator escape the final string (but this would also escape the intentional <del> tags), or

Accepting an optional escaping function parameter, or

Adding an escapeHtml parameter/flag to this method

🔒 Option: Add escaping callback parameter

+import java.util.function.Function; + public class GeneratorUtils { - public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing) { + public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) { StringBuilder stringBuilder = new StringBuilder(); for (TextColumn column : textNode.getColumns()) { for (TextBlock block : column.getBlocks()) { - stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing)); + stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing, escapeFunction)); } } return stringBuilder.toString(); } - public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) { + public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) { for (TextChunk chunk : line.getTextChunks()) { if (chunk.getIsStrikethroughText()) { stringBuilder.append(strikethroughTextOpening); } - stringBuilder.append(chunk.getValue()); + String value = escapeFunction != null ? escapeFunction.apply(chunk.getValue()) : chunk.getValue(); + stringBuilder.append(value); if (chunk.getIsStrikethroughText()) { stringBuilder.append(strikethroughTextClosing); } } }

Then HtmlGenerator would pass an escape function while MarkdownGenerator passes null or Function.identity().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java` around lines 34 - 44, The getTextFromLine method currently appends chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature to accept an optional escaping callback (e.g., Function<String,String> escapeFn) or a boolean escapeHtml flag and apply it to each chunk value before appending (use escapeFn.apply(value) if provided, or escape when flag is true), while still appending the strikethrough wrappers as-is; update callers (HtmlGenerator should pass an HTML-escaping function, MarkdownGenerator can pass null/identity) so that HTML-special characters in TextChunk.getValue() are properly escaped without escaping the intentional wrapper tags.

Update strikethrough text usage in Markdown

LonelyMidoriya self-assigned this Apr 1, 2026

LonelyMidoriya requested review from MaximPlusov, bundolee, hnc-jglee and hyunhee-jo as code owners April 1, 2026 15:32

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java Outdated

Comment thread ...opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java Outdated

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread ...opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java Outdated

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

LonelyMidoriya added 6 commits April 6, 2026 10:24

Add strikethrough text to HTML generator

b28993b

Update strikethrough text usage in Markdown

Update StrikethroughProcessorTest.java

abae772

Update documentation

99ec1af

Update getTextFromLines

79372ff

Add GeneratorUtils class

6c387c0

Update GeneratorUtils

fb28a31

LonelyMidoriya force-pushed the strikethrough-text branch from 461abbe to fb28a31 Compare April 6, 2026 07:26

Update documentation

73e28a0

MaximPlusov approved these changes Apr 6, 2026

View reviewed changes

MaximPlusov merged commit 071011e into main Apr 6, 2026
10 checks passed

MaximPlusov deleted the strikethrough-text branch April 6, 2026 07:38

coderabbitai Bot mentioned this pull request Apr 13, 2026

Fix table border processor #421

Merged

coderabbitai Bot mentioned this pull request Apr 28, 2026

fix: detect strikethroughs from line art in tables #459

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strikethrough text to HTML generator#379

Add strikethrough text to HTML generator#379
MaximPlusov merged 7 commits intomainfrom
strikethrough-text

LonelyMidoriya commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LonelyMidoriya commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LonelyMidoriya commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

codecov Bot commented Apr 1, 2026 •

edited

Loading