Skip to content

Add strikethrough text to HTML generator#379

Merged
MaximPlusov merged 7 commits intomainfrom
strikethrough-text
Apr 6, 2026
Merged

Add strikethrough text to HTML generator#379
MaximPlusov merged 7 commits intomainfrom
strikethrough-text

Conversation

@LonelyMidoriya
Copy link
Copy Markdown
Contributor

@LonelyMidoriya LonelyMidoriya commented Apr 1, 2026

Update strikethrough text usage in Markdown

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Summary by CodeRabbit

  • Refactor

    • Improved text extraction and formatting for lists, captions, paragraphs, and headings by assembling content from structured lines/columns for accurate spacing, multi-line and multi-column support.
    • Strikethrough handling moved to a detection flag: HTML renders deletion markup () and Markdown uses inline markers () based on detected formatting rather than pre-wrapped text.
  • Tests

    • Updated tests to assert strikethrough detection via the formatting flag and removed obsolete marker-wrapping checks.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces ad-hoc string use with structured-text extraction: adds GeneratorUtils to build text from columns/lines/chunks and apply strikethrough markers at render time; generators (HTML/Markdown) use it to render text/strikethrough; StrikethroughProcessor now sets a boolean flag on chunks; tests updated accordingly.

Changes

Cohort / File(s) Summary
Generator utilities
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java
New utility class that builds plain text from SemanticTextNode/TextLine/TextChunk structures, inserting configurable opening/closing strikethrough markers per chunk.
HTML generator
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
Replaced direct getValue()/toString() usage with GeneratorUtils text extraction for lists, captions, paragraphs, and headings; renders strikethrough chunks as <del>...</del>; added strikethrough tag constants and broadened import.
Markdown generator
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
Introduced strikethroughTextSyntax constant and switched list and semantic-text rendering to use GeneratorUtils to produce text with ~~ markers before Markdown sanitization.
Strikethrough processing
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java
Removed mutation that wrapped chunk text with ~~; processor now marks chunks via setIsStrikethroughText(true) only, leaving chunk value unchanged.
Tests
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java
Updated assertions to check getIsStrikethroughText() boolean flags instead of expecting ~~...~~-wrapped getValue(); removed test for double-wrapping prevention.

Sequence Diagram(s)

sequenceDiagram
    participant P as StrikethroughProcessor
    participant M as TextModel\n(TextChunk/TextLine/TextColumn)
    participant G as Generator\n(HtmlGenerator/MarkdownGenerator)
    participant U as GeneratorUtils
    participant O as Output\n(HTML/Markdown)

    P->>M: detect strikethroughs\nsetIsStrikethroughText(true)
    Note right of M: chunk.value unchanged\nboolean flag set
    G->>M: request semantic nodes (columns/blocks/lines)
    G->>U: for each node, call getTextFromTextNode / getTextFromLines
    U->>M: traverse lines -> chunks, wrap chunk text when isStrikethroughText
    U-->>G: return assembled string with strikethrough markers
    G->>O: emit formatted output\n(<del>...</del> for HTML, ~~...~~ for MD)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions only HTML generator strikethrough text, but the PR substantially modifies HTML, Markdown, and utility classes with broader refactoring of how strikethrough text is processed and rendered. Revise the title to reflect the full scope, such as 'Refactor strikethrough text handling across HTML and Markdown generators' or 'Add strikethrough text support to HTML and Markdown generators with GeneratorUtils'.
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 320-333: getTextFromLines writes raw chunk.getValue() into HTML
and can cause XSS; add a helper method (e.g., escapeHtmlContent) on
HtmlGenerator that returns "" for null and replaces &, <, > (and optionally "
and ') with their HTML entities, then call escapeHtmlContent(chunk.getValue())
wherever chunk.getValue() is appended (both inside the del branch and the plain
branch) so all TextChunk values are safely escaped before being appended to the
StringBuilder; reference methods: getTextFromLines(List<TextLine>,
StringBuilder) and create/use escapeHtmlContent(String).

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 251-258: Extract the column-joining code in MarkdownGenerator into
a reusable helper like HtmlGenerator's getTextFromColumns: create a private
method (e.g., getTextFromColumns(TextNode textNode, StringBuilder sb) or
getTextFromColumns(List<TextColumn> columns, StringBuilder sb)) that iterates
TextColumn objects, calls existing getTextFromLines(column.getLines(), sb), and
appends a space between non-final columns; then replace the current inline loop
in the method with a call to this new helper to keep behavior identical and
avoid duplication.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e4f34aae-0bff-452b-8ff9-1dc4fb1d0d7f

📥 Commits

Reviewing files that changed from the base of the PR and between b31557d and 25c8872.

📒 Files selected for processing (3)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java
💤 Files with no reviewable changes (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java (1)

30-37: 🧹 Nitpick | 🔵 Trivial

Note: Related javadoc in production code appears outdated.

The javadoc at StrikethroughProcessor.java:29-35 still states the processor "wraps affected TextChunk values with ~~ markdown syntax," but per the refactoring, it now only sets a boolean flag. Consider updating that documentation in a follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`
around lines 30 - 37, The javadoc for StrikethroughProcessor is outdated: update
the class/method Javadoc in StrikethroughProcessor (the comment around the class
header formerly saying it "wraps affected TextChunk values with ~~ markdown
syntax") to reflect the current behavior — that it no longer mutates text but
instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify
the description to mention the flag-setting behavior and any effects on
downstream rendering and remove references to markdown wrapping so the docs
match the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`:
- Around line 30-37: The javadoc for StrikethroughProcessor is outdated: update
the class/method Javadoc in StrikethroughProcessor (the comment around the class
header formerly saying it "wraps affected TextChunk values with ~~ markdown
syntax") to reflect the current behavior — that it no longer mutates text but
instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify
the description to mention the flag-setting behavior and any effects on
downstream rendering and remove references to markdown wrapping so the docs
match the implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 03a53da9-6e03-4bd7-bc73-b820dbaa7b26

📥 Commits

Reviewing files that changed from the base of the PR and between 25c8872 and 435af40.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java (1)

327-335: ⚠️ Potential issue | 🟠 Major

Missing HTML entity escaping — potential XSS vulnerability.

chunk.getValue() is written directly into HTML output without escaping special characters (<, >, &). This was flagged in a previous review and remains unaddressed. PDF content containing characters like <script> or x < y will be rendered as HTML rather than displayed as text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`
around lines 327 - 335, The getTextFromLine method appends raw chunk.getValue()
into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix
by HTML-escaping the text before appending: replace direct uses of
chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility
like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the
escaped value to stringBuilder (still wrapping with "<del>...</del>" for
strikethroughs) so all TextChunk.getValue() output is safely encoded.
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java (1)

252-258: 🧹 Nitpick | 🔵 Trivial

Consider extracting getTextFromColumns helper for consistency.

This inline traversal pattern (columns → blocks → lines) duplicates the logic in HtmlGenerator.getTextFromColumns(). A previous review suggested extracting this to a helper method, which remains applicable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`
around lines 252 - 258, Extract the inline traversal (columns → blocks → lines)
in MarkdownGenerator into a helper named getTextFromColumns (or reuse
HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit
nested loops over textNode.getColumns() and block.getLines() with a single call
to that helper; the helper should accept the same input used here (e.g., the
TextNode or its columns and a StringBuilder) and call getTextFromLines for each
block's lines to produce the same String value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 273-280: The method getTextFromLines in MarkdownGenerator risks
IndexOutOfBoundsException when textLines is empty; add a guard at the start
(e.g., if (textLines == null || textLines.isEmpty()) return;) so you exit early
for empty input, then keep the existing loop that processes all but the last
line (calling getTextFromLine and TextChunkUtils.formatLineEnd) and handle the
final line only if present; this ensures safe access to
textLines.get(textLines.size() - 1) and preserves current formatting logic.

---

Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 327-335: The getTextFromLine method appends raw chunk.getValue()
into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix
by HTML-escaping the text before appending: replace direct uses of
chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility
like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the
escaped value to stringBuilder (still wrapping with "<del>...</del>" for
strikethroughs) so all TextChunk.getValue() output is safely encoded.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 252-258: Extract the inline traversal (columns → blocks → lines)
in MarkdownGenerator into a helper named getTextFromColumns (or reuse
HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit
nested loops over textNode.getColumns() and block.getLines() with a single call
to that helper; the helper should accept the same input used here (e.g., the
TextNode or its columns and a StringBuilder) and call getTextFromLines for each
block's lines to produce the same String value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 526d1d45-c268-49cf-a2e1-207d1c810b95

📥 Commits

Reviewing files that changed from the base of the PR and between e06f0fd and 07a41fb.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`:
- Around line 23-32: The getTextFromLines method can throw
IndexOutOfBoundsException when textLines is empty; add a guard at the start of
getTextFromLines to return an empty string (or handle null) if textLines is null
or textLines.isEmpty(), otherwise proceed with the existing loop that calls
getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling
textLines.get(textLines.size() - 1) on an empty list.
- Around line 34-44: The getTextFromLine method currently appends
chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped
HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature
to accept an optional escaping callback (e.g., Function<String,String> escapeFn)
or a boolean escapeHtml flag and apply it to each chunk value before appending
(use escapeFn.apply(value) if provided, or escape when flag is true), while
still appending the strikethrough wrappers as-is; update callers (HtmlGenerator
should pass an HTML-escaping function, MarkdownGenerator can pass null/identity)
so that HTML-special characters in TextChunk.getValue() are properly escaped
without escaping the intentional wrapper tags.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9622c5c4-e1e0-4aca-9a38-1a7bebe696b5

📥 Commits

Reviewing files that changed from the base of the PR and between 07a41fb and c35bed4.

📒 Files selected for processing (3)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java

Comment on lines +23 to +32
public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < textLines.size() - 1; i++) {
TextLine line = textLines.get(i);
getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
TextChunkUtils.formatLineEnd(stringBuilder);
}
getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
return stringBuilder.toString();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

IndexOutOfBoundsException when textLines is empty.

Line 30 accesses textLines.get(textLines.size() - 1) without checking if the list is empty. If an empty list is passed, textLines.size() - 1 equals -1, causing an exception.

🐛 Proposed fix to guard against empty list
 public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
+    if (textLines == null || textLines.isEmpty()) {
+        return "";
+    }
     StringBuilder stringBuilder = new StringBuilder();
     for (int i = 0; i < textLines.size() - 1; i++) {
         TextLine line = textLines.get(i);
         getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
         TextChunkUtils.formatLineEnd(stringBuilder);
     }
     getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
     return stringBuilder.toString();
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < textLines.size() - 1; i++) {
TextLine line = textLines.get(i);
getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
TextChunkUtils.formatLineEnd(stringBuilder);
}
getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
return stringBuilder.toString();
}
public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
if (textLines == null || textLines.isEmpty()) {
return "";
}
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < textLines.size() - 1; i++) {
TextLine line = textLines.get(i);
getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
TextChunkUtils.formatLineEnd(stringBuilder);
}
getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
return stringBuilder.toString();
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`
around lines 23 - 32, The getTextFromLines method can throw
IndexOutOfBoundsException when textLines is empty; add a guard at the start of
getTextFromLines to return an empty string (or handle null) if textLines is null
or textLines.isEmpty(), otherwise proceed with the existing loop that calls
getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling
textLines.get(textLines.size() - 1) on an empty list.

Comment on lines +34 to +44
public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) {
for (TextChunk chunk : line.getTextChunks()) {
if (chunk.getIsStrikethroughText()) {
stringBuilder.append(strikethroughTextOpening);
}
stringBuilder.append(chunk.getValue());
if (chunk.getIsStrikethroughText()) {
stringBuilder.append(strikethroughTextClosing);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Chunk values are not escaped — XSS risk when used with HTML.

chunk.getValue() is appended directly without escaping. When HtmlGenerator passes "<del>"/"</del>" as wrappers, any special HTML characters in the PDF content (e.g., <script> or x < y) will be rendered as HTML rather than displayed as text.

Consider either:

  1. Having HtmlGenerator escape the final string (but this would also escape the intentional <del> tags), or
  2. Accepting an optional escaping function parameter, or
  3. Adding an escapeHtml parameter/flag to this method
🔒 Option: Add escaping callback parameter
+import java.util.function.Function;
+
 public class GeneratorUtils {
-    public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing) {
+    public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) {
         StringBuilder stringBuilder = new StringBuilder();
         for (TextColumn column : textNode.getColumns()) {
             for (TextBlock block : column.getBlocks()) {
-                stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing));
+                stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing, escapeFunction));
             }
         }
         return stringBuilder.toString();
     }
 
-    public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) {
+    public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) {
         for (TextChunk chunk : line.getTextChunks()) {
             if (chunk.getIsStrikethroughText()) {
                 stringBuilder.append(strikethroughTextOpening);
             }
-            stringBuilder.append(chunk.getValue());
+            String value = escapeFunction != null ? escapeFunction.apply(chunk.getValue()) : chunk.getValue();
+            stringBuilder.append(value);
             if (chunk.getIsStrikethroughText()) {
                 stringBuilder.append(strikethroughTextClosing);
             }
         }
     }

Then HtmlGenerator would pass an escape function while MarkdownGenerator passes null or Function.identity().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`
around lines 34 - 44, The getTextFromLine method currently appends
chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped
HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature
to accept an optional escaping callback (e.g., Function<String,String> escapeFn)
or a boolean escapeHtml flag and apply it to each chunk value before appending
(use escapeFn.apply(value) if provided, or escape when flag is true), while
still appending the strikethrough wrappers as-is; update callers (HtmlGenerator
should pass an HTML-escaping function, MarkdownGenerator can pass null/identity)
so that HTML-special characters in TextChunk.getValue() are properly escaped
without escaping the intentional wrapper tags.

@MaximPlusov MaximPlusov merged commit 071011e into main Apr 6, 2026
10 checks passed
@MaximPlusov MaximPlusov deleted the strikethrough-text branch April 6, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants