Add strikethrough text to HTML generator#379
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces ad-hoc string use with structured-text extraction: adds GeneratorUtils to build text from columns/lines/chunks and apply strikethrough markers at render time; generators (HTML/Markdown) use it to render text/strikethrough; StrikethroughProcessor now sets a boolean flag on chunks; tests updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant P as StrikethroughProcessor
participant M as TextModel\n(TextChunk/TextLine/TextColumn)
participant G as Generator\n(HtmlGenerator/MarkdownGenerator)
participant U as GeneratorUtils
participant O as Output\n(HTML/Markdown)
P->>M: detect strikethroughs\nsetIsStrikethroughText(true)
Note right of M: chunk.value unchanged\nboolean flag set
G->>M: request semantic nodes (columns/blocks/lines)
G->>U: for each node, call getTextFromTextNode / getTextFromLines
U->>M: traverse lines -> chunks, wrap chunk text when isStrikethroughText
U-->>G: return assembled string with strikethrough markers
G->>O: emit formatted output\n(<del>...</del> for HTML, ~~...~~ for MD)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 320-333: getTextFromLines writes raw chunk.getValue() into HTML
and can cause XSS; add a helper method (e.g., escapeHtmlContent) on
HtmlGenerator that returns "" for null and replaces &, <, > (and optionally "
and ') with their HTML entities, then call escapeHtmlContent(chunk.getValue())
wherever chunk.getValue() is appended (both inside the del branch and the plain
branch) so all TextChunk values are safely escaped before being appended to the
StringBuilder; reference methods: getTextFromLines(List<TextLine>,
StringBuilder) and create/use escapeHtmlContent(String).
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 251-258: Extract the column-joining code in MarkdownGenerator into
a reusable helper like HtmlGenerator's getTextFromColumns: create a private
method (e.g., getTextFromColumns(TextNode textNode, StringBuilder sb) or
getTextFromColumns(List<TextColumn> columns, StringBuilder sb)) that iterates
TextColumn objects, calls existing getTextFromLines(column.getLines(), sb), and
appends a space between non-final columns; then replace the current inline loop
in the method with a call to this new helper to keep behavior identical and
avoid duplication.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e4f34aae-0bff-452b-8ff9-1dc4fb1d0d7f
📒 Files selected for processing (3)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java
💤 Files with no reviewable changes (1)
- java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StrikethroughProcessor.java
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java (1)
30-37: 🧹 Nitpick | 🔵 TrivialNote: Related javadoc in production code appears outdated.
The javadoc at
StrikethroughProcessor.java:29-35still states the processor "wraps affected TextChunk values with ~~ markdown syntax," but per the refactoring, it now only sets a boolean flag. Consider updating that documentation in a follow-up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java` around lines 30 - 37, The javadoc for StrikethroughProcessor is outdated: update the class/method Javadoc in StrikethroughProcessor (the comment around the class header formerly saying it "wraps affected TextChunk values with ~~ markdown syntax") to reflect the current behavior — that it no longer mutates text but instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify the description to mention the flag-setting behavior and any effects on downstream rendering and remove references to markdown wrapping so the docs match the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java`:
- Around line 30-37: The javadoc for StrikethroughProcessor is outdated: update
the class/method Javadoc in StrikethroughProcessor (the comment around the class
header formerly saying it "wraps affected TextChunk values with ~~ markdown
syntax") to reflect the current behavior — that it no longer mutates text but
instead sets a boolean flag (e.g., isStrikethrough) on TextChunk objects; modify
the description to mention the flag-setting behavior and any effects on
downstream rendering and remove references to markdown wrapping so the docs
match the implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 03a53da9-6e03-4bd7-bc73-b820dbaa7b26
📒 Files selected for processing (1)
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StrikethroughProcessorTest.java
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java (1)
327-335:⚠️ Potential issue | 🟠 MajorMissing HTML entity escaping — potential XSS vulnerability.
chunk.getValue()is written directly into HTML output without escaping special characters (<,>,&). This was flagged in a previous review and remains unaddressed. PDF content containing characters like<script>orx < ywill be rendered as HTML rather than displayed as text.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java` around lines 327 - 335, The getTextFromLine method appends raw chunk.getValue() into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix by HTML-escaping the text before appending: replace direct uses of chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the escaped value to stringBuilder (still wrapping with "<del>...</del>" for strikethroughs) so all TextChunk.getValue() output is safely encoded.java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java (1)
252-258: 🧹 Nitpick | 🔵 TrivialConsider extracting
getTextFromColumnshelper for consistency.This inline traversal pattern (columns → blocks → lines) duplicates the logic in
HtmlGenerator.getTextFromColumns(). A previous review suggested extracting this to a helper method, which remains applicable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java` around lines 252 - 258, Extract the inline traversal (columns → blocks → lines) in MarkdownGenerator into a helper named getTextFromColumns (or reuse HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit nested loops over textNode.getColumns() and block.getLines() with a single call to that helper; the helper should accept the same input used here (e.g., the TextNode or its columns and a StringBuilder) and call getTextFromLines for each block's lines to produce the same String value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 273-280: The method getTextFromLines in MarkdownGenerator risks
IndexOutOfBoundsException when textLines is empty; add a guard at the start
(e.g., if (textLines == null || textLines.isEmpty()) return;) so you exit early
for empty input, then keep the existing loop that processes all but the last
line (calling getTextFromLine and TextChunkUtils.formatLineEnd) and handle the
final line only if present; this ensures safe access to
textLines.get(textLines.size() - 1) and preserves current formatting logic.
---
Duplicate comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java`:
- Around line 327-335: The getTextFromLine method appends raw chunk.getValue()
into HTML, allowing unescaped characters (e.g., <, >, &) and enabling XSS; fix
by HTML-escaping the text before appending: replace direct uses of
chunk.getValue() in getTextFromLine with an escaped string (e.g., use a utility
like StringEscapeUtils.escapeHtml4 or a local escapeHtml method) and append the
escaped value to stringBuilder (still wrapping with "<del>...</del>" for
strikethroughs) so all TextChunk.getValue() output is safely encoded.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java`:
- Around line 252-258: Extract the inline traversal (columns → blocks → lines)
in MarkdownGenerator into a helper named getTextFromColumns (or reuse
HtmlGenerator.getTextFromColumns if it is accessible) and replace the explicit
nested loops over textNode.getColumns() and block.getLines() with a single call
to that helper; the helper should accept the same input used here (e.g., the
TextNode or its columns and a StringBuilder) and call getTextFromLines for each
block's lines to produce the same String value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 526d1d45-c268-49cf-a2e1-207d1c810b95
📒 Files selected for processing (2)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`:
- Around line 23-32: The getTextFromLines method can throw
IndexOutOfBoundsException when textLines is empty; add a guard at the start of
getTextFromLines to return an empty string (or handle null) if textLines is null
or textLines.isEmpty(), otherwise proceed with the existing loop that calls
getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling
textLines.get(textLines.size() - 1) on an empty list.
- Around line 34-44: The getTextFromLine method currently appends
chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped
HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature
to accept an optional escaping callback (e.g., Function<String,String> escapeFn)
or a boolean escapeHtml flag and apply it to each chunk value before appending
(use escapeFn.apply(value) if provided, or escape when flag is true), while
still appending the strikethrough wrappers as-is; update callers (HtmlGenerator
should pass an HTML-escaping function, MarkdownGenerator can pass null/identity)
so that HTML-special characters in TextChunk.getValue() are properly escaped
without escaping the intentional wrapper tags.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9622c5c4-e1e0-4aca-9a38-1a7bebe696b5
📒 Files selected for processing (3)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.javajava/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java
| public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) { | ||
| StringBuilder stringBuilder = new StringBuilder(); | ||
| for (int i = 0; i < textLines.size() - 1; i++) { | ||
| TextLine line = textLines.get(i); | ||
| getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | ||
| TextChunkUtils.formatLineEnd(stringBuilder); | ||
| } | ||
| getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | ||
| return stringBuilder.toString(); | ||
| } |
There was a problem hiding this comment.
IndexOutOfBoundsException when textLines is empty.
Line 30 accesses textLines.get(textLines.size() - 1) without checking if the list is empty. If an empty list is passed, textLines.size() - 1 equals -1, causing an exception.
🐛 Proposed fix to guard against empty list
public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) {
+ if (textLines == null || textLines.isEmpty()) {
+ return "";
+ }
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < textLines.size() - 1; i++) {
TextLine line = textLines.get(i);
getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
TextChunkUtils.formatLineEnd(stringBuilder);
}
getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing);
return stringBuilder.toString();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) { | |
| StringBuilder stringBuilder = new StringBuilder(); | |
| for (int i = 0; i < textLines.size() - 1; i++) { | |
| TextLine line = textLines.get(i); | |
| getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | |
| TextChunkUtils.formatLineEnd(stringBuilder); | |
| } | |
| getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | |
| return stringBuilder.toString(); | |
| } | |
| public static String getTextFromLines(List<TextLine> textLines, String strikethroughTextOpening, String strikethroughTextClosing) { | |
| if (textLines == null || textLines.isEmpty()) { | |
| return ""; | |
| } | |
| StringBuilder stringBuilder = new StringBuilder(); | |
| for (int i = 0; i < textLines.size() - 1; i++) { | |
| TextLine line = textLines.get(i); | |
| getTextFromLine(line, stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | |
| TextChunkUtils.formatLineEnd(stringBuilder); | |
| } | |
| getTextFromLine(textLines.get(textLines.size() - 1), stringBuilder, strikethroughTextOpening, strikethroughTextClosing); | |
| return stringBuilder.toString(); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`
around lines 23 - 32, The getTextFromLines method can throw
IndexOutOfBoundsException when textLines is empty; add a guard at the start of
getTextFromLines to return an empty string (or handle null) if textLines is null
or textLines.isEmpty(), otherwise proceed with the existing loop that calls
getTextFromLine and TextChunkUtils.formatLineEnd; this prevents calling
textLines.get(textLines.size() - 1) on an empty list.
| public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) { | ||
| for (TextChunk chunk : line.getTextChunks()) { | ||
| if (chunk.getIsStrikethroughText()) { | ||
| stringBuilder.append(strikethroughTextOpening); | ||
| } | ||
| stringBuilder.append(chunk.getValue()); | ||
| if (chunk.getIsStrikethroughText()) { | ||
| stringBuilder.append(strikethroughTextClosing); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Chunk values are not escaped — XSS risk when used with HTML.
chunk.getValue() is appended directly without escaping. When HtmlGenerator passes "<del>"/"</del>" as wrappers, any special HTML characters in the PDF content (e.g., <script> or x < y) will be rendered as HTML rather than displayed as text.
Consider either:
- Having
HtmlGeneratorescape the final string (but this would also escape the intentional<del>tags), or - Accepting an optional escaping function parameter, or
- Adding an
escapeHtmlparameter/flag to this method
🔒 Option: Add escaping callback parameter
+import java.util.function.Function;
+
public class GeneratorUtils {
- public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing) {
+ public static String getTextFromTextNode(SemanticTextNode textNode, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) {
StringBuilder stringBuilder = new StringBuilder();
for (TextColumn column : textNode.getColumns()) {
for (TextBlock block : column.getBlocks()) {
- stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing));
+ stringBuilder.append(getTextFromLines(block.getLines(), strikethroughTextOpening, strikethroughTextClosing, escapeFunction));
}
}
return stringBuilder.toString();
}
- public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing) {
+ public static void getTextFromLine(TextLine line, StringBuilder stringBuilder, String strikethroughTextOpening, String strikethroughTextClosing, Function<String, String> escapeFunction) {
for (TextChunk chunk : line.getTextChunks()) {
if (chunk.getIsStrikethroughText()) {
stringBuilder.append(strikethroughTextOpening);
}
- stringBuilder.append(chunk.getValue());
+ String value = escapeFunction != null ? escapeFunction.apply(chunk.getValue()) : chunk.getValue();
+ stringBuilder.append(value);
if (chunk.getIsStrikethroughText()) {
stringBuilder.append(strikethroughTextClosing);
}
}
}Then HtmlGenerator would pass an escape function while MarkdownGenerator passes null or Function.identity().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/utils/GeneratorUtils.java`
around lines 34 - 44, The getTextFromLine method currently appends
chunk.getValue() directly (in GeneratorUtils.getTextFromLine), causing unescaped
HTML when HtmlGenerator supplies "<del>"/"</del>"; update the method signature
to accept an optional escaping callback (e.g., Function<String,String> escapeFn)
or a boolean escapeHtml flag and apply it to each chunk value before appending
(use escapeFn.apply(value) if provided, or escape when flag is true), while
still appending the strikethrough wrappers as-is; update callers (HtmlGenerator
should pass an HTML-escaping function, MarkdownGenerator can pass null/identity)
so that HTML-special characters in TextChunk.getValue() are properly escaped
without escaping the intentional wrapper tags.
Update strikethrough text usage in Markdown
461abbe to
fb28a31
Compare
Update strikethrough text usage in Markdown
Checklist:
Summary by CodeRabbit
Refactor
…) and Markdown uses inline markers (…) based on detected formatting rather than pre-wrapped text.Tests