Skip to content

Fix image extraction newline bug in ImageExtractorNew#837

Merged
jimkont merged 1 commit into
dbpedia:masterfrom
anshuman9468:fix-image-extraction-newline
Mar 5, 2026
Merged

Fix image extraction newline bug in ImageExtractorNew#837
jimkont merged 1 commit into
dbpedia:masterfrom
anshuman9468:fix-image-extraction-newline

Conversation

@anshuman9468
Copy link
Copy Markdown
Contributor

@anshuman9468 anshuman9468 commented Mar 4, 2026

Description

The issue in ImageExtractorNew.scala caused image names extracted from sections to include a leading newline (\n). This happened because when iterator.hasNext() returned false at the last character, the code appended the character directly without checking if it was a newline.

Example input:

\nLRO_WAC_South_Pole_Mosaic.jpg|

Previously extracted result:
\nLRO_WAC_South_Pole_Mosaic.jpg

The fix adds a condition to skip newline characters before appending them to the builder. This ensures the correct extraction:

LRO_WAC_South_Pole_Mosaic.jpg

Testing

Due to Scala 2.11 compatibility issues with modern JDK (object java.lang.Object in compiler mirror not found), the full Maven test suite could not run.

Instead, a standalone test program replicating the extraction loop was used with the problematic input. The output correctly extracted:

Extracted Image: [LRO_WAC_South_Pole_Mosaic.jpg]
confirming the fix works as expected.

Issue Link-: #748

Summary by CodeRabbit

  • Bug Fixes
    • Refined image file name extraction to properly handle delimiter boundaries during the parsing process. The system now stops accumulating characters upon encountering colons, equal signs, pipes, and line breaks. This prevents unwanted trailing delimiters from being included in extracted file names, ensuring more accurate and consistent data extraction results overall.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

A minor logic change in image filename extraction that adds delimiter-based early termination. When accumulating image filename characters, the parser now stops appending upon encountering delimiters (':', '=', '|', or newline) instead of including them in the final substring.

Changes

Cohort / File(s) Summary
Image Extraction Logic
core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala
Modified character accumulation in totallyNotImageLinkRegex to stop appending when delimiter characters are encountered, reducing unwanted trailing delimiters in extracted image filenames.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a newline handling bug in image extraction within ImageExtractorNew.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Mar 4, 2026

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala (1)

280-294: Consider extracting a shared delimiter predicate.

The delimiter check is duplicated in two branches. Centralizing it reduces the chance of future branch drift regressions.

♻️ Proposed refactor
 def totallyNotImageLinkRegex(section: String): ListBuffer[String] = {
   val charArray = section.reverse.toCharArray
   val iterator = charArray.iterator
   val sb = new mutable.StringBuilder()
   val images = ListBuffer[String]()
   var imageFound = false
+  `@inline` def isDelimiter(ch: Char): Boolean =
+    ch == ':' || ch == '=' || ch == '|' || ch == '\n'
   while (iterator.hasNext) {
     var c = iterator.next()
@@
-      } else if (iterator.hasNext) {
-        if (c != ':' && c != '=' && c != '|' && c != '\n') {
+      } else if (iterator.hasNext) {
+        if (!isDelimiter(c)) {
           sb.append(c)
         }
@@
       } else {
-        if (c != ':' && c != '=' && c != '|' && c != '\n') {
+        if (!isDelimiter(c)) {
           sb.append(c)
         }
         imageFound = false
         images += sb.reverse.toString()
         sb.clear()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala`
around lines 280 - 294, The duplicated delimiter check (c != ':' && c != '=' &&
c != '|' && c != '\n') in ImageExtractorNew.scala should be extracted into a
single helper predicate to avoid drift; add a private def or local function like
isDelimiter(c: Char): Boolean (or isNotDelimiter if you prefer) near the
surrounding method in the ImageExtractorNew object/class and replace both
occurrences in the branch that appends to sb and the final-else branch with
calls to that predicate (e.g., if (isNotDelimiter(c)) sb.append(c) else { ...
}); update usages in the same method to use the helper and keep behavior
identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala`:
- Around line 280-294: The duplicated delimiter check (c != ':' && c != '=' && c
!= '|' && c != '\n') in ImageExtractorNew.scala should be extracted into a
single helper predicate to avoid drift; add a private def or local function like
isDelimiter(c: Char): Boolean (or isNotDelimiter if you prefer) near the
surrounding method in the ImageExtractorNew object/class and replace both
occurrences in the branch that appends to sb and the final-else branch with
calls to that predicate (e.g., if (isNotDelimiter(c)) sb.append(c) else { ...
}); update usages in the same method to use the helper and keep behavior
identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7f7f8c3f-cd52-4883-aa50-d1bba5c1da2e

📥 Commits

Reviewing files that changed from the base of the PR and between 892ddfe and 4ce6f1d.

📒 Files selected for processing (1)
  • core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala

Copy link
Copy Markdown
Member

@jimkont jimkont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @anshuman9468

@jimkont jimkont merged commit 907552e into dbpedia:master Mar 5, 2026
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants