Skip to content

36003 dotai embed block editor story block fields as markdown instead of html stripped text#36366

Open
hassandotcms wants to merge 2 commits into
mainfrom
36003-dotai-embed-block-editor-story-block-fields-as-markdown-instead-of-html-stripped-text
Open

36003 dotai embed block editor story block fields as markdown instead of html stripped text#36366
hassandotcms wants to merge 2 commits into
mainfrom
36003-dotai-embed-block-editor-story-block-fields-as-markdown-instead-of-html-stripped-text

Conversation

@hassandotcms

Copy link
Copy Markdown
Member

What

dotAI now embeds Story Block (Block Editor) fields as Markdown instead of rendering them to HTML and stripping the markup with Tika.

ContentToStringUtil.parseBlockEditor returns StoryBlockMap.toMarkdown() directly — no Tika/HTML round-trip. Markdown is already plain text and preserves structure (tables, code blocks, lists, headings) that the Tika path flattened.

Closes #36003.

Why

Tika flattened tables, code, lists, and headings to plain text, giving the embedding model a worse representation. Markdown keeps that structure.

Changes

  • ContentToStringUtil.parseBlockEditor → returns toMarkdown() raw (not via parseText/parseHTML, which would re-collapse/re-strip the structure).
  • New test ContentToStringUtilTest — asserts a Story Block with a table + fenced code block extracts with structure intact; registered in MainSuite3a.

…ipped text (#36003)

parseBlockEditor now returns StoryBlockMap.toMarkdown() directly instead of rendering
to HTML and stripping it with Tika. Markdown is already plain text and preserves the
structure (tables, code blocks, lists, headings) that the Tika path flattened away.
The markdown is returned raw -- not routed through parseText (collapses newlines) or
parseHTML (Tika re-strips) -- so newline-delimited structure survives the extraction
layer.

Adds ContentToStringUtilTest asserting a Story Block with a table and a fenced code
block extracts with that structure intact, and registers it in MainSuite3a.
…i-embed-block-editor-story-block-fields-as-markdown-instead-of-html-stripped-text

# Conflicts:
#	dotcms-integration/src/test/java/com/dotcms/MainSuite3a.java
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @hassandotcms's task in 1m 24s —— View job


Rollback Safety Analysis

  • Read rollback-unsafe categories reference
  • Get full PR diff
  • Analyze diff against all unsafe categories
  • Apply label

Result: ✅ Safe To Rollback

Scope of changes:

File Nature
ContentToStringUtil.java Private method parseBlockEditor — swaps storyBlockMap.toHtml() + Tika for storyBlockMap.toMarkdown()
MainSuite3a.java Test suite registration only
ContentToStringUtilTest.java New integration test only

Analysis against every unsafe category:

Category Verdict
C-1 Structural Data Model ✅ No DB schema touched
C-2 Elasticsearch Mapping ✅ No ES mapping change; no ESMappingAPIImpl/putMapping() calls
C-3 Content JSON Model Version CURRENT_MODEL_VERSION unchanged
C-4 DROP TABLE/COLUMN ✅ No DDL of any kind
H-1 One-Way Data Migration ✅ No runonce task, no UPDATE … SELECT
H-2 RENAME TABLE/COLUMN ✅ None
H-3 PK Restructuring ✅ None
H-4 New Field Type ✅ No new field type registered
H-5 Storage Provider Change ✅ None
H-6 DROP PROCEDURE/FUNCTION ✅ None
H-7 NOT NULL Without Default ✅ None
H-8 VTL Viewtool Contract SearchTool is not registered in toolbox.xml; parseBlockEditor is private — no VTL-accessible surface changes
M-1 Column Type Change ✅ None
M-2 Push Publishing Bundle ✅ None
M-3 REST/GraphQL Contract ✅ No endpoint or response shape changed
M-4 OSGi Interface ✅ No public interface modified

Reasoning: parseBlockEditor is a private method deep inside the AI embeddings pipeline. Its output feeds into the vector store as embedding source text — format changes here affect only the quality of future embeddings for Story Block fields; they do not alter any database schema, Elasticsearch mapping, stored data contract, or API surface. Rolling back to N-1 reverts to the Tika path; any Markdown-format embeddings already written are semantically compatible (cosine similarity still works), and no data is lost.

@hassandotcms hassandotcms marked this pull request as ready for review June 30, 2026 14:28
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

🤖 dotBot Review (Bedrock)

Reviewed 3 file(s); 1 candidate(s) → 1 confirmed, 0 uncertain (unverified, kept for review).

Confirmed findings

  • 🟡 Medium dotcms-integration/src/test/java/com/dotcms/ai/util/ContentToStringUtilTest.java:100 — Missing test coverage for empty/null Story Block field handling
    The test testParseBlockEditorWithTableAndCodeBlock validates correct Markdown conversion of valid Story Block JSON but lacks cases for empty/null inputs. While parseBlockEditor uses @NotNull, callers might still pass empty strings (e.g., from empty content fields). The test sets body to STORY_BLOCK_WITH_TABLE_AND_CODE (line 97) but never tests empty body values. Without tests for empty/invalid JSON inputs, potential issues in StoryBlockMap constructor (e.g., NPEs, JSON parsing errors) or empty return values remain unverified.

us.deepseek.r1-v1:0 · Run: #28451883674 · tokens: in: 14379 · out: 3070 · total: 17449 · calls: 6 · est. ~$0.036

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

dotAI: embed Block Editor (Story Block) fields as Markdown instead of HTML-stripped text

1 participant