Skip to content

[Bug]: DOMParser produces different output for plain text input depending on libxml version #90

@GertjanRoke

Description

@GertjanRoke

PHP Version

8.4

Version

2.1.0

Bug Description

DOMParser::setDocument() calls DOMDocument::loadHTML() to parse HTML content. The behavior of loadHTML() differs between libxml versions when processing plain text (non-HTML) input:

  • libxml 2.9.x (Ubuntu 22.04/CI): auto-wraps bare text in <p> tags (HTML4/SGML behavior)
  • libxml 2.10+ (macOS, newer distros): does not wrap bare text in <p> tags (HTML5 tokenizer)

This means Editor->setContent('Hello world') produces different Tiptap JSON depending on the platform.

Steps to reproduce

$editor = new \Tiptap\Editor;
$editor->setContent('Hello world');
echo $editor->getJSON();

On libxml 2.9.x:

{"type":"doc","content":[{"type":"paragraph","content":[{"type":"text","text":"Hello world"}]}]}

On libxml 2.10+ (e.g. 2.15.x):

{"type":"doc","content":[{"type":"text","text":"Hello world"}]}

The difference is that on older libxml, DOMDocument::loadHTML() wraps bare text inside <body> in a <p> element, while newer libxml (with its HTML5-conformant tokenizer) treats it as a raw text node.

Root cause

In src/Core/DOMParser.php:45:

$this->DOM->loadHTML(
    $this->makeValidXMLDocument(
        $this->minify($value)
    )
);

loadHTML() delegates to libxml2's HTML parser, whose behavior changed starting with libxml 2.10.0 ("The HTML tokenizer now conforms fully to HTML5"). See also libxml2 GitLab issue #414.

Impact

This primarily affects tests that pass plain text to Tiptap-backed fields (e.g. Filament RichEditor's fillForm()). In production, the browser-based Tiptap editor always produces proper HTML with <p> tags, so the issue doesn't surface in normal usage.

However, it breaks the contract that setContent() should produce consistent output regardless of the server environment.

Possible solutions

  1. Detect plain text input (no HTML tags) and wrap it in <p> before calling loadHTML()
  2. Use PHP 8.4's \Dom\HTMLDocument (which has a built-in HTML5 parser) when available, with DOMDocument as fallback
  3. Document that setContent() expects HTML input, not plain text

Environment

  • tiptap-php: 2.1.0
  • PHP: 8.4
  • libxml: 2.15.1 (macOS) / 2.9.14 (Ubuntu 22.04 CI)

Expected Behavior

Editor->setContent() should produce identical Tiptap JSON output regardless of the underlying libxml version. When plain text (without HTML tags) is passed as input, the output should be consistent across all platforms:

$editor = new \Tiptap\Editor;
$editor->setContent('Hello world');
echo $editor->getJSON();

// Should always produce the same result, e.g.:
// {"type":"doc","content":[{"type":"paragraph","content":[{"type":"text","text":"Hello world"}]}]}

The library should either normalize plain text input before passing it to DOMDocument::loadHTML(), or document that setContent() expects HTML input rather than plain text.

Additional Context (Optional)

No response

Dependency Updates

  • Yes, I've updated all my dependencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions