Skip to content

HTML API: Preserve decoded carriage returns in serialization#42

Open
sirreal wants to merge 3 commits into
html-api-fuzz-fiz/decoded-cr-basefrom
html-api-fuzz-fiz/decoded-cr
Open

HTML API: Preserve decoded carriage returns in serialization#42
sirreal wants to merge 3 commits into
html-api-fuzz-fiz/decoded-cr-basefrom
html-api-fuzz-fiz/decoded-cr

Conversation

@sirreal

@sirreal sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner

What changed

  • Added focused HTML Processor normalization regression tests for decoded carriage returns in text, RCDATA, attribute values, table text, and template text.
  • Added raw attribute CR/CRLF regression coverage, including class update serialization after add_class().
  • Updated serialization of decoded text-like content so literal U+000D is emitted as 
 after existing HTML escaping.
  • Normalized raw source attribute CR/CRLF to LF before serialization, while preserving decoded 
 values.
  • Left SCRIPT and STYLE raw-text serialization unchanged.

Why

A decoded carriage-return character reference such as 
 was serialized as a literal CR byte. Re-parsing normalized output applies HTML input-stream preprocessing, which turns raw CR into LF, so WP_HTML_Processor::normalize() was not byte-idempotent.

A follow-up review found the inverse risk for raw attribute CR/CRLF: raw source newlines must normalize to LF before character-reference decoding, otherwise they can be incorrectly preserved as 
.

Validation

  • WP_TESTS_SKIP_INSTALL=1 ./vendor/bin/phpunit --group html-api,html-api-html5lib-tests
  • ./vendor/bin/phpcs src/wp-includes/html-api/class-wp-html-processor.php src/wp-includes/html-api/class-wp-html-tag-processor.php tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
  • codex review --base e1f1f5873a0d7c10d4e3d137184d7fb974525c9c
  • Direct replay of minimized fuzz signature 023771e2f7b2 against this branch: normalize output is byte-idempotent, contains no literal CR, and contains 
.

@sirreal sirreal marked this pull request as ready for review June 10, 2026 09:28
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal sirreal changed the base branch from trunk to html-api-fuzz-fiz/decoded-cr-base June 10, 2026 09:55
@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Code review

Found 1 issue:

  1. The new serialize_decoded_text() and get_attribute_for_serialization() methods are annotated @since 6.9.0, but trunk is at 7.1-alpha (src/wp-includes/version.php#L19) and Trac #65372 is milestoned 7.1. These should be @since 7.1.0 (same issue previously flagged on PR #44 and PR #46).

/**
* Serializes decoded text for use in text nodes and attribute values.
*
* @since 6.9.0
*
* @param string $text Decoded text to serialize.
* @return string Serialized text.
*/
private static function serialize_decoded_text( string $text ): string {

/**
* Returns the value of an attribute, applying HTML input stream preprocessing.
*
* This is intended for serialization, where source HTML values have already
* passed through preprocessing before character references decode. Enqueued
* attribute updates are plaintext API values, so they are returned unchanged.
*
* @since 6.9.0
* @ignore
*
* @param string $name Name of attribute whose value is requested.
* @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.
*/
protected function get_attribute_for_serialization( $name ) {

Also verified (no action needed): get_modifiable_text() already normalizes raw CR to LF before decoding, so a \r in decoded text can only originate from a character reference and the 
 replacement in serialize_decoded_text() is sound; the escape-then-replace order is correct (reversing it would double-escape the &); virtual nodes never reach the non-overridden get_attribute_for_serialization() because WP_HTML_Processor::get_attribute_names_with_prefix() returns null for them; and the raw CR/CRLF-to-LF normalization in class_name_updates_to_attributes_updates() matches the input-stream preprocessing already applied to modifiable text.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

sirreal added a commit that referenced this pull request Jun 10, 2026
…ation

# Conflicts:
#	src/wp-includes/html-api/class-wp-html-processor.php
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant