HTML API: Ensure that code points always encode to UTF-8

dmsnell · dmsnell · commit 41c748932b65 · 2026-06-11T17:04:01.000Z
This was brought up during fuzz testing of the HTML API. After polyfilling `mb_chr()` and relying on it in the HTML decoder, it became possible that for sites with a non-UTF-8 charset selected, then the creation of text from code points when decoding numeric character references might produce corrupted text, or text which encodes to non-UTF-8 bytes. While for these sites, there are broader issues with non-UTF-8 support, this change ensures that code point encoding remains deterministic. Developed in: WordPress#12155 Discussed in: https://core.trac.wordpress.org/ticket/65372 Follow-up to [62424]. Props dmsnell, jonsurrell. See #65372. git-svn-id: https://develop.svn.wordpress.org/trunk@62487 602fd350-edb4-49c9-b593-d223f7449a82
diff --git a/src/wp-includes/html-api/class-wp-html-decoder.php b/src/wp-includes/html-api/class-wp-html-decoder.php
@@ -424,7 +424,7 @@ public static function read_character_reference( $context, $text, $at = 0, &$mat
 	 * @return string Converted code point, or `�` if invalid.
 	 */
 	public static function code_point_to_utf8_bytes( $code_point ): string {
-		$string = mb_chr( $code_point );
+		$string = mb_chr( $code_point, 'UTF-8' );
 
 		return false !== $string ? $string : '�';
 	}

Original file line number	Diff line number	Diff line change
`@@ -424,7 +424,7 @@ public static function read_character_reference( $context, $text, $at = 0, &$mat`
`424`	`424`	* @return string Converted code point, or `�` if invalid.
`425`	`425`	`*/`
`426`	`426`	`public static function code_point_to_utf8_bytes( $code_point ): string {`
`427`		`- $string = mb_chr( $code_point );`
	`427`	`+ $string = mb_chr( $code_point, 'UTF-8' );`
`428`	`428`
`429`	`429`	`return false !== $string ? $string : '�';`
`430`	`430`	`}`