Skip to content

Commit 41c7489

Browse files
committed
HTML API: Ensure that code points always encode to UTF-8
This was brought up during fuzz testing of the HTML API. After polyfilling `mb_chr()` and relying on it in the HTML decoder, it became possible that for sites with a non-UTF-8 charset selected, then the creation of text from code points when decoding numeric character references might produce corrupted text, or text which encodes to non-UTF-8 bytes. While for these sites, there are broader issues with non-UTF-8 support, this change ensures that code point encoding remains deterministic. Developed in: WordPress#12155 Discussed in: https://core.trac.wordpress.org/ticket/65372 Follow-up to [62424]. Props dmsnell, jonsurrell. See #65372. git-svn-id: https://develop.svn.wordpress.org/trunk@62487 602fd350-edb4-49c9-b593-d223f7449a82
1 parent b5abaff commit 41c7489

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

src/wp-includes/html-api/class-wp-html-decoder.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -424,7 +424,7 @@ public static function read_character_reference( $context, $text, $at = 0, &$mat
424424
* @return string Converted code point, or `�` if invalid.
425425
*/
426426
public static function code_point_to_utf8_bytes( $code_point ): string {
427-
$string = mb_chr( $code_point );
427+
$string = mb_chr( $code_point, 'UTF-8' );
428428

429429
return false !== $string ? $string : '';
430430
}

0 commit comments

Comments
 (0)