Skip to content

Support rfc6530#5237

Open
arnt wants to merge 1 commit intoWordPress:trunkfrom
arnt:support-rfc6530
Open

Support rfc6530#5237
arnt wants to merge 1 commit intoWordPress:trunkfrom
arnt:support-rfc6530

Conversation

@arnt
Copy link
Copy Markdown

@arnt arnt commented Sep 18, 2023

This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,

it does not aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.

Trac ticket: https://core.trac.wordpress.org/ticket/31992

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 18ce341 to c8587f1 Compare September 18, 2023 12:13
@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 22, 2023

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

@ironprogrammer
Copy link
Copy Markdown
Member

In response to @arnt:

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one...

Absolutely, please do add unit tests (or updates to existing tests) for sanitize_email(). While Core test code coverage is pretty low today 😓, it is a project policy that new Core merges include appropriate tests.

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks for the updated PR! The updates to sanitize_email() appear to have resolved the issue with storing the correct email in the database 🎉.

This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ❓ Do new user creation, email change confirmation, or password change emails go to the correct address?
  • ❓ Do links in these emails work correctly?
  • ❓ Does the forgotten password workflow work correctly with a unicode address?
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ❌ Comments (approval list): emails with unicode are not displayed correctly (Figure 1).
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ❓ Site setup wizard: can the initial admin use a unicode address, and do the emails work correctly?
  • ❓ Do various author dropdowns display unicode addresses as expected?
  • ❓ Are unicode emails in feeds presented accurately?

Figure 1: Display of commenter unicode email addresses.
comments approval screen with incorrectly displayed unicode email addresses

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 27, 2023

Sigh. I see what you mean.

My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween.

And, sadly, I think you're right and I was wrong.

I'll extend the PR. It'll take a few weeks. I'll also find some testers.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 16, 2024

Hi,

you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site).

If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 16, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @peteresnick.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

Core Committers: Use this line as a base for the props when committing in SVN:

Props agulbra, sirlouen, mukesh27, justlevine, tusharbharti, dmsnell, ironprogrammer, sergeybiryukov, mdawaffe.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 16, 2024

I liked the accounts earlier today.

@github-actions
Copy link
Copy Markdown

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻

A good next step would be to rebase this PR on the latest trunk, and to resolve the coding standards issues to tidy things up for an updated review and testing. When the CI checks are in a happy state, then it will help attract more attention.

@arnt
Copy link
Copy Markdown
Author

arnt commented Sep 23, 2024

Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail.

Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24.

Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage.

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from cef6fe4 to 28b954e Compare December 3, 2024 14:06
@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 3, 2024

Hi,

the remaining problems, as far as I can tell:

  1. The code now requires PHP 7.4, which I thought was OK but now I see that 7.4 is merely recommended, not required. Oops. Not clear to me how to do mb_str_split in 7.2.
  2. I've clicked around, but don't feel sure that I have all of the "various dropdowns". The filters in default-filters.php are okay, but I don't feel confident that every third-party plugin will be okay, or that I've looked at everything in the UI. Wordpress is large, there are nooks and corners…

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 79ee473 to 3b0d8e5 Compare December 4, 2024 08:50
@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 4, 2024

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

A bit on testing:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ✅ New user creation, email change confirmation, and password change emails go to the correct address.
  • ✅ Links in all messages work correctly AFAICT (plugins can add new kinds of course…).
  • ✅ The forgotten password workflow works correctly with a unicode address.
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ✅ Comments (approval list): emails with unicode are displayed correctly.
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ✅ WordPress works on an IPv6-only server (oh, you weren't wondering?)

And… oh drat… Must retest the site setup wizard, too much code has changed.

Two questions remain.

  1. What author dropdowns display email addresses? I can't find anything. Lots of user names displayed but not email addresses. Maybe I misunderstand. I checked for usage of email in the code that reads the database and use of that code in display code.
  2. How can email addresses ever appear in feeds? The feed generators don't seem to reveal authors etc.

@arnt
Copy link
Copy Markdown
Author

arnt commented Dec 4, 2024

Just set up a new site from scratch using 'rød@grå.org' as admin address; it works.

@arnt
Copy link
Copy Markdown
Author

arnt commented Jan 3, 2025

Hi @ironprogrammer,

would you mind having a look?

Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down.

@ironprogrammer
Copy link
Copy Markdown
Member

Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon!

In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal.

Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket.

You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better.

@arnt arnt force-pushed the support-rfc6530 branch 6 times, most recently from fb001aa to d87d101 Compare June 23, 2025 14:20
Copy link
Copy Markdown
Member

@SirLouen SirLouen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little code review

Comment thread src/wp-includes/formatting.php Outdated
Comment thread tests/phpunit/tests/formatting/sanitizeUser.php Outdated
Comment thread src/wp-includes/formatting.php Outdated
@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from f0456b1 to 7f3ae71 Compare June 23, 2025 19:19
@SirLouen
Copy link
Copy Markdown
Member

SirLouen commented Nov 7, 2025

I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.

Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 7, 2025

@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus.

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 7, 2025

jetlag

having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly.


here are some things I reviewed yesterday:

  • WordPress appears to call sanitize_user() everywhere before detecting duplicate usernames, so changing that function should not accidentally allow duplicating user logins.
  • the schema.php indicates a 60-character login and 100-character email. for utf8mb4 configurations that means people will still be able to enter emails with “100 characters”.

I was left with a few more questions in general though:

  • for what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects.
  • what should we be doing if the database table cannot store the email addresses? for instance, if a table is configured as latin1 we have a problem. while that allows us to store the raw bytes just fine, unless we ensure that PHP and MySQL agree on what’s transferring through the query we will end up with corruption.
  • even if we ensure that we store non-ASCII characters properly, we end up with a length restriction for some email addresses based on the fact that VARCHAR(100) means 100 bytes for a latin1 table but up to 400 bytes for a utf8mb4 table. this is something to consider otherwise people might end up with some confusing error messages when submitting a valid email that looks like it’s shorter than WordPress and the database count it.
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );
MariaDB [wordpress]> SELECT * FROM test_l1;
+----------------------+
| t                    |
+----------------------+
| a?b?                 |
| a🅰b🅱           |
+----------------------+
2 rows in set (0.000 sec)

In the first case, we sent UTF-8 data to the database, which knows that the table is storing latin1, so it transparently re-encoded from UTF-8 to latin1, replacing unrepresentable characters as ?. In the second case we told the database that the UTF-8 bytes should be explicitly interpreted as latin1 and so it did store the raw bytes, however…

php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(10) "a🅰b🅱"
}

…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes back into UTF-8 and thus double-encoding them.

This seems problematic to me in a new way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins.

To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.


talk about growing/shrinking it

As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes.

For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:

  • Drop the detection for cross-script spoofing. I would love to see some evidence that the risk is substantial enough that it’s worth the complexity, but I suspect that having rules which change based on the content of the input is going to be confusing for users and developers.
  • Missing from all of the changes are checks that the input is valid UTF-8. If we start by rejecting invalid UTF-8 then we can make a lot of assumptions and drop safety-checks without increasing risk. From my reading of the RFC I don’t see any allowance for addresses with invalid UTF-8.
  • Maintain existing behavior in the antispambot function. That is, let’s not skip non-ASCII characters just because we don’t see them as being present as threats. I think we can probably deprecate that function entirely and skip this, but we will create another discussion for that.

Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII.

Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat.

@SergeyBiryukov
Copy link
Copy Markdown
Member

@arnt

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

At a glance, would a polyfill for mb_str_split() be helpful here, so that we could remove the function_exists() checks and make the behavior consistent across all supported PHP versions?

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 12, 2025

@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0?

@dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 12, 2025

Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.

@arnt you don’t need to worry about me or my schedule. just like anticipated objections, it will likely be more advantageous to just move at your own pace and let others work at theirs, be it fast or slow.


I am guessing the most important thing right now is figuring out the issue of database corruption since this creates a new avenue for that, and for potentially locking users out of their accounts because of it. we don’t necessarily have to solve it, but it’s a noteworthy impact from this change and creates a problem we didn’t previously have.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

I'll write several separate comments here, on separate topics, and push a new revision at the end. First, cross-script usernames.

There's effectively no evidence of cross-script usernames being used, either for good or for bad. There's an exception and ha half for good: ① Q and other Latin letters appear to be used a little bit in Chinese, including the well-known name 阿Q, and ② Arabic digits (0-9) are used with many writing systems, even ones that have other digits. (Arabic itself is an example of that, in the areas east of Lebanon/Jordan.)

I know about web forums etc. that allow non-ASCII usernames. Unfortunately for this, the ones I know only allow one script.

There's domains, though. Domains in many TLDs are like self-service usernames, so that experience is instructive. ICANN receives abuse reports for domains and forwards them to registrars; I don't read that feed closely but I think I would hear about examples of cross-script similarity abuse. I haven't so far.

Visually similar letters have been used in real attacks, but I don't know about any cross-script use. Rather, i has been used for l (PaypaI a long time ago, phiIIips recently) and 0 for o, and one clever attack used a little-known diacritic and an all-Latin domain.

Impersonators mostly rely on non-letter similarity as far as I can tell. "Look, my email or web page has your bank's logo so I must be a legit representive of your bank." "My bluesky account has a photo of that person so you can trust that I really am that person." It's similarity but it's not letters.

However, I suspect that if it's possible to impersonate a wordpress user on the same site and use the impersonation for a swindle, someone will eventually do it. It's more likely to be done by copying a profile photo than by leveraging cross-script similarity.

So… I think it's reasonable to drop single_unicode_script, either nor or later. The only substantive reason to keep is that having it from the start is IMO easier than adding it later.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

About UTF8 and other database encodings.

As I understand it, Wordpress 2.2 started using UTF8 for new databases and 4.2 converted old databases to UTF8, except when admins chose to override that. Overriding is possible to this day. There will be a few sites that do override.

I read ticket 62172 and it makes a great deal of sense to me. Agree about the brittleness. However, that change probably shouldn't be conflated with this.

It seems likely that a non-UTF8 database will suffer corruption in other tables, not just the usernames. That may be less serious. It won't lock anyone out. However, this problem could affect the core value of the site. When you publish a page, saving it to a non-UTF8 database can turn a dozen different currency symbols into question marks. Easy to overlook. Far from good. It seems… less urgent than usernames but worth solving.

Having written all this, I think I've formed an opion:

  • This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.
  • If the check fails, the error message should say something like "This name/password/address cannot be stored in the database. Note: DB_CHARSET other than UTF8 is deprecated since 7.0 and will be removed in a future version of Wordpress" without naming a version number.
  • The difference between utf8 and utf8mb4 is best ignored.
  • In theory a string may be roundtrip convertible using PHP but not using MySQL. Best ignored.

Does that sound good?

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

Next up: The scope of this PR.

As @dmsnell asks:

For what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects."

The first version of the PR did indeed touch only addresses. I extended it by request because as contributor, I try to conform to whatever is expected. At this point I'd prefer to keep the agglomerated PR because:

  • The unit test and review issues are mostly similar.
  • There are test scripts, props and commits that would either be difficult to split, or that would be bulky repetition.

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 13, 2025

Email address length:

Wordpress currently supports 100 bytes reliably, more unreliably. With the change I suggest above it'll store 100 characters reliably or reject the address at entry time.

Email address length limits are a bit of a mess in general. This is partly due to the specifications. The length limit for an email address differs from the length limit for an email address on the public internet. Too subtle for most people. Then there's widespread disobedience. and even worse, there's software and SaaS vendors that judge email addresses and have their own unpublished length limits. If you enter a 70-byte email address in a newsletter signup form, you may well be rejected because a SaaS vendor classifies the address as a bot.

Both 100 bytes and 100 characters are enough that any human who tries to use an address near the limit will run into problems quickly. Much software has shorter limits (a well-known CRM has 80) and few people can spell reliably enough to type a really long address correctly from a business card.

It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.

FWIW, my name is in both RFC 5321 (length on net) and 5322 (length).

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Nov 14, 2025

Does that sound good?

I’d like to hear the thoughts from some others who have more experience dealing with text encoding issues, particularly in places where we create new ones. Perhaps @mdawaffe has some or knows someone with thoughts.

A check for round-tripping seems reasonable and awkward at the same time, because making those checks everywhere would be excessive. For an email address though, maybe it’s worth the diversion.

It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.

We’re discussing a new reality which has no prior in WordPress and which can be dramatically different. The moment someone uses the flag of England in their email address, that’s 28 characters alone. Meaning that if someone uses the US flag in their email they are allowed 92 other ASCII characters but for the England flag it only leaves 72.

If we only allow storing Unicode emails on database tables with UTF-8 support then we could avoid this problem. Maybe that’s a reasonable feature-gate for this: don’t enable non-ASCII email address unless the {$wpdb->prefix}users table is stored as utf8mb4.

Were we to pursue that angle, it could be worth imagining a new global toggle or check like wp_supports_unicode_email_addresses() or something table-based. I know I have discussed the idea with @adamziel about checking database encodings at startup and caching them so we can reliably work with database tables regardless of the character set or collation. 🤔

I extended it #5237 (comment) because as contributor, I try to conform to whatever is expected.

Perhaps it’s my poor reading skills, but I re-read that comment multiple times and couldn’t find the request you’re referring to. Maybe @ironprogrammer left another comment on an older version of the code?

You are likely underselling your opinions here. This is a collaborative effort and we are all working together; I know from your previous comments that there are times you read more into things than people leave when they comment. It’s okay to ask questions and engage in the back and forth — if we all instantly did whatever anyone asked us then we’d have chaos, or LLM-level-slop.


I’d really love to consider this only in the context of sites with a utf8mb4 users table. I don’t know what to recommend as a way to do that, other than one obvious and necessary-but-insufficient task which is to verify that the emails are valid UTF-8.


There are a few places email addresses are read from query args: I think in the initial site installation and in some new user requests. This is not relatively important compared to the other feedback, but it will be something we want to make sure we check at some point, largely to ensure that when decoded they are valid UTF-8 and that they are properly percent-decoded. Perhaps it’s fine as-is and the email functions will verify that when passed the information (apart from the percent-encoding).

Either way, it’d be great to confirm what happens when some emails are sent with invalid query args.


If anyone would be interested in testing the same steps that others like @USERSATOSHI and @ironprogrammer have performed, but with a site setup with latin1 configured as the blog_charset and as the default database table charset, that would be helpful.

It would be valuable to test emails containing “normal” Unicode like non-latin scripts (which use only up to three bytes) as well as addresses containing characters from the supplemental planes, such as emoji (requiring four bytes).

@mdawaffe
Copy link
Copy Markdown
Contributor

This is a cool proposal. Note that I have not made the time to fully understand the conversation above. I don't think anything I've written below is particularly new: apologies if it is obviously repetitive :)

This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.

wpdb->strip_invalid_text() (and friends) does something similar:

  • allows anything in latin1 or binary columns/tables.
  • allows 3-byte UTF-8 in utf8(mb3) columns/tables.
  • allows all UTF-8 in utf8mb4 columns/tables.
  • Does an actual DB query for other encodings.

As you'd imagine from the method name, it strips characters (and optionally truncates the string) rather than bailing. I think, though, you could do a clever SELECT statement that would check charsets against the correct columns/tables, check if the result is different than the input, then bail as necessary before doing the UPDATE/INSERT.

I don't think that's necessary, though, since I think we should instead only allow non-ASCII email addresses in utf8mb4 tables: no roundtrip necessary. Just reject non-valid UTF-8 byte sequences in PHP and check column(/connection?) charsets. Any other DB setups are stuck with ASCII email addresses.

The difference between utf8 and utf8mb4 is best ignored.

I don't think we can ignore the difference. We should require utf8mb4 (supposing we take the direction I mentioned in my previous paragraph).

I don't have a strong opinion about limiting strings to having characters from only one script (plus ASCII). On small sites, it likely doesn't matter at all. On large sites (like WordPress.com or sites where WordPress+bbPress is used for public forums) with user registration available to anyone, spoofing is likely a big concern. I agree that uses_single_unicode_script() is easy to add now, easy to stop using later, and hard to add later, so it might be worth just doing now and reevaluating later. Note that (as you all know) domains are a special case. Many TLDs do not allow IDNs, and some only allow certain scripts. Email addresses contain domains :) I'm not sure if we should care about individual TLD limits. We probably shouldn't care.

I do have one strong opinion: non-WordPress code should be able to make "raw" SELECTs to the database and retrieve the correct username/email address. That is, we should not write "corrupted" or otherwise transformed data to these columns even if WordPress correctly uncorrupts/untransforms the values when it read them. That means no charset shenanigans or hacky encodings (e.g., Base64).

For usernames, another (non-?)option would be to allow any characters, but to reject new usernames that are too "close" to existing ones. I don't know how to do that efficiently without adding a new (indexed) column to the users table, which I think is a bad idea.

Fly-by suggestions:

  • start with email addresses only. I'm reasonably confident that non-ASCII usernames will break something, but I have no clue what :)
  • consider making non-ASCII values opt-in (with a WP filter), so we can get real world experience with sites that want it without altering the behavior of sites that don't care. If it all works: make it opt-out in some future version. (I don't know if we ever do that sort of thing. Perhaps I'm being too conservative.)

@arnt
Copy link
Copy Markdown
Author

arnt commented Nov 18, 2025

I just pushed a new revision.

I discovered that wpdb already protects against not being able to store the data. The database may corrupt something, but wpdb already tries really hard to avoid getting that far. I added an extra unit test, since locking people out is worse than most failures.

Side note: That function uses latin1 in a way that looks incompatible with selecting using ilike at first glance. I didn't look closer into that, though.

FWIW I would be unhappy with a toggle for email address support. Those toggles are trouble magnets and I strongly advise against going that way. A toggle for username support is another thing.

I know a data set that can be used for username similarity detection… but that's an unpleasant slippery slope, with much scope for pedantry about details. Some of the decisions in the data set puzzled me. Best avoided, I'd say.

I've no opinion on restricting this to UTF8 databases. If WP were my baby, I'd announce that support for other databases will be sunset in the middlish future and thereby make the smaller question go away, along with the risk caused by brittle code.

I didn't have time to rebase, I'm afraid, but I wanted to get this pushed before I go on a trip tomorrow morning.

@arnt arnt force-pushed the support-rfc6530 branch 2 times, most recently from 4498e43 to 3dc24ca Compare November 19, 2025 08:30
Copy link
Copy Markdown
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment is mostly irrelevant to the bigger picture in this PR, but I saw a thing so I figured I’d point it out to not lose it later

Comment thread src/wp-includes/formatting.php Outdated
Comment thread src/wp-includes/default-filters.php Outdated
Comment on lines +90 to +100
// Email addresses: Allow so long as the database can store them. This
// affects all addresses, including those entered into contact forms.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
}

// Usernames: Allow if the database can store them. This might be a
// setting instead, so that a site can restrict its own users to ASCII.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Email addresses: Allow so long as the database can store them. This
// affects all addresses, including those entered into contact forms.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
}
// Usernames: Allow if the database can store them. This might be a
// setting instead, so that a site can restrict its own users to ASCII.
if ( 'utf8mb4' !== $wpdb->charset ) {
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}
if ( 'utf8mb4' !== $wpdb->charset ) {
/*
* Email addresses: Allow so long as the database can store them. This
* affects all addresses, including those entered into contact forms.
*/
add_filter( 'sanitize_email', 'wp_ascii_without_controls' );
/*
* Usernames: Allow if the database can store them. This might be a
* setting instead, so that a site can restrict its own users to ASCII.
*/
add_filter( 'sanitize_user', 'wp_ascii_without_controls' );
}

Remove the duplicate conditions and correct the multi-line comment format.

Copy link
Copy Markdown
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s been a while since I’ve reviewed this and I don’t have the time to give that it truly deserves, but I’ll try and rehash some of my thoughts the best I can.

first of all, this is great work from @arnt — someone who understands the scope of the problem and is actively championing for its resolution. I want to comment @arnt for that effort.


but I want to back up and reach for earlier comments I left, which I think are the most-crucial questions to answer here. thankfully, one of them is largely already answered.

I discovered that wpdb already protects against not being able to store the data.

but still it seems we are making a number of implicit changes that aren’t getting enough eyes on them:

  • this goes beyond allowing non-ASCII email addresses and imposes new and arbitrary or subjective rules about what kinds of emails are allowed. there is reference to various concepts like Unicode similarity and security concerns, but those policies aren’t technically enforced.
  • there are questions about support on database encodings which cannot store UTF-8 adequately. ironically, a latin1 table can store this stuff just fine but will not play well with collation, search, and equality.

it seems like this is stuck in the middle of being comprehensive and not enough, and I think that could be the result of very diligent and honest attempts to meet the review feedback.

for me personally I could imagine it being easier to analyze at either extreme:

  • do almost nothing other than open up valid-UTF-8 emails.
  • handle all of the easily-foreseeable UX issues created by non-ASCII emails.

I’ll leave a comment in the Trac ticket, but I think this could use some more discussion in the weekly Dev Chat and also a post on Make to gather wider input; there are likely many parties affected by this change.

Comment thread tests/phpunit/tests/user.php Outdated
* A metal umlaut fails because validate_username is
* strict and n̈ is unfamiliar in every language.
*/
$this->assertFalse( validate_username( 'spın̈altap' ) );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually seems like the kind of username we would intentionally want to support, being a username someone might choose thinking they are clever, but not the kind I would expect from malicious attacks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it's been used in an attack. The attacker used an accent that's never used in the victims' language, and some victims overlooked the accent. Just a few pixels, not something the eye had practised parsing, easy to overlook.

However. I'm not married to this. I added it for a reason and now you know the reason, but that doesn't make the reason persuasive in Wordpress' context.

Comment thread tests/phpunit/tests/user.php Outdated
Comment thread tests/phpunit/tests/user.php Outdated
public function test_user_corrupted_login() {
global $wpdb;
$rows = $wpdb->update( $wpdb->users, array( 'user_login' => hex2bin( 'c0' ) ), array( 'ID' => $this->author->ID ) );
$this->assertFalse( $rows );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I appreciate the extremely-helpful comment on this test, the test itself doesn’t seem to assert anything about the usernames; it only seems to be asserting that $wpdb return a failure on an update that MySQL is presumed to reject based on assumptions about the testing environment.

can we find a way to setup the test into the expected failure mode and make assertions about the code under test?

'nonascii@nonascii' => array( 'grå@grå.org', true ),
'decomposed unicode' => array( 'gr\u{0061}\u{030a}blå@grå.org', true ),
'weird but legal dots' => array( '..@example.com', true ),
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to be making assertions about antispambot() here, but we’re also feeding it 100% valid UTF-80and then asserting that the result is valid UTF-8.

is there some case we would expect to fail? how would antispambot() be expected to return non-valid-UTF-8 given valid inputs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it broke before. The code looked a little adhoc, so I added a few moderately different test cases and made them pass, then left them in. An antispam function might be modified in a hurry... best to leave a few tests.

Comment thread src/wp-includes/formatting.php Outdated
* Test for invalid characters.
*/
$local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]/', '', $local );
$local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.\x80-\xff-]/', '', $local );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the intent behind this change is to open up all bytes for a domain other than the C0 control characters, including space, and also adding backspace/delete.

rather than accept any invalid UTF-8, would it make sense to start by rejecting invalid UTF-8 strings, and then perform a check for the rejected controls characters?

if (
	! wp_is_valid_utf8( $local ) ||
	strcspn( $local, "\x00\x01\x02...\x20\x7F" ) !== strlen( $local )
) {
	return apply_filters( ... );
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I seem to remember considering something like this, but the function that was there at the time didn't charm me. wp_is_valid_utf8 is so much nicer.

@arnt arnt force-pushed the support-rfc6530 branch 4 times, most recently from 172067a to 1514f31 Compare February 27, 2026 07:53
@arnt arnt force-pushed the support-rfc6530 branch 3 times, most recently from 05af093 to 2874c47 Compare April 18, 2026 10:29
@arnt
Copy link
Copy Markdown
Author

arnt commented Apr 18, 2026

This refactors code into a new WP_Email_Address as we discussed in Vienna and leaves is_email and sanitize_email much shorter. Changes:

  • is_email now accepts everything that passes the checks used for <input type=email> in the top browsers.
  • It's now really easy to disable unicode address support, just change filters.
  • sanitize_email() is well-aligned with mistakes people actually make. (New, only tangentially mentioned in Vienna.)
  • sanitize_email() will not change an address if there's doubt about the intended result. For example, thorleif@ifi.uio.no could be a different student from thorleif@uio.no (ifi had a separate mail system, maybe still has) and the if there was a soft hyphen near the first dot, the previous sanitization could result in thorleif@ifi.no or thorleif@uio.no. The current one performs only changes that are safe and correspond to known common problems such as pasting in an end-of-sentence dot at the end of an address, or copying from a source that inserted soft hyphens.
  • some mostly unrelated tests expected unrealistic things, e.g. support for 190-character domains. This has now been changed to 60-character domains (still testing the same thing, in this case db column length).
  • is_email and sanitize_email keep the filters for things a site admin can override, which is not everything (noone can make the DNS support 190-character labels, it's beyond our power).
  • Pasting in info@xn--gr-zia.org may be transformed to the human-readable format. (When WordPress sends email to that address, the To: header will say grå.org and the RCPT TO probably xn--gr-zia.org.)

It would be possible to split this into three:

  1. Simplify is_email and add support for unicode addresses using the new WP_Email_Address.
  2. Simplify sanitize_email, align it with observed mistakes, get rid of unsafe changes.
  3. Enable unicode support.

I'm not sure, but perhaps a good idea is to split it like that and commit all three at the same time, for clarity in svn?

…tize_email

This adds support for the unicode address extensions in RFC 6530-3 and
refactors the code so there are fewer long regexes and less duplication
between sanitize_email and is_email. A new class, WP_Email_Address,
provides the shared parts.

Opting out of unicode support is easy, default-filters.php adds
unicode support by adding filters, which can be removed.

sanitize_email no longer does major changes like removing an entire
subdomain from someone's address, it only cleans up things like soft
hyphens and whitespace — changes that happen when coping an email
address from text.

During testing, it became clear that antispambot() worked only for strings
using a single-byte encoding, while this uses UTF8. Fixed.

Fixes #31992.

Props SirLouen, dmsnell, tusharbharti,  mukeshpanchal27, akirk.
@arnt arnt force-pushed the support-rfc6530 branch from 2874c47 to 183dcf9 Compare April 18, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants