Conversation
18ce341 to
c8587f1
Compare
|
I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks. |
|
In response to @arnt:
Absolutely, please do add unit tests (or updates to existing tests) for
It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored. |
|
Thanks for the updated PR! The updates to This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:
|
|
Sigh. I see what you mean. My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween. And, sadly, I think you're right and I was wrong. I'll extend the PR. It'll take a few weeks. I'll also find some testers. |
|
Hi, you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site). If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;) |
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Unlinked AccountsThe following contributors have not linked their GitHub and WordPress.org accounts: @peteresnick. Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases. Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
I liked the accounts earlier today. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
|
Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻 A good next step would be to rebase this PR on the latest |
|
Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail. Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24. Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage. |
cef6fe4 to
28b954e
Compare
|
Hi, the remaining problems, as far as I can tell:
|
79ee473 to
3b0d8e5
Compare
|
I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split(). A bit on testing:
And… oh drat… Must retest the site setup wizard, too much code has changed. Two questions remain.
|
|
Just set up a new site from scratch using 'rød@grå.org' as admin address; it works. |
|
Hi @ironprogrammer, would you mind having a look? Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down. |
|
Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon! In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal. Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket. You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better. |
fb001aa to
d87d101
Compare
f0456b1 to
7f3ae71
Compare
Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once. |
|
@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus. |
having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly. here are some things I reviewed yesterday:
I was left with a few more questions in general though:
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );MariaDB [wordpress]> SELECT * FROM test_l1;
+----------------------+
| t |
+----------------------+
| a?b? |
| a🅰b🅱 |
+----------------------+
2 rows in set (0.000 sec)In the first case, we sent UTF-8 data to the database, which knows that the table is storing php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(10) "a🅰b🅱"
}…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes back into UTF-8 and thus double-encoding them. This seems problematic to me in a new way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins. To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.
As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes. For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:
Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII. Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat. |
At a glance, would a polyfill for |
|
@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0? @dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus. |
@arnt you don’t need to worry about me or my schedule. just like anticipated objections, it will likely be more advantageous to just move at your own pace and let others work at theirs, be it fast or slow. I am guessing the most important thing right now is figuring out the issue of database corruption since this creates a new avenue for that, and for potentially locking users out of their accounts because of it. we don’t necessarily have to solve it, but it’s a noteworthy impact from this change and creates a problem we didn’t previously have. |
|
I'll write several separate comments here, on separate topics, and push a new revision at the end. First, cross-script usernames. There's effectively no evidence of cross-script usernames being used, either for good or for bad. There's an exception and ha half for good: ① Q and other Latin letters appear to be used a little bit in Chinese, including the well-known name 阿Q, and ② Arabic digits (0-9) are used with many writing systems, even ones that have other digits. (Arabic itself is an example of that, in the areas east of Lebanon/Jordan.) I know about web forums etc. that allow non-ASCII usernames. Unfortunately for this, the ones I know only allow one script. There's domains, though. Domains in many TLDs are like self-service usernames, so that experience is instructive. ICANN receives abuse reports for domains and forwards them to registrars; I don't read that feed closely but I think I would hear about examples of cross-script similarity abuse. I haven't so far. Visually similar letters have been used in real attacks, but I don't know about any cross-script use. Rather, i has been used for l (PaypaI a long time ago, phiIIips recently) and 0 for o, and one clever attack used a little-known diacritic and an all-Latin domain. Impersonators mostly rely on non-letter similarity as far as I can tell. "Look, my email or web page has your bank's logo so I must be a legit representive of your bank." "My bluesky account has a photo of that person so you can trust that I really am that person." It's similarity but it's not letters. However, I suspect that if it's possible to impersonate a wordpress user on the same site and use the impersonation for a swindle, someone will eventually do it. It's more likely to be done by copying a profile photo than by leveraging cross-script similarity. So… I think it's reasonable to drop single_unicode_script, either nor or later. The only substantive reason to keep is that having it from the start is IMO easier than adding it later. |
|
About UTF8 and other database encodings. As I understand it, Wordpress 2.2 started using UTF8 for new databases and 4.2 converted old databases to UTF8, except when admins chose to override that. Overriding is possible to this day. There will be a few sites that do override. I read ticket 62172 and it makes a great deal of sense to me. Agree about the brittleness. However, that change probably shouldn't be conflated with this. It seems likely that a non-UTF8 database will suffer corruption in other tables, not just the usernames. That may be less serious. It won't lock anyone out. However, this problem could affect the core value of the site. When you publish a page, saving it to a non-UTF8 database can turn a dozen different currency symbols into question marks. Easy to overlook. Far from good. It seems… less urgent than usernames but worth solving. Having written all this, I think I've formed an opion:
Does that sound good? |
|
Next up: The scope of this PR. As @dmsnell asks:
The first version of the PR did indeed touch only addresses. I extended it by request because as contributor, I try to conform to whatever is expected. At this point I'd prefer to keep the agglomerated PR because:
|
|
Email address length: Wordpress currently supports 100 bytes reliably, more unreliably. With the change I suggest above it'll store 100 characters reliably or reject the address at entry time. Email address length limits are a bit of a mess in general. This is partly due to the specifications. The length limit for an email address differs from the length limit for an email address on the public internet. Too subtle for most people. Then there's widespread disobedience. and even worse, there's software and SaaS vendors that judge email addresses and have their own unpublished length limits. If you enter a 70-byte email address in a newsletter signup form, you may well be rejected because a SaaS vendor classifies the address as a bot. Both 100 bytes and 100 characters are enough that any human who tries to use an address near the limit will run into problems quickly. Much software has shorter limits (a well-known CRM has 80) and few people can spell reliably enough to type a really long address correctly from a business card. It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right. FWIW, my name is in both RFC 5321 (length on net) and 5322 (length). |
I’d like to hear the thoughts from some others who have more experience dealing with text encoding issues, particularly in places where we create new ones. Perhaps @mdawaffe has some or knows someone with thoughts. A check for round-tripping seems reasonable and awkward at the same time, because making those checks everywhere would be excessive. For an email address though, maybe it’s worth the diversion.
We’re discussing a new reality which has no prior in WordPress and which can be dramatically different. The moment someone uses the flag of England in their email address, that’s 28 characters alone. Meaning that if someone uses the US flag in their email they are allowed 92 other ASCII characters but for the England flag it only leaves 72. If we only allow storing Unicode emails on database tables with UTF-8 support then we could avoid this problem. Maybe that’s a reasonable feature-gate for this: don’t enable non-ASCII email address unless the Were we to pursue that angle, it could be worth imagining a new global toggle or check like
Perhaps it’s my poor reading skills, but I re-read that comment multiple times and couldn’t find the request you’re referring to. Maybe @ironprogrammer left another comment on an older version of the code? You are likely underselling your opinions here. This is a collaborative effort and we are all working together; I know from your previous comments that there are times you read more into things than people leave when they comment. It’s okay to ask questions and engage in the back and forth — if we all instantly did whatever anyone asked us then we’d have chaos, or LLM-level-slop. I’d really love to consider this only in the context of sites with a There are a few places email addresses are read from query args: I think in the initial site installation and in some new user requests. This is not relatively important compared to the other feedback, but it will be something we want to make sure we check at some point, largely to ensure that when decoded they are valid UTF-8 and that they are properly percent-decoded. Perhaps it’s fine as-is and the email functions will verify that when passed the information (apart from the percent-encoding). Either way, it’d be great to confirm what happens when some emails are sent with invalid query args. If anyone would be interested in testing the same steps that others like @USERSATOSHI and @ironprogrammer have performed, but with a site setup with It would be valuable to test emails containing “normal” Unicode like non-latin scripts (which use only up to three bytes) as well as addresses containing characters from the supplemental planes, such as emoji (requiring four bytes). |
|
This is a cool proposal. Note that I have not made the time to fully understand the conversation above. I don't think anything I've written below is particularly new: apologies if it is obviously repetitive :)
As you'd imagine from the method name, it strips characters (and optionally truncates the string) rather than bailing. I think, though, you could do a clever I don't think that's necessary, though, since I think we should instead only allow non-ASCII email addresses in
I don't think we can ignore the difference. We should require I don't have a strong opinion about limiting strings to having characters from only one script (plus ASCII). On small sites, it likely doesn't matter at all. On large sites (like WordPress.com or sites where WordPress+bbPress is used for public forums) with user registration available to anyone, spoofing is likely a big concern. I agree that I do have one strong opinion: non-WordPress code should be able to make "raw" SELECTs to the database and retrieve the correct username/email address. That is, we should not write "corrupted" or otherwise transformed data to these columns even if WordPress correctly uncorrupts/untransforms the values when it read them. That means no charset shenanigans or hacky encodings (e.g., Base64). For usernames, another (non-?)option would be to allow any characters, but to reject new usernames that are too "close" to existing ones. I don't know how to do that efficiently without adding a new (indexed) column to the users table, which I think is a bad idea. Fly-by suggestions:
|
|
I just pushed a new revision. I discovered that wpdb already protects against not being able to store the data. The database may corrupt something, but wpdb already tries really hard to avoid getting that far. I added an extra unit test, since locking people out is worse than most failures. Side note: That function uses FWIW I would be unhappy with a toggle for email address support. Those toggles are trouble magnets and I strongly advise against going that way. A toggle for username support is another thing. I know a data set that can be used for username similarity detection… but that's an unpleasant slippery slope, with much scope for pedantry about details. Some of the decisions in the data set puzzled me. Best avoided, I'd say. I've no opinion on restricting this to UTF8 databases. If WP were my baby, I'd announce that support for other databases will be sunset in the middlish future and thereby make the smaller question go away, along with the risk caused by brittle code. I didn't have time to rebase, I'm afraid, but I wanted to get this pushed before I go on a trip tomorrow morning. |
4498e43 to
3dc24ca
Compare
dmsnell
left a comment
There was a problem hiding this comment.
my comment is mostly irrelevant to the bigger picture in this PR, but I saw a thing so I figured I’d point it out to not lose it later
697aa9f to
7b7e00c
Compare
| // Email addresses: Allow so long as the database can store them. This | ||
| // affects all addresses, including those entered into contact forms. | ||
| if ( 'utf8mb4' !== $wpdb->charset ) { | ||
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | ||
| } | ||
|
|
||
| // Usernames: Allow if the database can store them. This might be a | ||
| // setting instead, so that a site can restrict its own users to ASCII. | ||
| if ( 'utf8mb4' !== $wpdb->charset ) { | ||
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | ||
| } |
There was a problem hiding this comment.
| // Email addresses: Allow so long as the database can store them. This | |
| // affects all addresses, including those entered into contact forms. | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | |
| } | |
| // Usernames: Allow if the database can store them. This might be a | |
| // setting instead, so that a site can restrict its own users to ASCII. | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | |
| } | |
| if ( 'utf8mb4' !== $wpdb->charset ) { | |
| /* | |
| * Email addresses: Allow so long as the database can store them. This | |
| * affects all addresses, including those entered into contact forms. | |
| */ | |
| add_filter( 'sanitize_email', 'wp_ascii_without_controls' ); | |
| /* | |
| * Usernames: Allow if the database can store them. This might be a | |
| * setting instead, so that a site can restrict its own users to ASCII. | |
| */ | |
| add_filter( 'sanitize_user', 'wp_ascii_without_controls' ); | |
| } |
Remove the duplicate conditions and correct the multi-line comment format.
dmsnell
left a comment
There was a problem hiding this comment.
it’s been a while since I’ve reviewed this and I don’t have the time to give that it truly deserves, but I’ll try and rehash some of my thoughts the best I can.
first of all, this is great work from @arnt — someone who understands the scope of the problem and is actively championing for its resolution. I want to comment @arnt for that effort.
but I want to back up and reach for earlier comments I left, which I think are the most-crucial questions to answer here. thankfully, one of them is largely already answered.
I discovered that wpdb already protects against not being able to store the data.
but still it seems we are making a number of implicit changes that aren’t getting enough eyes on them:
- this goes beyond allowing non-ASCII email addresses and imposes new and arbitrary or subjective rules about what kinds of emails are allowed. there is reference to various concepts like Unicode similarity and security concerns, but those policies aren’t technically enforced.
- there are questions about support on database encodings which cannot store UTF-8 adequately. ironically, a
latin1table can store this stuff just fine but will not play well with collation, search, and equality.
it seems like this is stuck in the middle of being comprehensive and not enough, and I think that could be the result of very diligent and honest attempts to meet the review feedback.
for me personally I could imagine it being easier to analyze at either extreme:
- do almost nothing other than open up valid-UTF-8 emails.
- handle all of the easily-foreseeable UX issues created by non-ASCII emails.
I’ll leave a comment in the Trac ticket, but I think this could use some more discussion in the weekly Dev Chat and also a post on Make to gather wider input; there are likely many parties affected by this change.
| * A metal umlaut fails because validate_username is | ||
| * strict and n̈ is unfamiliar in every language. | ||
| */ | ||
| $this->assertFalse( validate_username( 'spın̈altap' ) ); |
There was a problem hiding this comment.
this actually seems like the kind of username we would intentionally want to support, being a username someone might choose thinking they are clever, but not the kind I would expect from malicious attacks.
There was a problem hiding this comment.
FWIW it's been used in an attack. The attacker used an accent that's never used in the victims' language, and some victims overlooked the accent. Just a few pixels, not something the eye had practised parsing, easy to overlook.
However. I'm not married to this. I added it for a reason and now you know the reason, but that doesn't make the reason persuasive in Wordpress' context.
| public function test_user_corrupted_login() { | ||
| global $wpdb; | ||
| $rows = $wpdb->update( $wpdb->users, array( 'user_login' => hex2bin( 'c0' ) ), array( 'ID' => $this->author->ID ) ); | ||
| $this->assertFalse( $rows ); |
There was a problem hiding this comment.
while I appreciate the extremely-helpful comment on this test, the test itself doesn’t seem to assert anything about the usernames; it only seems to be asserting that $wpdb return a failure on an update that MySQL is presumed to reject based on assumptions about the testing environment.
can we find a way to setup the test into the expected failure mode and make assertions about the code under test?
| 'nonascii@nonascii' => array( 'grå@grå.org', true ), | ||
| 'decomposed unicode' => array( 'gr\u{0061}\u{030a}blå@grå.org', true ), | ||
| 'weird but legal dots' => array( '..@example.com', true ), | ||
| ); |
There was a problem hiding this comment.
we seem to be making assertions about antispambot() here, but we’re also feeding it 100% valid UTF-80and then asserting that the result is valid UTF-8.
is there some case we would expect to fail? how would antispambot() be expected to return non-valid-UTF-8 given valid inputs?
There was a problem hiding this comment.
Well, it broke before. The code looked a little adhoc, so I added a few moderately different test cases and made them pass, then left them in. An antispam function might be modified in a hurry... best to leave a few tests.
| * Test for invalid characters. | ||
| */ | ||
| $local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]/', '', $local ); | ||
| $local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.\x80-\xff-]/', '', $local ); |
There was a problem hiding this comment.
it seems like the intent behind this change is to open up all bytes for a domain other than the C0 control characters, including space, and also adding backspace/delete.
rather than accept any invalid UTF-8, would it make sense to start by rejecting invalid UTF-8 strings, and then perform a check for the rejected controls characters?
if (
! wp_is_valid_utf8( $local ) ||
strcspn( $local, "\x00\x01\x02...\x20\x7F" ) !== strlen( $local )
) {
return apply_filters( ... );
}There was a problem hiding this comment.
I like this. I seem to remember considering something like this, but the function that was there at the time didn't charm me. wp_is_valid_utf8 is so much nicer.
172067a to
1514f31
Compare
05af093 to
2874c47
Compare
|
This refactors code into a new WP_Email_Address as we discussed in Vienna and leaves is_email and sanitize_email much shorter. Changes:
It would be possible to split this into three:
I'm not sure, but perhaps a good idea is to split it like that and commit all three at the same time, for clarity in svn? |
…tize_email This adds support for the unicode address extensions in RFC 6530-3 and refactors the code so there are fewer long regexes and less duplication between sanitize_email and is_email. A new class, WP_Email_Address, provides the shared parts. Opting out of unicode support is easy, default-filters.php adds unicode support by adding filters, which can be removed. sanitize_email no longer does major changes like removing an entire subdomain from someone's address, it only cleans up things like soft hyphens and whitespace — changes that happen when coping an email address from text. During testing, it became clear that antispambot() worked only for strings using a single-byte encoding, while this uses UTF8. Fixed. Fixes #31992. Props SirLouen, dmsnell, tusharbharti, mukeshpanchal27, akirk.

This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,
it does not aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.
It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.
Trac ticket: https://core.trac.wordpress.org/ticket/31992