Use fast path in more cases when doing case folding with mb_convert_case#20889
Use fast path in more cases when doing case folding with mb_convert_case#20889alexdowad wants to merge 1 commit intophp:masterfrom
Conversation
mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables. However, for small codepoint values, we bypass the hashtable lookup and just use hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate ASCII range). For upcasing and downcasing, we had already optimized the conditional which sends execution down this fast path, to use the fast path for as many codepoint values as possible. However, for case folding, this had not been done. This will give a small performance boost for case-folding Unicode text which includes non-breaking spaces, symbols like ¥ or ™, or accented Latin characters (used in many European languages).
youkidearitai
left a comment
There was a problem hiding this comment.
Confirmed from https://www.unicode.org/Public/17.0.0/ucd/CaseFolding.txt . Looks good to me.
|
However, |
That's an interesting idea. It does seem that there are more ranges which could possibly be handled without doing a hashtable lookup. The problem is that as we keep adding more conditional tests to select different "fast paths", we make the "slow path" slower, because it has to go through all those tests before finally falling back to the hashtable. As with everything, if you are interested in adding another fast path, I think testing would be a good idea. |
|
Thanks to @youkidearitai for review. I merged this tweak. I have another optimization for case folding, which I hope to open another PR for soon (maybe tomorrow?) |
|
Pretty neat! |
mbstring's Unicode case conversion is table-driven, using Minimal Perfect Hash tables. However, for small codepoint values, we bypass the hashtable lookup and just use hard-coded conversion logic (i.e. adding or subtracting 0x20 from the appropriate ASCII range).
For upcasing and downcasing, we had already optimized the conditional which sends execution down this fast path, to use the fast path for as many codepoint values as possible. However, for case folding, this had not been done.
This will give a small performance boost for case-folding Unicode text which includes non-breaking spaces, symbols like ¥ or ™, or accented Latin characters (used in many European languages).
FYA @youkidearitai @ndossche @cmb69