Add grapheme_levenshtein function.#18087
Conversation
|
This RFC was accepted. I will go to implementation. |
|
Nice, could you possibly rebase with last master state to redo the CI ? cheers. |
dd9a015 to
5fc7475
Compare
|
I rebased and force pushed. I'm waiting for CI end. |
|
should be fine, thanks. Looking good but I ll go back at it in couple of hours. |
There was a problem hiding this comment.
Is there a reason that false is being returned on failure, rather than throwing an exception?
There was a problem hiding this comment.
For example, Userland can uses intl_get_error_message function after this function if this error. Therefore, this block is returns false.
|
Hmm... Windows CI is failed. |
There was a problem hiding this comment.
Where are the error checks for these calls and the next ones?
There was a problem hiding this comment.
The error checks seem to be still missing.
There was a problem hiding this comment.
You also don't need the void* cast, you can remove that.
There was a problem hiding this comment.
Where possible, I think it's better to merge the declaration and assignment as it reduces the area/scope where something is valid/can be used.
There was a problem hiding this comment.
The declarations and assignments are still not merged
ndossche
left a comment
There was a problem hiding this comment.
Second review round.
As a more general comment: the code can probably be simplified even more by using goto-style error handling.
There was a problem hiding this comment.
The entire comparison is too complex and inefficient.
You can just do something like this (I can't use a suggestion because the changes are not centralised):
diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index c4d0c3987a8..c0f21f1ad7e 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -1041,6 +1041,9 @@ PHP_FUNCTION(grapheme_levenshtein)
RETURN_FALSE;
}
+ UCollator *collator = ucol_open("", &ustatus1);
+ // TODO: handle error
+
p1 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
p2 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
@@ -1052,7 +1055,6 @@ PHP_FUNCTION(grapheme_levenshtein)
int32_t current2 = 0;
int32_t pos1 = 0;
int32_t pos2 = 0;
- int32_t usrch_pos = 0;
while (true) {
current1 = ubrk_current(bi1);
@@ -1067,37 +1069,8 @@ PHP_FUNCTION(grapheme_levenshtein)
if (pos2 == UBRK_DONE) {
break;
}
- UStringSearch *srch = usearch_open(ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2, "", NULL, &ustatus2);
- if (U_FAILURE(ustatus2)) {
- intl_error_set_code(NULL, ustatus2);
- intl_error_set_custom_msg(NULL, "Error usearch_open", 0);
- ubrk_close(bi1);
- ubrk_close(bi2);
-
- efree(ustring1);
- efree(ustring2);
-
- efree(p1);
- efree(p2);
- RETURN_FALSE;
- }
- usrch_pos = usearch_first(srch, &ustatus2);
- if (U_FAILURE(ustatus2)) {
- intl_error_set_code(NULL, ustatus2);
- intl_error_set_custom_msg(NULL, "Error usearch_first", 0);
- ubrk_close(bi1);
- ubrk_close(bi2);
-
- efree(ustring1);
- efree(ustring2);
-
- efree(p1);
- efree(p2);
- RETURN_FALSE;
- }
- usearch_close(srch);
- if (usrch_pos != USEARCH_DONE) {
+ if (ucol_strcoll(collator, ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2) == UCOL_EQUAL) {
c0 = p1[i2];
} else {
c0 = p1[i2] + cost_rep;
@@ -1118,6 +1091,8 @@ PHP_FUNCTION(grapheme_levenshtein)
p2 = tmp;
}
+ ucol_close(collator);
+
ubrk_close(bi1);
ubrk_close(bi2);
There was a problem hiding this comment.
I didn't intend to use collation. Please give me some time to judge.
There was a problem hiding this comment.
Thanks for suggestion. I could not find problem. Use your code.
There was a problem hiding this comment.
The error checks seem to be still missing.
There was a problem hiding this comment.
The declarations and assignments are still not merged
There was a problem hiding this comment.
I think you can just use a single status variable that you reuse all the time, that makes things a bit less complex.
There was a problem hiding this comment.
You also don't need the void* cast, you can remove that.
There was a problem hiding this comment.
The error message could probably be improved (same below)
e.g. "Error on ubrk_setText for argument #1 ($string1)"
1da6e4b to
6f41f98
Compare
ndossche
left a comment
There was a problem hiding this comment.
The code looks almost fully good now.
I wonder if the repetition in the error recovery can be reduced by using an error label with gotos. Maybe that's something you can try
There was a problem hiding this comment.
Can't this include be deleted now that you no longer rely on usearch API?
There was a problem hiding this comment.
Please get rid of this void cast as well.
57c5b99 to
42e060e
Compare
ndossche
left a comment
There was a problem hiding this comment.
This is ok apart from the merging of error-handling code. This is something I might try later if you don't do it.
|
Please don't forget to add this to UPGRADING & NEWS upon merging. |
Measure levenshtein for grapheme cluster unit
42e060e to
ebfa683
Compare
|
Thanks for review, |
|
Feel free to merge |
|
Thank you very much. |
Measure levenshtein for grapheme cluster unit.
RFC: https://wiki.php.net/rfc/grapheme_levenshtein