Fix LT-22033: Improve algorithm for merging lift pronunciations#489
Conversation
60c8e8c to
b7630b7
Compare
imnasnainaec
left a comment
There was a problem hiding this comment.
@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions
Src/LexText/LexTextControls/LiftMerger.cs line 3332 at r2 (raw file):
{ var entryPronunciation = matchedEntry.Value; var liftPronunciation = matchedEntry.Key;
Thank you for not sticking with phon and pron!
Src/LexText/LexTextControls/LiftMerger.cs line 3526 at r2 (raw file):
/// <summary> /// Find the best matching pronunciation in the lex entry (if one exists) for the imported LiftPhonetic data.
-> "pronunciations in the lex entry (if they exist)"
Src/LexText/LexTextControls/LiftMerger.cs line 3531 at r2 (raw file):
/// has no media files the first matching form will be selected. /// </summary> /// <returns>best match, or null</returns>
The <returns> description needs updating.
imnasnainaec
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 2 files reviewed, 6 unresolved discussions (waiting on @jasonleenaylor)
Src/LexText/LexTextControls/LiftMerger.cs line 3562 at r2 (raw file):
Dictionary<int, string> forms = GetAllUnicodeAlternatives(entryPronunciation.Form); hasFormMatch = forms.Count == 0; formMatches = 1; // both have no forms - count it as 1 for the match
Shouldn't you only set formMatches = 1 when forms.Count == 0?
Src/LexText/LexTextControls/LiftMerger.cs line 3569 at r2 (raw file):
hasFormMatch = formMatches > 0; } if (hasFormMatch)
Why define hasFormMatch when you can just check if (formMatches > 0)?
Src/LexText/LexTextControls/LiftMerger.cs line 3580 at r2 (raw file):
if (entryPronunciation.MediaFilesOS.Count > 0) { int matchingMedia = 0;
If inside if (hasFormMatch) the first thing you do is set int matchingMedia = 0;, then your 3 return statements can combined into a single one (with if (entryPronunciation.MediaFilesOS.Count > 0) changed to an else if).
imnasnainaec
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @jasonleenaylor)
Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 3754 at r2 (raw file):
// Setup: Create entry with pronunciation that has one media file var entry = CreateSimpleStemEntry("503d3478-3545-4213-9f6b-1f087464e140", "test"); var misPronun = AddPronunciation(entry, "pronunciation", wsEs);
Odd to have variable misPronun with string "pronunciation".
Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4170 at r2 (raw file):
// pronun2 should have been selected due to higher score (form + media match) Assert.That(pronun2.MediaFilesOS, Has.Count.EqualTo(1), "Pronunciation with media should be selected as best match");
Would it also be worth defining pronun1 and verifying that Assert.That(pronun1.MediaFilesOS, Has.Count.EqualTo(0)?
Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4227 at r2 (raw file):
"Should merge when at least one media file matches"); // After merge, should have 3 media files (original 2 + 1 new from LIFT) Assert.That(pronun.MediaFilesOS, Has.Count.GreaterThanOrEqualTo(2),
Why .GreaterThanOrEqualTo(2) instead of .EqualTo(3)?
b7630b7 to
b7ff586
Compare
imnasnainaec
left a comment
There was a problem hiding this comment.
I think there's now an edge case that could cause pronunciation proliferation in repeated round-tripping:
var entry = CreateSimpleStemEntry("66EE6430-D40E-4BBF-8E17-0793E1176CF0", "test");
var pronun = AddPronunciation(entry, "pronunciation", Cache.DefaultVernWs);
var pronun = AddPronunciation(entry, "pronunciation", Cache.DefaultVernWs);
// LIFT has 2 pronunciations with same form and no media:
// Both would select the first entry pronunciation as best match
// but only the first merges; the second creates new,
// and then on the next round-trip, the 3 pronunciations will become 4.
var liftXml = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<lift producer=""SIL.FLEx 8.0.3.41457"" version=""0.13"">
<header><ranges/><fields/></header>
<entry dateCreated=""2013-07-02T20:01:04Z"" dateModified=""2013-07-02T20:05:43Z"" id=""test_66EE6430-D40E-4BBF-8E17-0793E1176CF0"" guid=""66EE6430-D40E-4BBF-8E17-0793E1176CF0"">
<lexical-unit><form lang=""fr""><text>test</text></form></lexical-unit>
<trait name=""morph-type"" value=""stem""/>
<pronunciation>
<form lang=""fr""><text>pronunciation</text></form>
</pronunciation>
<pronunciation>
<form lang=""fr""><text>pronunciation</text></form>
</pronunciation>
</entry>
</lift>";
@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved
b7ff586 to
38c057e
Compare
jasonleenaylor
left a comment
There was a problem hiding this comment.
Done.
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @imnasnainaec)
Src/LexText/LexTextControls/LiftMerger.cs line 3562 at r2 (raw file):
Previously, imnasnainaec (D. Ror.) wrote…
Shouldn't you only set
formMatches = 1whenforms.Count == 0?
Done.
imnasnainaec
left a comment
There was a problem hiding this comment.
I like the refactor!
@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions
Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4313 at r4 (raw file):
// Both would select the first entry pronunciation as best match // but only the first merges; the second creates new, // and then on the next round-trip, the 3 pronunciations will become 4.
This comment description matches an older algorithm description.
Src/LexText/LexTextControls/LiftMerger.cs line 3540 at r4 (raw file):
foreach (var liftPron in liftPronunciations) { ILexPronunciation bestMatch = null;
bestMatch is now unused.
eeea2e1 to
9dc1d47
Compare
imnasnainaec
left a comment
There was a problem hiding this comment.
@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion
Src/LexText/LexTextControls/LiftMerger.cs line 3587 at r5 (raw file):
// Any liftPron we saw but didn't match gets null foreach (var liftPron in processedLiftProns)
Shouldn't this foreach (var liftPron in processedLiftProns) be foreach (var liftPron in liftPronunciations)? If so, then you don't even need to define processedLiftProns.
I think you need to add a test case where the entry has no pronunciations but the lift entry has pronunciations.
9dc1d47 to
d04bd6c
Compare
imnasnainaec
left a comment
There was a problem hiding this comment.
@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions
Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4381 at r6 (raw file):
"Second LIFT pronunciation should create new entry since best match was already claimed"); // The original pronunciation should have the matching media
The two "original pronunciation" comments are from a different test and don't match this test.
Src/LexText/LexTextControls/LiftMerger.cs line 3568 at r6 (raw file):
results.Add(liftPron, null); continue; }
Because the matches are sorted by descending score and all missed lift pronunciations are given null at the end
// If score is 0 assign null and move on
if (score <= 0)
{
results.Add(liftPron, null);
continue;
}
could be changed to
// If score is 0, no more matches
if (score <= 0)
{
break;
}
and moved up to right after var score = match.Item3;
Or, even better, you can get rid of it completely by putting matchScores.Add(Tuple.Create(liftPron, entryPron, score)); inside if (score <= 0) so the 0-scores don't have to be sorted.
* Consider form and media matches more systematically * Break up big test into smaller units * Add more test coverage
d04bd6c to
da2ec95
Compare
This change is