Skip to content

Fix LT-22033: Improve algorithm for merging lift pronunciations#489

Merged
jasonleenaylor merged 3 commits into
release/9.3from
LT-22033
Oct 24, 2025
Merged

Fix LT-22033: Improve algorithm for merging lift pronunciations#489
jasonleenaylor merged 3 commits into
release/9.3from
LT-22033

Conversation

@jasonleenaylor

@jasonleenaylor jasonleenaylor commented Oct 17, 2025

Copy link
Copy Markdown
Contributor
  • Consider form and media matches more systematically
  • Break up big test into smaller units
  • Add more test coverage

This change is Reviewable

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 3 unresolved discussions


Src/LexText/LexTextControls/LiftMerger.cs line 3332 at r2 (raw file):

			{
				var entryPronunciation = matchedEntry.Value;
				var liftPronunciation = matchedEntry.Key;

Thank you for not sticking with phon and pron!


Src/LexText/LexTextControls/LiftMerger.cs line 3526 at r2 (raw file):

		/// <summary>
		/// Find the best matching pronunciation in the lex entry (if one exists) for the imported LiftPhonetic data.

-> "pronunciations in the lex entry (if they exist)"


Src/LexText/LexTextControls/LiftMerger.cs line 3531 at r2 (raw file):

		/// has no media files the first matching form will be selected.
		/// </summary>
		/// <returns>best match, or null</returns>

The <returns> description needs updating.

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 6 unresolved discussions (waiting on @jasonleenaylor)


Src/LexText/LexTextControls/LiftMerger.cs line 3562 at r2 (raw file):

				Dictionary<int, string> forms = GetAllUnicodeAlternatives(entryPronunciation.Form);
				hasFormMatch = forms.Count == 0;
				formMatches = 1; // both have no forms - count it as 1 for the match

Shouldn't you only set formMatches = 1 when forms.Count == 0?


Src/LexText/LexTextControls/LiftMerger.cs line 3569 at r2 (raw file):

				hasFormMatch = formMatches > 0;
			}
			if (hasFormMatch)

Why define hasFormMatch when you can just check if (formMatches > 0)?


Src/LexText/LexTextControls/LiftMerger.cs line 3580 at r2 (raw file):

				if (entryPronunciation.MediaFilesOS.Count > 0)
				{
					int matchingMedia = 0;

If inside if (hasFormMatch) the first thing you do is set int matchingMedia = 0;, then your 3 return statements can combined into a single one (with if (entryPronunciation.MediaFilesOS.Count > 0) changed to an else if).

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 9 unresolved discussions (waiting on @jasonleenaylor)


Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 3754 at r2 (raw file):

			// Setup: Create entry with pronunciation that has one media file
			var entry = CreateSimpleStemEntry("503d3478-3545-4213-9f6b-1f087464e140", "test");
			var misPronun = AddPronunciation(entry, "pronunciation", wsEs);

Odd to have variable misPronun with string "pronunciation".


Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4170 at r2 (raw file):

			// pronun2 should have been selected due to higher score (form + media match)
			Assert.That(pronun2.MediaFilesOS, Has.Count.EqualTo(1),
				"Pronunciation with media should be selected as best match");

Would it also be worth defining pronun1 and verifying that Assert.That(pronun1.MediaFilesOS, Has.Count.EqualTo(0)?


Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4227 at r2 (raw file):

				"Should merge when at least one media file matches");
			// After merge, should have 3 media files (original 2 + 1 new from LIFT)
			Assert.That(pronun.MediaFilesOS, Has.Count.GreaterThanOrEqualTo(2),

Why .GreaterThanOrEqualTo(2) instead of .EqualTo(3)?

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's now an edge case that could cause pronunciation proliferation in repeated round-tripping:

			var entry = CreateSimpleStemEntry("66EE6430-D40E-4BBF-8E17-0793E1176CF0", "test");
			var pronun = AddPronunciation(entry, "pronunciation", Cache.DefaultVernWs);
			var pronun = AddPronunciation(entry, "pronunciation", Cache.DefaultVernWs);
			
			// LIFT has 2 pronunciations with same form and no media:
			// Both would select the first entry pronunciation as best match
			// but only the first merges; the second creates new,
			// and then on the next round-trip, the 3 pronunciations will become 4.
			var liftXml = @"<?xml version=""1.0"" encoding=""UTF-8""?>
				<lift producer=""SIL.FLEx 8.0.3.41457"" version=""0.13"">
				<header><ranges/><fields/></header>
				<entry dateCreated=""2013-07-02T20:01:04Z"" dateModified=""2013-07-02T20:05:43Z"" id=""test_66EE6430-D40E-4BBF-8E17-0793E1176CF0"" guid=""66EE6430-D40E-4BBF-8E17-0793E1176CF0"">
				<lexical-unit><form lang=""fr""><text>test</text></form></lexical-unit>
				<trait name=""morph-type"" value=""stem""/>
				<pronunciation>
				<form lang=""fr""><text>pronunciation</text></form>
				</pronunciation>
				<pronunciation>
				<form lang=""fr""><text>pronunciation</text></form>
				</pronunciation>
				</entry>
				</lift>";

@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, all discussions resolved

@jasonleenaylor jasonleenaylor left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @imnasnainaec)


Src/LexText/LexTextControls/LiftMerger.cs line 3562 at r2 (raw file):

Previously, imnasnainaec (D. Ror.) wrote…

Shouldn't you only set formMatches = 1 when forms.Count == 0?

Done.

@jasonleenaylor jasonleenaylor enabled auto-merge (squash) October 21, 2025 23:38

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactor!

@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions


Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4313 at r4 (raw file):

			// Both would select the first entry pronunciation as best match
			// but only the first merges; the second creates new,
			// and then on the next round-trip, the 3 pronunciations will become 4.

This comment description matches an older algorithm description.


Src/LexText/LexTextControls/LiftMerger.cs line 3540 at r4 (raw file):

			foreach (var liftPron in liftPronunciations)
			{
				ILexPronunciation bestMatch = null;

bestMatch is now unused.

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion


Src/LexText/LexTextControls/LiftMerger.cs line 3587 at r5 (raw file):

			// Any liftPron we saw but didn't match gets null
			foreach (var liftPron in processedLiftProns)

Shouldn't this foreach (var liftPron in processedLiftProns) be foreach (var liftPron in liftPronunciations)? If so, then you don't even need to define processedLiftProns.

I think you need to add a test case where the entry has no pronunciations but the lift entry has pronunciations.

@imnasnainaec imnasnainaec left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imnasnainaec reviewed all commit messages.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions


Src/LexText/LexTextControls/LexTextControlsTests/LiftMergerTests.cs line 4381 at r6 (raw file):

				"Second LIFT pronunciation should create new entry since best match was already claimed");

			// The original pronunciation should have the matching media

The two "original pronunciation" comments are from a different test and don't match this test.


Src/LexText/LexTextControls/LiftMerger.cs line 3568 at r6 (raw file):

					results.Add(liftPron, null);
					continue;
				}

Because the matches are sorted by descending score and all missed lift pronunciations are given null at the end

				// If score is 0 assign null and move on
				if (score <= 0)
				{
					results.Add(liftPron, null);
					continue;
				}

could be changed to

				// If score is 0, no more matches
				if (score <= 0)
				{
					break;
				}

and moved up to right after var score = match.Item3;

Or, even better, you can get rid of it completely by putting matchScores.Add(Tuple.Create(liftPron, entryPron, score)); inside if (score <= 0) so the 0-scores don't have to be sorted.

* Consider form and media matches more systematically
* Break up big test into smaller units
* Add more test coverage
@jasonleenaylor jasonleenaylor merged commit 20ca049 into release/9.3 Oct 24, 2025
4 of 5 checks passed
@jasonleenaylor jasonleenaylor deleted the LT-22033 branch October 24, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants