Skip to content

fix: add is_base_form to FrenchLemmatizer to prevent over-lemmatizing infinitives#13965

Open
RudrenduPaul wants to merge 1 commit intoexplosion:masterfrom
RudrenduPaul:fix/french-lemmatizer-base-form
Open

fix: add is_base_form to FrenchLemmatizer to prevent over-lemmatizing infinitives#13965
RudrenduPaul wants to merge 1 commit intoexplosion:masterfrom
RudrenduPaul:fix/french-lemmatizer-base-form

Conversation

@RudrenduPaul
Copy link
Copy Markdown

Description

Fixes #7320

French infinitives ending in -dre, -re etc. (e.g. descendre, prendre) were incorrectly lemmatized to descendrer, prendrer because suffix rules like e → er were applied to them after no match was found in the lookup tables.

Root cause: FrenchLemmatizer.rule_lemmatize fully overrides the parent class method and does not call is_base_form before applying suffix rules. The parent Lemmatizer.rule_lemmatize checks is_base_form before the rules loop (line 190), but the French override skips that.

Fix:

  1. Added is_base_form() to FrenchLemmatizer — returns True when VerbForm=Inf in the token's morphology (same pattern as EnglishLemmatizer)
  2. Added a call to self.is_base_form(token) at the top of FrenchLemmatizer.rule_lemmatize (after the space/eol early return), mirroring the parent class pattern

French infinitives are now returned as-is (lowercased) without suffix rule application.

Types of change

  • Bug fix

Checklist

  • I confirm that I have the right to submit this contribution under the project's open-source license
  • I ran the tests, and all new and existing tests passed
  • My changes don't require a change to the documentation, or if they do, I've added all required information

… infinitives

French infinitives ending in -dre, -re, etc. (e.g. 'descendre', 'prendre')
were incorrectly lemmatized to 'descendrer', 'prendrer' because suffix rules
like 'e -> er' were applied to them after no match was found in the lookup
tables.

Fix: add is_base_form() to FrenchLemmatizer that returns True when the token
morphology has VerbForm=Inf, and add a call to it at the top of rule_lemmatize
(mirroring the parent Lemmatizer.rule_lemmatize pattern and the EnglishLemmatizer
implementation). Infinitives are returned as-is without suffix rule application.

Fixes explosion#7320

Built by Rudrendu Paul, developed with Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lemmatizer in French not getting the right lemma for some Verbs.

1 participant