tutorial/docs/machine-learning/advanced-ml-topics/natural-language-processing/lemmatization.mdx at 1030c5bab473af81b619e172de0c1728f0d42647 · codeharborhub/tutorial

title

Lemmatization: Context-Aware Normalization

sidebar_label

Lemmatization

description

Understanding how to return words to their dictionary base forms using morphological analysis.

1. Lemmatization vs. Stemming

The primary difference lies in Intelligence. While a stemmer operates on a single word without context, a lemmatizer considers the word's meaning and its Part of Speech (POS) tag.

Word	Stemming (Porter)	Lemmatization (WordNet)
Studies	studi	study
Studying	studi	study
Was	wa	be
Mice	mice	mouse
Better	better	good (if context is adjective)

2. The Importance of Part of Speech (POS)

A lemmatizer’s behavior changes depending on whether a word is a noun, verb, or adjective.

For example, the word "saw":

If Verb: Lemma is "see" (e.g., "I saw the movie").
If Noun: Lemma is "saw" (e.g., "The carpenter used a saw").

Most modern lemmatizers (like those in spaCy) automatically detect the POS tag to provide the correct lemma.

3. The Lemmatization Pipeline (Mermaid)

The following diagram shows how a lemmatizer uses linguistic resources to find the base form.

graph TD
    Word[Input Token] --> POS[POS Tagging]
    POS --> Morph[Morphological Analysis]
    Morph --> Lookup{Dictionary Lookup}
    
    Lookup -- "Found" --> Lemma[Return Lemma]
    Lookup -- "Not Found" --> Identity[Return Original Word]

4. Implementation with spaCy

While NLTK requires you to manually pass POS tags, spaCy performs lemmatization as part of its default pipeline, making it much more accurate.

import spacy

# 1. Load the English language model
nlp = spacy.load("en_core_web_sm")

text = "The mice were running better than the cats."

# 2. Process the text
doc = nlp(text)

# 3. Extract Lemmas
lemmas = [token.lemma_ for token in doc]

print(f"Original: {[token.text for token in doc]}")
print(f"Lemmas:   {lemmas}")
# Output: ['the', 'mouse', 'be', 'run', 'well', 'than', 'the', 'cat', '.']

5. When to Choose Lemmatization?

Chatbots & QA Systems: Where understanding the precise meaning and dictionary form is vital for retrieving information.
Topic Modeling: To ensure that "organizing" and "organization" are grouped together correctly without losing the root meaning to over-stemming.
High-Accuracy NLP: Whenever computational resources allow for a slightly slower preprocessing step in exchange for significantly better data quality.

References

spaCy Documentation: Lemmatization and Morphology
WordNet: A Lexical Database for English
NLTK: WordNet Lemmatizer Tutorial

Lemmatization provides us with clean, dictionary-base words. But how do we turn these words into high-dimensional vectors that a model can actually "understand"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Lemmatization vs. Stemming

2. The Importance of Part of Speech (POS)

3. The Lemmatization Pipeline (Mermaid)

4. Implementation with spaCy

5. When to Choose Lemmatization?

References

Uh oh!

FilesExpand file tree

lemmatization.mdx

Latest commit

History

lemmatization.mdx

File metadata and controls

1. Lemmatization vs. Stemming

2. The Importance of Part of Speech (POS)

3. The Lemmatization Pipeline (Mermaid)

4. Implementation with spaCy

5. When to Choose Lemmatization?

References