tutorial/ai-ml/machine-learning/advanced-ml-topics/natural-language-processing/stemming.mdx at b098e2c66e44a01782a3b783cf790e36f4a5f30e · codeharborhub/tutorial

title

Stemming: Reducing Words to Roots

sidebar_label

Stemming

description

Learn how to normalize text by stripping suffixes to find the base form of words.

1. How Stemming Works

Stemming is primarily a heuristic-based process. It uses crude rule-based algorithms to chop off the ends of words (suffixes) in the hope of reaching the base form.

Unlike Lemmatization, stemming does not use a dictionary and does not care about the context or the part of speech (POS).

Example:

Input: "Universal", "University", "Universe"
Stem: "Univers"

2. Popular Stemming Algorithms

There are several algorithms used to perform stemming, ranging from aggressive to conservative:

Algorithm	Characteristics	Use Case
Porter Stemmer	The oldest and most common. Uses 5 phases of word reduction.	General purpose NLP, fast and reliable.
Snowball Stemmer	An improvement over Porter; supports multiple languages (also called Porter2).	Multi-lingual applications.
Lancaster Stemmer	Very aggressive. Often results in stems that are not real words.	When extreme compression/normalization is needed.

3. The Pitfalls of Stemming

Because stemming follows rigid rules without "understanding" the language, it often makes two types of errors:

A. Over-stemming

This occurs when two words are reduced to the same stem even though they have different meanings.

Example: "Organization" and "Organs" both being reduced to "organ".

B. Under-stemming

This occurs when two words that should result in the same stem do not.

Example: "Alumnus" and "Alumni" might remain distinct because the rules don't recognize the Latin plural change.

4. Logical Workflow (Mermaid)

The following diagram illustrates the decision-making process of a typical rule-based stemmer like the Porter Stemmer.

graph TD
    Word[Input Word] --> Rule1{Ends in 'ies'?}
    Rule1 -- Yes --> Replace1[Replace with 'i']
    Rule1 -- No --> Rule2{Ends in 'ing'?}
    
    Rule2 -- Yes --> CheckLen{Remaining Length > 1?}
    CheckLen -- Yes --> Strip1[Remove 'ing']
    CheckLen -- No --> Keep1[Keep Word]
    
    Rule2 -- No --> Rule3{Ends in 's'?}
    Rule3 -- Yes --> Strip2[Remove 's']
    
    Replace1 --> End[Output Stem]
    Strip1 --> End
    Strip2 --> End
    Keep1 --> End

5. Implementation with NLTK

The Natural Language Toolkit (NLTK) is the most popular library for stemming in Python.

from nltk.stem import PorterStemmer, SnowballStemmer

# 1. Initialize the Porter Stemmer
porter = PorterStemmer()

words = ["connection", "connected", "connecting", "connections"]

# 2. Apply Stemming
stemmed_words = [porter.stem(w) for w in words]

print(f"Original: {words}")
print(f"Stemmed:  {stemmed_words}")
# Output: ['connect', 'connect', 'connect', 'connect']

# 3. Using Snowball (Porter2) for better results
snowball = SnowballStemmer(language='english')
print(snowball.stem("generously")) # Output: generous

6. When to use Stemming?

Information Retrieval: Search engines use stemming to ensure that searching for "fishing" brings up results for "fish."
Sentiment Analysis: When the specific tense of a verb doesn't change the underlying emotion.
Speed: When you have a massive corpus and Lemmatization is too computationally expensive.

References

NLTK Documentation: Stemming Package
Stanford NLP: Stemming and Lemmatization

Stemming is fast but "dumb." If you need your base words to be actual dictionary words and you care about the grammar, you need a more sophisticated approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. How Stemming Works

Example:

2. Popular Stemming Algorithms

3. The Pitfalls of Stemming

A. Over-stemming

B. Under-stemming

4. Logical Workflow (Mermaid)

5. Implementation with NLTK

6. When to use Stemming?

References

Uh oh!

FilesExpand file tree

stemming.mdx

Latest commit

History

stemming.mdx

File metadata and controls

1. How Stemming Works

Example:

2. Popular Stemming Algorithms

3. The Pitfalls of Stemming

A. Over-stemming

B. Under-stemming

4. Logical Workflow (Mermaid)

5. Implementation with NLTK

6. When to use Stemming?

References