Skip to content

calebjubal/spaCy-text-cleaning-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spaCy-tc-api

FastAPI Python spaCy

#fastapi #spacy #nlp #text-cleaning

Overview

This project exposes a minimal FastAPI service that cleans and tokenizes English sentences with spaCy. The API lowers the barrier for downstream services that need consistent preprocessing by wrapping a reusable text-cleaning pipeline behind a single endpoint.

Features

  • Removes punctuation, stop words, and extraneous whitespace from free-form text.
  • Returns both the original sentence and the filtered token list for traceability.
  • Loads the en_core_web_sm spaCy model once at startup for efficient reuse.
  • Organized into routers, models, and reusable functions for simple extension.

Architecture

app.py                  # FastAPI application bootstrap
functions/text_cleaner.py  # spaCy-powered text cleaning utility
models/sentence_model.py   # Pydantic request/response schemas
routers/text_router.py     # API routes for text processing

Quickstart

  1. Clone the repository and create a virtual environment.
    git clone <repo-url>
    cd prj-1
    python -m venv .venv
    .\.venv\Scripts\Activate.ps1
  2. Install dependencies and download the spaCy model.
    pip install fastapi uvicorn spacy
    python -m spacy download en_core_web_sm
  3. Start the API server.
    uvicorn app:app --reload
  4. Open the interactive docs at http://127.0.0.1:8000/docs.

API

  • POST /api/remove_stopwords
    • Body (application/json):
       {
       	"text": "The quick brown fox jumps over the lazy dog!"
       }
    • Response (200 OK):
       {
       	"original": "The quick brown fox jumps over the lazy dog!",
       	"tokens": ["quick", "brown", "fox", "jumps", "lazy", "dog"]
       }

Development Notes

  • The spaCy model loads at import time. If you change models, update functions/text_cleaner.py accordingly.
  • Adjust stop-word behavior by toggling token.is_stop or extending spaCy's vocabulary.
  • Add more endpoints by creating new routers under routers/ and registering them in app.py.

Testing Ideas

  • Add unit tests for edge cases (empty strings, punctuation-only inputs, mixed languages).
  • Consider contract tests for the FastAPI router using TestClient from fastapi.

About

A lightweight FastAPI service that uses spaCy to clean and normalize text via a simple HTTP API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors