#fastapi #spacy #nlp #text-cleaning
This project exposes a minimal FastAPI service that cleans and tokenizes English sentences with spaCy. The API lowers the barrier for downstream services that need consistent preprocessing by wrapping a reusable text-cleaning pipeline behind a single endpoint.
- Removes punctuation, stop words, and extraneous whitespace from free-form text.
- Returns both the original sentence and the filtered token list for traceability.
- Loads the
en_core_web_smspaCy model once at startup for efficient reuse. - Organized into routers, models, and reusable functions for simple extension.
app.py # FastAPI application bootstrap
functions/text_cleaner.py # spaCy-powered text cleaning utility
models/sentence_model.py # Pydantic request/response schemas
routers/text_router.py # API routes for text processing
- Clone the repository and create a virtual environment.
git clone <repo-url> cd prj-1 python -m venv .venv .\.venv\Scripts\Activate.ps1
- Install dependencies and download the spaCy model.
pip install fastapi uvicorn spacy python -m spacy download en_core_web_sm - Start the API server.
uvicorn app:app --reload - Open the interactive docs at http://127.0.0.1:8000/docs.
- POST
/api/remove_stopwords- Body (
application/json):{ "text": "The quick brown fox jumps over the lazy dog!" } - Response (
200 OK):{ "original": "The quick brown fox jumps over the lazy dog!", "tokens": ["quick", "brown", "fox", "jumps", "lazy", "dog"] }
- Body (
- The spaCy model loads at import time. If you change models, update
functions/text_cleaner.pyaccordingly. - Adjust stop-word behavior by toggling
token.is_stopor extending spaCy's vocabulary. - Add more endpoints by creating new routers under
routers/and registering them inapp.py.
- Add unit tests for edge cases (empty strings, punctuation-only inputs, mixed languages).
- Consider contract tests for the FastAPI router using
TestClientfromfastapi.