A production-ready, Dockerized semantic search engine built on the 20 Newsgroups dataset. It features GMM-based soft clustering, FAISS vector search, and a custom cluster-partitioned LRU cache.
Designed from first principles to handle the overlapping nature of natural language.
Vectorized using: all-MiniLM-L6-v2
We evaluate different cluster sizes:
Using the Bayesian Information Criterion (BIC):
Where:
- L = likelihood of the model
- p = number of parameters
- n = number of samples
Result:
- K = 10 produced the optimal (lowest) BIC.
- K = 20 (human label count) resulted in significant overfitting.
Unlike K-Means, Gaussian Mixture Models produce soft probabilities.
Example: A document about gun laws may simultaneously belong to the:
- Politics cluster
- Firearms cluster
A native in-memory cache implementation designed to avoid external dependencies like Redis.
Built using collections.OrderedDict, guaranteeing:
- O(1) insertion
- O(1) eviction
Cache keys are partitioned using the dominant cluster ID. Search complexity reduces from:
to:
Where:
- N = total documents
- K = clusters
- d = embedding dimension
Cache lookups use cosine similarity with the following threshold:
If similarity
To ensure embeddings represent topical semantics rather than noise, the preprocessing pipeline removes:
- Email headers
- Email footers
- Quoted replies
This prevents embeddings from overfitting to email domains, signatures, or message formatting artifacts.
- FastAPI (Async)
- PyTorch
- Sentence Transformers
- FAISS
- Scikit-Learn
- Docker
- Docker Compose
docker-compose up --build
http://127.0.0.1:8000/docs
POST /query
{
"query": "How do I upgrade my computer RAM?"
}
{
"query": "How do I upgrade my computer RAM?",
"cache_hit": false,
"result": "I am looking to upgrade the memory on my motherboard. What is the best...",
"dominant_cluster": 3
}
- result: Actual text retrieved from the FAISS index.
- dominant_cluster: Cluster ID assigned by the BIC-tuned GMM model.
Embedding Model → GMM Soft Clustering → Cluster Partitioned FAISS Index → Cluster-Aware LRU Cache → FastAPI Query Layer
