Skip to content

Alien0427/semantic-search-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Search API: GMM Clustering & Cluster-Aware LRU Cache

A production-ready, Dockerized semantic search engine built on the 20 Newsgroups dataset. It features GMM-based soft clustering, FAISS vector search, and a custom cluster-partitioned LRU cache.


System Architecture (2)_page-0001

1. GMM Clustering over K-Means

Designed from first principles to handle the overlapping nature of natural language.

Embeddings

Vectorized using: all-MiniLM-L6-v2

Optimal Cluster Selection (BIC)

We evaluate different cluster sizes: $K \in {10, 15, 20, 25}$

Using the Bayesian Information Criterion (BIC):

$$BIC = -2 \log L + p \log n$$

Where:

  • L = likelihood of the model
  • p = number of parameters
  • n = number of samples

Result:

  • K = 10 produced the optimal (lowest) BIC.
  • K = 20 (human label count) resulted in significant overfitting.

Fuzzy Cluster Membership

Unlike K-Means, Gaussian Mixture Models produce soft probabilities.

Example: A document about gun laws may simultaneously belong to the:

  • Politics cluster
  • Firearms cluster

2. Cluster-Aware Custom LRU Cache

A native in-memory cache implementation designed to avoid external dependencies like Redis.

Data Structure

Built using collections.OrderedDict, guaranteeing:

  • O(1) insertion
  • O(1) eviction

Cluster Partitioning

Cache keys are partitioned using the dominant cluster ID. Search complexity reduces from:

$$O(N \cdot d)$$

to:

$$O((N/K) \cdot d)$$

Where:

  • N = total documents
  • K = clusters
  • d = embedding dimension

Semantic Cache Hits

Cache lookups use cosine similarity with the following threshold:

$$\tau = 0.92$$

If similarity $\ge \tau$, the query is treated as a semantic cache hit.


3. Data Preprocessing

To ensure embeddings represent topical semantics rather than noise, the preprocessing pipeline removes:

  • Email headers
  • Email footers
  • Quoted replies

This prevents embeddings from overfitting to email domains, signatures, or message formatting artifacts.


4. Tech Stack

API

  • FastAPI (Async)

Machine Learning

  • PyTorch
  • Sentence Transformers
  • FAISS
  • Scikit-Learn

DevOps

  • Docker
  • Docker Compose

5. Quickstart

Run with Docker

docker-compose up --build

Interactive API Docs

http://127.0.0.1:8000/docs


6. API Example

Endpoint

POST /query

Request Payload

{
  "query": "How do I upgrade my computer RAM?"
}

Response

{
  "query": "How do I upgrade my computer RAM?",
  "cache_hit": false,
  "result": "I am looking to upgrade the memory on my motherboard. What is the best...",
  "dominant_cluster": 3
}

Response Fields

  • result: Actual text retrieved from the FAISS index.
  • dominant_cluster: Cluster ID assigned by the BIC-tuned GMM model.

7. Architecture Summary

Embedding Model → GMM Soft Clustering → Cluster Partitioned FAISS Index → Cluster-Aware LRU Cache → FastAPI Query Layer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors