A Python-based Search Engine that retrieves and ranks documents from a local dataset using the **Vector Space Model (VSM) and TF-IDF ranking.
This project demonstrates core concepts of Information Retrieval, Search Algorithms, and Text Processing used in real-world systems.
- Tokenization (splitting text into words)
- Stopword Removal (removing common words like the, is, a)
- Stemming (reducing words to root form)
- Inverted Index (fast word → document lookup)
- TF-IDF Ranking (relevance-based scoring)
- Title Boost (title matches weighted higher)
- Category Filter (search within specific categories)
- Boolean Search (
AND,OR,NOT) - Autocomplete suggestions
- Search History tracking
The Vector Space Model (VSM) represents documents and queries as vectors in a multi-dimensional space.
- Each document is converted into a vector
- Each word represents a dimension
- Importance of words is calculated using TF-IDF
-
User enters a query
-
Text is processed:
- Tokenization
- Stopword removal
- Stemming
-
Documents are converted into vectors using TF-IDF
-
Query vector is compared with document vectors
-
Similarity is calculated
-
Results are ranked based on relevance
Used to measure similarity between query and documents: cos(θ) = (A · B) / (||A|| × ||B||)
- A = Query vector
- B = Document vector
- Value ranges from 0 to 1
- Python
- Jupyter Notebook
- JSON (for dataset storage)
search-engine-project/
│
├── main.ipynb # Main code
├── database.json # Dataset
└── README.md # Documentation
- Clone the repository:
git clone https://github.com/Swetalin26/search-engine-project.git
- Go to project folder:
cd search-engine-project
- Open Jupyter Notebook:
jupyter notebook
- Run
search_engine.ipynb
- machine learning
- climate change
- blockchain AND cryptocurrency
- mental health
- stock market
- Understanding Vector Space Model
- Implementing TF-IDF
- Building search using inverted index
- Applying Boolean logic in search
- Improving user experience with enhancements
- Web interface (React / Next.js)
- Better ranking using NLP
- Voice-based search
- Real-time data integration
Swetalin Sahoo B.Tech Student | Aspiring Developer
This is a beginner-friendly project that demonstrates how modern search engines rank and retrieve information.