🕷️ CrawlX - Advanced Web Scraping Platform

A powerful, production-ready web scraping platform with modern React frontend, FastAPI backend, and intelligent content extraction

Features • Quick Start • Documentation • API

🎯 What is CrawlX?

CrawlX is a full-stack web scraping platform combining powerful backend capabilities with a stunning modern UI. Built to handle everything from simple static pages to complex multi-source data aggregation.

Why CrawlX?

✅ Scrape ANY website - Windows-compatible HTTP-based scraper with smart extraction
✅ Beautiful Modern UI - Next.js 14 with Three.js 3D particle effects
✅ Pre-configured Scrapers - Built-in Scrapy spiders for news and jobs
✅ Dark/Light Theme - Seamless theme switching with smooth transitions
✅ Multiple Export Formats - CSV, PDF, and JSON
✅ Fuzzy Search - Find content even with typos
✅ Real-time Dashboard - Live stats and data visualization
✅ Production Ready - Battle-tested, Windows compatible, fully documented

✨ Features

🌐 Custom URL Scraper

Windows Compatible ✅ - No subprocess issues!
HTTP-based with Trafilatura - Intelligent content extraction
Smart Extraction - Auto-detects articles, tables, lists
Fast & Lightweight - No browser overhead
Works on ~80% of websites - Static sites, blogs, news, e-commerce

📰 Pre-configured Scrapers

News Spider - Tech news scraping (30 items per run)
Jobs Spider - Job listings scraper
Automated Scheduling - Runs every 6 hours
Built with Scrapy - Industrial-strength scraping

🎨 Modern Frontend

Next.js 14 - React with server components
TypeScript - Type-safe development
Three.js - Interactive 3D particle background
Tailwind CSS - Beautiful, responsive design
Dark Mode - Toggle between themes
Real-time Stats - Live dashboard updates

💾 Database & Search

PostgreSQL - Robust data storage
Fuzzy Search - PostgreSQL trigrams for similarity search
Full-text Search - Fast content search
Connection Pooling - Optimized performance

📊 Data Export

CSV Export - Spreadsheet-friendly format
PDF Export - Professional reports with ReportLab
JSON Export - API-ready data

🚀 Quick Start

Prerequisites

Python 3.12+
Node.js 18+
PostgreSQL 14+
Git

Installation (3 minutes)

Clone the repository

git clone https://github.com/juni2003/CrawlX-Data-Scrapping-Project.git
cd CrawlX-Data-Scrapping-Project

Setup Backend

cd backend

# Create virtual environment
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Create database
createdb crawlx

# Enable fuzzy search
psql -d crawlx -f migrations/001_enable_fuzzy_search.sql

Setup Frontend

cd ../frontend
npm install

Start Application

Option 1: Quick Start (Windows)

# From project root
start-all.bat

Option 2: Manual Start

# Terminal 1 - Backend
cd backend
uvicorn main:app --reload --port 8000

# Terminal 2 - Frontend  
cd frontend
npm run dev

Access Application

🌐 Frontend: http://localhost:3000
📡 Backend API: http://localhost:8000
📚 API Docs: http://localhost:8000/docs

📖 Usage

Web Interface

Dashboard (/)

View total scraped items
Quick scraping controls
Real-time statistics

Custom Scraper (/scraper)

Enter any URL to scrape
Choose extraction type
View extracted content, tables, lists

Data Explorer (/data)

Search and filter items
Export to CSV/PDF/JSON
View detailed information

API Usage

import requests

# Scrape custom URL
response = requests.post("http://localhost:8000/scrape/url", json={
    "url": "https://books.toscrape.com",
    "extract_type": "auto",
    "wait_for": 2
})
print(response.json())

# Get all items
items = requests.get("http://localhost:8000/items?limit=100")

# Search with fuzzy matching
results = requests.get("http://localhost:8000/items/search?query=technology")

# Export data
csv_data = requests.get("http://localhost:8000/export/csv")
with open("data.csv", "wb") as f:
    f.write(csv_data.content)

See examples/api_usage.py for more examples.

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│               Frontend (Next.js 14 + TypeScript)         │
│  ┌──────────┐  ┌───────────┐  ┌─────────────┐          │
│  │Dashboard │  │  Scraper  │  │Data Explorer│          │
│  │(3D UI)   │  │(Custom)   │  │(Search)     │          │
│  └──────────┘  └───────────┘  └─────────────┘          │
└─────────────────────┬───────────────────────────────────┘
                      │ REST API
┌─────────────────────┴───────────────────────────────────┐
│                Backend (FastAPI + Scrapy)                │
│  ┌──────────┐  ┌──────────┐  ┌────────────────────┐    │
│  │ REST API │  │Scheduler │  │  Scraping Engine   │    │
│  │          │  │(Every 6h)│  │  (httpx+trafilatura)│   │
│  └──────────┘  └──────────┘  └────────────────────┘    │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────┴───────────────────────────────────┐
│           PostgreSQL Database (with Fuzzy Search)        │
│  ┌─────────────────────────────────────────────────┐    │
│  │ scraped_items (full-text + trigram indexing)   │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

🛠️ Tech Stack

Backend

Technology	Version	Purpose
FastAPI	0.115.8	Modern async web framework
SQLAlchemy	2.0.37	SQL toolkit and ORM
PostgreSQL	14+	Relational database
Scrapy	Latest	Industrial web scraping
httpx	0.28.0	Async HTTP client
Trafilatura	2.0.0	Content extraction
APScheduler	3.10.4	Job scheduling
NLTK	Latest	Text summarization

Frontend

Technology	Version	Purpose
Next.js	14	React framework
TypeScript	5.0	Type safety
Tailwind CSS	3.4	Styling
Three.js	Latest	3D graphics
React Three Fiber	Latest	React + Three.js
Axios	Latest	HTTP client

📂 Project Structure

CrawlX-Data-Scrapping-Project/
│
├── 📁 backend/
│   ├── main.py                      # FastAPI application
│   ├── models.py                    # SQLAlchemy models
│   ├── schemas.py                   # Pydantic schemas
│   ├── crud.py                      # Database operations
│   ├── config.py                    # Configuration
│   ├── scheduler.py                 # Job scheduling
│   ├── summarizer.py                # Content summarization
│   ├── pdf_export.py                # PDF generation
│   ├── scraper_engine/
│   │   ├── simple_scraper.py        # ✅ HTTP-based scraper (NEW!)
│   │   ├── extractors.py            # Content extraction
│   │   ├── browser_pool.py          # Browser management
│   │   └── stealth.py               # Anti-detection
│   └── migrations/
│       └── 001_enable_fuzzy_search.sql
│
├── 📁 frontend/
│   ├── app/
│   │   ├── page.tsx                 # Dashboard
│   │   ├── scraper/page.tsx         # Custom scraper UI
│   │   ├── data/page.tsx            # Data explorer
│   │   └── layout.tsx               # Root layout
│   ├── components/
│   │   ├── 3d/
│   │   │   └── ParticleBackground.tsx  # Three.js particles
│   │   ├── layout/
│   │   │   └── Navbar.tsx           # Navigation
│   │   └── providers/
│   │       └── ThemeProvider.tsx    # Dark/Light theme
│   ├── lib/
│   │   └── api.ts                   # API client
│   └── types/
│       └── index.ts                 # TypeScript types
│
├── 📁 scraper/                      # Scrapy spiders
│   └── scraper/spiders/
│       ├── news_spider.py           # ✅ News scraper
│       └── jobs_spider.py           # Jobs scraper
│
├── 📁 examples/
│   └── api_usage.py                 # API examples
│
├── 📁 Documentation/
│   ├── QUICKSTART.md
│   ├── ENHANCED_FEATURES.md
│   ├── TROUBLESHOOTING.md
│   ├── SECURITY.md
│   ├── FIX_COMPLETE.md              # ✅ Custom scraper fix details
│   ├── CUSTOM_SCRAPER_FIX.md
│   └── READY_TO_USE.md
│
├── start-all.bat                    # ✅ Start both services (Windows)
├── start-backend.bat                # Start backend only
├── start-frontend.bat               # Start frontend only
├── test_custom_scraper.py           # ✅ Scraper test script
└── test_comprehensive.py            # Full test suite

🎯 Key Improvements (Latest Updates)

✅ Custom URL Scraper Fix

Problem Solved: Windows + Playwright subprocess incompatibility

Solution: Replaced Playwright with httpx + Trafilatura

✅ Works on Windows without subprocess errors
✅ 3x faster than browser automation
✅ Covers ~80% of websites
✅ Intelligent content extraction

See FIX_COMPLETE.md for details.

✅ Complete Frontend Implementation

Modern Next.js 14 with TypeScript
3D particle background with Three.js
Dark/Light theme with smooth transitions
Real-time dashboard
Data explorer with search and export

✅ Production-Ready Features

Connection pooling for database
Scheduled scraping every 6 hours
Comprehensive error handling
Full API documentation
Test scripts included

🧪 Testing

# Quick test - Custom URL scraper
python test_custom_scraper.py

# Comprehensive test suite
python test_comprehensive.py

Expected Output:

🔄 Testing Custom URL Scraper...
   Target: https://books.toscrape.com

✅ SUCCESS!

📊 Results:
   - Success: True
   - Content Length: 354 characters
   - Lists Found: 5

📚 Documentation

Document	Description
QUICKSTART.md	5-minute setup guide
COMPLETE_SETUP_GUIDE.md	Detailed installation
ENHANCED_FEATURES.md	Feature documentation
TROUBLESHOOTING.md	Common issues & fixes
SECURITY.md	Security best practices
FIX_COMPLETE.md	Custom scraper fix
READY_TO_USE.md	Quick reference

🔒 Security

✅ SQL injection prevention with parameterized queries
✅ Input validation using Pydantic schemas
✅ CORS configuration for frontend
✅ Environment-based configuration
✅ Secure database connection handling
✅ Rate limiting ready (configurable)

🐛 Troubleshooting

Backend won't start

# Check PostgreSQL
pg_isready

# Check port availability
netstat -an | findstr :8000

# Reinstall dependencies
pip install -r backend/requirements.txt --force-reinstall

Frontend won't start

# Clear cache
rm -rf frontend/node_modules frontend/.next
cd frontend && npm install

Database errors

Verify PostgreSQL is running
Check connection string in backend/config.py
Ensure database exists: createdb crawlx

See TROUBLESHOOTING.md for complete guide.

📈 Performance Metrics

Metric	Performance
Scraping Speed	100-500 pages/minute
Database Capacity	100K+ items
API Response Time	<100ms average
Frontend Load Time	<2s initial load
Memory Usage	~200MB backend, ~150MB frontend

🗺️ Roadmap

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Open Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

FastAPI - For the amazing async web framework
Trafilatura - For intelligent content extraction
Next.js - For the excellent React framework
PostgreSQL - For robust database capabilities
Scrapy - For industrial-strength web scraping

📧 Contact & Support

Author: Juni
GitHub: @juni2003
Repository: CrawlX-Data-Scrapping-Project

Get Help

⭐ Star this repo if you find it useful!

Made with ❤️ by Juni

CrawlX - Scrape Smarter, Not Harder

FilesExpand file tree

README.md

Latest commit

History