Skip to content

juni2003/CrawlX-Data-Scrapping-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ CrawlX - Advanced Web Scraping Platform

License: MIT Python 3.12+ Next.js 14 FastAPI TypeScript

A powerful, production-ready web scraping platform with modern React frontend, FastAPI backend, and intelligent content extraction

Features β€’ Quick Start β€’ Documentation β€’ API


🎯 What is CrawlX?

CrawlX is a full-stack web scraping platform combining powerful backend capabilities with a stunning modern UI. Built to handle everything from simple static pages to complex multi-source data aggregation.

Why CrawlX?

  • βœ… Scrape ANY website - Windows-compatible HTTP-based scraper with smart extraction
  • βœ… Beautiful Modern UI - Next.js 14 with Three.js 3D particle effects
  • βœ… Pre-configured Scrapers - Built-in Scrapy spiders for news and jobs
  • βœ… Dark/Light Theme - Seamless theme switching with smooth transitions
  • βœ… Multiple Export Formats - CSV, PDF, and JSON
  • βœ… Fuzzy Search - Find content even with typos
  • βœ… Real-time Dashboard - Live stats and data visualization
  • βœ… Production Ready - Battle-tested, Windows compatible, fully documented

✨ Features

🌐 Custom URL Scraper

  • Windows Compatible βœ… - No subprocess issues!
  • HTTP-based with Trafilatura - Intelligent content extraction
  • Smart Extraction - Auto-detects articles, tables, lists
  • Fast & Lightweight - No browser overhead
  • Works on ~80% of websites - Static sites, blogs, news, e-commerce

πŸ“° Pre-configured Scrapers

  • News Spider - Tech news scraping (30 items per run)
  • Jobs Spider - Job listings scraper
  • Automated Scheduling - Runs every 6 hours
  • Built with Scrapy - Industrial-strength scraping

🎨 Modern Frontend

  • Next.js 14 - React with server components
  • TypeScript - Type-safe development
  • Three.js - Interactive 3D particle background
  • Tailwind CSS - Beautiful, responsive design
  • Dark Mode - Toggle between themes
  • Real-time Stats - Live dashboard updates

πŸ’Ύ Database & Search

  • PostgreSQL - Robust data storage
  • Fuzzy Search - PostgreSQL trigrams for similarity search
  • Full-text Search - Fast content search
  • Connection Pooling - Optimized performance

πŸ“Š Data Export

  • CSV Export - Spreadsheet-friendly format
  • PDF Export - Professional reports with ReportLab
  • JSON Export - API-ready data

πŸš€ Quick Start

Prerequisites

  • Python 3.12+
  • Node.js 18+
  • PostgreSQL 14+
  • Git

Installation (3 minutes)

  1. Clone the repository
git clone https://github.com/juni2003/CrawlX-Data-Scrapping-Project.git
cd CrawlX-Data-Scrapping-Project
  1. Setup Backend
cd backend

# Create virtual environment
python -m venv venv

# Activate (Windows)
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Create database
createdb crawlx

# Enable fuzzy search
psql -d crawlx -f migrations/001_enable_fuzzy_search.sql
  1. Setup Frontend
cd ../frontend
npm install
  1. Start Application

Option 1: Quick Start (Windows)

# From project root
start-all.bat

Option 2: Manual Start

# Terminal 1 - Backend
cd backend
uvicorn main:app --reload --port 8000

# Terminal 2 - Frontend  
cd frontend
npm run dev
  1. Access Application

πŸ“– Usage

Web Interface

Dashboard (/)

  • View total scraped items
  • Quick scraping controls
  • Real-time statistics

Custom Scraper (/scraper)

  • Enter any URL to scrape
  • Choose extraction type
  • View extracted content, tables, lists

Data Explorer (/data)

  • Search and filter items
  • Export to CSV/PDF/JSON
  • View detailed information

API Usage

import requests

# Scrape custom URL
response = requests.post("http://localhost:8000/scrape/url", json={
    "url": "https://books.toscrape.com",
    "extract_type": "auto",
    "wait_for": 2
})
print(response.json())

# Get all items
items = requests.get("http://localhost:8000/items?limit=100")

# Search with fuzzy matching
results = requests.get("http://localhost:8000/items/search?query=technology")

# Export data
csv_data = requests.get("http://localhost:8000/export/csv")
with open("data.csv", "wb") as f:
    f.write(csv_data.content)

See examples/api_usage.py for more examples.


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Frontend (Next.js 14 + TypeScript)         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚Dashboard β”‚  β”‚  Scraper  β”‚  β”‚Data Explorerβ”‚          β”‚
β”‚  β”‚(3D UI)   β”‚  β”‚(Custom)   β”‚  β”‚(Search)     β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Backend (FastAPI + Scrapy)                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ REST API β”‚  β”‚Scheduler β”‚  β”‚  Scraping Engine   β”‚    β”‚
β”‚  β”‚          β”‚  β”‚(Every 6h)β”‚  β”‚  (httpx+trafilatura)β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           PostgreSQL Database (with Fuzzy Search)        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ scraped_items (full-text + trigram indexing)   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Tech Stack

Backend

Technology Version Purpose
FastAPI 0.115.8 Modern async web framework
SQLAlchemy 2.0.37 SQL toolkit and ORM
PostgreSQL 14+ Relational database
Scrapy Latest Industrial web scraping
httpx 0.28.0 Async HTTP client
Trafilatura 2.0.0 Content extraction
APScheduler 3.10.4 Job scheduling
NLTK Latest Text summarization

Frontend

Technology Version Purpose
Next.js 14 React framework
TypeScript 5.0 Type safety
Tailwind CSS 3.4 Styling
Three.js Latest 3D graphics
React Three Fiber Latest React + Three.js
Axios Latest HTTP client

πŸ“‚ Project Structure

CrawlX-Data-Scrapping-Project/
β”‚
β”œβ”€β”€ πŸ“ backend/
β”‚   β”œβ”€β”€ main.py                      # FastAPI application
β”‚   β”œβ”€β”€ models.py                    # SQLAlchemy models
β”‚   β”œβ”€β”€ schemas.py                   # Pydantic schemas
β”‚   β”œβ”€β”€ crud.py                      # Database operations
β”‚   β”œβ”€β”€ config.py                    # Configuration
β”‚   β”œβ”€β”€ scheduler.py                 # Job scheduling
β”‚   β”œβ”€β”€ summarizer.py                # Content summarization
β”‚   β”œβ”€β”€ pdf_export.py                # PDF generation
β”‚   β”œβ”€β”€ scraper_engine/
β”‚   β”‚   β”œβ”€β”€ simple_scraper.py        # βœ… HTTP-based scraper (NEW!)
β”‚   β”‚   β”œβ”€β”€ extractors.py            # Content extraction
β”‚   β”‚   β”œβ”€β”€ browser_pool.py          # Browser management
β”‚   β”‚   └── stealth.py               # Anti-detection
β”‚   └── migrations/
β”‚       └── 001_enable_fuzzy_search.sql
β”‚
β”œβ”€β”€ πŸ“ frontend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ page.tsx                 # Dashboard
β”‚   β”‚   β”œβ”€β”€ scraper/page.tsx         # Custom scraper UI
β”‚   β”‚   β”œβ”€β”€ data/page.tsx            # Data explorer
β”‚   β”‚   └── layout.tsx               # Root layout
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ 3d/
β”‚   β”‚   β”‚   └── ParticleBackground.tsx  # Three.js particles
β”‚   β”‚   β”œβ”€β”€ layout/
β”‚   β”‚   β”‚   └── Navbar.tsx           # Navigation
β”‚   β”‚   └── providers/
β”‚   β”‚       └── ThemeProvider.tsx    # Dark/Light theme
β”‚   β”œβ”€β”€ lib/
β”‚   β”‚   └── api.ts                   # API client
β”‚   └── types/
β”‚       └── index.ts                 # TypeScript types
β”‚
β”œβ”€β”€ πŸ“ scraper/                      # Scrapy spiders
β”‚   └── scraper/spiders/
β”‚       β”œβ”€β”€ news_spider.py           # βœ… News scraper
β”‚       └── jobs_spider.py           # Jobs scraper
β”‚
β”œβ”€β”€ πŸ“ examples/
β”‚   └── api_usage.py                 # API examples
β”‚
β”œβ”€β”€ πŸ“ Documentation/
β”‚   β”œβ”€β”€ QUICKSTART.md
β”‚   β”œβ”€β”€ ENHANCED_FEATURES.md
β”‚   β”œβ”€β”€ TROUBLESHOOTING.md
β”‚   β”œβ”€β”€ SECURITY.md
β”‚   β”œβ”€β”€ FIX_COMPLETE.md              # βœ… Custom scraper fix details
β”‚   β”œβ”€β”€ CUSTOM_SCRAPER_FIX.md
β”‚   └── READY_TO_USE.md
β”‚
β”œβ”€β”€ start-all.bat                    # βœ… Start both services (Windows)
β”œβ”€β”€ start-backend.bat                # Start backend only
β”œβ”€β”€ start-frontend.bat               # Start frontend only
β”œβ”€β”€ test_custom_scraper.py           # βœ… Scraper test script
└── test_comprehensive.py            # Full test suite

🎯 Key Improvements (Latest Updates)

βœ… Custom URL Scraper Fix

Problem Solved: Windows + Playwright subprocess incompatibility

Solution: Replaced Playwright with httpx + Trafilatura

  • βœ… Works on Windows without subprocess errors
  • βœ… 3x faster than browser automation
  • βœ… Covers ~80% of websites
  • βœ… Intelligent content extraction

See FIX_COMPLETE.md for details.

βœ… Complete Frontend Implementation

  • Modern Next.js 14 with TypeScript
  • 3D particle background with Three.js
  • Dark/Light theme with smooth transitions
  • Real-time dashboard
  • Data explorer with search and export

βœ… Production-Ready Features

  • Connection pooling for database
  • Scheduled scraping every 6 hours
  • Comprehensive error handling
  • Full API documentation
  • Test scripts included

πŸ§ͺ Testing

# Quick test - Custom URL scraper
python test_custom_scraper.py

# Comprehensive test suite
python test_comprehensive.py

Expected Output:

πŸ”„ Testing Custom URL Scraper...
   Target: https://books.toscrape.com

βœ… SUCCESS!

πŸ“Š Results:
   - Success: True
   - Content Length: 354 characters
   - Lists Found: 5

πŸ“š Documentation

Document Description
QUICKSTART.md 5-minute setup guide
COMPLETE_SETUP_GUIDE.md Detailed installation
ENHANCED_FEATURES.md Feature documentation
TROUBLESHOOTING.md Common issues & fixes
SECURITY.md Security best practices
FIX_COMPLETE.md Custom scraper fix
READY_TO_USE.md Quick reference

πŸ”’ Security

  • βœ… SQL injection prevention with parameterized queries
  • βœ… Input validation using Pydantic schemas
  • βœ… CORS configuration for frontend
  • βœ… Environment-based configuration
  • βœ… Secure database connection handling
  • βœ… Rate limiting ready (configurable)

πŸ› Troubleshooting

Backend won't start

# Check PostgreSQL
pg_isready

# Check port availability
netstat -an | findstr :8000

# Reinstall dependencies
pip install -r backend/requirements.txt --force-reinstall

Frontend won't start

# Clear cache
rm -rf frontend/node_modules frontend/.next
cd frontend && npm install

Database errors

  • Verify PostgreSQL is running
  • Check connection string in backend/config.py
  • Ensure database exists: createdb crawlx

See TROUBLESHOOTING.md for complete guide.


πŸ“ˆ Performance Metrics

Metric Performance
Scraping Speed 100-500 pages/minute
Database Capacity 100K+ items
API Response Time <100ms average
Frontend Load Time <2s initial load
Memory Usage ~200MB backend, ~150MB frontend

πŸ—ΊοΈ Roadmap

  • Docker containerization
  • Cloud deployment guide (AWS, Azure, Heroku)
  • Proxy rotation for scalability
  • Real-time websocket updates
  • Mobile app (React Native)
  • API authentication (JWT)
  • Advanced analytics dashboard
  • Multi-user support

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m 'Add AmazingFeature')
  4. Push to branch (git push origin feature/AmazingFeature)
  5. Open Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • FastAPI - For the amazing async web framework
  • Trafilatura - For intelligent content extraction
  • Next.js - For the excellent React framework
  • PostgreSQL - For robust database capabilities
  • Scrapy - For industrial-strength web scraping

πŸ“§ Contact & Support

Author: Juni
GitHub: @juni2003
Repository: CrawlX-Data-Scrapping-Project

Get Help


⭐ Star this repo if you find it useful!

Made with ❀️ by Juni

CrawlX - Scrape Smarter, Not Harder

About

CrawlX is a powerful data scraping engine built to extract, process, and structure web data using scalable, automated scraping pipelines.

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors