CrawlX is a powerful web scraping platform with:
- Pre-configured scrapers for Hacker News and RemoteOK jobs
- Custom URL scraper that can scrape ANY website with anti-bot detection
- Smart content extraction with multiple modes (auto, article, text, structured)
- Modern Next.js frontend with 3D effects and dark mode
- PostgreSQL database for data storage
- Export functionality (JSON, CSV, PDF)
Before running CrawlX, ensure you have:
- ✅ Python 3.11+ installed
- ✅ PostgreSQL 12+ installed and running
- ✅ Node.js 18+ and npm installed
- ✅ Git (optional, for version control)
# Navigate to backend directory
cd backend
# Install Python dependencies
pip install -r requirements.txt
# Install Playwright browser (required for custom URL scraper)
python -m playwright install chromium
# Create .env file
# Copy this content into backend/.env:
DATABASE_URL=postgresql://postgres:your_password@localhost:5432/crawlxImportant: Replace your_password with your actual PostgreSQL password!
# Create database (run in PostgreSQL)
createdb crawlx
# OR use psql:
psql -U postgres -c "CREATE DATABASE crawlx;"# In backend directory
uvicorn main:app --reload
# Should see:
# INFO: Uvicorn running on http://127.0.0.1:8000Test it: Open http://localhost:8000/docs to see the interactive API documentation.
# Navigate to frontend directory (from project root)
cd frontend
# Install dependencies
npm install
# Start development server
npm run dev
# Should see:
# ▲ Next.js 14.x.x
# Local: http://localhost:3000Open http://localhost:3000 in your browser!
Dashboard (http://localhost:3000)
- View statistics (total items, today's scrapes, etc.)
- Click "Run Scrapers" to scrape Hacker News and RemoteOK
- Navigate to Custom Scraper or Data Explorer
Custom URL Scraper (http://localhost:3000/scraper)
- Enter any website URL (e.g., https://books.toscrape.com)
- Choose extraction mode:
- Auto: Smart detection
- Article: For news/blogs
- Text: All visible text
- Structured: Tables & lists
- Adjust wait time (1-10 seconds for slow sites)
- Click "Scrape URL"
- Copy or download results
Data Explorer (http://localhost:3000/data)
- Search scraped content (with fuzzy search)
- Filter by tags (news, tech, jobs, remote)
- Export to CSV or PDF
- View full details with source links
Error: FATAL: database "crawlx" does not exist
createdb crawlxError: connection to server at "localhost" (::1), port 5432 failed
- PostgreSQL is not running
- Start it:
sudo systemctl start postgresql(Linux) orpg_ctl start(Windows)
Error: DETAIL: role "postgres" does not exist
- Create PostgreSQL user:
psql -c "CREATE USER postgres WITH PASSWORD 'your_password' SUPERUSER;"Error: Browser not installed
python -m playwright install chromiumError: Scraping failed
- Check if the website blocks bots
- Try increasing wait time to 5-10 seconds
- Some sites use aggressive anti-bot protection
Error: Network Error or ERR_CONNECTION_REFUSED
- Ensure backend is running on http://localhost:8000
- Check CORS settings in
backend/main.py - Verify firewall isn't blocking port 8000
Error: QueuePool limit exceeded
- Restart backend server
- Check for hanging database connections:
SELECT * FROM pg_stat_activity WHERE datname = 'crawlx';GET http://localhost:8000/
Response: {"message": "CrawlX API is running"}GET http://localhost:8000/items?limit=10&tag=newsGET http://localhost:8000/search?q=python&fuzzy=truePOST http://localhost:8000/scrape
Body: ["news", "jobs"]POST http://localhost:8000/scrape/url
Body: {
"url": "https://example.com",
"extract_type": "auto",
"wait_seconds": 2
}GET http://localhost:8000/export/json
GET http://localhost:8000/export/csv
POST http://localhost:8000/export/pdfFull interactive docs: http://localhost:8000/docs
CrawlX-Data-Scrapping-Project/
├── backend/ # FastAPI backend
│ ├── main.py # API endpoints
│ ├── models.py # Database models
│ ├── crud.py # Database operations
│ ├── scraper_engine/ # Custom URL scraper
│ │ ├── browser_pool.py
│ │ ├── stealth.py
│ │ └── extractors.py
│ ├── scraper/ # Scrapy project
│ │ └── spiders/
│ │ ├── news_spider.py
│ │ └── jobs_spider.py
│ ├── requirements.txt
│ └── .env # Database config
├── frontend/ # Next.js frontend
│ ├── app/ # Pages
│ │ ├── page.tsx # Dashboard
│ │ ├── scraper/page.tsx # Custom scraper
│ │ └── data/page.tsx # Data explorer
│ ├── components/
│ │ ├── 3d/ # Three.js effects
│ │ ├── layout/ # Navbar, etc.
│ │ └── providers/ # Theme provider
│ ├── lib/api.ts # API client
│ └── package.json
└── README.md # This file
- Go to Dashboard
- Click "Run Scrapers"
- Wait ~10 seconds
- Go to Data Explorer
- Filter by tag: "news"
- Go to Custom Scraper
- Enter URL:
https://example.com - Choose extraction mode
- Click "Scrape URL"
- View results
- Go to Data Explorer
- Click "Export CSV" or "Export PDF"
- File downloads automatically
- Go to Data Explorer
- Enter search term
- Enable "Fuzzy search" for typo tolerance
- Click "Search"
- Database: Change default PostgreSQL password
- API: Add authentication for production use
- CORS: Restrict allowed origins in production
- Environment: Never commit
.envfiles - Rate Limiting: Add rate limits for scraping endpoints
id: Primary keyurl: Source URLtitle: Item titlecontent: Full contentsummary: AI-generated summarytags: Array of tagssource: Scraper sourcemetadata: JSON metadatascraped_at: Timestamp
# Use production ASGI server
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker# Build for production
npm run build
# Start production server
npm start- Use PostgreSQL connection pooling
- Set up regular backups
- Configure proper indexes
- Fork the repository
- Create feature branch
- Make changes
- Test thoroughly
- Submit pull request
This project is for educational purposes.
- Backend API Docs: http://localhost:8000/docs
- Frontend Errors: Check browser console
- Database Issues: Check PostgreSQL logs
- Scraper Issues: Check
CUSTOM_URL_SCRAPER.md
✅ PostgreSQL running ✅ Backend starts without errors ✅ Frontend loads at http://localhost:3000 ✅ Dashboard shows stats ✅ Custom URL scraper works ✅ Data explorer displays items ✅ Export functionality works
Congratulations! CrawlX is fully operational! 🎊