|
| 1 | +# Plombery Scraper |
| 2 | + |
| 3 | +A scheduled data pipeline platform built on [Plombery](https://github.com/luciano-fiandesiro/plombery) for scraping, enriching, and publishing structured data. It exposes a web UI for monitoring and triggering pipelines, and runs multiple independent scrapers on cron/interval schedules. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +```bash |
| 8 | +python -m venv venv |
| 9 | +source venv/bin/activate |
| 10 | +pip install -r requirements.txt |
| 11 | +``` |
| 12 | + |
| 13 | +Copy the config template and fill in your credentials: |
| 14 | + |
| 15 | +```bash |
| 16 | +cp src/config/config.ini.template src/config/config.ini |
| 17 | +``` |
| 18 | + |
| 19 | +Run the server: |
| 20 | + |
| 21 | +```bash |
| 22 | +python src/app.py |
| 23 | +``` |
| 24 | + |
| 25 | +The dashboard is available at **http://localhost:8080**. |
| 26 | + |
| 27 | +## Pipelines |
| 28 | + |
| 29 | +| Pipeline | File | Schedule | Description | |
| 30 | +|---|---|---|---| |
| 31 | +| **Jobs Scraper** | `src/jobs_scrape_pipeline.py` | Every 8 hours | Scrapes LinkedIn for Data Engineer / Data Architect roles in UAE, Saudi Arabia, and Qatar via `python-jobspy`. Enriches listings with Gemini AI (skills, company info, job metadata). Persists to PostgreSQL and Elasticsearch, exports JSON to Cloudflare R2. | |
| 32 | +| **Jobs Alerts** | `src/jobs_alerts_pipeline.py` | Daily at 07:45 GST | Queries subscriber preferences and sends matching job alert emails via SMTP. | |
| 33 | +| **Dimension Standardization** | `src/standardization_pipeline.py` | Every 24 hours | Uses Gemini AI to map raw inferred values (job titles, countries) to standardized reference entries in the database. | |
| 34 | +| **Dubizzle Cars** | `src/dubzl_crs.py` | Daily at 05:45 GST | Scrapes used-car listings from Dubizzle UAE (paginated Next.js SSR). Stores in Elasticsearch and exports to R2. | |
| 35 | +| **Carswitch Cars** | `src/crswth_crs.py` | Scheduled | Scrapes car listings from Carswitch. Parses streaming HTML payloads, enriches with detail-page data, indexes into Elasticsearch, and exports to R2. | |
| 36 | +| **Allsopp Property** | `src/allsopp_crs.py` | Scheduled | Scrapes residential sales listings from allsoppandallsopp.com (Dubai). Stores in Elasticsearch and exports to R2. | |
| 37 | +| **99acres Property** | `src/acres99_crs.py` | Scheduled | Scrapes property listings from 99acres (India). Stores in Elasticsearch and exports to R2. | |
| 38 | + |
| 39 | +## Architecture |
| 40 | + |
| 41 | +``` |
| 42 | +┌─────────────────────────────────────────────────┐ |
| 43 | +│ Plombery Web UI (:8080) │ |
| 44 | +│ (FastAPI + Uvicorn + WebSocket) │ |
| 45 | +└──────────────────────┬──────────────────────────┘ |
| 46 | + │ |
| 47 | + ┌────────────┼────────────────┐ |
| 48 | + ▼ ▼ ▼ |
| 49 | + ┌────────────┐ ┌──────────┐ ┌──────────────┐ |
| 50 | + │ Jobs │ │ Cars │ │ Property │ |
| 51 | + │ Pipelines │ │ Pipelines│ │ Pipelines │ |
| 52 | + └─────┬──────┘ └────┬─────┘ └──────┬───────┘ |
| 53 | + │ │ │ |
| 54 | + ▼ ▼ ▼ |
| 55 | + ┌───────────┐ ┌───────────────────────────┐ |
| 56 | + │ Gemini AI │ │ Data Stores │ |
| 57 | + │ (2.0 / │ │ PostgreSQL (Neon DB) │ |
| 58 | + │ 1.5) │ │ Elasticsearch 7.x │ |
| 59 | + └───────────┘ │ Cloudflare R2 (exports) │ |
| 60 | + └───────────────────────────┘ |
| 61 | +``` |
| 62 | + |
| 63 | +### Data Flow (Jobs Pipeline) |
| 64 | + |
| 65 | +1. **Scrape** - `python-jobspy` fetches LinkedIn job listings for configured locations. |
| 66 | +2. **Deduplicate** - New jobs are filtered against existing hashes in PostgreSQL. |
| 67 | +3. **Persist Raw** - Raw records are appended to PostgreSQL and bulk-indexed into Elasticsearch. |
| 68 | +4. **AI Enrichment** - Gemini extracts structured fields (skills, seniority, company info) from raw descriptions. |
| 69 | +5. **Persist Enriched** - Enriched records are saved to PostgreSQL and Elasticsearch. |
| 70 | +6. **Export** - A public JSON snapshot is uploaded to Cloudflare R2 for downstream dashboards. |
| 71 | +7. **Alert** - Subscribers receive matching job alerts via email. |
| 72 | + |
| 73 | +## Configuration |
| 74 | + |
| 75 | +All credentials are managed through `src/config/config.ini` (INI format). See `src/config/config.ini.template` for the full list of sections: |
| 76 | + |
| 77 | +- **`[PostgresDB]`** - Neon PostgreSQL connection string |
| 78 | +- **`[GeminiPro]`** - Google Gemini API keys |
| 79 | +- **`[openrouter]`** - OpenRouter API key (optional) |
| 80 | +- **`[elasticsearch]`** - Elasticsearch hosts, auth, index names |
| 81 | +- **`[cloudflare]`** - Cloudflare R2 credentials and bucket config |
| 82 | +- **`[Sendgrid]`** - SMTP credentials for alert emails |
| 83 | +- **`[core]`** - Shared settings (base URLs, etc.) |
| 84 | + |
| 85 | +## Project Structure |
| 86 | + |
| 87 | +``` |
| 88 | +plombery-scraper/ |
| 89 | +├── src/ |
| 90 | +│ ├── app.py # Entry point - starts Uvicorn server |
| 91 | +│ ├── jobs_scrape_pipeline.py # Jobs scraping + AI enrichment pipeline |
| 92 | +│ ├── jobs_alerts_pipeline.py # Job alert email pipeline |
| 93 | +│ ├── standardization_pipeline.py # Dimension standardization pipeline |
| 94 | +│ ├── dubzl_crs.py # Dubizzle car listings pipeline |
| 95 | +│ ├── crswth_crs.py # Carswitch car listings pipeline |
| 96 | +│ ├── allsopp_crs.py # Allsopp property listings pipeline |
| 97 | +│ ├── acres99_crs.py # 99acres property listings pipeline |
| 98 | +│ ├── config/ |
| 99 | +│ │ ├── __init__.py # Config reader (configparser) |
| 100 | +│ │ ├── config.ini # Local config (gitignored) |
| 101 | +│ │ └── config.ini.template # Template with placeholder values |
| 102 | +│ ├── connections/ |
| 103 | +│ │ └── neondb_client.py # Neon DB connection helper |
| 104 | +│ └── utils/ |
| 105 | +│ ├── ai_infer.py # Gemini API inference helper |
| 106 | +│ └── rdbms_conn.py # Database query executor |
| 107 | +├── deploy/ |
| 108 | +│ └── plomberly.service # systemd unit file for production |
| 109 | +├── .github/workflows/ |
| 110 | +│ └── deploy.yml # CI/CD - rsync to Oracle Cloud |
| 111 | +├── requirements.txt |
| 112 | +└── .gitignore |
| 113 | +``` |
| 114 | + |
| 115 | +## Deployment |
| 116 | + |
| 117 | +### Production (Oracle Cloud) |
| 118 | + |
| 119 | +The GitHub Actions workflow (`.github/workflows/deploy.yml`) handles deployment: |
| 120 | + |
| 121 | +1. Triggers on push/merge to `main` or `master`, releases, and manual dispatch. |
| 122 | +2. Syncs files to the remote server via `rsync` over SSH. |
| 123 | +3. Installs dependencies if `requirements.txt` changed. |
| 124 | +4. The app runs as a systemd service (`plomberly.service`). |
| 125 | + |
| 126 | +Required GitHub secrets: `SSH_HOST`, `SSH_USER`, `SSH_PORT`, `REMOTE_PATH`, `SSH_PRIVATE_KEY`. |
| 127 | + |
| 128 | +The systemd service runs: |
| 129 | + |
| 130 | +``` |
| 131 | +/home/ubuntu/plombery/venv/bin/python /home/ubuntu/plombery/src/app.py |
| 132 | +``` |
| 133 | + |
| 134 | +To manage the service on the remote server: |
| 135 | + |
| 136 | +```bash |
| 137 | +sudo systemctl start plomberly |
| 138 | +sudo systemctl stop plomberly |
| 139 | +sudo systemctl restart plomberly |
| 140 | +sudo systemctl status plomberly |
| 141 | +``` |
| 142 | + |
| 143 | +### Local Development |
| 144 | + |
| 145 | +```bash |
| 146 | +source venv/bin/activate |
| 147 | +python src/app.py |
| 148 | +``` |
| 149 | + |
| 150 | +The server starts with hot-reload enabled (`--reload`), watching for changes in the parent directory. |
| 151 | + |
| 152 | +## Dependencies |
| 153 | + |
| 154 | +Key dependencies: |
| 155 | + |
| 156 | +- **plombery** - Pipeline orchestration with web UI |
| 157 | +- **python-jobspy** - LinkedIn / Indeed / Glassdoor job scraper |
| 158 | +- **elasticsearch 7.x** - Search and indexing |
| 159 | +- **sqlalchemy + psycopg2** - PostgreSQL access (Neon DB) |
| 160 | +- **boto3** - Cloudflare R2 (S3-compatible) uploads |
| 161 | +- **requests-html** - JavaScript-rendered page loading |
| 162 | +- **curl-cffi** - TLS-fingerprint-resistant HTTP client |
| 163 | +- **lxml[html_clean]** - Fast HTML parsing |
| 164 | +- **google-gemini** (via REST) - LLM-based data extraction |
| 165 | +- **json-repair / dirtyjson / python-rapidjson** - Robust JSON parsing for LLM outputs |
0 commit comments