A scheduled data pipeline platform built on Plombery for scraping, enriching, and publishing structured data. It exposes a web UI for monitoring and triggering pipelines, and runs multiple independent scrapers on cron/interval schedules.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtCopy the config template and fill in your credentials:
cp src/config/config.ini.template src/config/config.iniRun the server:
python src/app.pyThe dashboard is available at http://localhost:8080.
| Pipeline | File | Schedule | Description |
|---|---|---|---|
| Jobs Scraper | src/jobs_scrape_pipeline.py |
Every 8 hours | Scrapes LinkedIn for Data Engineer / Data Architect roles in UAE, Saudi Arabia, and Qatar via python-jobspy. Enriches listings with Gemini AI (skills, company info, job metadata). Persists to PostgreSQL and Elasticsearch, exports JSON to Cloudflare R2. |
| Jobs Alerts | src/jobs_alerts_pipeline.py |
Daily at 07:45 GST | Queries subscriber preferences and sends matching job alert emails via SMTP. |
| Dimension Standardization | src/standardization_pipeline.py |
Every 24 hours | Uses Gemini AI to map raw inferred values (job titles, countries) to standardized reference entries in the database. |
| Dubizzle Cars | src/dubzl_crs.py |
Daily at 05:45 GST | Scrapes used-car listings from Dubizzle UAE (paginated Next.js SSR). Stores in Elasticsearch and exports to R2. |
| Carswitch Cars | src/crswth_crs.py |
Scheduled | Scrapes car listings from Carswitch. Parses streaming HTML payloads, enriches with detail-page data, indexes into Elasticsearch, and exports to R2. |
| Allsopp Property | src/allsopp_crs.py |
Scheduled | Scrapes residential sales listings from allsoppandallsopp.com (Dubai). Stores in Elasticsearch and exports to R2. |
| 99acres Property | src/acres99_crs.py |
Scheduled | Scrapes property listings from 99acres (India). Stores in Elasticsearch and exports to R2. |
┌─────────────────────────────────────────────────┐
│ Plombery Web UI (:8080) │
│ (FastAPI + Uvicorn + WebSocket) │
└──────────────────────┬──────────────────────────┘
│
┌────────────┼────────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Jobs │ │ Cars │ │ Property │
│ Pipelines │ │ Pipelines│ │ Pipelines │
└─────┬──────┘ └────┬─────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────────────────────┐
│ Gemini AI │ │ Data Stores │
│ (2.0 / │ │ PostgreSQL (Neon DB) │
│ 1.5) │ │ Elasticsearch 7.x │
└───────────┘ │ Cloudflare R2 (exports) │
└───────────────────────────┘
- Scrape -
python-jobspyfetches LinkedIn job listings for configured locations. - Deduplicate - New jobs are filtered against existing hashes in PostgreSQL.
- Persist Raw - Raw records are appended to PostgreSQL and bulk-indexed into Elasticsearch.
- AI Enrichment - Gemini extracts structured fields (skills, seniority, company info) from raw descriptions.
- Persist Enriched - Enriched records are saved to PostgreSQL and Elasticsearch.
- Export - A public JSON snapshot is uploaded to Cloudflare R2 for downstream dashboards.
- Alert - Subscribers receive matching job alerts via email.
All credentials are managed through src/config/config.ini (INI format). See src/config/config.ini.template for the full list of sections:
[PostgresDB]- Neon PostgreSQL connection string[GeminiPro]- Google Gemini API keys[openrouter]- OpenRouter API key (optional)[elasticsearch]- Elasticsearch hosts, auth, index names[cloudflare]- Cloudflare R2 credentials and bucket config[Sendgrid]- SMTP credentials for alert emails[core]- Shared settings (base URLs, etc.)
plombery-scraper/
├── src/
│ ├── app.py # Entry point - starts Uvicorn server
│ ├── jobs_scrape_pipeline.py # Jobs scraping + AI enrichment pipeline
│ ├── jobs_alerts_pipeline.py # Job alert email pipeline
│ ├── standardization_pipeline.py # Dimension standardization pipeline
│ ├── dubzl_crs.py # Dubizzle car listings pipeline
│ ├── crswth_crs.py # Carswitch car listings pipeline
│ ├── allsopp_crs.py # Allsopp property listings pipeline
│ ├── acres99_crs.py # 99acres property listings pipeline
│ ├── config/
│ │ ├── __init__.py # Config reader (configparser)
│ │ ├── config.ini # Local config (gitignored)
│ │ └── config.ini.template # Template with placeholder values
│ ├── connections/
│ │ └── neondb_client.py # Neon DB connection helper
│ └── utils/
│ ├── ai_infer.py # Gemini API inference helper
│ └── rdbms_conn.py # Database query executor
├── deploy/
│ └── plomberly.service # systemd unit file for production
├── .github/workflows/
│ └── deploy.yml # CI/CD - rsync to Oracle Cloud
├── requirements.txt
└── .gitignore
The GitHub Actions workflow (.github/workflows/deploy.yml) handles deployment:
- Triggers on push/merge to
mainormaster, releases, and manual dispatch. - Syncs files to the remote server via
rsyncover SSH. - Installs dependencies if
requirements.txtchanged. - The app runs as a systemd service (
plomberly.service).
Go to your repository Settings → Secrets and variables → Actions and add these secrets:
| Secret | How to get it |
|---|---|
SSH_PRIVATE_KEY |
Generate a key pair locally: ssh-keygen -t ed25519 -C "deploy". Copy the private key contents (cat ~/.ssh/id_ed25519). |
SSH_HOST |
The public IP or hostname of your Oracle Cloud instance. Find it in the Oracle Cloud Console under Compute → Instances → [your instance], or run ssh <user>@<public-ip> to test. You can also get it via OCI CLI: oci compute instance list --compartment-id <ocid> --query "data[0].\"public-ips\"[0]". |
SSH_USER |
The SSH username for your instance. For Oracle Cloud Ubuntu images this is typically ubuntu. |
SSH_PORT |
The SSH port (default is 22). Check your instance's security list / NSG if you customized it. |
REMOTE_PATH |
The absolute path on the remote server where the app lives, e.g. /home/ubuntu/plombery. |
After generating the key pair, add the public key to the remote server's ~/.ssh/authorized_keys:
ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<SSH_HOST>The systemd service runs:
/home/ubuntu/plombery/venv/bin/python /home/ubuntu/plombery/src/app.py
To manage the service on the remote server:
sudo systemctl start plomberly
sudo systemctl stop plomberly
sudo systemctl restart plomberly
sudo systemctl status plomberlysource venv/bin/activate
python src/app.pyThe server starts with hot-reload enabled (--reload), watching for changes in the parent directory.
Key dependencies:
- plombery - Pipeline orchestration with web UI
- python-jobspy - LinkedIn / Indeed / Glassdoor job scraper
- elasticsearch 7.x - Search and indexing
- sqlalchemy + psycopg2 - PostgreSQL access (Neon DB)
- boto3 - Cloudflare R2 (S3-compatible) uploads
- requests-html - JavaScript-rendered page loading
- curl-cffi - TLS-fingerprint-resistant HTTP client
- lxml[html_clean] - Fast HTML parsing
- google-gemini (via REST) - LLM-based data extraction
- json-repair / dirtyjson / python-rapidjson - Robust JSON parsing for LLM outputs