Plombery Scraper

A scheduled data pipeline platform built on Plombery for scraping, enriching, and publishing structured data. It exposes a web UI for monitoring and triggering pipelines, and runs multiple independent scrapers on cron/interval schedules.

Quick Start

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Copy the config template and fill in your credentials:

cp src/config/config.ini.template src/config/config.ini

Run the server:

python src/app.py

The dashboard is available at http://localhost:8080.

Pipelines

Pipeline	File	Schedule	Description
Jobs Scraper	`src/jobs_scrape_pipeline.py`	Every 8 hours	Scrapes LinkedIn for Data Engineer / Data Architect roles in UAE, Saudi Arabia, and Qatar via `python-jobspy`. Enriches listings with Gemini AI (skills, company info, job metadata). Persists to PostgreSQL and Elasticsearch, exports JSON to Cloudflare R2.
Jobs Alerts	`src/jobs_alerts_pipeline.py`	Daily at 07:45 GST	Queries subscriber preferences and sends matching job alert emails via SMTP.
Dimension Standardization	`src/standardization_pipeline.py`	Every 24 hours	Uses Gemini AI to map raw inferred values (job titles, countries) to standardized reference entries in the database.
Dubizzle Cars	`src/dubzl_crs.py`	Daily at 05:45 GST	Scrapes used-car listings from Dubizzle UAE (paginated Next.js SSR). Stores in Elasticsearch and exports to R2.
Carswitch Cars	`src/crswth_crs.py`	Scheduled	Scrapes car listings from Carswitch. Parses streaming HTML payloads, enriches with detail-page data, indexes into Elasticsearch, and exports to R2.
Allsopp Property	`src/allsopp_crs.py`	Scheduled	Scrapes residential sales listings from allsoppandallsopp.com (Dubai). Stores in Elasticsearch and exports to R2.
99acres Property	`src/acres99_crs.py`	Scheduled	Scrapes property listings from 99acres (India). Stores in Elasticsearch and exports to R2.

Architecture

┌─────────────────────────────────────────────────┐
│              Plombery Web UI (:8080)             │
│          (FastAPI + Uvicorn + WebSocket)         │
└──────────────────────┬──────────────────────────┘
                       │
          ┌────────────┼────────────────┐
          ▼            ▼                ▼
   ┌────────────┐ ┌──────────┐  ┌──────────────┐
   │   Jobs     │ │  Cars    │  │  Property    │
   │ Pipelines  │ │ Pipelines│  │  Pipelines   │
   └─────┬──────┘ └────┬─────┘  └──────┬───────┘
         │              │               │
         ▼              ▼               ▼
   ┌───────────┐  ┌───────────────────────────┐
   │ Gemini AI │  │      Data Stores          │
   │ (2.0 /    │  │  PostgreSQL (Neon DB)     │
   │  1.5)     │  │  Elasticsearch 7.x        │
   └───────────┘  │  Cloudflare R2 (exports)  │
                  └───────────────────────────┘

Data Flow (Jobs Pipeline)

Scrape - python-jobspy fetches LinkedIn job listings for configured locations.
Deduplicate - New jobs are filtered against existing hashes in PostgreSQL.
Persist Raw - Raw records are appended to PostgreSQL and bulk-indexed into Elasticsearch.
AI Enrichment - Gemini extracts structured fields (skills, seniority, company info) from raw descriptions.
Persist Enriched - Enriched records are saved to PostgreSQL and Elasticsearch.
Export - A public JSON snapshot is uploaded to Cloudflare R2 for downstream dashboards.
Alert - Subscribers receive matching job alerts via email.

Configuration

All credentials are managed through src/config/config.ini (INI format). See src/config/config.ini.template for the full list of sections:

[PostgresDB] - Neon PostgreSQL connection string
[GeminiPro] - Google Gemini API keys
[openrouter] - OpenRouter API key (optional)
[elasticsearch] - Elasticsearch hosts, auth, index names
[cloudflare] - Cloudflare R2 credentials and bucket config
[Sendgrid] - SMTP credentials for alert emails
[core] - Shared settings (base URLs, etc.)

Project Structure

plombery-scraper/
├── src/
│   ├── app.py                    # Entry point - starts Uvicorn server
│   ├── jobs_scrape_pipeline.py   # Jobs scraping + AI enrichment pipeline
│   ├── jobs_alerts_pipeline.py   # Job alert email pipeline
│   ├── standardization_pipeline.py # Dimension standardization pipeline
│   ├── dubzl_crs.py              # Dubizzle car listings pipeline
│   ├── crswth_crs.py             # Carswitch car listings pipeline
│   ├── allsopp_crs.py            # Allsopp property listings pipeline
│   ├── acres99_crs.py            # 99acres property listings pipeline
│   ├── config/
│   │   ├── __init__.py           # Config reader (configparser)
│   │   ├── config.ini            # Local config (gitignored)
│   │   └── config.ini.template   # Template with placeholder values
│   ├── connections/
│   │   └── neondb_client.py      # Neon DB connection helper
│   └── utils/
│       ├── ai_infer.py           # Gemini API inference helper
│       └── rdbms_conn.py         # Database query executor
├── deploy/
│   └── plomberly.service         # systemd unit file for production
├── .github/workflows/
│   └── deploy.yml                # CI/CD - rsync to Oracle Cloud
├── requirements.txt
└── .gitignore

Deployment

Production (Oracle Cloud)

The GitHub Actions workflow (.github/workflows/deploy.yml) handles deployment:

Triggers on push/merge to main or master, releases, and manual dispatch.
Syncs files to the remote server via rsync over SSH.
Installs dependencies if requirements.txt changed.
The app runs as a systemd service (plomberly.service).

Setting up GitHub secrets

Go to your repository Settings → Secrets and variables → Actions and add these secrets:

Secret	How to get it
`SSH_PRIVATE_KEY`	Generate a key pair locally: `ssh-keygen -t ed25519 -C "deploy"`. Copy the private key contents (`cat ~/.ssh/id_ed25519`).
`SSH_HOST`	The public IP or hostname of your Oracle Cloud instance. Find it in the Oracle Cloud Console under Compute → Instances → [your instance], or run `ssh <user>@<public-ip>` to test. You can also get it via OCI CLI: `oci compute instance list --compartment-id <ocid> --query "data[0].\"public-ips\"[0]"`.
`SSH_USER`	The SSH username for your instance. For Oracle Cloud Ubuntu images this is typically `ubuntu`.
`SSH_PORT`	The SSH port (default is `22`). Check your instance's security list / NSG if you customized it.
`REMOTE_PATH`	The absolute path on the remote server where the app lives, e.g. `/home/ubuntu/plombery`.

After generating the key pair, add the public key to the remote server's ~/.ssh/authorized_keys:

ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<SSH_HOST>

The systemd service runs:

/home/ubuntu/plombery/venv/bin/python /home/ubuntu/plombery/src/app.py

To manage the service on the remote server:

sudo systemctl start plomberly
sudo systemctl stop plomberly
sudo systemctl restart plomberly
sudo systemctl status plomberly

Local Development

source venv/bin/activate
python src/app.py

The server starts with hot-reload enabled (--reload), watching for changes in the parent directory.

Dependencies

Key dependencies:

plombery - Pipeline orchestration with web UI
python-jobspy - LinkedIn / Indeed / Glassdoor job scraper
elasticsearch 7.x - Search and indexing
sqlalchemy + psycopg2 - PostgreSQL access (Neon DB)
boto3 - Cloudflare R2 (S3-compatible) uploads
requests-html - JavaScript-rendered page loading
curl-cffi - TLS-fingerprint-resistant HTTP client
lxml[html_clean] - Fast HTML parsing
google-gemini (via REST) - LLM-based data extraction
json-repair / dirtyjson / python-rapidjson - Robust JSON parsing for LLM outputs

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
deploy		deploy
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plombery Scraper

Quick Start

Pipelines

Architecture

Data Flow (Jobs Pipeline)

Configuration

Project Structure

Deployment

Production (Oracle Cloud)

Setting up GitHub secrets

Local Development

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plombery Scraper

Quick Start

Pipelines

Architecture

Data Flow (Jobs Pipeline)

Configuration

Project Structure

Deployment

Production (Oracle Cloud)

Setting up GitHub secrets

Local Development

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages