Skip to content

Commit b0ce93f

Browse files
committed
update doc
1 parent 757a611 commit b0ce93f

3 files changed

Lines changed: 194 additions & 3 deletions

File tree

README.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Plombery Scraper
2+
3+
A scheduled data pipeline platform built on [Plombery](https://github.com/luciano-fiandesiro/plombery) for scraping, enriching, and publishing structured data. It exposes a web UI for monitoring and triggering pipelines, and runs multiple independent scrapers on cron/interval schedules.
4+
5+
## Quick Start
6+
7+
```bash
8+
python -m venv venv
9+
source venv/bin/activate
10+
pip install -r requirements.txt
11+
```
12+
13+
Copy the config template and fill in your credentials:
14+
15+
```bash
16+
cp src/config/config.ini.template src/config/config.ini
17+
```
18+
19+
Run the server:
20+
21+
```bash
22+
python src/app.py
23+
```
24+
25+
The dashboard is available at **http://localhost:8080**.
26+
27+
## Pipelines
28+
29+
| Pipeline | File | Schedule | Description |
30+
|---|---|---|---|
31+
| **Jobs Scraper** | `src/jobs_scrape_pipeline.py` | Every 8 hours | Scrapes LinkedIn for Data Engineer / Data Architect roles in UAE, Saudi Arabia, and Qatar via `python-jobspy`. Enriches listings with Gemini AI (skills, company info, job metadata). Persists to PostgreSQL and Elasticsearch, exports JSON to Cloudflare R2. |
32+
| **Jobs Alerts** | `src/jobs_alerts_pipeline.py` | Daily at 07:45 GST | Queries subscriber preferences and sends matching job alert emails via SMTP. |
33+
| **Dimension Standardization** | `src/standardization_pipeline.py` | Every 24 hours | Uses Gemini AI to map raw inferred values (job titles, countries) to standardized reference entries in the database. |
34+
| **Dubizzle Cars** | `src/dubzl_crs.py` | Daily at 05:45 GST | Scrapes used-car listings from Dubizzle UAE (paginated Next.js SSR). Stores in Elasticsearch and exports to R2. |
35+
| **Carswitch Cars** | `src/crswth_crs.py` | Scheduled | Scrapes car listings from Carswitch. Parses streaming HTML payloads, enriches with detail-page data, indexes into Elasticsearch, and exports to R2. |
36+
| **Allsopp Property** | `src/allsopp_crs.py` | Scheduled | Scrapes residential sales listings from allsoppandallsopp.com (Dubai). Stores in Elasticsearch and exports to R2. |
37+
| **99acres Property** | `src/acres99_crs.py` | Scheduled | Scrapes property listings from 99acres (India). Stores in Elasticsearch and exports to R2. |
38+
39+
## Architecture
40+
41+
```
42+
┌─────────────────────────────────────────────────┐
43+
│ Plombery Web UI (:8080) │
44+
│ (FastAPI + Uvicorn + WebSocket) │
45+
└──────────────────────┬──────────────────────────┘
46+
47+
┌────────────┼────────────────┐
48+
▼ ▼ ▼
49+
┌────────────┐ ┌──────────┐ ┌──────────────┐
50+
│ Jobs │ │ Cars │ │ Property │
51+
│ Pipelines │ │ Pipelines│ │ Pipelines │
52+
└─────┬──────┘ └────┬─────┘ └──────┬───────┘
53+
│ │ │
54+
▼ ▼ ▼
55+
┌───────────┐ ┌───────────────────────────┐
56+
│ Gemini AI │ │ Data Stores │
57+
│ (2.0 / │ │ PostgreSQL (Neon DB) │
58+
│ 1.5) │ │ Elasticsearch 7.x │
59+
└───────────┘ │ Cloudflare R2 (exports) │
60+
└───────────────────────────┘
61+
```
62+
63+
### Data Flow (Jobs Pipeline)
64+
65+
1. **Scrape** - `python-jobspy` fetches LinkedIn job listings for configured locations.
66+
2. **Deduplicate** - New jobs are filtered against existing hashes in PostgreSQL.
67+
3. **Persist Raw** - Raw records are appended to PostgreSQL and bulk-indexed into Elasticsearch.
68+
4. **AI Enrichment** - Gemini extracts structured fields (skills, seniority, company info) from raw descriptions.
69+
5. **Persist Enriched** - Enriched records are saved to PostgreSQL and Elasticsearch.
70+
6. **Export** - A public JSON snapshot is uploaded to Cloudflare R2 for downstream dashboards.
71+
7. **Alert** - Subscribers receive matching job alerts via email.
72+
73+
## Configuration
74+
75+
All credentials are managed through `src/config/config.ini` (INI format). See `src/config/config.ini.template` for the full list of sections:
76+
77+
- **`[PostgresDB]`** - Neon PostgreSQL connection string
78+
- **`[GeminiPro]`** - Google Gemini API keys
79+
- **`[openrouter]`** - OpenRouter API key (optional)
80+
- **`[elasticsearch]`** - Elasticsearch hosts, auth, index names
81+
- **`[cloudflare]`** - Cloudflare R2 credentials and bucket config
82+
- **`[Sendgrid]`** - SMTP credentials for alert emails
83+
- **`[core]`** - Shared settings (base URLs, etc.)
84+
85+
## Project Structure
86+
87+
```
88+
plombery-scraper/
89+
├── src/
90+
│ ├── app.py # Entry point - starts Uvicorn server
91+
│ ├── jobs_scrape_pipeline.py # Jobs scraping + AI enrichment pipeline
92+
│ ├── jobs_alerts_pipeline.py # Job alert email pipeline
93+
│ ├── standardization_pipeline.py # Dimension standardization pipeline
94+
│ ├── dubzl_crs.py # Dubizzle car listings pipeline
95+
│ ├── crswth_crs.py # Carswitch car listings pipeline
96+
│ ├── allsopp_crs.py # Allsopp property listings pipeline
97+
│ ├── acres99_crs.py # 99acres property listings pipeline
98+
│ ├── config/
99+
│ │ ├── __init__.py # Config reader (configparser)
100+
│ │ ├── config.ini # Local config (gitignored)
101+
│ │ └── config.ini.template # Template with placeholder values
102+
│ ├── connections/
103+
│ │ └── neondb_client.py # Neon DB connection helper
104+
│ └── utils/
105+
│ ├── ai_infer.py # Gemini API inference helper
106+
│ └── rdbms_conn.py # Database query executor
107+
├── deploy/
108+
│ └── plomberly.service # systemd unit file for production
109+
├── .github/workflows/
110+
│ └── deploy.yml # CI/CD - rsync to Oracle Cloud
111+
├── requirements.txt
112+
└── .gitignore
113+
```
114+
115+
## Deployment
116+
117+
### Production (Oracle Cloud)
118+
119+
The GitHub Actions workflow (`.github/workflows/deploy.yml`) handles deployment:
120+
121+
1. Triggers on push/merge to `main` or `master`, releases, and manual dispatch.
122+
2. Syncs files to the remote server via `rsync` over SSH.
123+
3. Installs dependencies if `requirements.txt` changed.
124+
4. The app runs as a systemd service (`plomberly.service`).
125+
126+
Required GitHub secrets: `SSH_HOST`, `SSH_USER`, `SSH_PORT`, `REMOTE_PATH`, `SSH_PRIVATE_KEY`.
127+
128+
The systemd service runs:
129+
130+
```
131+
/home/ubuntu/plombery/venv/bin/python /home/ubuntu/plombery/src/app.py
132+
```
133+
134+
To manage the service on the remote server:
135+
136+
```bash
137+
sudo systemctl start plomberly
138+
sudo systemctl stop plomberly
139+
sudo systemctl restart plomberly
140+
sudo systemctl status plomberly
141+
```
142+
143+
### Local Development
144+
145+
```bash
146+
source venv/bin/activate
147+
python src/app.py
148+
```
149+
150+
The server starts with hot-reload enabled (`--reload`), watching for changes in the parent directory.
151+
152+
## Dependencies
153+
154+
Key dependencies:
155+
156+
- **plombery** - Pipeline orchestration with web UI
157+
- **python-jobspy** - LinkedIn / Indeed / Glassdoor job scraper
158+
- **elasticsearch 7.x** - Search and indexing
159+
- **sqlalchemy + psycopg2** - PostgreSQL access (Neon DB)
160+
- **boto3** - Cloudflare R2 (S3-compatible) uploads
161+
- **requests-html** - JavaScript-rendered page loading
162+
- **curl-cffi** - TLS-fingerprint-resistant HTTP client
163+
- **lxml[html_clean]** - Fast HTML parsing
164+
- **google-gemini** (via REST) - LLM-based data extraction
165+
- **json-repair / dirtyjson / python-rapidjson** - Robust JSON parsing for LLM outputs

requirements.lock

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,6 @@ tzlocal==5.2
8787
urllib3==1.26.19
8888
uvicorn==0.40.0
8989
w3lib==2.2.1
90-
websockets==11.0.3
90+
websockets
9191
wsproto==1.2.0
9292
zipp==3.19.2

src/config/config.ini.template

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,33 @@
11
[PostgresDB]
2-
connection_string = postgresql://********:*********@ep-****-****-****-pooler.eu-central-1.aws.neon.tech/*****?sslmode=require
2+
connection_string = postgresql://user:password@host/dbname?sslmode=require
33

44
[GeminiPro]
5-
API_KEY = **********************************
5+
API_KEY_RH = your-gemini-api-key-1
6+
API_KEY_RHA = your-gemini-api-key-2
67

8+
[openrouter]
9+
API_KEY = your-openrouter-api-key
710

11+
[elasticsearch]
12+
host = https://your-es-host:9200
13+
username = elastic
14+
password = your-es-password
15+
index = cars_listings
16+
jobs_index = jobs_analyzer_jobs
17+
18+
[cloudflare]
19+
ACCOUNT_ID = your-cloudflare-account-id
20+
R2_ENDPOINT = https://your-account-id.r2.cloudflarestorage.com
21+
ACCESS_KEY_ID = your-r2-access-key-id
22+
SECRET_ACCESS_KEY = your-r2-secret-access-key
23+
BUCKET = your-r2-bucket
24+
JOBS_BUCKET = me-data-jobs
25+
JOBS_EXPORT_KEY = jobs.json
26+
JOBS_CACHE_CONTROL = public, max-age=300
27+
28+
[Sendgrid]
29+
Username = your-smtp-username
30+
Password = your-smtp-password
31+
32+
[core]
33+
base_url = https://example.com/api/listings?page={0}&offset={1}

0 commit comments

Comments
 (0)