Skip to content

Commit c0de3e2

Browse files
committed
feat(server): update server with pgvector
1 parent a6c95f5 commit c0de3e2

11 files changed

Lines changed: 401 additions & 695 deletions

server/.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# OpenAI API Configuration
22
OPENAI_API_KEY=your_openai_api_key_here
33

4+
# PostgreSQL connection (used by DatabaseManager)
5+
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/veille_technique
6+
7+
# Optional: Override embedding model (OpenAI)
8+
EMBEDDING_MODEL=text-embedding-3-small
9+
410
# Optional: GitHub Token for higher rate limits
511
GITHUB_TOKEN=your_github_token_here

server/ARCHITECTURE.md

Lines changed: 88 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -2,119 +2,117 @@
22

33
## 📋 Summary of Changes
44

5-
The server has been completely redesigned with a modular architecture and two clearly defined operating modes.
5+
The server has been redesigned with a modular architecture and two operating modes.
66

77
## 🏗️ Architecture
88

99
```
1010
server/
1111
├── main.py # Main server with WatchServer (orchestrator)
12-
├── database.py # SQLite database manager
12+
├── database.py # PostgreSQL + pgvector database manager
1313
├── embeddings.py # Embeddings manager (vectors)
14-
├── config.py # Centralized configuration
15-
├── examples.py # Usage examples
16-
├── requirements.txt # Python dependencies
17-
├── README.md # Complete documentation
14+
├── config.py # Centralized configuration
15+
├── examples.py # Usage examples
16+
├── requirements.txt # Python dependencies
17+
├── README.md # Complete documentation
1818
└── scrapers/
19-
├── base.py # Abstract BaseScraper interface
20-
├── arxiv_scraper.py # Scraper for arXiv
21-
├── github_scraper.py # Scraper for GitHub
22-
├── medium_scraper.py # Scraper for Medium
23-
├── lemonde_scraper.py # Scraper for Le Monde
19+
├── base.py # Abstract BaseScraper interface
20+
├── arxiv_scraper.py # Scraper for arXiv
21+
├── github_scraper.py # Scraper for GitHub
22+
├── medium_scraper.py # Scraper for Medium
23+
├── lemonde_scraper.py # Scraper for Le Monde
2424
└── huggingface_scraper.py # Scraper for Hugging Face
2525
```
2626

27-
## 🎯 Two Operating Modes
27+
## 🎯 Operating Modes
2828

2929
### 1️⃣ Backfill Mode (History)
30-
**When:** At server startup
30+
**When:** At startup (optional)
3131

32-
**What:** Scrapes all available history from each source
32+
**What:** Scrapes all available history from each source.
3333

3434
**How:**
3535
```bash
3636
python main.py backfill --limit 100
3737
```
3838

3939
**Flow:**
40-
1. Scraper calls `scrape_all()` for each source
41-
2. Articles are saved to DB (deduplicated by ID)
42-
3. Embeddings generated and stored for each article
43-
4. Sync recorded in `sync_history`
40+
1. Each scraper calls `scrape_all()`.
41+
2. Articles are saved (deduplicated by ID and hash).
42+
3. Embeddings are generated and stored.
43+
4. Sync history is recorded.
4444

4545
### 2️⃣ Watch Mode (Monitoring)
46-
**When:** After backfill or directly
47-
48-
**What:** Continuously scrapes new articles
46+
**When:** Continuous monitoring after (or without) backfill.
4947

5048
**How:**
5149
```bash
5250
python main.py watch --interval 300
5351
```
5452

5553
**Flow:**
56-
1. Infinite loop (by default, checks every 5 min)
57-
2. Scraper calls `scrape_latest()` for each source
58-
3. New articles detected (ID comparison)
59-
4. Save and create embeddings
60-
5. Wait for next interval
54+
1. Infinite loop (default 5-minute interval).
55+
2. Each scraper calls `scrape_latest()`.
56+
3. New articles are saved; embeddings generated.
57+
4. Sync history is recorded.
6158

6259
## 🔧 Main Components
6360

6461
### BaseScraper (abstract interface)
65-
All scrapers inherit from this class and implement:
66-
- `scrape_latest(limit)` → for watch mode
67-
- `scrape_all(limit)` → for backfill mode
62+
- `scrape_latest(limit)` → watch mode
63+
- `scrape_all(limit)` → backfill mode
6864
- `normalize_item()` → unified format
6965

7066
### DatabaseManager
71-
- Manages SQLite persistence
72-
- Tables: articles, embeddings, sync_history
73-
- Automatic deduplication (INSERT OR IGNORE)
74-
- Batch operations for performance
67+
- PostgreSQL persistence with pgvector
68+
- Tables: articles, embeddings (vector), sync_history
69+
- Automatic deduplication via `ON CONFLICT`
70+
- Vector-ready queries and batch operations
7571

7672
### EmbeddingManager
77-
- Support for multiple providers (Dummy, SentenceTransformers)
78-
- Vector serialization/deserialization
79-
- Storage as BLOB in DB
73+
- Providers: Dummy, SentenceTransformers, OpenAI
74+
- Generates numpy vectors sized to the chosen model
75+
- Stores vectors directly in pgvector columns (no pickle)
8076

8177
### WatchServer (orchestrator)
8278
- Initializes all scrapers
8379
- Manages both modes
84-
- Detailed operation logging
85-
- Statistics and monitoring
80+
- Logging, statistics, and monitoring
8681

8782
## 💾 Database Structure
8883

8984
### Table `articles`
9085
```
9186
id (TEXT PRIMARY KEY) # Unique identifier per source
9287
source_site (TEXT) # arxiv, github, medium, le_monde, huggingface
93-
title (TEXT) # Article title
94-
description (TEXT) # Summary/content
95-
author_info (TEXT) # Author(s)
96-
keywords (TEXT) # Tags/categories
97-
content_url (TEXT) # Link to source
98-
published_date (TEXT) # Publication date
99-
item_type (TEXT) # article, paper, repository, etc.
100-
created_at (TIMESTAMP) # When we retrieved it
101-
updated_at (TIMESTAMP) # Last update
88+
title (TEXT) # Article title
89+
description (TEXT) # Summary/content
90+
author_info (TEXT) # Author(s)
91+
keywords (TEXT) # Tags/categories
92+
content_url (TEXT) # Link to source
93+
published_date (TIMESTAMPTZ) # Publication date
94+
item_type (TEXT) # article, paper, repository, etc.
95+
created_at (TIMESTAMPTZ) # When retrieved
96+
updated_at (TIMESTAMPTZ) # Last update
10297
```
10398

10499
### Table `embeddings`
105100
```
106-
article_id (TEXT UNIQUE) # Link to articles.id
107-
embedding (BLOB) # Serialized vector (pickle)
108-
embedding_model (TEXT) # Which model generated the embedding
109-
created_at (TIMESTAMP) # When created
101+
id (SERIAL PRIMARY KEY) # Unique embedding row
102+
article_id (TEXT UNIQUE) # Link to articles.id
103+
embedding vector(1536) # pgvector column (dimension tied to embedding model)
104+
embedding_model (TEXT) # Which model generated the embedding
105+
created_at (TIMESTAMPTZ) # When created
110106
```
111107

112108
### Table `sync_history`
113109
```
114-
source_site (TEXT) # Which source
115-
sync_mode (TEXT) # "watch" or "backfill"
116-
last_sync_time (TIMESTAMP) # When
117-
items_processed (INTEGER) # How many articles
110+
id (SERIAL PRIMARY KEY) # Unique sync row
111+
source_site (TEXT) # Which source
112+
sync_mode (TEXT) # "watch" or "backfill"
113+
last_sync_time (TIMESTAMPTZ) # When
114+
items_processed (INTEGER) # How many articles
115+
created_at (TIMESTAMPTZ) # When recorded
118116
```
119117

120118
## 🚀 Usage
@@ -124,7 +122,7 @@ items_processed (INTEGER) # How many articles
124122
# 1. Fill DB with history
125123
python main.py backfill --limit 50
126124

127-
# 2. Then monitor continuously
125+
# 2. Monitor continuously
128126
python main.py watch --interval 300
129127

130128
# 3. Check stats
@@ -134,13 +132,13 @@ python main.py stats
134132
### With options
135133
```bash
136134
# Custom backfill
137-
python main.py backfill --limit 200 --db custom.db
135+
python main.py backfill --limit 200 --db-url postgresql://user:pass@localhost:5432/veille_technique
138136

139-
# Watch with 10 min interval
137+
# Watch with 10-minute interval
140138
python main.py watch --interval 600
141139

142140
# Stats on specific DB
143-
python main.py stats --db custom.db
141+
python main.py stats --db-url postgresql://user:pass@localhost:5432/veille_technique
144142
```
145143

146144
## 📊 Complete Flow Example
@@ -149,104 +147,64 @@ python main.py stats --db custom.db
149147
Server startup
150148
151149
├─→ BACKFILL Mode (optional)
152-
│ ├─→ ArXiv.scrape_all(100) → 45 articles → DB
153-
│ ├─→ GitHub.scrape_all(100) → 78 articles → DB
154-
│ ├─→ Medium.scrape_all(100) → 23 articles → DB
155-
│ ├─→ LeMonde.scrape_all(100) → 67 articles → DB
156-
│ └─→ HF.scrape_all(100) → 89 articles → DB
150+
│ ├─→ ArXiv.scrape_all(100) → articles → DB
151+
│ ├─→ GitHub.scrape_all(100) → articles → DB
152+
│ ├─→ Medium.scrape_all(100) → articles → DB
153+
│ ├─→ LeMonde.scrape_all(100) → articles → DB
154+
│ └─→ HF.scrape_all(100) → articles → DB
157155
│ ↓ All articles receive an embedding
158-
│ → 302 articles in DB with embeddings
159156
160157
└─→ WATCH Mode (infinite loop)
161-
├─→ Iteration 1
162-
│ ├─→ ArXiv.scrape_latest(20) → 2 new
163-
│ ├─→ GitHub.scrape_latest(20) → 1 new
164-
│ ├─→ Medium.scrape_latest(20) → 0 new
165-
│ ├─→ LeMonde.scrape_latest(20) → 1 new
166-
│ └─→ HF.scrape_latest(20) → 2 new
167-
│ → 6 new articles added
168-
169-
├─→ [Wait 5 min]
170-
171-
└─→ Iteration 2
172-
└─→ ...
158+
├─→ Iteration 1 …
159+
├─→ [Wait interval]
160+
└─→ Iteration 2 …
173161
```
174162

175163
## 🔑 Key Design Points
176164

177165
### ✓ Modularity
178-
- Each scraper is independent
179-
- Easy to add/remove a source
166+
- Independent scrapers; easy to add/remove
180167
- Interchangeable embedding providers
181168

182169
### ✓ Robustness
183-
- Error handling per scraper
184-
- No interruption if a source fails
185-
- Automatic deduplication
170+
- Error isolation per scraper
171+
- Deduplication prevents duplicates
186172

187173
### ✓ Scalability
188-
- Batch operations for DB
189-
- Context manager for connections
190-
- Logging for monitoring
174+
- Batch DB operations
175+
- Vector-ready schema
176+
- Structured logging
191177

192178
### ✓ Maintainability
193-
- Clear and documented code
179+
- Clear code and docs
194180
- Centralized configuration
195-
- Usage examples
181+
- Usage examples included
196182

197183
## 💻 How to View the Database
198184

199-
### Option 1: Export and View Locally
200-
```bash
201-
# Export database to local .db file
202-
python main.py export --db veille_technique.db --output veille_export.db
203-
204-
# View with SQLite Browser or VSCode extension
205-
# Export creates a complete copy of the DB
206-
```
185+
Use PostgreSQL tooling (`psql`, `pgcli`, DBeaver, PgAdmin`) with `DATABASE_URL`.
207186

208-
### Option 2: Use sqlite3 from Command Line
209187
```bash
210-
# Open the database
211-
sqlite3 veille_technique.db
212-
213-
# Some useful queries
214-
sqlite> SELECT COUNT(*) FROM articles; -- Total articles
215-
sqlite> SELECT source_site, COUNT(*) FROM articles GROUP BY source_site; -- By source
216-
sqlite> SELECT * FROM articles LIMIT 5; -- View first 5 articles
217-
sqlite> SELECT source_site, COUNT(*) FROM embeddings GROUP BY source_site; -- Embeddings per source
218-
```
188+
# List tables
189+
psql "$DATABASE_URL" -c "\dt"
219190

220-
### Option 3: Use a GUI
221-
- **SQLite Browser**: `brew install sqlitebrowser` (macOS) or `apt install sqlitebrowser` (Linux)
222-
- **VSCode Extension**: "SQLite" extension (officially supported)
223-
- **DBeaver Community**: Free multi-DB application
191+
# Check pgvector extension
192+
psql "$DATABASE_URL" -c "\dx vector"
224193

225-
### Example: View Articles from One Source
226-
```bash
227-
sqlite3 veille_technique.db << EOF
228-
.headers on
229-
.mode column
230-
SELECT title, author_info, published_date FROM articles
231-
WHERE source_site = 'github'
232-
ORDER BY published_date DESC
233-
LIMIT 10;
234-
EOF
235-
```
194+
# Quick counts
195+
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM articles;"
196+
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM embeddings;"
236197

237-
### Complete DB Structure
238-
```bash
239-
# View all tables
240-
sqlite3 veille_technique.db ".tables"
198+
# Example vector query (top 5 nearest)
199+
psql "$DATABASE_URL" -c "SELECT article_id, embedding <-> '[0.1,0.2,...]' AS distance FROM embeddings ORDER BY embedding <-> '[0.1,0.2,...]' LIMIT 5;"
241200

242-
# View schema of a table
243-
sqlite3 veille_technique.db ".schema articles"
201+
# Last syncs
202+
psql "$DATABASE_URL" -c "SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5;"
244203

245-
# View sync stats
246-
sqlite3 veille_technique.db "SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5;"
204+
# Export (custom format)
205+
pg_dump --dbname="$DATABASE_URL" --format=c --file=veille_technique.dump
247206
```
248207

249208
## 📝 Migration from Old Server
250209

251-
Old code in the `scrap/` folder remains untouched for reference.
252-
The new server reuses the scraping logic but with a completely restructured architecture.
210+
Legacy code in `scrap/` remains for reference; the new server reuses scraping logic with the updated architecture.

server/config.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
1-
"""
2-
Configuration for the watch server.
3-
"""
1+
"""Configuration for the watch server."""
42

3+
import os
54
from dataclasses import dataclass
65
from typing import Dict
76

@@ -19,7 +18,7 @@ class ScraperConfig:
1918
class ServerConfig:
2019
"""Global server configuration."""
2120

22-
db_path: str = "veille_technique.db"
21+
db_url: str = os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique")
2322

2423
watch_interval_seconds: int = 300
2524

@@ -62,12 +61,12 @@ def from_file(cls, filepath: str) -> "ServerConfig":
6261
DEFAULT_CONFIG = ServerConfig()
6362

6463
DEV_CONFIG = ServerConfig(
65-
db_path="veille_technique_dev.db",
64+
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique_dev"),
6665
watch_interval_seconds=60,
6766
)
6867

6968
PROD_CONFIG = ServerConfig(
70-
db_path="veille_technique.db",
69+
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique"),
7170
watch_interval_seconds=600,
7271
log_level="WARNING",
7372
)

0 commit comments

Comments
 (0)