22
33## 📋 Summary of Changes
44
5- The server has been completely redesigned with a modular architecture and two clearly defined operating modes.
5+ The server has been redesigned with a modular architecture and two operating modes.
66
77## 🏗️ Architecture
88
99```
1010server/
1111├── main.py # Main server with WatchServer (orchestrator)
12- ├── database.py # SQLite database manager
12+ ├── database.py # PostgreSQL + pgvector database manager
1313├── embeddings.py # Embeddings manager (vectors)
14- ├── config.py # Centralized configuration
15- ├── examples.py # Usage examples
16- ├── requirements.txt # Python dependencies
17- ├── README.md # Complete documentation
14+ ├── config.py # Centralized configuration
15+ ├── examples.py # Usage examples
16+ ├── requirements.txt # Python dependencies
17+ ├── README.md # Complete documentation
1818└── scrapers/
19- ├── base.py # Abstract BaseScraper interface
20- ├── arxiv_scraper.py # Scraper for arXiv
21- ├── github_scraper.py # Scraper for GitHub
22- ├── medium_scraper.py # Scraper for Medium
23- ├── lemonde_scraper.py # Scraper for Le Monde
19+ ├── base.py # Abstract BaseScraper interface
20+ ├── arxiv_scraper.py # Scraper for arXiv
21+ ├── github_scraper.py # Scraper for GitHub
22+ ├── medium_scraper.py # Scraper for Medium
23+ ├── lemonde_scraper.py # Scraper for Le Monde
2424 └── huggingface_scraper.py # Scraper for Hugging Face
2525```
2626
27- ## 🎯 Two Operating Modes
27+ ## 🎯 Operating Modes
2828
2929### 1️⃣ Backfill Mode (History)
30- ** When:** At server startup
30+ ** When:** At startup (optional)
3131
32- ** What:** Scrapes all available history from each source
32+ ** What:** Scrapes all available history from each source.
3333
3434** How:**
3535``` bash
3636python main.py backfill --limit 100
3737```
3838
3939** Flow:**
40- 1 . Scraper calls ` scrape_all() ` for each source
41- 2 . Articles are saved to DB (deduplicated by ID)
42- 3 . Embeddings generated and stored for each article
43- 4 . Sync recorded in ` sync_history `
40+ 1 . Each scraper calls ` scrape_all() ` .
41+ 2 . Articles are saved (deduplicated by ID and hash).
42+ 3 . Embeddings are generated and stored.
43+ 4 . Sync history is recorded.
4444
4545### 2️⃣ Watch Mode (Monitoring)
46- ** When:** After backfill or directly
47-
48- ** What:** Continuously scrapes new articles
46+ ** When:** Continuous monitoring after (or without) backfill.
4947
5048** How:**
5149``` bash
5250python main.py watch --interval 300
5351```
5452
5553** Flow:**
56- 1 . Infinite loop (by default, checks every 5 min)
57- 2 . Scraper calls ` scrape_latest() ` for each source
58- 3 . New articles detected (ID comparison)
59- 4 . Save and create embeddings
60- 5 . Wait for next interval
54+ 1 . Infinite loop (default 5-minute interval).
55+ 2 . Each scraper calls ` scrape_latest() ` .
56+ 3 . New articles are saved; embeddings generated.
57+ 4 . Sync history is recorded.
6158
6259## 🔧 Main Components
6360
6461### BaseScraper (abstract interface)
65- All scrapers inherit from this class and implement:
66- - ` scrape_latest(limit) ` → for watch mode
67- - ` scrape_all(limit) ` → for backfill mode
62+ - ` scrape_latest(limit) ` → watch mode
63+ - ` scrape_all(limit) ` → backfill mode
6864- ` normalize_item() ` → unified format
6965
7066### DatabaseManager
71- - Manages SQLite persistence
72- - Tables: articles, embeddings, sync_history
73- - Automatic deduplication (INSERT OR IGNORE)
74- - Batch operations for performance
67+ - PostgreSQL persistence with pgvector
68+ - Tables: articles, embeddings (vector) , sync_history
69+ - Automatic deduplication via ` ON CONFLICT `
70+ - Vector-ready queries and batch operations
7571
7672### EmbeddingManager
77- - Support for multiple providers ( Dummy, SentenceTransformers)
78- - Vector serialization/deserialization
79- - Storage as BLOB in DB
73+ - Providers: Dummy, SentenceTransformers, OpenAI
74+ - Generates numpy vectors sized to the chosen model
75+ - Stores vectors directly in pgvector columns (no pickle)
8076
8177### WatchServer (orchestrator)
8278- Initializes all scrapers
8379- Manages both modes
84- - Detailed operation logging
85- - Statistics and monitoring
80+ - Logging, statistics, and monitoring
8681
8782## 💾 Database Structure
8883
8984### Table ` articles `
9085```
9186id (TEXT PRIMARY KEY) # Unique identifier per source
9287source_site (TEXT) # arxiv, github, medium, le_monde, huggingface
93- title (TEXT) # Article title
94- description (TEXT) # Summary/content
95- author_info (TEXT) # Author(s)
96- keywords (TEXT) # Tags/categories
97- content_url (TEXT) # Link to source
98- published_date (TEXT) # Publication date
99- item_type (TEXT) # article, paper, repository, etc.
100- created_at (TIMESTAMP ) # When we retrieved it
101- updated_at (TIMESTAMP) # Last update
88+ title (TEXT) # Article title
89+ description (TEXT) # Summary/content
90+ author_info (TEXT) # Author(s)
91+ keywords (TEXT) # Tags/categories
92+ content_url (TEXT) # Link to source
93+ published_date (TIMESTAMPTZ) # Publication date
94+ item_type (TEXT) # article, paper, repository, etc.
95+ created_at (TIMESTAMPTZ ) # When retrieved
96+ updated_at (TIMESTAMPTZ) # Last update
10297```
10398
10499### Table ` embeddings `
105100```
106- article_id (TEXT UNIQUE) # Link to articles.id
107- embedding (BLOB) # Serialized vector (pickle)
108- embedding_model (TEXT) # Which model generated the embedding
109- created_at (TIMESTAMP) # When created
101+ id (SERIAL PRIMARY KEY) # Unique embedding row
102+ article_id (TEXT UNIQUE) # Link to articles.id
103+ embedding vector(1536) # pgvector column (dimension tied to embedding model)
104+ embedding_model (TEXT) # Which model generated the embedding
105+ created_at (TIMESTAMPTZ) # When created
110106```
111107
112108### Table ` sync_history `
113109```
114- source_site (TEXT) # Which source
115- sync_mode (TEXT) # "watch" or "backfill"
116- last_sync_time (TIMESTAMP) # When
117- items_processed (INTEGER) # How many articles
110+ id (SERIAL PRIMARY KEY) # Unique sync row
111+ source_site (TEXT) # Which source
112+ sync_mode (TEXT) # "watch" or "backfill"
113+ last_sync_time (TIMESTAMPTZ) # When
114+ items_processed (INTEGER) # How many articles
115+ created_at (TIMESTAMPTZ) # When recorded
118116```
119117
120118## 🚀 Usage
@@ -124,7 +122,7 @@ items_processed (INTEGER) # How many articles
124122# 1. Fill DB with history
125123python main.py backfill --limit 50
126124
127- # 2. Then monitor continuously
125+ # 2. Monitor continuously
128126python main.py watch --interval 300
129127
130128# 3. Check stats
@@ -134,13 +132,13 @@ python main.py stats
134132### With options
135133``` bash
136134# Custom backfill
137- python main.py backfill --limit 200 --db custom.db
135+ python main.py backfill --limit 200 --db-url postgresql://user:pass@localhost:5432/veille_technique
138136
139- # Watch with 10 min interval
137+ # Watch with 10-minute interval
140138python main.py watch --interval 600
141139
142140# Stats on specific DB
143- python main.py stats --db custom.db
141+ python main.py stats --db-url postgresql://user:pass@localhost:5432/veille_technique
144142```
145143
146144## 📊 Complete Flow Example
@@ -149,104 +147,64 @@ python main.py stats --db custom.db
149147Server startup
150148│
151149├─→ BACKFILL Mode (optional)
152- │ ├─→ ArXiv.scrape_all(100) → 45 articles → DB
153- │ ├─→ GitHub.scrape_all(100) → 78 articles → DB
154- │ ├─→ Medium.scrape_all(100) → 23 articles → DB
155- │ ├─→ LeMonde.scrape_all(100) → 67 articles → DB
156- │ └─→ HF.scrape_all(100) → 89 articles → DB
150+ │ ├─→ ArXiv.scrape_all(100) → … articles → DB
151+ │ ├─→ GitHub.scrape_all(100) → … articles → DB
152+ │ ├─→ Medium.scrape_all(100) → … articles → DB
153+ │ ├─→ LeMonde.scrape_all(100) → … articles → DB
154+ │ └─→ HF.scrape_all(100) → … articles → DB
157155│ ↓ All articles receive an embedding
158- │ → 302 articles in DB with embeddings
159156│
160157└─→ WATCH Mode (infinite loop)
161- ├─→ Iteration 1
162- │ ├─→ ArXiv.scrape_latest(20) → 2 new
163- │ ├─→ GitHub.scrape_latest(20) → 1 new
164- │ ├─→ Medium.scrape_latest(20) → 0 new
165- │ ├─→ LeMonde.scrape_latest(20) → 1 new
166- │ └─→ HF.scrape_latest(20) → 2 new
167- │ → 6 new articles added
168- │
169- ├─→ [Wait 5 min]
170- │
171- └─→ Iteration 2
172- └─→ ...
158+ ├─→ Iteration 1 …
159+ ├─→ [Wait interval]
160+ └─→ Iteration 2 …
173161```
174162
175163## 🔑 Key Design Points
176164
177165### ✓ Modularity
178- - Each scraper is independent
179- - Easy to add/remove a source
166+ - Independent scrapers; easy to add/remove
180167- Interchangeable embedding providers
181168
182169### ✓ Robustness
183- - Error handling per scraper
184- - No interruption if a source fails
185- - Automatic deduplication
170+ - Error isolation per scraper
171+ - Deduplication prevents duplicates
186172
187173### ✓ Scalability
188- - Batch operations for DB
189- - Context manager for connections
190- - Logging for monitoring
174+ - Batch DB operations
175+ - Vector-ready schema
176+ - Structured logging
191177
192178### ✓ Maintainability
193- - Clear and documented code
179+ - Clear code and docs
194180- Centralized configuration
195- - Usage examples
181+ - Usage examples included
196182
197183## 💻 How to View the Database
198184
199- ### Option 1: Export and View Locally
200- ``` bash
201- # Export database to local .db file
202- python main.py export --db veille_technique.db --output veille_export.db
203-
204- # View with SQLite Browser or VSCode extension
205- # Export creates a complete copy of the DB
206- ```
185+ Use PostgreSQL tooling (` psql ` , ` pgcli ` , DBeaver, PgAdmin` ) with ` DATABASE_URL`.
207186
208- ### Option 2: Use sqlite3 from Command Line
209187``` bash
210- # Open the database
211- sqlite3 veille_technique.db
212-
213- # Some useful queries
214- sqlite> SELECT COUNT(* ) FROM articles; -- Total articles
215- sqlite> SELECT source_site, COUNT(* ) FROM articles GROUP BY source_site; -- By source
216- sqlite> SELECT * FROM articles LIMIT 5; -- View first 5 articles
217- sqlite> SELECT source_site, COUNT(* ) FROM embeddings GROUP BY source_site; -- Embeddings per source
218- ```
188+ # List tables
189+ psql " $DATABASE_URL " -c " \dt"
219190
220- ### Option 3: Use a GUI
221- - ** SQLite Browser** : ` brew install sqlitebrowser ` (macOS) or ` apt install sqlitebrowser ` (Linux)
222- - ** VSCode Extension** : "SQLite" extension (officially supported)
223- - ** DBeaver Community** : Free multi-DB application
191+ # Check pgvector extension
192+ psql " $DATABASE_URL " -c " \dx vector"
224193
225- ### Example: View Articles from One Source
226- ``` bash
227- sqlite3 veille_technique.db << EOF
228- .headers on
229- .mode column
230- SELECT title, author_info, published_date FROM articles
231- WHERE source_site = 'github'
232- ORDER BY published_date DESC
233- LIMIT 10;
234- EOF
235- ```
194+ # Quick counts
195+ psql " $DATABASE_URL " -c " SELECT COUNT(*) FROM articles;"
196+ psql " $DATABASE_URL " -c " SELECT COUNT(*) FROM embeddings;"
236197
237- ### Complete DB Structure
238- ``` bash
239- # View all tables
240- sqlite3 veille_technique.db " .tables"
198+ # Example vector query (top 5 nearest)
199+ psql " $DATABASE_URL " -c " SELECT article_id, embedding <-> '[0.1,0.2,...]' AS distance FROM embeddings ORDER BY embedding <-> '[0.1,0.2,...]' LIMIT 5;"
241200
242- # View schema of a table
243- sqlite3 veille_technique.db " .schema articles "
201+ # Last syncs
202+ psql " $DATABASE_URL " -c " SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5; "
244203
245- # View sync stats
246- sqlite3 veille_technique.db " SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5; "
204+ # Export (custom format)
205+ pg_dump --dbname= " $DATABASE_URL " --format=c --file=veille_technique.dump
247206```
248207
249208## 📝 Migration from Old Server
250209
251- Old code in the ` scrap/ ` folder remains untouched for reference.
252- The new server reuses the scraping logic but with a completely restructured architecture.
210+ Legacy code in ` scrap/ ` remains for reference; the new server reuses scraping logic with the updated architecture.
0 commit comments