Skip to content

Commit 05a1daf

Browse files
committed
feat(server): add complete server for data scrapping
1 parent c4c5d94 commit 05a1daf

28 files changed

Lines changed: 2274 additions & 0 deletions

server/ARCHITECTURE.md

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# Watch Server - Complete Redesign
2+
3+
## 📋 Summary of Changes
4+
5+
The server has been completely redesigned with a modular architecture and two clearly defined operating modes.
6+
7+
## 🏗️ Architecture
8+
9+
```
10+
server/
11+
├── main.py # Main server with WatchServer (orchestrator)
12+
├── database.py # SQLite database manager
13+
├── embeddings.py # Embeddings manager (vectors)
14+
├── config.py # Centralized configuration
15+
├── examples.py # Usage examples
16+
├── requirements.txt # Python dependencies
17+
├── README.md # Complete documentation
18+
└── scrapers/
19+
├── base.py # Abstract BaseScraper interface
20+
├── arxiv_scraper.py # Scraper for arXiv
21+
├── github_scraper.py # Scraper for GitHub
22+
├── medium_scraper.py # Scraper for Medium
23+
├── lemonde_scraper.py # Scraper for Le Monde
24+
└── huggingface_scraper.py # Scraper for Hugging Face
25+
```
26+
27+
## 🎯 Two Operating Modes
28+
29+
### 1️⃣ Backfill Mode (History)
30+
**When:** At server startup
31+
32+
**What:** Scrapes all available history from each source
33+
34+
**How:**
35+
```bash
36+
python main.py backfill --limit 100
37+
```
38+
39+
**Flow:**
40+
1. Scraper calls `scrape_all()` for each source
41+
2. Articles are saved to DB (deduplicated by ID)
42+
3. Embeddings generated and stored for each article
43+
4. Sync recorded in `sync_history`
44+
45+
### 2️⃣ Watch Mode (Monitoring)
46+
**When:** After backfill or directly
47+
48+
**What:** Continuously scrapes new articles
49+
50+
**How:**
51+
```bash
52+
python main.py watch --interval 300
53+
```
54+
55+
**Flow:**
56+
1. Infinite loop (by default, checks every 5 min)
57+
2. Scraper calls `scrape_latest()` for each source
58+
3. New articles detected (ID comparison)
59+
4. Save and create embeddings
60+
5. Wait for next interval
61+
62+
## 🔧 Main Components
63+
64+
### BaseScraper (abstract interface)
65+
All scrapers inherit from this class and implement:
66+
- `scrape_latest(limit)` → for watch mode
67+
- `scrape_all(limit)` → for backfill mode
68+
- `normalize_item()` → unified format
69+
70+
### DatabaseManager
71+
- Manages SQLite persistence
72+
- Tables: articles, embeddings, sync_history
73+
- Automatic deduplication (INSERT OR IGNORE)
74+
- Batch operations for performance
75+
76+
### EmbeddingManager
77+
- Support for multiple providers (Dummy, SentenceTransformers)
78+
- Vector serialization/deserialization
79+
- Storage as BLOB in DB
80+
81+
### WatchServer (orchestrator)
82+
- Initializes all scrapers
83+
- Manages both modes
84+
- Detailed operation logging
85+
- Statistics and monitoring
86+
87+
## 💾 Database Structure
88+
89+
### Table `articles`
90+
```
91+
id (TEXT PRIMARY KEY) # Unique identifier per source
92+
source_site (TEXT) # arxiv, github, medium, le_monde, huggingface
93+
title (TEXT) # Article title
94+
description (TEXT) # Summary/content
95+
author_info (TEXT) # Author(s)
96+
keywords (TEXT) # Tags/categories
97+
content_url (TEXT) # Link to source
98+
published_date (TEXT) # Publication date
99+
item_type (TEXT) # article, paper, repository, etc.
100+
created_at (TIMESTAMP) # When we retrieved it
101+
updated_at (TIMESTAMP) # Last update
102+
```
103+
104+
### Table `embeddings`
105+
```
106+
article_id (TEXT UNIQUE) # Link to articles.id
107+
embedding (BLOB) # Serialized vector (pickle)
108+
embedding_model (TEXT) # Which model generated the embedding
109+
created_at (TIMESTAMP) # When created
110+
```
111+
112+
### Table `sync_history`
113+
```
114+
source_site (TEXT) # Which source
115+
sync_mode (TEXT) # "watch" or "backfill"
116+
last_sync_time (TIMESTAMP) # When
117+
items_processed (INTEGER) # How many articles
118+
```
119+
120+
## 🚀 Usage
121+
122+
### Simple startup
123+
```bash
124+
# 1. Fill DB with history
125+
python main.py backfill --limit 50
126+
127+
# 2. Then monitor continuously
128+
python main.py watch --interval 300
129+
130+
# 3. Check stats
131+
python main.py stats
132+
```
133+
134+
### With options
135+
```bash
136+
# Custom backfill
137+
python main.py backfill --limit 200 --db custom.db
138+
139+
# Watch with 10 min interval
140+
python main.py watch --interval 600
141+
142+
# Stats on specific DB
143+
python main.py stats --db custom.db
144+
```
145+
146+
## 📊 Complete Flow Example
147+
148+
```
149+
Server startup
150+
151+
├─→ BACKFILL Mode (optional)
152+
│ ├─→ ArXiv.scrape_all(100) → 45 articles → DB
153+
│ ├─→ GitHub.scrape_all(100) → 78 articles → DB
154+
│ ├─→ Medium.scrape_all(100) → 23 articles → DB
155+
│ ├─→ LeMonde.scrape_all(100) → 67 articles → DB
156+
│ └─→ HF.scrape_all(100) → 89 articles → DB
157+
│ ↓ All articles receive an embedding
158+
│ → 302 articles in DB with embeddings
159+
160+
└─→ WATCH Mode (infinite loop)
161+
├─→ Iteration 1
162+
│ ├─→ ArXiv.scrape_latest(20) → 2 new
163+
│ ├─→ GitHub.scrape_latest(20) → 1 new
164+
│ ├─→ Medium.scrape_latest(20) → 0 new
165+
│ ├─→ LeMonde.scrape_latest(20) → 1 new
166+
│ └─→ HF.scrape_latest(20) → 2 new
167+
│ → 6 new articles added
168+
169+
├─→ [Wait 5 min]
170+
171+
└─→ Iteration 2
172+
└─→ ...
173+
```
174+
175+
## 🔑 Key Design Points
176+
177+
### ✓ Modularity
178+
- Each scraper is independent
179+
- Easy to add/remove a source
180+
- Interchangeable embedding providers
181+
182+
### ✓ Robustness
183+
- Error handling per scraper
184+
- No interruption if a source fails
185+
- Automatic deduplication
186+
187+
### ✓ Scalability
188+
- Batch operations for DB
189+
- Context manager for connections
190+
- Logging for monitoring
191+
192+
### ✓ Maintainability
193+
- Clear and documented code
194+
- Centralized configuration
195+
- Usage examples
196+
197+
## 💻 How to View the Database
198+
199+
### Option 1: Export and View Locally
200+
```bash
201+
# Export database to local .db file
202+
python main.py export --db veille_technique.db --output veille_export.db
203+
204+
# View with SQLite Browser or VSCode extension
205+
# Export creates a complete copy of the DB
206+
```
207+
208+
### Option 2: Use sqlite3 from Command Line
209+
```bash
210+
# Open the database
211+
sqlite3 veille_technique.db
212+
213+
# Some useful queries
214+
sqlite> SELECT COUNT(*) FROM articles; -- Total articles
215+
sqlite> SELECT source_site, COUNT(*) FROM articles GROUP BY source_site; -- By source
216+
sqlite> SELECT * FROM articles LIMIT 5; -- View first 5 articles
217+
sqlite> SELECT source_site, COUNT(*) FROM embeddings GROUP BY source_site; -- Embeddings per source
218+
```
219+
220+
### Option 3: Use a GUI
221+
- **SQLite Browser**: `brew install sqlitebrowser` (macOS) or `apt install sqlitebrowser` (Linux)
222+
- **VSCode Extension**: "SQLite" extension (officially supported)
223+
- **DBeaver Community**: Free multi-DB application
224+
225+
### Example: View Articles from One Source
226+
```bash
227+
sqlite3 veille_technique.db << EOF
228+
.headers on
229+
.mode column
230+
SELECT title, author_info, published_date FROM articles
231+
WHERE source_site = 'github'
232+
ORDER BY published_date DESC
233+
LIMIT 10;
234+
EOF
235+
```
236+
237+
### Complete DB Structure
238+
```bash
239+
# View all tables
240+
sqlite3 veille_technique.db ".tables"
241+
242+
# View schema of a table
243+
sqlite3 veille_technique.db ".schema articles"
244+
245+
# View sync stats
246+
sqlite3 veille_technique.db "SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5;"
247+
```
248+
249+
## 📝 Migration from Old Server
250+
251+
Old code in the `scrap/` folder remains untouched for reference.
252+
The new server reuses the scraping logic but with a completely restructured architecture.

0 commit comments

Comments
 (0)