Skip to content

Commit 2acf981

Browse files
Merge pull request #6 from PoCInnovation/server_base
Server base
2 parents c4c5d94 + c0de3e2 commit 2acf981

16 files changed

Lines changed: 2060 additions & 0 deletions

server/.env.example

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# OpenAI API Configuration
2+
OPENAI_API_KEY=your_openai_api_key_here
3+
4+
# PostgreSQL connection (used by DatabaseManager)
5+
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/veille_technique
6+
7+
# Optional: Override embedding model (OpenAI)
8+
EMBEDDING_MODEL=text-embedding-3-small
9+
10+
# Optional: GitHub Token for higher rate limits
11+
GITHUB_TOKEN=your_github_token_here

server/ARCHITECTURE.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# Watch Server - Complete Redesign
2+
3+
## 📋 Summary of Changes
4+
5+
The server has been redesigned with a modular architecture and two operating modes.
6+
7+
## 🏗️ Architecture
8+
9+
```
10+
server/
11+
├── main.py # Main server with WatchServer (orchestrator)
12+
├── database.py # PostgreSQL + pgvector database manager
13+
├── embeddings.py # Embeddings manager (vectors)
14+
├── config.py # Centralized configuration
15+
├── examples.py # Usage examples
16+
├── requirements.txt # Python dependencies
17+
├── README.md # Complete documentation
18+
└── scrapers/
19+
├── base.py # Abstract BaseScraper interface
20+
├── arxiv_scraper.py # Scraper for arXiv
21+
├── github_scraper.py # Scraper for GitHub
22+
├── medium_scraper.py # Scraper for Medium
23+
├── lemonde_scraper.py # Scraper for Le Monde
24+
└── huggingface_scraper.py # Scraper for Hugging Face
25+
```
26+
27+
## 🎯 Operating Modes
28+
29+
### 1️⃣ Backfill Mode (History)
30+
**When:** At startup (optional)
31+
32+
**What:** Scrapes all available history from each source.
33+
34+
**How:**
35+
```bash
36+
python main.py backfill --limit 100
37+
```
38+
39+
**Flow:**
40+
1. Each scraper calls `scrape_all()`.
41+
2. Articles are saved (deduplicated by ID and hash).
42+
3. Embeddings are generated and stored.
43+
4. Sync history is recorded.
44+
45+
### 2️⃣ Watch Mode (Monitoring)
46+
**When:** Continuous monitoring after (or without) backfill.
47+
48+
**How:**
49+
```bash
50+
python main.py watch --interval 300
51+
```
52+
53+
**Flow:**
54+
1. Infinite loop (default 5-minute interval).
55+
2. Each scraper calls `scrape_latest()`.
56+
3. New articles are saved; embeddings generated.
57+
4. Sync history is recorded.
58+
59+
## 🔧 Main Components
60+
61+
### BaseScraper (abstract interface)
62+
- `scrape_latest(limit)` → watch mode
63+
- `scrape_all(limit)` → backfill mode
64+
- `normalize_item()` → unified format
65+
66+
### DatabaseManager
67+
- PostgreSQL persistence with pgvector
68+
- Tables: articles, embeddings (vector), sync_history
69+
- Automatic deduplication via `ON CONFLICT`
70+
- Vector-ready queries and batch operations
71+
72+
### EmbeddingManager
73+
- Providers: Dummy, SentenceTransformers, OpenAI
74+
- Generates numpy vectors sized to the chosen model
75+
- Stores vectors directly in pgvector columns (no pickle)
76+
77+
### WatchServer (orchestrator)
78+
- Initializes all scrapers
79+
- Manages both modes
80+
- Logging, statistics, and monitoring
81+
82+
## 💾 Database Structure
83+
84+
### Table `articles`
85+
```
86+
id (TEXT PRIMARY KEY) # Unique identifier per source
87+
source_site (TEXT) # arxiv, github, medium, le_monde, huggingface
88+
title (TEXT) # Article title
89+
description (TEXT) # Summary/content
90+
author_info (TEXT) # Author(s)
91+
keywords (TEXT) # Tags/categories
92+
content_url (TEXT) # Link to source
93+
published_date (TIMESTAMPTZ) # Publication date
94+
item_type (TEXT) # article, paper, repository, etc.
95+
created_at (TIMESTAMPTZ) # When retrieved
96+
updated_at (TIMESTAMPTZ) # Last update
97+
```
98+
99+
### Table `embeddings`
100+
```
101+
id (SERIAL PRIMARY KEY) # Unique embedding row
102+
article_id (TEXT UNIQUE) # Link to articles.id
103+
embedding vector(1536) # pgvector column (dimension tied to embedding model)
104+
embedding_model (TEXT) # Which model generated the embedding
105+
created_at (TIMESTAMPTZ) # When created
106+
```
107+
108+
### Table `sync_history`
109+
```
110+
id (SERIAL PRIMARY KEY) # Unique sync row
111+
source_site (TEXT) # Which source
112+
sync_mode (TEXT) # "watch" or "backfill"
113+
last_sync_time (TIMESTAMPTZ) # When
114+
items_processed (INTEGER) # How many articles
115+
created_at (TIMESTAMPTZ) # When recorded
116+
```
117+
118+
## 🚀 Usage
119+
120+
### Simple startup
121+
```bash
122+
# 1. Fill DB with history
123+
python main.py backfill --limit 50
124+
125+
# 2. Monitor continuously
126+
python main.py watch --interval 300
127+
128+
# 3. Check stats
129+
python main.py stats
130+
```
131+
132+
### With options
133+
```bash
134+
# Custom backfill
135+
python main.py backfill --limit 200 --db-url postgresql://user:pass@localhost:5432/veille_technique
136+
137+
# Watch with 10-minute interval
138+
python main.py watch --interval 600
139+
140+
# Stats on specific DB
141+
python main.py stats --db-url postgresql://user:pass@localhost:5432/veille_technique
142+
```
143+
144+
## 📊 Complete Flow Example
145+
146+
```
147+
Server startup
148+
149+
├─→ BACKFILL Mode (optional)
150+
│ ├─→ ArXiv.scrape_all(100) → … articles → DB
151+
│ ├─→ GitHub.scrape_all(100) → … articles → DB
152+
│ ├─→ Medium.scrape_all(100) → … articles → DB
153+
│ ├─→ LeMonde.scrape_all(100) → … articles → DB
154+
│ └─→ HF.scrape_all(100) → … articles → DB
155+
│ ↓ All articles receive an embedding
156+
157+
└─→ WATCH Mode (infinite loop)
158+
├─→ Iteration 1 …
159+
├─→ [Wait interval]
160+
└─→ Iteration 2 …
161+
```
162+
163+
## 🔑 Key Design Points
164+
165+
### ✓ Modularity
166+
- Independent scrapers; easy to add/remove
167+
- Interchangeable embedding providers
168+
169+
### ✓ Robustness
170+
- Error isolation per scraper
171+
- Deduplication prevents duplicates
172+
173+
### ✓ Scalability
174+
- Batch DB operations
175+
- Vector-ready schema
176+
- Structured logging
177+
178+
### ✓ Maintainability
179+
- Clear code and docs
180+
- Centralized configuration
181+
- Usage examples included
182+
183+
## 💻 How to View the Database
184+
185+
Use PostgreSQL tooling (`psql`, `pgcli`, DBeaver, PgAdmin`) with `DATABASE_URL`.
186+
187+
```bash
188+
# List tables
189+
psql "$DATABASE_URL" -c "\dt"
190+
191+
# Check pgvector extension
192+
psql "$DATABASE_URL" -c "\dx vector"
193+
194+
# Quick counts
195+
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM articles;"
196+
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM embeddings;"
197+
198+
# Example vector query (top 5 nearest)
199+
psql "$DATABASE_URL" -c "SELECT article_id, embedding <-> '[0.1,0.2,...]' AS distance FROM embeddings ORDER BY embedding <-> '[0.1,0.2,...]' LIMIT 5;"
200+
201+
# Last syncs
202+
psql "$DATABASE_URL" -c "SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5;"
203+
204+
# Export (custom format)
205+
pg_dump --dbname="$DATABASE_URL" --format=c --file=veille_technique.dump
206+
```
207+
208+
## 📝 Migration from Old Server
209+
210+
Legacy code in `scrap/` remains for reference; the new server reuses scraping logic with the updated architecture.

server/config.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""Configuration for the watch server."""
2+
3+
import os
4+
from dataclasses import dataclass
5+
from typing import Dict
6+
7+
8+
@dataclass
9+
class ScraperConfig:
10+
"""Configuration for a specific scraper."""
11+
enabled: bool = True
12+
limit_latest: int = 20
13+
14+
limit_all: int = 100
15+
16+
17+
@dataclass
18+
class ServerConfig:
19+
"""Global server configuration."""
20+
21+
db_url: str = os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique")
22+
23+
watch_interval_seconds: int = 300
24+
25+
log_level: str = "INFO"
26+
27+
scrapers: Dict[str, ScraperConfig] = None
28+
29+
def __post_init__(self):
30+
"""Initialize default scraper configuration."""
31+
if self.scrapers is None:
32+
self.scrapers = {
33+
"arxiv": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
34+
"github": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
35+
"medium": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
36+
"lemonde": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
37+
"huggingface": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
38+
}
39+
40+
@classmethod
41+
def from_file(cls, filepath: str) -> "ServerConfig":
42+
"""Load configuration from JSON/YAML file."""
43+
import json
44+
45+
try:
46+
with open(filepath, 'r') as f:
47+
data = json.load(f)
48+
49+
if "scrapers" in data:
50+
data["scrapers"] = {
51+
name: ScraperConfig(**cfg)
52+
for name, cfg in data["scrapers"].items()
53+
}
54+
55+
return cls(**data)
56+
except FileNotFoundError:
57+
print(f"Config file {filepath} not found, using default config")
58+
return cls()
59+
60+
61+
DEFAULT_CONFIG = ServerConfig()
62+
63+
DEV_CONFIG = ServerConfig(
64+
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique_dev"),
65+
watch_interval_seconds=60,
66+
)
67+
68+
PROD_CONFIG = ServerConfig(
69+
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique"),
70+
watch_interval_seconds=600,
71+
log_level="WARNING",
72+
)

0 commit comments

Comments
 (0)