Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions server/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key_here

# PostgreSQL connection (used by DatabaseManager)
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/veille_technique

# Optional: Override embedding model (OpenAI)
EMBEDDING_MODEL=text-embedding-3-small

# Optional: GitHub Token for higher rate limits
GITHUB_TOKEN=your_github_token_here
210 changes: 210 additions & 0 deletions server/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Watch Server - Complete Redesign

## 📋 Summary of Changes

The server has been redesigned with a modular architecture and two operating modes.

## 🏗️ Architecture

```
server/
├── main.py # Main server with WatchServer (orchestrator)
├── database.py # PostgreSQL + pgvector database manager
├── embeddings.py # Embeddings manager (vectors)
├── config.py # Centralized configuration
├── examples.py # Usage examples
├── requirements.txt # Python dependencies
├── README.md # Complete documentation
└── scrapers/
├── base.py # Abstract BaseScraper interface
├── arxiv_scraper.py # Scraper for arXiv
├── github_scraper.py # Scraper for GitHub
├── medium_scraper.py # Scraper for Medium
├── lemonde_scraper.py # Scraper for Le Monde
└── huggingface_scraper.py # Scraper for Hugging Face
```

## 🎯 Operating Modes

### 1️⃣ Backfill Mode (History)
**When:** At startup (optional)

**What:** Scrapes all available history from each source.

**How:**
```bash
python main.py backfill --limit 100
```

**Flow:**
1. Each scraper calls `scrape_all()`.
2. Articles are saved (deduplicated by ID and hash).
3. Embeddings are generated and stored.
4. Sync history is recorded.

### 2️⃣ Watch Mode (Monitoring)
**When:** Continuous monitoring after (or without) backfill.

**How:**
```bash
python main.py watch --interval 300
```

**Flow:**
1. Infinite loop (default 5-minute interval).
2. Each scraper calls `scrape_latest()`.
3. New articles are saved; embeddings generated.
4. Sync history is recorded.

## 🔧 Main Components

### BaseScraper (abstract interface)
- `scrape_latest(limit)` → watch mode
- `scrape_all(limit)` → backfill mode
- `normalize_item()` → unified format

### DatabaseManager
- PostgreSQL persistence with pgvector
- Tables: articles, embeddings (vector), sync_history
- Automatic deduplication via `ON CONFLICT`
- Vector-ready queries and batch operations

### EmbeddingManager
- Providers: Dummy, SentenceTransformers, OpenAI
- Generates numpy vectors sized to the chosen model
- Stores vectors directly in pgvector columns (no pickle)

### WatchServer (orchestrator)
- Initializes all scrapers
- Manages both modes
- Logging, statistics, and monitoring

## 💾 Database Structure

### Table `articles`
```
id (TEXT PRIMARY KEY) # Unique identifier per source
source_site (TEXT) # arxiv, github, medium, le_monde, huggingface
title (TEXT) # Article title
description (TEXT) # Summary/content
author_info (TEXT) # Author(s)
keywords (TEXT) # Tags/categories
content_url (TEXT) # Link to source
published_date (TIMESTAMPTZ) # Publication date
item_type (TEXT) # article, paper, repository, etc.
created_at (TIMESTAMPTZ) # When retrieved
updated_at (TIMESTAMPTZ) # Last update
```

### Table `embeddings`
```
id (SERIAL PRIMARY KEY) # Unique embedding row
article_id (TEXT UNIQUE) # Link to articles.id
embedding vector(1536) # pgvector column (dimension tied to embedding model)
embedding_model (TEXT) # Which model generated the embedding
created_at (TIMESTAMPTZ) # When created
```

### Table `sync_history`
```
id (SERIAL PRIMARY KEY) # Unique sync row
source_site (TEXT) # Which source
sync_mode (TEXT) # "watch" or "backfill"
last_sync_time (TIMESTAMPTZ) # When
items_processed (INTEGER) # How many articles
created_at (TIMESTAMPTZ) # When recorded
```

## 🚀 Usage

### Simple startup
```bash
# 1. Fill DB with history
python main.py backfill --limit 50

# 2. Monitor continuously
python main.py watch --interval 300

# 3. Check stats
python main.py stats
```

### With options
```bash
# Custom backfill
python main.py backfill --limit 200 --db-url postgresql://user:pass@localhost:5432/veille_technique

# Watch with 10-minute interval
python main.py watch --interval 600

# Stats on specific DB
python main.py stats --db-url postgresql://user:pass@localhost:5432/veille_technique
```

## 📊 Complete Flow Example

```
Server startup
├─→ BACKFILL Mode (optional)
│ ├─→ ArXiv.scrape_all(100) → … articles → DB
│ ├─→ GitHub.scrape_all(100) → … articles → DB
│ ├─→ Medium.scrape_all(100) → … articles → DB
│ ├─→ LeMonde.scrape_all(100) → … articles → DB
│ └─→ HF.scrape_all(100) → … articles → DB
│ ↓ All articles receive an embedding
└─→ WATCH Mode (infinite loop)
├─→ Iteration 1 …
├─→ [Wait interval]
└─→ Iteration 2 …
```

## 🔑 Key Design Points

### ✓ Modularity
- Independent scrapers; easy to add/remove
- Interchangeable embedding providers

### ✓ Robustness
- Error isolation per scraper
- Deduplication prevents duplicates

### ✓ Scalability
- Batch DB operations
- Vector-ready schema
- Structured logging

### ✓ Maintainability
- Clear code and docs
- Centralized configuration
- Usage examples included

## 💻 How to View the Database

Use PostgreSQL tooling (`psql`, `pgcli`, DBeaver, PgAdmin`) with `DATABASE_URL`.

```bash
# List tables
psql "$DATABASE_URL" -c "\dt"

# Check pgvector extension
psql "$DATABASE_URL" -c "\dx vector"

# Quick counts
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM articles;"
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM embeddings;"

# Example vector query (top 5 nearest)
psql "$DATABASE_URL" -c "SELECT article_id, embedding <-> '[0.1,0.2,...]' AS distance FROM embeddings ORDER BY embedding <-> '[0.1,0.2,...]' LIMIT 5;"

# Last syncs
psql "$DATABASE_URL" -c "SELECT * FROM sync_history ORDER BY created_at DESC LIMIT 5;"

# Export (custom format)
pg_dump --dbname="$DATABASE_URL" --format=c --file=veille_technique.dump
```

## 📝 Migration from Old Server

Legacy code in `scrap/` remains for reference; the new server reuses scraping logic with the updated architecture.
72 changes: 72 additions & 0 deletions server/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
"""Configuration for the watch server."""

import os
from dataclasses import dataclass
from typing import Dict


@dataclass
class ScraperConfig:
"""Configuration for a specific scraper."""
enabled: bool = True
limit_latest: int = 20

limit_all: int = 100


@dataclass
class ServerConfig:
"""Global server configuration."""

db_url: str = os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique")

watch_interval_seconds: int = 300

log_level: str = "INFO"

scrapers: Dict[str, ScraperConfig] = None

def __post_init__(self):
"""Initialize default scraper configuration."""
if self.scrapers is None:
self.scrapers = {
"arxiv": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
"github": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
"medium": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
"lemonde": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
"huggingface": ScraperConfig(enabled=True, limit_latest=20, limit_all=100),
}

@classmethod
def from_file(cls, filepath: str) -> "ServerConfig":
"""Load configuration from JSON/YAML file."""
import json

try:
with open(filepath, 'r') as f:
data = json.load(f)

if "scrapers" in data:
data["scrapers"] = {
name: ScraperConfig(**cfg)
for name, cfg in data["scrapers"].items()
}

return cls(**data)
except FileNotFoundError:
print(f"Config file {filepath} not found, using default config")
return cls()


DEFAULT_CONFIG = ServerConfig()

DEV_CONFIG = ServerConfig(
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique_dev"),
watch_interval_seconds=60,
)

PROD_CONFIG = ServerConfig(
db_url=os.getenv("DATABASE_URL", "postgresql://postgres:postgres@localhost:5432/veille_technique"),
watch_interval_seconds=600,
log_level="WARNING",
)
Loading
Loading