Skip to content

Commit 9501b86

Browse files
Merge pull request #9 from PoCInnovation/feat/server_route
Feat/server route
2 parents b76d6c3 + 607e47c commit 9501b86

15 files changed

Lines changed: 670 additions & 417 deletions

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,5 @@
1-
.env
1+
.env
2+
target/
3+
*.lock
4+
.venv/
5+
__pycache__/

scrapper/START.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# Scrapper Startup Guide
2+
3+
This guide explains how to launch the technical watch scrapper system.
4+
5+
## Prerequisites
6+
7+
- **Python 3.9+**
8+
- **PostgreSQL** with **pgvector** extension
9+
- **OpenAI API Key** (for embeddings and entity extraction)
10+
- **(Optional)** GitHub Token for higher rate limits
11+
12+
## Installation
13+
14+
### 1. Create a Python Virtual Environment
15+
16+
```bash
17+
cd scrapper
18+
python3 -m venv .venv
19+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
20+
```
21+
22+
### 2. Install Dependencies
23+
24+
```bash
25+
pip install -r requirements.txt
26+
```
27+
28+
### 3. Setup PostgreSQL with pgvector
29+
30+
Install PostgreSQL and the pgvector extension:
31+
32+
```bash
33+
# On Ubuntu/Debian
34+
sudo apt install postgresql postgresql-contrib
35+
sudo -u postgres psql -c "CREATE EXTENSION vector;"
36+
37+
# Or using Docker
38+
docker run -d \
39+
--name postgres-pgvector \
40+
-e POSTGRES_PASSWORD=postgres \
41+
-e POSTGRES_DB=veille_technique \
42+
-p 5432:5432 \
43+
pgvector/pgvector:pg16
44+
```
45+
46+
### 4. Configure Environment Variables
47+
48+
Copy the example environment file and configure it:
49+
50+
```bash
51+
cp .env.example .env
52+
```
53+
54+
Edit `.env` and set your credentials:
55+
56+
```env
57+
OPENAI_API_KEY=your_openai_api_key_here
58+
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/veille_technique
59+
EMBEDDING_MODEL=text-embedding-3-small
60+
GITHUB_TOKEN=your_github_token_here # Optional
61+
```
62+
63+
### 5. Initialize the Database
64+
65+
The database schema will be created automatically on first run.
66+
67+
## Running the Scrapper
68+
69+
The scrapper has **3 modes**:
70+
71+
### 1. Backfill Mode (Historical Data)
72+
73+
Scrape entire available history from all sources:
74+
75+
```bash
76+
python main.py backfill
77+
```
78+
79+
Options:
80+
- `--limit N` - Maximum articles per source (default: 100)
81+
- `--db-url URL` - Override database URL
82+
- `--embedding-model MODEL` - Override embedding model
83+
- `--llm-model MODEL` - Override LLM model for entities
84+
85+
Example with custom limit:
86+
```bash
87+
python main.py backfill --limit 200
88+
```
89+
90+
### 2. Watch Mode (Continuous Monitoring)
91+
92+
Scrape new articles continuously at regular intervals:
93+
94+
```bash
95+
python main.py watch
96+
```
97+
98+
Options:
99+
- `--interval SECONDS` - Scraping interval (default: 300s = 5 minutes)
100+
- `--db-url URL` - Override database URL
101+
- `--embedding-model MODEL` - Override embedding model
102+
- `--llm-model MODEL` - Override LLM model for entities
103+
104+
Example with 10-minute interval:
105+
```bash
106+
python main.py watch --interval 600
107+
```
108+
109+
Press `Ctrl+C` to stop the watch mode.
110+
111+
### 3. Stats Mode (View Statistics)
112+
113+
Display database statistics:
114+
115+
```bash
116+
python main.py stats
117+
```
118+
119+
## Available Scrapers
120+
121+
The system includes scrapers for:
122+
123+
- **ArXiv** - Scientific papers (cs.LG category by default)
124+
- **GitHub** - Trending repositories
125+
- **Medium** - Technical articles
126+
- **Le Monde** - News articles
127+
- **Hugging Face** - ML models and papers
128+
129+
## Features
130+
131+
Each scraped article is automatically:
132+
1. **Deduplicated** - By ID and content hash
133+
2. **Embedded** - Using OpenAI embeddings (for similarity search)
134+
3. **Analyzed** - Entities extracted via LLM (technologies, companies, people, etc.)
135+
136+
## Configuration
137+
138+
### Database Connection
139+
140+
Set via environment variable or command-line:
141+
- Environment: `DATABASE_URL=postgresql://user:pass@host:port/dbname`
142+
- CLI: `--db-url postgresql://user:pass@host:port/dbname`
143+
144+
### Embedding Model
145+
146+
Configure the OpenAI embedding model:
147+
- Environment: `EMBEDDING_MODEL=text-embedding-3-small`
148+
- CLI: `--embedding-model text-embedding-3-small`
149+
150+
Available models:
151+
- `text-embedding-3-small` (1536 dimensions, faster)
152+
- `text-embedding-3-large` (3072 dimensions, more accurate)
153+
154+
### LLM Model
155+
156+
Configure the LLM for entity extraction:
157+
- Environment: `LLM_MODEL=gpt-4o-mini`
158+
- CLI: `--llm-model gpt-4o-mini`
159+
160+
## Troubleshooting
161+
162+
### Missing OpenAI API Key
163+
164+
```
165+
Error: OpenAI API key not found
166+
```
167+
168+
**Solution**: Set `OPENAI_API_KEY` in your `.env` file.
169+
170+
### PostgreSQL Connection Error
171+
172+
```
173+
Error: could not connect to server
174+
```
175+
176+
**Solution**:
177+
1. Check PostgreSQL is running: `sudo systemctl status postgresql`
178+
2. Verify DATABASE_URL in `.env`
179+
3. Ensure pgvector extension is installed
180+
181+
### Scraper Initialization Failed
182+
183+
If a scraper fails to initialize, it will be skipped automatically. Check the logs for details.
184+
185+
### Port Already in Use (PostgreSQL)
186+
187+
If port 5432 is already used, either:
188+
1. Stop the conflicting service
189+
2. Use a different port in `DATABASE_URL`
190+
191+
## Development Tips
192+
193+
### Check Architecture
194+
195+
See [ARCHITECTURE.md](ARCHITECTURE.md) for system design details.
196+
197+
### Database Management
198+
199+
View articles directly in PostgreSQL:
200+
```sql
201+
-- Connect to database
202+
psql $DATABASE_URL
203+
204+
-- Count articles
205+
SELECT COUNT(*) FROM articles;
206+
207+
-- View recent articles
208+
SELECT title, source, published_at FROM articles
209+
ORDER BY published_at DESC LIMIT 10;
210+
211+
-- Check embeddings
212+
SELECT COUNT(*) FROM embeddings;
213+
```
214+
215+
## Recommended Workflow
216+
217+
1. **Initial setup**: Run `backfill` mode once to populate historical data
218+
2. **Continuous monitoring**: Run `watch` mode to keep data up-to-date
219+
3. **Check progress**: Use `stats` mode to monitor collection
220+
221+
Example:
222+
```bash
223+
# One-time: populate history
224+
python main.py backfill --limit 50
225+
226+
# Continuous: monitor new content
227+
python main.py watch --interval 600
228+
229+
# Anytime: check statistics
230+
python main.py stats
231+
```

scrapper/examples.py

Lines changed: 0 additions & 147 deletions
This file was deleted.

0 commit comments

Comments
 (0)