You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: apps/docs/src/content/docs/architecture/data-flow.md
+6-4Lines changed: 6 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,11 +9,13 @@ Understanding the flow of data is crucial to comprehending how AlbertPlus works.
9
9
The primary data pipeline is responsible for collecting, storing, and serving course and program information.
10
10
11
11
1.**Scraping (Cloudflare Worker)**
12
-
-**Trigger**: The process begins with a scheduled job (cron trigger) or http handlers in the Cloudflare Worker.
13
-
-**Discovery**: The scraper first discovers the URLs for all available programs and courses from NYU's public course catalog.
14
-
-**Job Queuing**: For each discovered program and course, a new job is added to a Cloudflare D1 queue. This allows for resilient and distributed processing.
12
+
-**Admin Trigger**: Admin users initiate scraping by calling Convex actions (`api.scraper.triggerMajorsScraping` or `api.scraper.triggerCoursesScraping`).
13
+
-**Authenticated Request**: The Convex action makes a POST request to the scraper's HTTP endpoints (`/api/trigger-majors` or `/api/trigger-courses`) with the `CONVEX_API_KEY` in the `X-API-KEY` header.
14
+
-**API Key Validation**: The scraper validates the API key to ensure the request is from the trusted Convex backend.
15
+
-**Discovery**: The scraper discovers the URLs for all available programs or courses from NYU's public course catalog.
16
+
-**Job Queuing**: For each discovered program or course, a new job is added to a Cloudflare Queue. This allows for resilient and distributed processing.
15
17
-**Data Extraction**: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
16
-
-**Upsert to Backend**: The scraped data is then sent to the Convex backend via an HTTP endpoint.
18
+
-**Upsert to Backend**: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
17
19
18
20
2.**Backend Processing (Convex)**
19
21
-**Data Reception**: The Convex backend receives the scraped data from the Cloudflare Worker.
2.**HTTP Endpoints**: These actions make authenticated POST requests to the scraper's HTTP endpoints:
23
+
-`POST /api/trigger-majors` - Creates a major discovery job
24
+
-`POST /api/trigger-courses` - Creates a course discovery job
25
+
3.**API Key Authentication**: The endpoints validate the `X-API-KEY` header against the `CONVEX_API_KEY` to ensure requests originate from the trusted Convex backend.
26
+
4.**Job Discovery**: The initial job discovers all the available programs and courses and creates individual jobs for each one.
27
+
5.**Queueing**: These individual jobs are added to a Cloudflare Queue and tracked in the D1 database.
28
+
6.**Job Processing**: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
29
+
7.**Data Upsert**: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
30
+
8.**Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
0 commit comments