Skip to content

Commit 717f04a

Browse files
authored
feat: add on-demand scraper triggering via http handlers (#81)
1 parent 4061ea0 commit 717f04a

12 files changed

Lines changed: 261 additions & 116 deletions

File tree

apps/docs/src/content/docs/architecture/data-flow.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@ Understanding the flow of data is crucial to comprehending how AlbertPlus works.
99
The primary data pipeline is responsible for collecting, storing, and serving course and program information.
1010

1111
1. **Scraping (Cloudflare Worker)**
12-
- **Trigger**: The process begins with a scheduled job (cron trigger) or http handlers in the Cloudflare Worker.
13-
- **Discovery**: The scraper first discovers the URLs for all available programs and courses from NYU's public course catalog.
14-
- **Job Queuing**: For each discovered program and course, a new job is added to a Cloudflare D1 queue. This allows for resilient and distributed processing.
12+
- **Admin Trigger**: Admin users initiate scraping by calling Convex actions (`api.scraper.triggerMajorsScraping` or `api.scraper.triggerCoursesScraping`).
13+
- **Authenticated Request**: The Convex action makes a POST request to the scraper's HTTP endpoints (`/api/trigger-majors` or `/api/trigger-courses`) with the `CONVEX_API_KEY` in the `X-API-KEY` header.
14+
- **API Key Validation**: The scraper validates the API key to ensure the request is from the trusted Convex backend.
15+
- **Discovery**: The scraper discovers the URLs for all available programs or courses from NYU's public course catalog.
16+
- **Job Queuing**: For each discovered program or course, a new job is added to a Cloudflare Queue. This allows for resilient and distributed processing.
1517
- **Data Extraction**: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
16-
- **Upsert to Backend**: The scraped data is then sent to the Convex backend via an HTTP endpoint.
18+
- **Upsert to Backend**: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
1719

1820
2. **Backend Processing (Convex)**
1921
- **Data Reception**: The Convex backend receives the scraped data from the Cloudflare Worker.

apps/docs/src/content/docs/getting-started/environment-variables.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,11 @@ These variables are required for the Cloudflare Worker scraper.
3939

4040
These variables are configured in your Convex deployment environment.
4141

42-
| Variable | Description |
43-
| ------------------------- | ------------------------------------------------------------------- |
44-
| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation. |
45-
| `CONVEX_API_KEY` | An API key for authenticating with the Convex backend. |
42+
| Variable | Description |
43+
| ------------------------- | ---------------------------------------------------------------------------------- |
44+
| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation. |
45+
| `CONVEX_API_KEY` | A shared API key for authenticating requests between Convex and the scraper worker. |
46+
| `SCRAPER_URL` | The URL of the deployed scraper worker (e.g., `https://scraper.albertplus.com`). |
4647

4748
## Cloudflare Worker Bindings
4849

apps/docs/src/content/docs/modules/scraper.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,18 @@ The scraper, located in the `apps/scraper` directory, is a critical component of
1616

1717
The scraping process is designed to be robust and resilient:
1818

19-
1. **Scheduled Trigger**: A cron job defined in `wrangler.toml` triggers the scraper to run on a regular schedule.
20-
2. **Job Discovery**: The initial job discovers all the available programs and courses and creates individual jobs for each one.
21-
3. **Queueing**: These individual jobs are added to a queue in the D1 database.
22-
4. **Job Processing**: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
23-
5. **Data Upsert**: The scraped data is then sent to the Convex backend via an HTTP request to be stored in the main database.
24-
6. **Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
19+
1. **Admin Trigger**: Admin users can trigger scraping through the Convex backend by calling dedicated actions:
20+
- `api.scraper.triggerMajorsScraping` - Initiates major (program) discovery
21+
- `api.scraper.triggerCoursesScraping` - Initiates course discovery
22+
2. **HTTP Endpoints**: These actions make authenticated POST requests to the scraper's HTTP endpoints:
23+
- `POST /api/trigger-majors` - Creates a major discovery job
24+
- `POST /api/trigger-courses` - Creates a course discovery job
25+
3. **API Key Authentication**: The endpoints validate the `X-API-KEY` header against the `CONVEX_API_KEY` to ensure requests originate from the trusted Convex backend.
26+
4. **Job Discovery**: The initial job discovers all the available programs and courses and creates individual jobs for each one.
27+
5. **Queueing**: These individual jobs are added to a Cloudflare Queue and tracked in the D1 database.
28+
6. **Job Processing**: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
29+
7. **Data Upsert**: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
30+
8. **Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
2531

2632
## Project Structure
2733

apps/scraper/src/index.ts

Lines changed: 60 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { eq } from "drizzle-orm";
2+
import type { Context, Next } from "hono";
23
import { Hono } from "hono";
3-
import * as z from "zod/mini";
44
import getDB from "./drizzle";
55
import { errorLogs, jobs } from "./drizzle/schema";
66
import { ConvexApi } from "./lib/convex";
@@ -10,108 +10,84 @@ import { discoverPrograms, scrapeProgram } from "./modules/programs";
1010

1111
const app = new Hono<{ Bindings: CloudflareBindings }>();
1212

13+
const validateApiKey = async (
14+
c: Context<{ Bindings: CloudflareBindings }>,
15+
next: Next,
16+
) => {
17+
const apiKey = c.req.header("X-API-KEY");
18+
19+
if (!apiKey) {
20+
return c.json({ error: "Missing API key" }, 401);
21+
}
22+
23+
if (apiKey !== c.env.CONVEX_API_KEY) {
24+
return c.json({ error: "Invalid API key" }, 403);
25+
}
26+
27+
await next();
28+
};
29+
1330
app.get("/", async (c) => {
1431
// const db = await getDB(c.env);
1532
// TODO: use hono to render a dashboard to monitor the scraping status
1633
return c.json({ status: "ok" });
1734
});
1835

19-
const ZCacheData = z.object({
20-
isMajorsEnabled: z.transform((val) => val === "true"),
21-
isCoursesEnabled: z.transform((val) => val === "true"),
22-
});
36+
// Endpoint to trigger major discovery scraping
37+
app.post("/api/majors", validateApiKey, async (c) => {
38+
const db = getDB(c.env);
2339

24-
export default {
25-
fetch: app.fetch,
40+
const programsUrl = new URL("/programs", c.env.SCRAPING_BASE_URL).toString();
2641

27-
async scheduled(_event: ScheduledEvent, env: CloudflareBindings) {
28-
const db = getDB(env);
29-
const convex = new ConvexApi({
30-
baseUrl: env.CONVEX_SITE_URL,
31-
apiKey: env.CONVEX_API_KEY,
32-
});
42+
const [createdJob] = await db
43+
.insert(jobs)
44+
.values({ url: programsUrl, jobType: "discover-programs" })
45+
.returning();
3346

34-
const cache = caches.default;
35-
const cacheKey = `${env.CONVEX_SITE_URL}/app-configs`;
47+
await c.env.SCRAPING_QUEUE.send({ jobId: createdJob.id });
3648

37-
let isMajorsEnabled = false;
38-
let isCoursesEnabled = false;
49+
console.log(`Created major discovery job [id: ${createdJob.id}]`);
3950

40-
// Check to see if app configs are cached
41-
const cached = await cache.match(cacheKey);
42-
if (cached) {
43-
const { data, success } = ZCacheData.safeParse(await cached.json());
51+
return c.json({
52+
success: true,
53+
jobId: createdJob.id,
54+
jobType: createdJob.jobType,
55+
});
56+
});
4457

45-
if (!success) {
46-
throw new JobError("Failed to parse cache data", "validation");
47-
}
58+
// Endpoint to trigger course discovery scraping
59+
app.post("/api/courses", validateApiKey, async (c) => {
60+
const db = getDB(c.env);
4861

49-
isMajorsEnabled = data.isMajorsEnabled;
50-
isCoursesEnabled = data.isCoursesEnabled;
51-
} else {
52-
const [isScrapingMajors, isScrapingCourses] = await Promise.all([
53-
convex.getAppConfig({ key: "is_scraping_majors" }),
54-
convex.getAppConfig({ key: "is_scraping_courses" }),
55-
]);
56-
57-
isMajorsEnabled = isScrapingMajors === "true";
58-
isCoursesEnabled = isScrapingCourses === "true";
59-
60-
await cache.put(
61-
cacheKey,
62-
new Response(
63-
JSON.stringify({
64-
isScrapingMajors,
65-
isScrapingCourses,
66-
}),
67-
{
68-
headers: { "Cache-Control": "max-age=3600" },
69-
},
70-
),
71-
);
72-
}
62+
const coursesUrl = new URL("/courses", c.env.SCRAPING_BASE_URL).toString();
7363

74-
const jobsToCreate: Array<{
75-
url: string;
76-
jobType: "discover-programs" | "discover-courses";
77-
}> = [];
78-
const flagsToDisable: string[] = [];
79-
80-
// add major discovery job to the queue
81-
if (isMajorsEnabled) {
82-
const programsUrl = new URL(
83-
"/programs",
84-
env.SCRAPING_BASE_URL,
85-
).toString();
86-
jobsToCreate.push({ url: programsUrl, jobType: "discover-programs" });
87-
flagsToDisable.push("is_scraping_majors");
88-
}
64+
const [createdJob] = await db
65+
.insert(jobs)
66+
.values({ url: coursesUrl, jobType: "discover-courses" })
67+
.returning();
8968

90-
// add course discovery job to the queue
91-
if (isCoursesEnabled) {
92-
const coursesUrl = new URL("/courses", env.SCRAPING_BASE_URL).toString();
93-
jobsToCreate.push({ url: coursesUrl, jobType: "discover-courses" });
94-
flagsToDisable.push("is_scraping_courses");
95-
}
69+
await c.env.SCRAPING_QUEUE.send({ jobId: createdJob.id });
9670

97-
if (jobsToCreate.length === 0) {
98-
console.log("No scraping jobs enabled, skipping");
99-
return;
100-
}
71+
console.log(`Created course discovery job [id: ${createdJob.id}]`);
10172

102-
const createdJobs = await db.insert(jobs).values(jobsToCreate).returning();
73+
return c.json({
74+
success: true,
75+
jobId: createdJob.id,
76+
jobType: createdJob.jobType,
77+
});
78+
});
10379

104-
await Promise.all([
105-
...createdJobs.map((job) => env.SCRAPING_QUEUE.send({ jobId: job.id })),
106-
...flagsToDisable.map((flag) =>
107-
convex.setAppConfig({ key: flag, value: "false" }),
108-
),
109-
cache.delete(cacheKey),
110-
]);
80+
export default {
81+
fetch: app.fetch,
11182

112-
console.log(
113-
`Created ${createdJobs.length} jobs [${createdJobs.map((j) => j.jobType).join(", ")}], disabled flags: ${flagsToDisable.join(", ")}`,
114-
);
83+
async scheduled(_event: ScheduledEvent, _env: CloudflareBindings) {
84+
// const db = getDB(env);
85+
// const convex = new ConvexApi({
86+
// baseUrl: env.CONVEX_SITE_URL,
87+
// apiKey: env.CONVEX_API_KEY,
88+
// });
89+
// TODO: add albert public search
90+
return;
11591
},
11692

11793
async queue(

apps/scraper/wrangler.jsonc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"$schema": "../../node_modules/wrangler/config-schema.json",
2+
"$schema": "./node_modules/wrangler/config-schema.json",
33
"name": "albert-plus-scraper",
44
"routes": [{ "pattern": "scraper.albertplus.com", "custom_domain": true }],
55
"main": "src/index.ts",

apps/web/src/app/dashboard/admin/page.tsx

Lines changed: 80 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,11 @@
33
import { api } from "@albert-plus/server/convex/_generated/api";
44
import type { Doc } from "@albert-plus/server/convex/_generated/dataModel";
55
import { useUser } from "@clerk/nextjs";
6-
import { useConvexAuth, useMutation, useQuery } from "convex/react";
7-
import { Plus } from "lucide-react";
6+
import { useAction, useConvexAuth, useMutation, useQuery } from "convex/react";
7+
import { BookOpen, GraduationCap, Plus } from "lucide-react";
88
import { useRouter } from "next/navigation";
99
import { useState } from "react";
10+
import { toast } from "sonner";
1011
import { Button } from "@/components/ui/button";
1112
import {
1213
Dialog,
@@ -32,6 +33,8 @@ export default function AdminPage() {
3233
);
3334
const setConfig = useMutation(api.appConfigs.setAppConfig);
3435
const removeConfig = useMutation(api.appConfigs.removeAppConfig);
36+
const triggerMajorsScraping = useAction(api.scraper.triggerMajorsScraping);
37+
const triggerCoursesScraping = useAction(api.scraper.triggerCoursesScraping);
3538

3639
const [isDialogOpen, setIsDialogOpen] = useState(false);
3740
const [mode, setMode] = useState<"add" | "edit">("add");
@@ -44,6 +47,9 @@ export default function AdminPage() {
4447
Doc<"appConfigs"> | undefined
4548
>(undefined);
4649

50+
const [isTriggeringMajors, setIsTriggeringMajors] = useState(false);
51+
const [isTriggeringCourses, setIsTriggeringCourses] = useState(false);
52+
4753
if (isAuthenticated && !isAdmin) {
4854
router.push("/dashboard");
4955
return null;
@@ -86,12 +92,79 @@ export default function AdminPage() {
8692
}
8793
};
8894

95+
const handleTriggerMajors = async () => {
96+
setIsTriggeringMajors(true);
97+
try {
98+
const result = await triggerMajorsScraping({});
99+
toast.success("Majors scraping triggered successfully", {
100+
description: `Job ID: ${result.jobId}`,
101+
});
102+
} catch (error) {
103+
toast.error("Failed to trigger majors scraping", {
104+
description: error instanceof Error ? error.message : "Unknown error",
105+
});
106+
} finally {
107+
setIsTriggeringMajors(false);
108+
}
109+
};
110+
111+
const handleTriggerCourses = async () => {
112+
setIsTriggeringCourses(true);
113+
try {
114+
const result = await triggerCoursesScraping({});
115+
toast.success("Courses scraping triggered successfully", {
116+
description: `Job ID: ${result.jobId}`,
117+
});
118+
} catch (error) {
119+
toast.error("Failed to trigger courses scraping", {
120+
description: error instanceof Error ? error.message : "Unknown error",
121+
});
122+
} finally {
123+
setIsTriggeringCourses(false);
124+
}
125+
};
126+
89127
return (
90-
<div className="space-y-4">
91-
<Button onClick={handleAdd} size="sm">
92-
<Plus className="size-4" />
93-
Add
94-
</Button>
128+
<div className="space-y-6">
129+
<div>
130+
<h2 className="text-lg font-semibold mb-3">Scraper Controls</h2>
131+
<div className="flex gap-3">
132+
<Button
133+
onClick={handleTriggerMajors}
134+
disabled={isTriggeringMajors}
135+
size="sm"
136+
variant="outline"
137+
>
138+
{isTriggeringMajors ? (
139+
<Spinner className="size-4" />
140+
) : (
141+
<GraduationCap className="size-4" />
142+
)}
143+
Trigger Majors Scraping
144+
</Button>
145+
<Button
146+
onClick={handleTriggerCourses}
147+
disabled={isTriggeringCourses}
148+
size="sm"
149+
variant="outline"
150+
>
151+
{isTriggeringCourses ? (
152+
<Spinner className="size-4" />
153+
) : (
154+
<BookOpen className="size-4" />
155+
)}
156+
Trigger Courses Scraping
157+
</Button>
158+
</div>
159+
</div>
160+
161+
<div>
162+
<h2 className="text-lg font-semibold mb-3">App Configuration</h2>
163+
<Button onClick={handleAdd} size="sm">
164+
<Plus className="size-4" />
165+
Add
166+
</Button>
167+
</div>
95168

96169
<ConfigTable data={configs} onEdit={handleEdit} onDelete={handleDelete} />
97170

packages/server/convex/_generated/api.d.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ import type * as schemas_programs from "../schemas/programs.js";
2323
import type * as schemas_schools from "../schemas/schools.js";
2424
import type * as schemas_students from "../schemas/students.js";
2525
import type * as schools from "../schools.js";
26+
import type * as scraper from "../scraper.js";
2627
import type * as seed from "../seed.js";
2728
import type * as students from "../students.js";
2829
import type * as userCourseOfferings from "../userCourseOfferings.js";
@@ -58,6 +59,7 @@ declare const fullApi: ApiFromModules<{
5859
"schemas/schools": typeof schemas_schools;
5960
"schemas/students": typeof schemas_students;
6061
schools: typeof schools;
62+
scraper: typeof scraper;
6163
seed: typeof seed;
6264
students: typeof students;
6365
userCourseOfferings: typeof userCourseOfferings;

packages/server/convex/appConfigs.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@ export const removeAppConfig = protectedAdminMutation({
7474
},
7575
});
7676

77+
// TODO: might be able to remove this function
78+
7779
export const setAppConfigInternal = internalMutation({
7880
args: {
7981
key: v.string(),

packages/server/convex/schemas/appConfigs.ts

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ const appConfigOptions = [
66
"current_year",
77
"next_term",
88
"next_year",
9-
"is_scraping_majors",
10-
"is_scraping_courses",
119
] as const;
1210

1311
const AppConfigKey = z.string() as z.ZodMiniType<

0 commit comments

Comments
 (0)