feat: add on-demand scraper triggering via http handlers (#81)

chenxin-yan · web-flow · commit 717f04a8d97f · 2025-11-05T01:10:11.000-05:00
diff --git a/apps/docs/src/content/docs/architecture/data-flow.md b/apps/docs/src/content/docs/architecture/data-flow.md
@@ -9,11 +9,13 @@ Understanding the flow of data is crucial to comprehending how AlbertPlus works.
 The primary data pipeline is responsible for collecting, storing, and serving course and program information.
 
 1. **Scraping (Cloudflare Worker)**
-   - **Trigger**: The process begins with a scheduled job (cron trigger) or http handlers in the Cloudflare Worker.
-   - **Discovery**: The scraper first discovers the URLs for all available programs and courses from NYU's public course catalog.
-   - **Job Queuing**: For each discovered program and course, a new job is added to a Cloudflare D1 queue. This allows for resilient and distributed processing.
+   - **Admin Trigger**: Admin users initiate scraping by calling Convex actions (`api.scraper.triggerMajorsScraping` or `api.scraper.triggerCoursesScraping`).
+   - **Authenticated Request**: The Convex action makes a POST request to the scraper's HTTP endpoints (`/api/trigger-majors` or `/api/trigger-courses`) with the `CONVEX_API_KEY` in the `X-API-KEY` header.
+   - **API Key Validation**: The scraper validates the API key to ensure the request is from the trusted Convex backend.
+   - **Discovery**: The scraper discovers the URLs for all available programs or courses from NYU's public course catalog.
+   - **Job Queuing**: For each discovered program or course, a new job is added to a Cloudflare Queue. This allows for resilient and distributed processing.
    - **Data Extraction**: Each job in the queue is processed by the worker, which scrapes the detailed information for a specific course or program.
-   - **Upsert to Backend**: The scraped data is then sent to the Convex backend via an HTTP endpoint.
+   - **Upsert to Backend**: The scraped data is sent back to the Convex backend via authenticated HTTP endpoints.
 
 2. **Backend Processing (Convex)**
    - **Data Reception**: The Convex backend receives the scraped data from the Cloudflare Worker.
diff --git a/apps/docs/src/content/docs/getting-started/environment-variables.md b/apps/docs/src/content/docs/getting-started/environment-variables.md
@@ -39,10 +39,11 @@ These variables are required for the Cloudflare Worker scraper.
 
 These variables are configured in your Convex deployment environment.
 
-| Variable                  | Description                                                         |
-| ------------------------- | ------------------------------------------------------------------- |
-| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation. |
-| `CONVEX_API_KEY`          | An API key for authenticating with the Convex backend.              |
+| Variable                  | Description                                                                        |
+| ------------------------- | ---------------------------------------------------------------------------------- |
+| `CLERK_JWT_ISSUER_DOMAIN` | The JWT issuer domain from your Clerk account for token validation.                |
+| `CONVEX_API_KEY`          | A shared API key for authenticating requests between Convex and the scraper worker. |
+| `SCRAPER_URL`             | The URL of the deployed scraper worker (e.g., `https://scraper.albertplus.com`).   |
 
 ## Cloudflare Worker Bindings
 
diff --git a/apps/docs/src/content/docs/modules/scraper.md b/apps/docs/src/content/docs/modules/scraper.md
@@ -16,12 +16,18 @@ The scraper, located in the `apps/scraper` directory, is a critical component of
 
 The scraping process is designed to be robust and resilient:
 
-1. **Scheduled Trigger**: A cron job defined in `wrangler.toml` triggers the scraper to run on a regular schedule.
-2. **Job Discovery**: The initial job discovers all the available programs and courses and creates individual jobs for each one.
-3. **Queueing**: These individual jobs are added to a queue in the D1 database.
-4. **Job Processing**: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
-5. **Data Upsert**: The scraped data is then sent to the Convex backend via an HTTP request to be stored in the main database.
-6. **Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
+1. **Admin Trigger**: Admin users can trigger scraping through the Convex backend by calling dedicated actions:
+   - `api.scraper.triggerMajorsScraping` - Initiates major (program) discovery
+   - `api.scraper.triggerCoursesScraping` - Initiates course discovery
+2. **HTTP Endpoints**: These actions make authenticated POST requests to the scraper's HTTP endpoints:
+   - `POST /api/trigger-majors` - Creates a major discovery job
+   - `POST /api/trigger-courses` - Creates a course discovery job
+3. **API Key Authentication**: The endpoints validate the `X-API-KEY` header against the `CONVEX_API_KEY` to ensure requests originate from the trusted Convex backend.
+4. **Job Discovery**: The initial job discovers all the available programs and courses and creates individual jobs for each one.
+5. **Queueing**: These individual jobs are added to a Cloudflare Queue and tracked in the D1 database.
+6. **Job Processing**: The Cloudflare Worker processes jobs from the queue, scraping the data for each course or program.
+7. **Data Upsert**: The scraped data is then sent to the Convex backend via authenticated HTTP requests to be stored in the main database.
+8. **Error Handling**: The system includes error logging and a retry mechanism for failed jobs.
 
 ## Project Structure
 
diff --git a/apps/scraper/src/index.ts b/apps/scraper/src/index.ts
@@ -1,6 +1,6 @@
 import { eq } from "drizzle-orm";
+import type { Context, Next } from "hono";
 import { Hono } from "hono";
-import * as z from "zod/mini";
 import getDB from "./drizzle";
 import { errorLogs, jobs } from "./drizzle/schema";
 import { ConvexApi } from "./lib/convex";
@@ -10,108 +10,84 @@ import { discoverPrograms, scrapeProgram } from "./modules/programs";
 
 const app = new Hono<{ Bindings: CloudflareBindings }>();
 
+const validateApiKey = async (
+  c: Context<{ Bindings: CloudflareBindings }>,
+  next: Next,
+) => {
+  const apiKey = c.req.header("X-API-KEY");
+
+  if (!apiKey) {
+    return c.json({ error: "Missing API key" }, 401);
+  }
+
+  if (apiKey !== c.env.CONVEX_API_KEY) {
+    return c.json({ error: "Invalid API key" }, 403);
+  }
+
+  await next();
+};
+
 app.get("/", async (c) => {
   // const db = await getDB(c.env);
   // TODO: use hono to render a dashboard to monitor the scraping status
   return c.json({ status: "ok" });
 });
 
-const ZCacheData = z.object({
-  isMajorsEnabled: z.transform((val) => val === "true"),
-  isCoursesEnabled: z.transform((val) => val === "true"),
-});
+// Endpoint to trigger major discovery scraping
+app.post("/api/majors", validateApiKey, async (c) => {
+  const db = getDB(c.env);
 
-export default {
-  fetch: app.fetch,
+  const programsUrl = new URL("/programs", c.env.SCRAPING_BASE_URL).toString();
 
-  async scheduled(_event: ScheduledEvent, env: CloudflareBindings) {
-    const db = getDB(env);
-    const convex = new ConvexApi({
-      baseUrl: env.CONVEX_SITE_URL,
-      apiKey: env.CONVEX_API_KEY,
-    });
+  const [createdJob] = await db
+    .insert(jobs)
+    .values({ url: programsUrl, jobType: "discover-programs" })
+    .returning();
 
-    const cache = caches.default;
-    const cacheKey = `${env.CONVEX_SITE_URL}/app-configs`;
+  await c.env.SCRAPING_QUEUE.send({ jobId: createdJob.id });
 
-    let isMajorsEnabled = false;
-    let isCoursesEnabled = false;
+  console.log(`Created major discovery job [id: ${createdJob.id}]`);
 
-    // Check to see if app configs are cached
-    const cached = await cache.match(cacheKey);
-    if (cached) {
-      const { data, success } = ZCacheData.safeParse(await cached.json());
+  return c.json({
+    success: true,
+    jobId: createdJob.id,
+    jobType: createdJob.jobType,
+  });
+});
 
-      if (!success) {
-        throw new JobError("Failed to parse cache data", "validation");
-      }
+// Endpoint to trigger course discovery scraping
+app.post("/api/courses", validateApiKey, async (c) => {
+  const db = getDB(c.env);
 
-      isMajorsEnabled = data.isMajorsEnabled;
-      isCoursesEnabled = data.isCoursesEnabled;
-    } else {
-      const [isScrapingMajors, isScrapingCourses] = await Promise.all([
-        convex.getAppConfig({ key: "is_scraping_majors" }),
-        convex.getAppConfig({ key: "is_scraping_courses" }),
-      ]);
-
-      isMajorsEnabled = isScrapingMajors === "true";
-      isCoursesEnabled = isScrapingCourses === "true";
-
-      await cache.put(
-        cacheKey,
-        new Response(
-          JSON.stringify({
-            isScrapingMajors,
-            isScrapingCourses,
-          }),
-          {
-            headers: { "Cache-Control": "max-age=3600" },
-          },
-        ),
-      );
-    }
+  const coursesUrl = new URL("/courses", c.env.SCRAPING_BASE_URL).toString();
 
-    const jobsToCreate: Array<{
-      url: string;
-      jobType: "discover-programs" | "discover-courses";
-    }> = [];
-    const flagsToDisable: string[] = [];
-
-    // add major discovery job to the queue
-    if (isMajorsEnabled) {
-      const programsUrl = new URL(
-        "/programs",
-        env.SCRAPING_BASE_URL,
-      ).toString();
-      jobsToCreate.push({ url: programsUrl, jobType: "discover-programs" });
-      flagsToDisable.push("is_scraping_majors");
-    }
+  const [createdJob] = await db
+    .insert(jobs)
+    .values({ url: coursesUrl, jobType: "discover-courses" })
+    .returning();
 
-    // add course discovery job to the queue
-    if (isCoursesEnabled) {
-      const coursesUrl = new URL("/courses", env.SCRAPING_BASE_URL).toString();
-      jobsToCreate.push({ url: coursesUrl, jobType: "discover-courses" });
-      flagsToDisable.push("is_scraping_courses");
-    }
+  await c.env.SCRAPING_QUEUE.send({ jobId: createdJob.id });
 
-    if (jobsToCreate.length === 0) {
-      console.log("No scraping jobs enabled, skipping");
-      return;
-    }
+  console.log(`Created course discovery job [id: ${createdJob.id}]`);
 
-    const createdJobs = await db.insert(jobs).values(jobsToCreate).returning();
+  return c.json({
+    success: true,
+    jobId: createdJob.id,
+    jobType: createdJob.jobType,
+  });
+});
 
-    await Promise.all([
-      ...createdJobs.map((job) => env.SCRAPING_QUEUE.send({ jobId: job.id })),
-      ...flagsToDisable.map((flag) =>
-        convex.setAppConfig({ key: flag, value: "false" }),
-      ),
-      cache.delete(cacheKey),
-    ]);
+export default {
+  fetch: app.fetch,
 
-    console.log(
-      `Created ${createdJobs.length} jobs [${createdJobs.map((j) => j.jobType).join(", ")}], disabled flags: ${flagsToDisable.join(", ")}`,
-    );
+  async scheduled(_event: ScheduledEvent, _env: CloudflareBindings) {
+    // const db = getDB(env);
+    // const convex = new ConvexApi({
+    //   baseUrl: env.CONVEX_SITE_URL,
+    //   apiKey: env.CONVEX_API_KEY,
+    // });
+    // TODO: add albert public search
+    return;
   },
 
   async queue(
diff --git a/apps/scraper/wrangler.jsonc b/apps/scraper/wrangler.jsonc
@@ -1,5 +1,5 @@
 {
-  "$schema": "../../node_modules/wrangler/config-schema.json",
+  "$schema": "./node_modules/wrangler/config-schema.json",
   "name": "albert-plus-scraper",
   "routes": [{ "pattern": "scraper.albertplus.com", "custom_domain": true }],
   "main": "src/index.ts",
diff --git a/apps/web/src/app/dashboard/admin/page.tsx b/apps/web/src/app/dashboard/admin/page.tsx
@@ -3,10 +3,11 @@
 import { api } from "@albert-plus/server/convex/_generated/api";
 import type { Doc } from "@albert-plus/server/convex/_generated/dataModel";
 import { useUser } from "@clerk/nextjs";
-import { useConvexAuth, useMutation, useQuery } from "convex/react";
-import { Plus } from "lucide-react";
+import { useAction, useConvexAuth, useMutation, useQuery } from "convex/react";
+import { BookOpen, GraduationCap, Plus } from "lucide-react";
 import { useRouter } from "next/navigation";
 import { useState } from "react";
+import { toast } from "sonner";
 import { Button } from "@/components/ui/button";
 import {
   Dialog,
@@ -32,6 +33,8 @@ export default function AdminPage() {
   );
   const setConfig = useMutation(api.appConfigs.setAppConfig);
   const removeConfig = useMutation(api.appConfigs.removeAppConfig);
+  const triggerMajorsScraping = useAction(api.scraper.triggerMajorsScraping);
+  const triggerCoursesScraping = useAction(api.scraper.triggerCoursesScraping);
 
   const [isDialogOpen, setIsDialogOpen] = useState(false);
   const [mode, setMode] = useState<"add" | "edit">("add");
@@ -44,6 +47,9 @@ export default function AdminPage() {
     Doc<"appConfigs"> | undefined
   >(undefined);
 
+  const [isTriggeringMajors, setIsTriggeringMajors] = useState(false);
+  const [isTriggeringCourses, setIsTriggeringCourses] = useState(false);
+
   if (isAuthenticated && !isAdmin) {
     router.push("/dashboard");
     return null;
@@ -86,12 +92,79 @@ export default function AdminPage() {
     }
   };
 
+  const handleTriggerMajors = async () => {
+    setIsTriggeringMajors(true);
+    try {
+      const result = await triggerMajorsScraping({});
+      toast.success("Majors scraping triggered successfully", {
+        description: `Job ID: ${result.jobId}`,
+      });
+    } catch (error) {
+      toast.error("Failed to trigger majors scraping", {
+        description: error instanceof Error ? error.message : "Unknown error",
+      });
+    } finally {
+      setIsTriggeringMajors(false);
+    }
+  };
+
+  const handleTriggerCourses = async () => {
+    setIsTriggeringCourses(true);
+    try {
+      const result = await triggerCoursesScraping({});
+      toast.success("Courses scraping triggered successfully", {
+        description: `Job ID: ${result.jobId}`,
+      });
+    } catch (error) {
+      toast.error("Failed to trigger courses scraping", {
+        description: error instanceof Error ? error.message : "Unknown error",
+      });
+    } finally {
+      setIsTriggeringCourses(false);
+    }
+  };
+
   return (
-    <div className="space-y-4">
-      <Button onClick={handleAdd} size="sm">
-        <Plus className="size-4" />
-        Add
-      </Button>
+    <div className="space-y-6">
+      <div>
+        <h2 className="text-lg font-semibold mb-3">Scraper Controls</h2>
+        <div className="flex gap-3">
+          <Button
+            onClick={handleTriggerMajors}
+            disabled={isTriggeringMajors}
+            size="sm"
+            variant="outline"
+          >
+            {isTriggeringMajors ? (
+              <Spinner className="size-4" />
+            ) : (
+              <GraduationCap className="size-4" />
+            )}
+            Trigger Majors Scraping
+          </Button>
+          <Button
+            onClick={handleTriggerCourses}
+            disabled={isTriggeringCourses}
+            size="sm"
+            variant="outline"
+          >
+            {isTriggeringCourses ? (
+              <Spinner className="size-4" />
+            ) : (
+              <BookOpen className="size-4" />
+            )}
+            Trigger Courses Scraping
+          </Button>
+        </div>
+      </div>
+
+      <div>
+        <h2 className="text-lg font-semibold mb-3">App Configuration</h2>
+        <Button onClick={handleAdd} size="sm">
+          <Plus className="size-4" />
+          Add
+        </Button>
+      </div>
 
       <ConfigTable data={configs} onEdit={handleEdit} onDelete={handleDelete} />
 
diff --git a/packages/server/convex/_generated/api.d.ts b/packages/server/convex/_generated/api.d.ts
@@ -23,6 +23,7 @@ import type * as schemas_programs from "../schemas/programs.js";
 import type * as schemas_schools from "../schemas/schools.js";
 import type * as schemas_students from "../schemas/students.js";
 import type * as schools from "../schools.js";
+import type * as scraper from "../scraper.js";
 import type * as seed from "../seed.js";
 import type * as students from "../students.js";
 import type * as userCourseOfferings from "../userCourseOfferings.js";
@@ -58,6 +59,7 @@ declare const fullApi: ApiFromModules<{
   "schemas/schools": typeof schemas_schools;
   "schemas/students": typeof schemas_students;
   schools: typeof schools;
+  scraper: typeof scraper;
   seed: typeof seed;
   students: typeof students;
   userCourseOfferings: typeof userCourseOfferings;
diff --git a/packages/server/convex/appConfigs.ts b/packages/server/convex/appConfigs.ts
@@ -74,6 +74,8 @@ export const removeAppConfig = protectedAdminMutation({
   },
 });
 
+// TODO: might be able to remove this function
+
 export const setAppConfigInternal = internalMutation({
   args: {
     key: v.string(),
diff --git a/packages/server/convex/schemas/appConfigs.ts b/packages/server/convex/schemas/appConfigs.ts
@@ -6,8 +6,6 @@ const appConfigOptions = [
   "current_year",
   "next_term",
   "next_year",
-  "is_scraping_majors",
-  "is_scraping_courses",
 ] as const;
 
 const AppConfigKey = z.string() as z.ZodMiniType<
diff --git a/packages/server/convex/scraper.ts b/packages/server/convex/scraper.ts
diff --git a/packages/server/seed/appConfigs.json b/packages/server/seed/appConfigs.json
diff --git a/setup.sh b/setup.sh

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`		`- "$schema": "../../node_modules/wrangler/config-schema.json",`
	`2`	`+ "$schema": "./node_modules/wrangler/config-schema.json",`
`3`	`3`	`"name": "albert-plus-scraper",`
`4`	`4`	`"routes": [{ "pattern": "scraper.albertplus.com", "custom_domain": true }],`
`5`	`5`	`"main": "src/index.ts",`