fix(docs): restore Typesense search index after #22861 (#23042)

critesjosh · web-flow · commit 81e4f157eec2 · 2026-05-07T15:00:55.000Z
## Summary Search on docs.aztec.network has been broken since #22861 was merged. The nightly Typesense docsearch-scraper run dropped from indexing **~12,457 records to 48 records** and has stayed there. ### Root cause Two compounding regressions from #22861: 1. **`augment_sitemap.js` blasted the scraper.** It appends every `aztec-nr-api/mainnet/**/*.html` URL into the published `sitemap.xml`, which the scraper then queues for crawling via `sitemap_urls`. The previous-day baseline `sitemap.xml` had hundreds of URLs; post-PR it had thousands. The resulting request volume tripped Netlify's rate limiter, which started returning HTTP 403 on ~36% of responses, including every `/developers/tags/*` page and many content pages that worked the day before. 2. **The `api-nr` `text` selector matched nothing.** It targeted `.comments p, .comments li, .item-description` on nargo-doc pages. `.item-description` is empty on most auto-generated index pages, so the scraper produced **`0 records`** for every `aztec-nr-api/mainnet/*` URL it managed to crawl. Evidence from the most recent nightly run: `request_count=1677, 200=814, 403=609`, `Nb hits: 48`. The previous-day baseline run was `Nb hits: 12457`. Workflow exited 0 in both cases because the docker container exits 0 regardless. ## Fix `docs/typesense.config.json`: - Remove `sitemap_urls`. Keep `augment_sitemap.js` and the augmented sitemap in place for SEO; rely on link traversal from the two `start_urls` for indexing. This shrinks the scraper's request volume back toward baseline. - Drop `sitemap_alternate_links: true` (only affects sitemap-driven crawling, which we no longer do). - Broaden the `api-nr` `text` selector to `main .comments p, main .comments li, main .padded-description, main .item-description, main .struct-field, main li`. Verified against the checked-in nargo-doc HTML in `docs/static/aztec-nr-api/mainnet/`: 465 files use `.comments`, struct/fn pages use `.padded-description`, and module-index pages need `main li` to surface the names of nested items. `.github/workflows/docs-typesense.yml`: - Capture the scraper output and fail the run if fewer than 5,000 records are indexed. The container exits 0 even when the config is broken, which let the 48-record regression land silently and stay broken across many nightly runs. The threshold catches the failure mode while leaving plenty of headroom below the 12k baseline. ## Test plan - [ ] Manually dispatch the `Docs Scraper` workflow on this branch via `workflow_dispatch` and confirm `Nb hits` returns to baseline (>>5,000) and the run logs no longer report a flood of 403s. - [ ] After merge, confirm site search on https://docs.aztec.network/ returns results for common queries (e.g. `PXE`, `deploy`, `account`, `ContractClassId`). - [ ] Confirm Aztec.nr API entries (e.g. searching for `ContractClassId`, `protocol_types`) now appear in search results.
diff --git a/.github/workflows/docs-typesense.yml b/.github/workflows/docs-typesense.yml
@@ -27,11 +27,30 @@ jobs:
           fetch-depth: 0
 
       - name: Reindex with Typesense docsearch-scraper
+        env:
+          # Fail the run if the scraper indexes fewer than this many records.
+          # The docsearch-scraper container exits 0 even when its config is broken
+          # and the index ends up nearly empty, so this guard turns a silent
+          # regression (which happened with #22861 dropping the index from
+          # ~12k to 48 records) into a loud CI failure.
+          MIN_HITS: "5000"
         run: |
+          set -o pipefail
           docker run \
             -e "TYPESENSE_API_KEY=${{ secrets.TYPESENSE_API_KEY }}" \
             -e "TYPESENSE_HOST=${{ secrets.TYPESENSE_HOST }}" \
             -e "TYPESENSE_PORT=443" \
             -e "TYPESENSE_PROTOCOL=https" \
             -e "CONFIG=$(cat docs/typesense.config.json | jq -r tostring)" \
-            typesense/docsearch-scraper:0.11.0
+            typesense/docsearch-scraper:0.11.0 2>&1 | tee scraper.log
+
+          nb_hits=$(grep -oE 'Nb hits: *[0-9]+' scraper.log | tail -1 | grep -oE '[0-9]+' || true)
+          if [ -z "$nb_hits" ]; then
+            echo "::error::Could not parse 'Nb hits' from scraper output — assuming index is broken."
+            exit 1
+          fi
+          echo "Indexed $nb_hits records (threshold: $MIN_HITS)"
+          if [ "$nb_hits" -lt "$MIN_HITS" ]; then
+            echo "::error::Indexed only $nb_hits records (expected at least $MIN_HITS). Search index is likely broken."
+            exit 1
+          fi
diff --git a/docs/typesense.config.json b/docs/typesense.config.json
@@ -1,25 +1,21 @@
 {
   "index_name": "aztec-docs",
   "start_urls": [
-      {
-        "url": "https://docs.aztec.network/",
-        "page_rank": 10
-      },
-      {
-        "url": "https://docs.aztec.network/aztec-nr-api/mainnet/",
-        "selectors_key": "api-nr",
-        "page_rank": 2
-      }
-    ],
-    "sitemap_urls": [
-      "https://docs.aztec.network/sitemap.xml"
-    ],
-    "stop_urls": [
-      "https://docs.aztec.network/aztec-nr-api/mainnet/std/",
-      "https://docs.aztec.network/aztec-nr-api/mainnet/all.html",
-      "aztec-nr-api/.*/global\\.[^/]+\\.html$"
-    ],
-  "sitemap_alternate_links": true,
+    {
+      "url": "https://docs.aztec.network/",
+      "page_rank": 10
+    },
+    {
+      "url": "https://docs.aztec.network/aztec-nr-api/mainnet/",
+      "selectors_key": "api-nr",
+      "page_rank": 2
+    }
+  ],
+  "stop_urls": [
+    "https://docs.aztec.network/aztec-nr-api/mainnet/std/",
+    "https://docs.aztec.network/aztec-nr-api/mainnet/all.html",
+    "aztec-nr-api/.*/global\\.[^/]+\\.html$"
+  ],
   "selectors": {
     "default": {
       "lvl0": {
@@ -45,18 +41,13 @@
       "lvl2": "main h2",
       "lvl3": "main h3",
       "lvl4": "main h4",
-      "text": "main .comments p, main .comments li, main .item-description"
+      "text": "main .comments p, main .comments li, main .padded-description, main .item-description, main .struct-field, main li"
     }
   },
   "strip_chars": " .,;:#",
   "custom_settings": {
     "separatorsToIndex": "_",
-    "attributesForFaceting": [
-      "language",
-      "version",
-      "type",
-      "docusaurus_tag"
-    ],
+    "attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
     "attributesToRetrieve": [
       "hierarchy",
       "content",
@@ -66,8 +57,6 @@
       "type"
     ]
   },
-  "conversation_id": [
-    "833762294"
-  ],
+  "conversation_id": ["833762294"],
   "nb_hits": 46250
-}
+}