Skip to content

Commit 81e4f15

Browse files
authored
fix(docs): restore Typesense search index after #22861 (#23042)
## Summary Search on docs.aztec.network has been broken since #22861 was merged. The nightly Typesense docsearch-scraper run dropped from indexing **~12,457 records to 48 records** and has stayed there. ### Root cause Two compounding regressions from #22861: 1. **`augment_sitemap.js` blasted the scraper.** It appends every `aztec-nr-api/mainnet/**/*.html` URL into the published `sitemap.xml`, which the scraper then queues for crawling via `sitemap_urls`. The previous-day baseline `sitemap.xml` had hundreds of URLs; post-PR it had thousands. The resulting request volume tripped Netlify's rate limiter, which started returning HTTP 403 on ~36% of responses, including every `/developers/tags/*` page and many content pages that worked the day before. 2. **The `api-nr` `text` selector matched nothing.** It targeted `.comments p, .comments li, .item-description` on nargo-doc pages. `.item-description` is empty on most auto-generated index pages, so the scraper produced **`0 records`** for every `aztec-nr-api/mainnet/*` URL it managed to crawl. Evidence from the most recent nightly run: `request_count=1677, 200=814, 403=609`, `Nb hits: 48`. The previous-day baseline run was `Nb hits: 12457`. Workflow exited 0 in both cases because the docker container exits 0 regardless. ## Fix `docs/typesense.config.json`: - Remove `sitemap_urls`. Keep `augment_sitemap.js` and the augmented sitemap in place for SEO; rely on link traversal from the two `start_urls` for indexing. This shrinks the scraper's request volume back toward baseline. - Drop `sitemap_alternate_links: true` (only affects sitemap-driven crawling, which we no longer do). - Broaden the `api-nr` `text` selector to `main .comments p, main .comments li, main .padded-description, main .item-description, main .struct-field, main li`. Verified against the checked-in nargo-doc HTML in `docs/static/aztec-nr-api/mainnet/`: 465 files use `.comments`, struct/fn pages use `.padded-description`, and module-index pages need `main li` to surface the names of nested items. `.github/workflows/docs-typesense.yml`: - Capture the scraper output and fail the run if fewer than 5,000 records are indexed. The container exits 0 even when the config is broken, which let the 48-record regression land silently and stay broken across many nightly runs. The threshold catches the failure mode while leaving plenty of headroom below the 12k baseline. ## Test plan - [ ] Manually dispatch the `Docs Scraper` workflow on this branch via `workflow_dispatch` and confirm `Nb hits` returns to baseline (>>5,000) and the run logs no longer report a flood of 403s. - [ ] After merge, confirm site search on https://docs.aztec.network/ returns results for common queries (e.g. `PXE`, `deploy`, `account`, `ContractClassId`). - [ ] Confirm Aztec.nr API entries (e.g. searching for `ContractClassId`, `protocol_types`) now appear in search results.
2 parents 23a580e + 9a8afa8 commit 81e4f15

2 files changed

Lines changed: 39 additions & 31 deletions

File tree

.github/workflows/docs-typesense.yml

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,30 @@ jobs:
2727
fetch-depth: 0
2828

2929
- name: Reindex with Typesense docsearch-scraper
30+
env:
31+
# Fail the run if the scraper indexes fewer than this many records.
32+
# The docsearch-scraper container exits 0 even when its config is broken
33+
# and the index ends up nearly empty, so this guard turns a silent
34+
# regression (which happened with #22861 dropping the index from
35+
# ~12k to 48 records) into a loud CI failure.
36+
MIN_HITS: "5000"
3037
run: |
38+
set -o pipefail
3139
docker run \
3240
-e "TYPESENSE_API_KEY=${{ secrets.TYPESENSE_API_KEY }}" \
3341
-e "TYPESENSE_HOST=${{ secrets.TYPESENSE_HOST }}" \
3442
-e "TYPESENSE_PORT=443" \
3543
-e "TYPESENSE_PROTOCOL=https" \
3644
-e "CONFIG=$(cat docs/typesense.config.json | jq -r tostring)" \
37-
typesense/docsearch-scraper:0.11.0
45+
typesense/docsearch-scraper:0.11.0 2>&1 | tee scraper.log
46+
47+
nb_hits=$(grep -oE 'Nb hits: *[0-9]+' scraper.log | tail -1 | grep -oE '[0-9]+' || true)
48+
if [ -z "$nb_hits" ]; then
49+
echo "::error::Could not parse 'Nb hits' from scraper output — assuming index is broken."
50+
exit 1
51+
fi
52+
echo "Indexed $nb_hits records (threshold: $MIN_HITS)"
53+
if [ "$nb_hits" -lt "$MIN_HITS" ]; then
54+
echo "::error::Indexed only $nb_hits records (expected at least $MIN_HITS). Search index is likely broken."
55+
exit 1
56+
fi

docs/typesense.config.json

Lines changed: 19 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,21 @@
11
{
22
"index_name": "aztec-docs",
33
"start_urls": [
4-
{
5-
"url": "https://docs.aztec.network/",
6-
"page_rank": 10
7-
},
8-
{
9-
"url": "https://docs.aztec.network/aztec-nr-api/mainnet/",
10-
"selectors_key": "api-nr",
11-
"page_rank": 2
12-
}
13-
],
14-
"sitemap_urls": [
15-
"https://docs.aztec.network/sitemap.xml"
16-
],
17-
"stop_urls": [
18-
"https://docs.aztec.network/aztec-nr-api/mainnet/std/",
19-
"https://docs.aztec.network/aztec-nr-api/mainnet/all.html",
20-
"aztec-nr-api/.*/global\\.[^/]+\\.html$"
21-
],
22-
"sitemap_alternate_links": true,
4+
{
5+
"url": "https://docs.aztec.network/",
6+
"page_rank": 10
7+
},
8+
{
9+
"url": "https://docs.aztec.network/aztec-nr-api/mainnet/",
10+
"selectors_key": "api-nr",
11+
"page_rank": 2
12+
}
13+
],
14+
"stop_urls": [
15+
"https://docs.aztec.network/aztec-nr-api/mainnet/std/",
16+
"https://docs.aztec.network/aztec-nr-api/mainnet/all.html",
17+
"aztec-nr-api/.*/global\\.[^/]+\\.html$"
18+
],
2319
"selectors": {
2420
"default": {
2521
"lvl0": {
@@ -45,18 +41,13 @@
4541
"lvl2": "main h2",
4642
"lvl3": "main h3",
4743
"lvl4": "main h4",
48-
"text": "main .comments p, main .comments li, main .item-description"
44+
"text": "main .comments p, main .comments li, main .padded-description, main .item-description, main .struct-field, main li"
4945
}
5046
},
5147
"strip_chars": " .,;:#",
5248
"custom_settings": {
5349
"separatorsToIndex": "_",
54-
"attributesForFaceting": [
55-
"language",
56-
"version",
57-
"type",
58-
"docusaurus_tag"
59-
],
50+
"attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
6051
"attributesToRetrieve": [
6152
"hierarchy",
6253
"content",
@@ -66,8 +57,6 @@
6657
"type"
6758
]
6859
},
69-
"conversation_id": [
70-
"833762294"
71-
],
60+
"conversation_id": ["833762294"],
7261
"nb_hits": 46250
73-
}
62+
}

0 commit comments

Comments
 (0)