Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions .github/workflows/deploy-site.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,30 @@ jobs:
run: |
set -euo pipefail

# HTML pages and the three metadata files share a 5-minute edge TTL
# with stale-while-revalidate for 1 hour. A Cloudflare Cache Rule on
# the zone enables caching for these specific paths (extensionless
# HTML isn't cached by Cloudflare default); see infrastructure notes
# in apps/web/README.md. Static assets (favicons, og-image, logo)
# keep Cloudflare's default static-asset cache behavior - no header.
CACHE_HTML="public, max-age=300, stale-while-revalidate=3600"

# Non-HTML assets: copy the finite Astro output directly. Do not use
# `aws s3 sync` at the bucket root because the same bucket also stores
# the million-plus scraper-managed /documents/ and /extracted/ objects.
find apps/web/dist -type f ! -name "*.html" -print0 | while IFS= read -r -d '' f; do
rel="${f#apps/web/dist/}"
aws s3 cp "$f" "$R2_BUCKET/$rel" \
--endpoint-url "$R2_ENDPOINT"
case "$rel" in
sitemap.xml|robots.txt|llms.txt)
aws s3 cp "$f" "$R2_BUCKET/$rel" \
--cache-control "$CACHE_HTML" \
--endpoint-url "$R2_ENDPOINT"
;;
*)
aws s3 cp "$f" "$R2_BUCKET/$rel" \
--endpoint-url "$R2_ENDPOINT"
;;
esac
done

# HTML files: upload to extensionless keys to match canonical URLs.
Expand All @@ -62,5 +79,6 @@ jobs:
fi
aws s3 cp "$f" "$R2_BUCKET/$key" \
--content-type "text/html; charset=utf-8" \
--cache-control "$CACHE_HTML" \
--endpoint-url "$R2_ENDPOINT"
done
25 changes: 25 additions & 0 deletions apps/web/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,28 @@ Verify before changing `robots.txt`:
curl -I https://docxcorp.us/documents/000014a959f5225c658740fd7915cd50c5728c9cbe06c7d72d79a9708244ec1f.docx
curl -I https://docxcorp.us/extracted/000014a959f5225c658740fd7915cd50c5728c9cbe06c7d72d79a9708244ec1f.txt
```

## HTML Edge Caching

Cloudflare does not cache HTML by default. A zone-level Cache Rule enables caching
for HTML pages, the homepage, and the three metadata files. Static assets
(favicons, og-image, logo) keep Cloudflare's default static-asset cache behavior.

The Cache Rule is set outside this repo:

- Phase: `http_request_cache_settings`
- Expression: `http.host eq "docxcorp.us" and (http.request.uri.path eq "/" or http.request.uri.path in {"/dataset" "/classification" "/quality" "/download" "/types" "/topics" "/sitemap.xml" "/robots.txt" "/llms.txt"} or starts_with(http.request.uri.path, "/types/") or starts_with(http.request.uri.path, "/topics/"))`
- Action: `cache: true`, `edge_ttl.mode: bypass_by_default` (TTL driven by origin `Cache-Control`)

The matching upload-side `Cache-Control` header (`public, max-age=300, stale-while-revalidate=3600`)
is set by `deploy-site.yml` on HTML uploads and on `sitemap.xml`/`robots.txt`/`llms.txt`. Keep the
two sides in sync: if you add a new cacheable HTML route, add it to the expression above.

Verify after deploy:

```bash
# Two same-URL requests in a row: first MISS, second HIT.
URL="https://docxcorp.us/dataset"
curl -sI "$URL" | grep -i "cf-cache-status\|cache-control"
curl -sI "$URL" | grep -i "cf-cache-status"
```
Loading