chore: edge cache headers + agent-aware robots + bot-block middleware#119
Conversation
Cuts Vercel hobby edge-request burn on both skillkit (Vite static) and skillkit-docs (Next.js fumadocs) projects. Site stays human-first AND agent-first — Firecrawl, Context7, Crawl4AI, OpenClaw, Hermes, ChatGPT-User, Claude-User, PerplexityBot explicitly allowed; training crawlers and SEO scrapers blocked. skillkit (Vite, docs/skillkit): - vercel.json: Cache-Control on assets (1d browser, 7d edge, 30d SWR; immutable for hashed /assets/*), HTML (5min browser, 1d edge, 7d SWR), and /api JSON - redirects with has user-agent: deflect known training+SEO bots to /robots.txt; negative lookahead on source prevents /robots.txt redirect loop - public/robots.txt: explicit allow + deny lists skillkit-docs (Next.js, docs/fumadocs): - next.config.mjs: headers() block for /_next/static (1y immutable), assets (1d/7d/30d), /docs/* and / (5min/1d/7d) - src/middleware.ts: UA-aware allow/deny pipeline; allowed agents pass through, blocked bots get 403 with cache header so Vercel edge serves the rejection cheaply - public/robots.txt: same allow/deny list as marketing site
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughBot and crawler management policies are implemented across two documentation projects via Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| import { NextResponse } from 'next/server'; | ||
| import type { NextRequest } from 'next/server'; | ||
|
|
||
| const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i; |
There was a problem hiding this comment.
🟡 FacebookBot missing from middleware BLOCK regex despite being in robots.txt Disallow list
The robots.txt at docs/fumadocs/public/robots.txt:78-79 explicitly disallows FacebookBot, but the BLOCK regex in the middleware omits it. This means FacebookBot will pass through the middleware (falling through to the default NextResponse.next() at line 17) and serve content normally, undermining the intended bot-blocking enforcement.
| const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i; | |
| const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|FacebookBot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i; |
Was this helpful? React with 👍 or 👎 to provide feedback.
| { | ||
| "source": "/((?!robots\\.txt$).*)", | ||
| "has": [ | ||
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } |
There was a problem hiding this comment.
🟡 FacebookBot missing from vercel.json redirect user-agent pattern despite being in robots.txt Disallow list
The robots.txt at docs/skillkit/public/robots.txt:78-79 explicitly disallows FacebookBot, but the user-agent regex in the vercel.json redirect rule omits it. This means FacebookBot requests will not be redirected to /robots.txt and will be served content normally, undermining the intended bot-blocking enforcement for the skillkit docs site.
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } | |
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|FacebookBot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/fumadocs/src/middleware.ts`:
- Around line 4-11: The middleware currently tests ALLOW before BLOCK which lets
mixed UAs bypass blocking and also omits FacebookBot from BLOCK; update the
middleware function to first test BLOCK (e.g., if (BLOCK.test(ua)) return
NextResponse.rewrite(.../NextResponse.redirect/NextResponse.next with block
code) so block precedence wins, then test ALLOW afterwards, and add the missing
"FacebookBot" token to the BLOCK RegExp constant so explicit disallowed agents
are caught; keep using the existing symbols BLOCK, ALLOW, middleware,
NextRequest and NextResponse to locate and modify the code.
In `@docs/skillkit/vercel.json`:
- Around line 57-65: The redirect blocklist in the "redirects" array (the object
that has the header with "key": "user-agent" and the "value" regex) is missing
FacebookBot whereas robots.txt disallows it; update the user-agent regex value
to include FacebookBot (add the token "FacebookBot" into the alternation list)
so the redirect that maps to "/robots.txt" will also match and block
FacebookBot, ensuring the header-based redirect and
docs/skillkit/public/robots.txt remain consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ea240d64-f2ab-418e-a058-d460155622f4
📒 Files selected for processing (5)
docs/fumadocs/next.config.mjsdocs/fumadocs/public/robots.txtdocs/fumadocs/src/middleware.tsdocs/skillkit/public/robots.txtdocs/skillkit/vercel.json
| const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i; | ||
|
|
||
| const ALLOW = /Googlebot|Bingbot|DuckDuckBot|Applebot(?!-Extended)|ChatGPT-User|OAI-SearchBot|PerplexityBot|Perplexity-User|Claude-User|Claude-SearchBot|FirecrawlAgent|firecrawl|Context7Bot|Crawl4AI|Clawdbot|OpenClaw|Hermes/i; | ||
|
|
||
| export function middleware(req: NextRequest) { | ||
| const ua = req.headers.get('user-agent') || ''; | ||
| if (ALLOW.test(ua)) return NextResponse.next(); | ||
| if (BLOCK.test(ua)) { |
There was a problem hiding this comment.
Blocklist/enforcement mismatch and precedence bug in UA checks.
FacebookBot is disallowed in docs/fumadocs/public/robots.txt (Line 78) but missing from BLOCK (Line 4). Also, Line 10 checks ALLOW before BLOCK, so a mixed UA containing both patterns can bypass blocking.
Suggested fix
-const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i;
+const BLOCK = /GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|FacebookBot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot/i;
export function middleware(req: NextRequest) {
const ua = req.headers.get('user-agent') || '';
- if (ALLOW.test(ua)) return NextResponse.next();
if (BLOCK.test(ua)) {
return new NextResponse('disallowed by robots.txt', {
status: 403,
headers: { 'Cache-Control': 'public, max-age=86400' },
});
}
+ if (ALLOW.test(ua)) return NextResponse.next();
return NextResponse.next();
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/fumadocs/src/middleware.ts` around lines 4 - 11, The middleware
currently tests ALLOW before BLOCK which lets mixed UAs bypass blocking and also
omits FacebookBot from BLOCK; update the middleware function to first test BLOCK
(e.g., if (BLOCK.test(ua)) return
NextResponse.rewrite(.../NextResponse.redirect/NextResponse.next with block
code) so block precedence wins, then test ALLOW afterwards, and add the missing
"FacebookBot" token to the BLOCK RegExp constant so explicit disallowed agents
are caught; keep using the existing symbols BLOCK, ALLOW, middleware,
NextRequest and NextResponse to locate and modify the code.
| "redirects": [ | ||
| { | ||
| "source": "/((?!robots\\.txt$).*)", | ||
| "has": [ | ||
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } | ||
| ], | ||
| "destination": "/robots.txt", | ||
| "permanent": false | ||
| } |
There was a problem hiding this comment.
FacebookBot is disallowed in robots but not matched in redirect blocklist.
Line 61 omits FacebookBot, while docs/skillkit/public/robots.txt (Line 78) disallows it. This creates policy drift and allows that crawler to bypass this redirect control.
Suggested fix
- { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" }
+ { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|FacebookBot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "redirects": [ | |
| { | |
| "source": "/((?!robots\\.txt$).*)", | |
| "has": [ | |
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } | |
| ], | |
| "destination": "/robots.txt", | |
| "permanent": false | |
| } | |
| "redirects": [ | |
| { | |
| "source": "/((?!robots\\.txt$).*)", | |
| "has": [ | |
| { "type": "header", "key": "user-agent", "value": "(?i).*(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|Applebot-Extended|Bytespider|Amazonbot|FacebookBot|Meta-ExternalAgent|cohere-ai|Diffbot|ImagesiftBot|Omgilibot|peer39_crawler|YouBot|Timpibot|ICC-Crawler|AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|BLEXBot|MegaIndex|SeznamBot|DataForSeoBot).*" } | |
| ], | |
| "destination": "/robots.txt", | |
| "permanent": false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/skillkit/vercel.json` around lines 57 - 65, The redirect blocklist in
the "redirects" array (the object that has the header with "key": "user-agent"
and the "value" regex) is missing FacebookBot whereas robots.txt disallows it;
update the user-agent regex value to include FacebookBot (add the token
"FacebookBot" into the alternation list) so the redirect that maps to
"/robots.txt" will also match and block FacebookBot, ensuring the header-based
redirect and docs/skillkit/public/robots.txt remain consistent.
Summary
Two Vercel projects covered: skillkit (Vite marketing site,
docs/skillkit/) and skillkit-docs (Next.js fumadocs,docs/fumadocs/).skillkit (Vite,
docs/skillkit/)vercel.jsonheaders:Cache-Controlon assets (1d browser, 7d edge, 30d SWR),/assets/*immutable for 1y (Vite hashes filenames), HTML (5min/1d/7d),/apiJSON (5min/1d/7d)vercel.jsonredirects: deflect known training+SEO bots to/robots.txtviahasuser-agent matcher; source uses negative lookahead so/robots.txtitself never matches (no redirect loop)public/robots.txt: explicit allow + deny listsskillkit-docs (Next.js fumadocs,
docs/fumadocs/)next.config.mjsheaders():/_next/static/*immutable for 1y, assets 1d/7d/30d,/docs/*and/5min/1d/7dsrc/middleware.ts: UA allow-list passes through, deny-list returns 403 withCache-Control: max-age=86400so the rejection itself is edge-cachedpublic/robots.txt: same agent-aware allow + deny listsWhy
Vercel hobby plan hit edge-request and Web Analytics caps. Per Vercel usage chart for Apr:
docs/skillkit/vercel.jsonrewrites/_next/*and/docs/*tohttps://skillkit-docs.vercel.app/..., so every docs pageview fires on both projects. Compounding both with cache headers + bot deflection is needed.Bot policy (both projects)
Allowed: Googlebot, Bingbot, DuckDuckBot, Applebot, ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, Claude-User, Claude-SearchBot, FirecrawlAgent, Context7Bot, Crawl4AI, Clawdbot, OpenClaw, Hermes — plus default
User-agent: *allow.Disallowed: GPTBot, ClaudeBot, anthropic-ai, CCBot, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, Meta-ExternalAgent, cohere-ai, Diffbot, ImagesiftBot, Omgilibot, peer39_crawler, YouBot, Timpibot, ICC-Crawler, AhrefsBot, SemrushBot, MJ12bot, DotBot, PetalBot, BLEXBot, MegaIndex, SeznamBot, DataForSeoBot.
Test plan
https://<preview-skillkit>.vercel.app/robots.txtreturns expected contenthttps://<preview-skillkit-docs>.vercel.app/robots.txtreturns expected contentCache-Control: public, max-age=86400, .../_next/static/*includesCache-Control: public, max-age=31536000, immutablecurl -A 'SemrushBot/7.0' https://<preview-skillkit>returns 307 to/robots.txtcurl -A 'SemrushBot/7.0' https://<preview-skillkit>/robots.txtreturns 200 (no loop)curl -A 'SemrushBot/7.0' https://<preview-docs>/docs/...returns 403curl -A 'ChatGPT-User' https://<preview-docs>/docs/...returns 200curl -A 'FirecrawlAgent/1.0' https://<preview-docs>/docs/...returns 200Summary by CodeRabbit