Skip to content

Commit e01ca74

Browse files
committed
fix(seo): add robots.txt and llms.txt, deploy them to R2
- Create robots.txt allowing all crawlers with sitemap reference, blocking AI training crawlers (CCBot, Google-Extended, Bytespider, Applebot-Extended, meta-externalagent) while keeping retrieval bots - Create llms.txt describing the dataset for LLM-friendly discovery - Update deploy-site workflow to upload both files to R2
1 parent 6b6983d commit e01ca74

3 files changed

Lines changed: 70 additions & 0 deletions

File tree

.github/workflows/deploy-site.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ jobs:
3232
# Static files
3333
aws s3 cp apps/web/index.html "$R2_BUCKET/index.html" --endpoint-url "$R2_ENDPOINT"
3434
aws s3 cp apps/web/sitemap.xml "$R2_BUCKET/sitemap.xml" --endpoint-url "$R2_ENDPOINT"
35+
aws s3 cp apps/web/robots.txt "$R2_BUCKET/robots.txt" --endpoint-url "$R2_ENDPOINT"
36+
aws s3 cp apps/web/llms.txt "$R2_BUCKET/llms.txt" --endpoint-url "$R2_ENDPOINT"
3537
aws s3 cp apps/web/public/ "$R2_BUCKET/public/" --recursive --endpoint-url "$R2_ENDPOINT"
3638
3739
# Generated type pages (clean URLs: /types/legal, not /types/legal.html)

apps/web/llms.txt

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# docx-corpus
2+
3+
> The largest classified corpus of Word documents on the public web.
4+
5+
docx-corpus is an open dataset of 736K+ .docx files collected from the public web, classified into 10 document types and 9 topics across 46+ languages. It is built for document processing research, NLP benchmarking, and training models that work with real-world Word documents.
6+
7+
Documents are classified using ModernBERT with an average confidence of 82%.
8+
9+
Built by [SuperDoc](https://www.superdoc.dev), the document rendering engine.
10+
11+
## Document Types
12+
13+
- [Legal](https://docxcorp.us/types/legal): Contracts, agreements, legal notices, court filings
14+
- [Forms](https://docxcorp.us/types/forms): Application forms, surveys, questionnaires, fillable templates
15+
- [Educational](https://docxcorp.us/types/educational): Course materials, syllabi, assignments, lecture notes
16+
- [Administrative](https://docxcorp.us/types/administrative): Meeting minutes, agendas, organizational documents
17+
- [Policies](https://docxcorp.us/types/policies): Policy documents, procedures, guidelines, handbooks
18+
- [Correspondence](https://docxcorp.us/types/correspondence): Letters, memos, formal communications
19+
- [Reports](https://docxcorp.us/types/reports): Annual reports, research reports, financial reports
20+
- [Reference](https://docxcorp.us/types/reference): Reference materials, glossaries, directories, catalogs
21+
- [Technical](https://docxcorp.us/types/technical): Technical documentation, specifications, user manuals
22+
- [Creative](https://docxcorp.us/types/creative): Creative writing, marketing materials, newsletters
23+
24+
## Topics
25+
26+
- [Government](https://docxcorp.us/topics/government): Public administration and civic organizations
27+
- [Education](https://docxcorp.us/topics/education): Schools, universities, research institutions
28+
- [Healthcare](https://docxcorp.us/topics/healthcare): Hospitals, clinics, health organizations
29+
- [General](https://docxcorp.us/topics/general): Cross-sector documents
30+
- [Legal / Judicial](https://docxcorp.us/topics/legal_judicial): Law firms, courts, regulatory bodies
31+
- [Finance](https://docxcorp.us/topics/finance): Banks, investment firms, insurance
32+
- [Environment](https://docxcorp.us/topics/environment): Environmental agencies, sustainability
33+
- [Nonprofit](https://docxcorp.us/topics/nonprofit): NGOs, charities, foundations
34+
- [Technology](https://docxcorp.us/topics/technology): Tech companies, software, IT
35+
36+
## Links
37+
38+
- Homepage: https://docxcorp.us
39+
- Browse all types: https://docxcorp.us/types
40+
- Browse all topics: https://docxcorp.us/topics
41+
- GitHub: https://github.com/superdoc-dev/docx-corpus
42+
- HuggingFace: https://huggingface.co/datasets/superdoc-dev/docx-corpus
43+
- API: https://api.docxcorp.us
44+
- Takedown requests: help@docxcorp.us

apps/web/robots.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# https://docxcorp.us/robots.txt
2+
3+
User-agent: *
4+
Allow: /
5+
Sitemap: https://docxcorp.us/sitemap.xml
6+
7+
# AI-friendly content available at /llms.txt
8+
# See https://llmstxt.org for the specification
9+
10+
# Block AI training crawlers (we allow retrieval bots like GPTBot, ClaudeBot, Amazonbot)
11+
User-agent: CCBot
12+
Disallow: /
13+
14+
User-agent: Google-Extended
15+
Disallow: /
16+
17+
User-agent: Bytespider
18+
Disallow: /
19+
20+
User-agent: Applebot-Extended
21+
Disallow: /
22+
23+
User-agent: meta-externalagent
24+
Disallow: /

0 commit comments

Comments
 (0)