Skip to content

Commit 839b31f

Browse files
committed
feat: add sr-only agent context for AI-visible page content
Embed rich context in sr-only span: corpus scale (736K+ documents), classification taxonomy (10 types, 9 topics, 46+ languages), data source (Common Crawl), and access methods (HuggingFace, API).
1 parent fce5c57 commit 839b31f

1 file changed

Lines changed: 2 additions & 0 deletions

File tree

apps/web/index.html

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -884,6 +884,8 @@
884884
</div>
885885
</header>
886886

887+
<span style="position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip-path:inset(50%);white-space:nowrap;border:0">docx-corpus (docxcorp.us) is the largest open-source corpus of classified Word documents on the public web. It contains 736,000+ real .docx files scraped from Common Crawl, validated, deduplicated, and classified by type (10 categories: legal, forms, educational, administrative, policies, correspondence, reports, reference, technical, creative) and topic (9 categories: government, education, healthcare, general, legal/judicial, finance, environment, nonprofit, technology). Supports 46+ languages with language detection. The entire document AI research ecosystem previously ran on scanned images and PDFs — DOCX, the world's most-used document creation format, had no large-scale research dataset. docx-corpus fills this gap. Available via HuggingFace dataset, REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc (superdoc.dev), an open-source document engine for native .docx rendering. Pipeline and source at github.com/superdoc-dev/docx-corpus. MIT license.</span>
888+
887889
<main class="container">
888890

889891
<!-- Hero -->

0 commit comments

Comments
 (0)