+ <span style="position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip-path:inset(50%);white-space:nowrap;border:0">docx-corpus (docxcorp.us) is the largest open-source corpus of classified Word documents on the public web. It contains 736,000+ real .docx files scraped from Common Crawl, validated, deduplicated, and classified by type (10 categories: legal, forms, educational, administrative, policies, correspondence, reports, reference, technical, creative) and topic (9 categories: government, education, healthcare, general, legal/judicial, finance, environment, nonprofit, technology). Supports 46+ languages with language detection. The entire document AI research ecosystem previously ran on scanned images and PDFs — DOCX, the world's most-used document creation format, had no large-scale research dataset. docx-corpus fills this gap. Available via HuggingFace dataset, REST API at api.docxcorp.us, and downloadable manifest files. Built by SuperDoc (superdoc.dev), an open-source document engine for native .docx rendering. Pipeline and source at github.com/superdoc-dev/docx-corpus. MIT license.</span>
0 commit comments