|
| 1 | +# docx-corpus |
| 2 | + |
| 3 | +> The largest classified corpus of Word documents on the public web. |
| 4 | + |
| 5 | +docx-corpus is an open dataset of 736K+ .docx files collected from the public web, classified into 10 document types and 9 topics across 46+ languages. It is built for document processing research, NLP benchmarking, and training models that work with real-world Word documents. |
| 6 | + |
| 7 | +Documents are classified using ModernBERT with an average confidence of 82%. |
| 8 | + |
| 9 | +Built by [SuperDoc](https://www.superdoc.dev), the document rendering engine. |
| 10 | + |
| 11 | +## Document Types |
| 12 | + |
| 13 | +- [Legal](https://docxcorp.us/types/legal): Contracts, agreements, legal notices, court filings |
| 14 | +- [Forms](https://docxcorp.us/types/forms): Application forms, surveys, questionnaires, fillable templates |
| 15 | +- [Educational](https://docxcorp.us/types/educational): Course materials, syllabi, assignments, lecture notes |
| 16 | +- [Administrative](https://docxcorp.us/types/administrative): Meeting minutes, agendas, organizational documents |
| 17 | +- [Policies](https://docxcorp.us/types/policies): Policy documents, procedures, guidelines, handbooks |
| 18 | +- [Correspondence](https://docxcorp.us/types/correspondence): Letters, memos, formal communications |
| 19 | +- [Reports](https://docxcorp.us/types/reports): Annual reports, research reports, financial reports |
| 20 | +- [Reference](https://docxcorp.us/types/reference): Reference materials, glossaries, directories, catalogs |
| 21 | +- [Technical](https://docxcorp.us/types/technical): Technical documentation, specifications, user manuals |
| 22 | +- [Creative](https://docxcorp.us/types/creative): Creative writing, marketing materials, newsletters |
| 23 | + |
| 24 | +## Topics |
| 25 | + |
| 26 | +- [Government](https://docxcorp.us/topics/government): Public administration and civic organizations |
| 27 | +- [Education](https://docxcorp.us/topics/education): Schools, universities, research institutions |
| 28 | +- [Healthcare](https://docxcorp.us/topics/healthcare): Hospitals, clinics, health organizations |
| 29 | +- [General](https://docxcorp.us/topics/general): Cross-sector documents |
| 30 | +- [Legal / Judicial](https://docxcorp.us/topics/legal_judicial): Law firms, courts, regulatory bodies |
| 31 | +- [Finance](https://docxcorp.us/topics/finance): Banks, investment firms, insurance |
| 32 | +- [Environment](https://docxcorp.us/topics/environment): Environmental agencies, sustainability |
| 33 | +- [Nonprofit](https://docxcorp.us/topics/nonprofit): NGOs, charities, foundations |
| 34 | +- [Technology](https://docxcorp.us/topics/technology): Tech companies, software, IT |
| 35 | + |
| 36 | +## Links |
| 37 | + |
| 38 | +- Homepage: https://docxcorp.us |
| 39 | +- Browse all types: https://docxcorp.us/types |
| 40 | +- Browse all topics: https://docxcorp.us/topics |
| 41 | +- GitHub: https://github.com/superdoc-dev/docx-corpus |
| 42 | +- HuggingFace: https://huggingface.co/datasets/superdoc-dev/docx-corpus |
| 43 | +- API: https://api.docxcorp.us |
| 44 | +- Takedown requests: help@docxcorp.us |
0 commit comments