This document provides licensing information for all external data sources used in the WorldAlphabets project. All sources listed here allow redistribution under their respective licenses.
WorldAlphabets aggregates data from multiple open-source and openly-licensed sources. All data sources used in this project permit redistribution, though some require attribution or have share-alike provisions.
The unified frequency data pipeline (scripts/build_top200_unified.py) uses the following sources in priority order:
Priority Order:
- Leipzig Corpora Collection (77 languages)
- HermitDave FrequencyWords (48 languages)
- Mozilla CommonVoice (130+ languages)
- Tatoeba Sentences (73 languages)
- Existing alphabet frequency data (character-level fallback)
- Simia unigrams (CJK character data)
Source: Wortschatz Leipzig, University of Leipzig
Website: https://wortschatz.uni-leipzig.de/
Download Portal: https://downloads.wortschatz-leipzig.de/
License: Creative Commons Attribution (CC BY)
Languages Covered: 77 languages (dynamically discovered via catalogue)
License Details:
- The text corpora offered for download are made available under the Creative Commons licence CC BY
- Attribution required when using the data
- Commercial use is permitted
- Modifications are permitted
- Redistribution: ✅ Allowed with attribution
Terms of Use: https://wortschatz.uni-leipzig.de/en/usage
Usage in WorldAlphabets:
- Primary source for high-quality frequency data
- Accessed via dynamic catalogue scraping from
https://corpora.wortschatz-leipzig.de/ - Downloads corpus archives in
.tar.gzformat - Extracts word frequency lists from
-words.txtfiles within archives - Prioritizes corpus types: community > news > mixed > web > newscrawl > wikipedia
Attribution: Data from Leipzig Corpora Collection, Wortschatz Leipzig, University of Leipzig
Source: Hermit Dave
Repository: https://github.com/hermitdave/FrequencyWords
License:
- Code: MIT License
- Content: Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA-4.0)
Languages Covered: 48 languages
License Details:
- Content is derived from OpenSubtitles and Wikipedia
- Attribution required
- Share-alike: Derivative works must use the same license
- Commercial use is permitted
- Redistribution: ✅ Allowed with attribution and share-alike
Data Sources:
- OpenSubtitles 2016: http://opus.lingfil.uu.se/OpenSubtitles2016.php
- OpenSubtitles 2018: http://opus.nlpl.eu/OpenSubtitles2018.php
Usage in WorldAlphabets:
- Secondary source for frequency data
- Accessed via GitHub raw content URLs
- Provides word frequency lists in format:
{word} {frequency}
Attribution: Data from FrequencyWords by Hermit Dave (https://github.com/hermitdave/FrequencyWords), licensed under CC-BY-SA-4.0
Source: Mozilla Foundation Website: https://commonvoice.mozilla.org/ Data Portal: https://datacollective.mozillafoundation.org/ License: Creative Commons Zero (CC0) - Public Domain Dedication
Languages Covered: 130+ languages (Spontaneous Speech and regular datasets)
License Details:
- CC0 1.0 Universal Public Domain Dedication
- No attribution required (though appreciated)
- Commercial use is permitted
- Modifications are permitted
- Redistribution: ✅ Allowed without restrictions
Dataset Types:
- Regular CommonVoice: Validated speech recordings with transcriptions (
cv-corpus-*/validated.tsv) - Spontaneous Speech: Natural speech recordings with transcriptions (
sps-corpus-*/ss-corpus-{lang}.tsv)
API Access: Mozilla Data Collective API
- Base URL:
https://datacollective.mozillafoundation.org/api - Requires API key authentication
- Two-step download process (create session, then download)
- Terms acceptance required per dataset via web interface
Usage in WorldAlphabets:
- Third-priority source for frequency data
- Accessed via Mozilla Data Collective API (
scripts/fetch_commonvoice.py) - Downloads complete dataset archives (
.tar.gzformat) - Extracts transcriptions from TSV files
- Calculates word frequencies from speech transcriptions
- Particularly valuable for under-resourced and indigenous languages
Attribution: Speech transcription data from Mozilla CommonVoice (https://commonvoice.mozilla.org), licensed under CC0
Configured Languages (as of 2025-11):
- ady (Adyghe), ttj (Rutoro), tob (Toba Qom), meh (Southwestern Tlaxiaco Mixtec)
- top (Papantla Totonac), ukv (Kuku), seh (Sena), mel (Central Melanau)
- xkl (Kenyah), ruc (Ruuli), mmc (Michoacán Mazahua), msi (Sabah Malay)
Source: Tatoeba Association
Website: https://tatoeba.org/
Downloads: https://tatoeba.org/en/downloads
License: Creative Commons Attribution 2.0 France (CC-BY 2.0 FR)
Languages Covered: 73 languages
License Details:
- Textual sentences are under CC-BY 2.0 FR
- Attribution required
- Commercial use is permitted
- Modifications are permitted
- Redistribution: ✅ Allowed with attribution
Terms of Use: https://tatoeba.org/en/terms_of_use
Usage in WorldAlphabets:
- Fourth-priority source for frequency data, especially for under-resourced languages
- Sentences are tokenized and word frequencies are calculated
- Particularly valuable for languages lacking other frequency data sources
Attribution: Sentence data from Tatoeba (https://tatoeba.org), licensed under CC-BY 2.0 FR
Source: Denny Vrandečić (Simia.net)
Website: http://simia.net/letters/
Original Data Source: Wiktionary
License: Creative Commons Attribution-ShareAlike (CC-BY-SA)
Languages Covered: CJK character data and 262 language editions
License Details:
- Data extracted from Wikipedia using WikiExtractor
- Wiktionary content is under CC-BY-SA
- Attribution required
- Share-alike: Derivative works must use the same license
- Redistribution: ✅ Allowed with attribution and share-alike
Usage in WorldAlphabets:
- CJK character frequency data
- Fallback for character-level frequency information
- Stored in
data/sources/unigrams/
Attribution: Character frequency data from Simia unigrams dataset (http://simia.net/letters/), derived from Wiktionary, licensed under CC-BY-SA
Source: Unicode Consortium
Website: https://cldr.unicode.org/
License: Unicode License Agreement
Usage in WorldAlphabets:
- Primary source for alphabet exemplar characters
- Locale-specific character sets
- Script information
Redistribution: ✅ Allowed under Unicode License
Source: Kalenchukov
Repository: https://github.com/kalenchukov/Alphabet
License: Apache License 2.0
Usage in WorldAlphabets:
- Supplementary alphabet data
- Character set definitions
Redistribution: ✅ Allowed under Apache 2.0
Source: Wikidata
Website: https://www.wikidata.org/
License: Creative Commons CC0 (Public Domain)
Usage in WorldAlphabets:
- Language-to-script mappings
- Stored in
data/sources/wikidata_language_scripts.json
Redistribution: ✅ Allowed (Public Domain)
Source: SIL International
Website: https://iso639-3.sil.org/
License: Open Data Commons Attribution License (ODC-By)
Usage in WorldAlphabets:
- Language code mappings
- Language registry data
- Stored in
data/sources/iso-639-3.tab
Redistribution: ✅ Allowed with attribution
Source: Unicode Consortium
Website: https://unicode.org/
License: Unicode License Agreement
Usage in WorldAlphabets:
- Character properties
- Script definitions
- Normalization data
Redistribution: ✅ Allowed under Unicode License
Source: Kbdlayout.info
Website: https://kbdlayout.info/
License: Various (per-layout, generally permissive)
Usage in WorldAlphabets:
- Keyboard layout definitions
- Key mappings for different languages
Source: Wikimedia Foundation
Website: https://wikipedia.org/
License: Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA-3.0)
Usage in WorldAlphabets:
- Some supplementary language data
- Referenced in various data sources
Redistribution: ✅ Allowed with attribution and share-alike
All data sources used in WorldAlphabets are compatible with the project's MIT License for the following reasons:
- CC-BY (Leipzig): Requires attribution only - compatible with MIT
- CC-BY-SA (HermitDave, Simia, Wikipedia): Requires attribution and share-alike for derivatives - our redistribution maintains original licenses
- CC-BY 2.0 FR (Tatoeba): Requires attribution - compatible with MIT
- Apache 2.0 (Kalenchukov): Compatible with MIT
- CC0 (Wikidata): Public domain - no restrictions
- Unicode License: Permissive, allows redistribution
- ODC-By (ISO 639-3): Requires attribution - compatible with MIT
All data sources explicitly permit redistribution under their respective licenses. When redistributing:
- Maintain attribution for all sources (see attribution text above)
- Preserve license information for CC-BY-SA content
- Include this documentation or equivalent attribution in distributions
- Do not remove copyright notices or license information from source files
When using WorldAlphabets data, please include:
This project uses data from:
- Leipzig Corpora Collection (CC-BY) - Wortschatz Leipzig, University of Leipzig
- FrequencyWords by Hermit Dave (CC-BY-SA-4.0) - https://github.com/hermitdave/FrequencyWords
- Tatoeba (CC-BY 2.0 FR) - https://tatoeba.org
- Simia unigrams dataset (CC-BY-SA) - http://simia.net/letters/ (derived from Wiktionary)
- Kalenchukov/Alphabet (Apache 2.0) - https://github.com/kalenchukov/Alphabet
- Unicode CLDR and Character Database (Unicode License)
- ISO 639-3 (ODC-By) - SIL International
- Wikidata (CC0)
This document reflects the licensing status as of the date of last update. License terms may change over time. Always verify current license terms at the source websites.
Last Updated: 2025-11-11
Maintained By: WorldAlphabets Project
Contact: https://github.com/willwade/WorldAlphabets
- Project License: MIT License (see
LICENSEfile) - Data Pipeline Documentation:
docs/DATA_PIPELINE.md - Main README:
README.md