All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
with_decisionoption for the score functioncolumnsoption for the unzip function- support for Python 3.14
- modernize packaging and CLI scripts
- support for Python 3.8
3.3.1 - 2025-12-10
- changed
HtmlTagFilterto accept broken tags in order to avoid inconsistent results - update required
opus-fast-mosestokenizerversion for better platform support
- broken citations and links in the documentation
3.3.0 - 2025-07-02
letter_words_onlyoption for duplicate detection- support for selection of possible languages for
lingualanguage detection - support for language detection based on
heliport(a port of HeLI-OTS)
- update library requirements and include Python 3.13 tests
- removed pycld2 and fasttext from the
[all]extras - refactored all language identification filters to have their own classes and marked
LanguageIDFilteras deprecated
- fix broken import in
opusfilter-scores
3.2.0 - 2024-08-14
- make
pycld2andfasttextlibraries optional - replace
langid.pylibrary withpy3langid - update github workflows and include Python 3.12 tests
OpusReadinterface usingmosesformat (requiresopustools >= 1.6.2)
3.1.0 - 2024-06-05
- support
linguabased for language detection (#65)
- Python 3.7 support
- fix score method in
SentenceEmbeddingFilter(#71) - fix filter and filterfalse methods in
SentenceEmbeddingFilter
3.0.0 - 2023-10-11
opusfilter-autogenscript for automatic filter config generationscore_direction,accept_threshold, andreject_thresholdproperties for filters
- refactor code and move auxiliary methods to opusfilter.util
- update varikn installation instructions (installable from PyPI)
- update github workflows and include Python 3.11 tests
- update library version requirements to support Python 3.11
- use xxhash instead of pyhash for hash functions
- use opus-fast-mosestokenizer instead of fast-mosestokenizer
- install eflomal from PyPI and use the new interface in WordAlignFilter
- Python 3.6 support
- catch NotImplementedError from beautifulsoup 4.11.2
- catch ParserRejectedMarkup from beautifulsoup 4.12.0
2.6.0 - 2022-11-30
- add
slicemissing from the enabled steps
- improve documentation
- import slow libraries only when needed
- use chunks for the filter method of
SentenceEmbeddingFilter - change
RepetitionFilterto use single score for consistency with the threshold
- allow float thresholds for
AverageWordLengthFilter - remove unnecessary code from
RegExpSub - add
setuptoolsversion requirement
2.5.1 - 2022-09-28
- add missing document file
2.5.0 - 2022-09-28
map_space_tooption for Jieba and MeCab tokenizers to preserve existing space characters in input- parallel processing options for filter, score, and preprocess steps
- re-organize documentation and support building it with sphinx
- catch TypeError exceptions from BeautifulSoup in HtmlTagFilter
2.4.0 - 2022-04-05
- an option to write filter scores to a file with
opusfilter-test - new filters:
AlphabetRatioFilter,RegExpFilter,SimilarityFilter,SentenceEmbeddingFilter - support for Japanese word segmentation using
MeCabas a tokenizer - preprocessing methods for subword segmentation (
BPESegmentation,MorfessorSegmentation) - subword segmentation support for the n-gram language models and language model filters
- allow per-language parameters for LengthFilter, LengthRatioFilter, LongWordFilter, and AverageWordLengthFilter
- fix documentation for
train_aligmentparameters
2.3.1 - 2022-01-28
- fix bug in classifier training without development set
2.3.0 - 2022-01-18
- new OpusFilterRuntimeError exception for having e.g. empty training data
- option to save scores from the training data when creating word aligment priors
- RepetitionFilter for filtering segments with repeated substrings
- new preprocessor for sentence splitting monolingual data
- method-specific options for LanguageIDFilter
- chunksize option to the common section
- LMClassifierFilter for classification based on n-gram language models
- add
workdirattribute to theFilterABCbase class and change that the filters should use it for any file parameters - increase default chunksize in FilterPipeline from 10000 to 100000
- refactor and clean up code
2.2.0 - 2021-11-23
- support for Chinese word segmentation using
jiebaas a tokenizer (#27)
2.1.2 - 2021-11-11
- fix wrong keyword argument name in opusfilter-duplicates
2.1.1 - 2021-10-19
- move "How to contribute" to docs/CONTRIBUTING.md
- fix setuptools requirement (#21)
- fix version requirement for pandas (>=1.0.0)
2.1.0 - 2021-08-31
- replace PyYAML with ruamel.yaml
- support for variables in the YAML configuration (#13)
- support to
fasttextbased for language detection (#20) suppress_promptsparameter foropus_read(#19)downloadandwritesteps- "How to contribute" section to README.md
- changelog
- bibliography and improved references
2.0.0 - 2021-06-01
- extend to n-lingual parallel data instead of just bilingual data
- switch tokenizer to
fast-mosestokenizer
- new commands:
opusfilter-diagram,opusfilter-duplicates,opusfilter-test - new filters:
LongestCommonSubstringFilter,AverageWordLengthFilter - new steps:
preprocess - set "latest" as the default corpus release for
opus_read(#5) - overlap option for
remove_duplicates - lower threshold option for
CrossEntropyFilter - github CI workflow for flake8 and unittests
- behaviour of simple filters on empty segments
1.0.1 - 2020-05-25
- improved logging, documentation, and project files
- prevent
UnboundLocalErrorfor empty output after filter
1.0.0 - 2020-04-10
First tagged version.