This guide contains essential information for agents working on the OpusFilter codebase.
OpusFilter is a tool for filtering and processing parallel corpora. It's written in Python and supports filtering through various methods including language identification, language models, word alignment, and custom filters.
- Run all tests:
pytest tests/ - Run single test file:
pytest tests/test_filename.py - Run specific test:
pytest tests/test_filename.py::ClassName::test_method - Run with verbose output:
pytest -v tests/
- Check style:
flake8 opusfilter/ - Run flake8 with statistics:
flake8 opusfilter --count --exit-zero --statistics - Run full linting (recommended for contributions):
pylint opusfilter/ - Format code:
black opusfilter/ opusfilter/cli/ - Sort imports:
isort opusfilter/ opusfilter/cli/ - Type checking:
mypy opusfilter/ opusfilter/cli/
- Install with test dependencies:
pip install .[test] - Install optional dependencies for full functionality:
pip install .[test,eflomal,jieba,mecab,varikn]
- Follow PEP 8 with a maximum line length of 127 characters (not 79)
- Support Python 3.9 to 3.14
- Use type hints where appropriate (see existing code for patterns)
- Maintain backward compatibility when possible
- Do not add trailing whitespace
# Standard library imports
import os
import logging
from typing import Iterator, List, Tuple
# Third-party imports
import numpy as np
import pandas as pd
import regex
# Local imports
from . import FilterABC, ConfigurationError
from .util import file_open, count_lines- Classes: PascalCase (e.g.,
LengthFilter,CrossEntropyFilter) - Functions/Methods: snake_case (e.g.,
get_length,score_pairs) - Variables: snake_case (e.g.,
min_length,accept_threshold) - Constants: UPPER_SNAKE_CASE (e.g.,
CLEAN_LOW,CLEAN_HIGH) - Private members: Leading underscore (e.g.,
_internal_method)
- Use custom exception classes from
opusfilter.__init__:OpusFilterError: Base exceptionConfigurationError: For configuration-related errorsOpusFilterRuntimeError: For runtime errors
- Always include descriptive error messages
- Use logging for warnings and debug information
When creating new filters, follow this pattern:
class NewFilter(FilterABC):
"""Brief description of the filter"""
score_direction = CLEAN_LOW # or CLEAN_HIGH, CLEAN_BETWEEN, etc.
accept_threshold = <value>
reject_threshold = <value>
def __init__(self, required_param, optional_param=default, **kwargs):
# Validate parameters
self.required_param = required_param
self.optional_param = optional_param
super().__init__(**kwargs)
def score(self, pairs):
"""Yield scores for each sentence pair"""
for pair in pairs:
# Calculate score
yield score
def accept(self, score):
"""Return True if score passes the filter"""
return <condition>- Write unit tests for all new functionality
- Use descriptive test method names
- Test both success and failure cases
- Use
unittest.TestCaseas base class - Mark tests that require optional dependencies with appropriate skips
- Use module-level logger:
logger = logging.getLogger(__name__) - Log important operations and warnings
- Avoid excessive debug logging in production code
- Use
file_open()fromopusfilter.utilfor handling compressed files - Support common formats: plain text, gzip, bzip2, lzma
- Use context managers for file operations
- Filters receive configuration through
__init__parameters - Use
**kwargsto capture extra parameters and warn about them - Validate parameters and raise
ConfigurationErrorfor invalid values
- Use type hints for public APIs
- Import common types from
typingmodule - Follow existing patterns for complex types
- Add docstrings to all public classes and methods
- Use Google-style or NumPy-style docstrings
- Include examples in docstrings when helpful
opusfilter/: Main package directoryfilters.py: Core filter implementationspreprocessors.py: Text preprocessing functionslm.py: Language model filterslid.py: Language identification filtersembeddings.py: Sentence embedding filtersword_alignment.py: Word alignment filtersutil.py: Utility functions and helperspipeline.py: Processing pipeline implementationopusfilter.py: Main application entry point
tests/: Unit testsdocs/: Documentation source fileswork/: Example configurations and working files
for pair in pairs:
# pair is a list/tuple of strings
# pair[0] is source, pair[1] is target, etc.def score(self, pairs):
for pair in pairs:
# Calculate score(s)
yield score_value # or yield [score1, score2, ...]def accept(self, score):
if isinstance(score, list):
# Multiple scores
return all(condition(s) for s in score)
else:
# Single score
return condition(score)- Make pull requests to the
developbranch (notmaster) - Include tests for new features
- Run linting and tests before submitting
- Follow conventional commit messages if possible