Performs a character replacement in strings according to the configuration key1=value1,key2=value2,...
- Python >= 3.7 version (only stdlib used).
- https://pypi.org/project/pyahocorasick (Optional, if use aho_corasick_c)
python3 -m main datasets/config_simple.txt datasets/sample_simple.txt --method aho_corasick- config – file with key=value pairs (each pair on a new line, order determines priority).
- sample – file with replacement text (line-by-line processing).
- --method – replacement method:
- cascading – cascading replacement (each key is applied to the result of the previous one). Defaults.
- single_pass – single-pass search of keys for each position.
- regexp – single regular expression with alternatives.
- aho_corasick – Aho-Corasick algorithm (recommended for large numbers of keys).
- aho_corasick_c – Aho-Corasick pyahocorasick (C) wrapper (the fastest implementation for large numbers of keys).
The result is output to stdout: first the lines in which substitutions occurred (sorted by descending number of substitutions), then the unchanged lines.
from replacers import AhoCorasickReplacer, Replacements
replacements = Replacements([("ab", "X"), ("bc", "Y")])
replacer = AhoCorasickReplacer(replacements)
new_line, replaced_count = replacer.apply("abc")
print(new_line) # Xc
print(replaced_count) # 2| Method | Complexity | Description |
|---|---|---|
| cascading | O(K * L) | Sequential str.replace. May produce false replacements (cascade). Fast, but not correct for disjoint requirements. |
| single_pass | O(L * K) | Enumerate all keys for each position. Correct, but slow for large K. |
| regexp | O(L) | One pass using re.sub. Takes into account the order of alternatives. |
| aho_corasick | O(L + M) | Building of an automaton (M is the total length of keys) and linear search. Optimal for a large number of keys. |
| aho_corasick_с | O(L + M) | AhoCorasick C wrapper (https://pypi.org/project/pyahocorasick). Optimal for a large number of keys. |
| L - length of the string, K - number of keys | ||
Make datasets:
python3 -m tests.make_datasets datasets/config.txt datasets/sample.txt --total_pairs 1000 --total_lines 1000Launch all tests:
python3 -m unittest discover testsThe tests check:
- Correctness of substitution (comparison of results between modes).
- Non-overlapping and priority keys.
- Handling of empty strings, special characters, and edge cases.
- Working with the configuration file parser
- Performance benchmark and comparison results between replacers.
| Replacer | Time (s) | Ratio(x) | Replaced chars | Replacements length | Text length |
| aho_corasick_c | 0.004124 | x1.00 | 4891 | 1000 | 1000 |
| aho_corasick | 0.017762 | x4.31 | 4891 | 1000 | 1000 |
| cascading | 0.112779 | x27.35 | 4885 | 1000 | 1000 |
| regexp | 0.229525 | x55.66 | 4891 | 1000 | 1000 |
| single_pass | 6.491263 | x1574.18 | 4891 | 1000 | 1000 |
main.py # Entry point.
parsers.py # Config parsing.
replacers/
├── __init__.py
├── aho_corasick.py # Aho‑Corasick algorithms.
├── cascading.py # Cascade replacing.
├── regexp.py # Regexp sub replacing.
├── single_pass.py # Single-pass replacing.
├── type_defs.py # Common types.
tests/
├── make_datasets.py # Make datasets files.
├── test_benchmark.py # Benchmark comparison.
├── test_parsers.py # Test parsing.
└── test_replacers.py # Test replacers.
datasets/ # Sample files.