Skip to content

Uter88/replacer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Substrings replacer

Performs a character replacement in strings according to the configuration key1=value1,key2=value2,...

Requirements

Launch

From console

python3 -m main datasets/config_simple.txt datasets/sample_simple.txt --method aho_corasick
  • config – file with key=value pairs (each pair on a new line, order determines priority).
  • sample – file with replacement text (line-by-line processing).
  • --method – replacement method:
    • cascading – cascading replacement (each key is applied to the result of the previous one). Defaults.
    • single_pass – single-pass search of keys for each position.
    • regexp – single regular expression with alternatives.
    • aho_corasick – Aho-Corasick algorithm (recommended for large numbers of keys).
    • aho_corasick_c – Aho-Corasick pyahocorasick (C) wrapper (the fastest implementation for large numbers of keys).

The result is output to stdout: first the lines in which substitutions occurred (sorted by descending number of substitutions), then the unchanged lines.

From code

from replacers import AhoCorasickReplacer, Replacements

replacements = Replacements([("ab", "X"), ("bc", "Y")])
replacer = AhoCorasickReplacer(replacements)

new_line, replaced_count = replacer.apply("abc")
print(new_line)        # Xc
print(replaced_count)  # 2

Algorithms

Method Complexity Description
cascading O(K * L) Sequential str.replace. May produce false replacements (cascade). Fast, but not correct for disjoint requirements.
single_pass O(L * K) Enumerate all keys for each position. Correct, but slow for large K.
regexp O(L) One pass using re.sub. Takes into account the order of alternatives.
aho_corasick O(L + M) Building of an automaton (M is the total length of keys) and linear search. Optimal for a large number of keys.
aho_corasick_с O(L + M) AhoCorasick C wrapper (https://pypi.org/project/pyahocorasick). Optimal for a large number of keys.
L - length of the string, K - number of keys

Testing

Make datasets:

python3 -m tests.make_datasets datasets/config.txt datasets/sample.txt --total_pairs 1000 --total_lines 1000

Launch all tests:

python3 -m unittest discover tests

The tests check:

  • Correctness of substitution (comparison of results between modes).
  • Non-overlapping and priority keys.
  • Handling of empty strings, special characters, and edge cases.
  • Working with the configuration file parser
  • Performance benchmark and comparison results between replacers.

Performance

1000 lines and 1000 replacements pairs
Replacer Time (s) Ratio(x) Replaced chars Replacements length Text length
aho_corasick_c 0.004124 x1.00 4891 1000 1000
aho_corasick 0.017762 x4.31 4891 1000 1000
cascading 0.112779 x27.35 4885 1000 1000
regexp 0.229525 x55.66 4891 1000 1000
single_pass 6.491263 x1574.18 4891 1000 1000

Project structure

main.py                   # Entry point.
parsers.py                # Config parsing.
replacers/
├── __init__.py
├── aho_corasick.py       # Aho‑Corasick algorithms.
├── cascading.py          # Cascade replacing.
├── regexp.py             # Regexp sub replacing.
├── single_pass.py        # Single-pass replacing.
├── type_defs.py          # Common types.
tests/
├── make_datasets.py      # Make datasets files.
├── test_benchmark.py     # Benchmark comparison.
├── test_parsers.py       # Test parsing.
└── test_replacers.py     # Test replacers.
datasets/                 # Sample files.

About

Substring replacer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages