Skip to content

Commit 9cb11e8

Browse files
authored
Merge pull request #12 from lupodevelop/v1.2.2-fixes-and-corrections
V1.2.2 fixes and corrections
2 parents 498451a + a00a52f commit 9cb11e8

19 files changed

Lines changed: 1576 additions & 156 deletions

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,35 @@
22

33
All notable changes to this project are documented in this file.
44

5+
## [1.2.2] - 2026-01-05
6+
### Added
7+
- Added internal helper `grapheme_len/1` (internal) to centralize grapheme cluster length computation and avoid repetitive `string.to_graphemes |> list.length` patterns.
8+
- **Experimental:** Implemented two substring search strategies optimized for different workloads:
9+
- **KMP (Knuth–Morris–Pratt)**: prefix-table based search with both full-search (`kmp_search_all`) and early-exit index (`kmp_index_of`) variants. Good performance on long or highly repetitive patterns.
10+
- **Sliding-match**: a non-allocating sliding-window matcher (`sliding_search_all`) with an early-exit index variant (`sliding_index_of`). Often fastest for short, non-repetitive patterns.
11+
- **Experimental, opt-in APIs:** `index_of_auto` and `count_auto` — heuristic-based automatic selection between KMP and Sliding (experimental; disabled by default). Added explicit APIs for deterministic control: `index_of_strategy` and `count_strategy` (accept `core.Kmp` or `core.Sliding`).
12+
- Added `src/str/config.gleam` with tunable thresholds (`kmp_min_pattern_len`, `kmp_large_text_threshold`, `kmp_large_text_min_pat`, `kmp_border_multiplier`) to allow projects to control heuristic behavior.
13+
14+
### Style
15+
- Replaced direct grapheme-length patterns with `grapheme_len/1` where appropriate to improve readability and maintainability.
16+
17+
### Tests
18+
- Added tests verifying grapheme-aware length behavior (ASCII, combining marks, ZWJ emoji sequences, regional flags, and long ASCII strings).
19+
- Added unit tests for KMP, Sliding, heuristic chooser, and explicit strategy APIs (`test/str_kmp_test.gleam`, `test/str_sliding_test.gleam`, `test/str_strategy_test.gleam`, `test/str_auto_test.gleam`, `test/str_strategy_explicit_test.gleam`). All tests pass locally (355 passed at time of change).
20+
21+
### Performance & Benchmarking
22+
- Added BEAM-native benchmark harness (`scripts/bench_beam.erl`) and Python micro-benchmark (`scripts/bench_kmp.py`) to evaluate algorithmic trade-offs on the VM. The BEAM harness now records `max_border` (prefix-table max) in CSV output to help heuristic tuning.
23+
- Micro-optimizations: converted `prefix_eq_list` and `sliding_index_loop` to iterative implementations and removed redundant list reversals in KMP prefix table construction (also reduced `list.last` usage by tracking `k`), lowering per-iteration overhead and improving BEAM measurements.
24+
- Observed behavior from benchmarks:
25+
- KMP performs very well on long and highly repetitive patterns (it no longer exhibits the prior pathological slowdown after optimizations).
26+
- Sliding wins on short, largely random patterns.
27+
- `index_of_auto` remains heuristic and may choose a non-optimal algorithm for some inputs — explicit strategy APIs are recommended for performance-critical code.
28+
29+
### Fixed
30+
- Made `remove_prefix`, `remove_suffix`, `ensure_prefix` and `ensure_suffix` grapheme-aware to avoid splitting multi-codepoint graphemes (emoji, combining sequences); added tests to cover these cases.
31+
32+
Contributed by: Daniele (`lupodevelop`)
33+
534
## [1.2.1] - 2026-01-02
635
### Fixed
736
- Made `repeat_str/2` iterative to avoid deep recursion and improve performance on large repetition counts.

EXAMPLES.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,43 @@ pub fn search_examples() {
3030
}
3131
```
3232

33+
### Experimental Search Strategies & Caching (1.2.2)
34+
35+
```gleam
36+
import str/core
37+
38+
pub fn search_strategy_examples() {
39+
// 1) Use the automatic heuristic (experimental)
40+
// The heuristic chooses between a sliding matcher and KMP based on
41+
// pattern/text characteristics. It is opt-in and may choose a
42+
// non-optimal strategy in some cases.
43+
let auto = core.index_of_auto("some long text...", "pat")
44+
45+
// 2) Force a specific strategy: use this when performance is critical
46+
// and you know which algorithm is better for your input shape.
47+
let forced_kmp = core.index_of_strategy("long text...", "pattern", core.Kmp)
48+
let forced_sliding = core.index_of_strategy("short text", "pat", core.Sliding)
49+
50+
// 3) Caching KMP maps: precompute pattern maps once and reuse them
51+
// across multiple searches to avoid rebuilding prefix tables.
52+
let pattern = "abababab..."
53+
let maps = core.build_kmp_maps(pattern)
54+
let pmap = maps.0
55+
let pimap = maps.1
56+
57+
// Reuse maps across many texts
58+
let idx1 = core.kmp_index_of_with_maps("first long text...", pattern, pmap, pimap)
59+
let occurrences = core.kmp_search_all_with_maps("another text...", pmap, pimap)
60+
61+
// Guidance: prefer explicit strategy or caching in hot loops; use
62+
// `index_of_auto` for convenience and exploratory testing.
63+
}
64+
```
65+
66+
> Note: `index_of_auto` is experimental and its behavior depends on tunable
67+
> thresholds in `src/str/config.gleam`. For production-critical paths,
68+
> prefer `index_of_strategy` or precomputing maps via `build_kmp_maps`.
69+
3370
### Grapheme-Aware Length and String Checks (NEW in 1.1.0)
3471

3572
```gleam
@@ -394,7 +431,7 @@ This gives you full control over decomposition/normalization order.
394431
The project uses Gleam's test runner. Example commands:
395432

396433
```sh
397-
# run all tests (325 tests)
434+
# run all tests
398435
gleam test
399436

400437
# run a single test file (shell navigation)

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,33 @@ pub fn main() {
114114
| `replace_first(text, old, new)` | `"aaa", "a", "b"` | `"baa"` |
115115
| `replace_last(text, old, new)` | `"aaa", "a", "b"` | `"aab"` |
116116

117+
### ⚠️ Experimental: Search Strategies
118+
119+
**Algorithms:**
120+
- **KMP**: optimized for long/repetitive patterns
121+
- **Sliding**: fast for short patterns, zero allocations
122+
123+
**APIs:**
124+
125+
| Function | Description |
126+
|----------|-------------|
127+
| `index_of_auto(text, pattern)` | Auto-select algorithm (heuristic) |
128+
| `index_of_strategy(text, pattern, Kmp\|Sliding)` | Explicit algorithm choice |
129+
| `count_auto(text, pattern, overlapping)` | Auto-select for counting |
130+
| `count_strategy(text, pattern, overlapping, Kmp\|Sliding)` | Explicit count algorithm |
131+
132+
**Examples:**
133+
134+
```gleam
135+
// Force KMP explicitly
136+
core.index_of_strategy("long text...", "pattern", core.Kmp)
137+
138+
// Let heuristic decide (experimental)
139+
core.index_of_auto("some text", "pat")
140+
```
141+
142+
> **Note:** `_auto` variants use heuristics and may not always choose optimally. For performance-critical code, use `_strategy` variants. Configure thresholds in `src/str/config.gleam`.
143+
117144
### 🧩 Splitting & Partitioning
118145

119146
| Function | Example | Result |

gleam.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
name = "str"
2-
version = "1.2.1"
2+
version = "1.2.2"
33

44
# Project metadata (fill or replace placeholders before publishing)
55
description = "Unicode-aware string utilities for Gleam: grapheme-safe operations, pragmatic ASCII transliteration, and slug generation."
6-
licences = ["MIT"]
6+
licenses = ["MIT"]
77
repository = { type = "github", user = "lupodevelop", repo = "str" }
88
links = [{ title = "Repository", href = "https://github.com/lupodevelop/str" }]
99

scripts/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Benchmarks
2+
==========
3+
4+
This folder contains micro-benchmarks used to evaluate substring search
5+
heuristics (sliding-match vs KMP) on representative inputs.
6+
7+
Usage:
8+
9+
python3 scripts/bench_kmp.py --repeat 5
10+
11+
Notes:
12+
- These benchmarks are implementation-agnostic: they exercise algorithmic
13+
characteristics (O(nm) vs O(n+m)) using Python implementations.
14+
- They are intended to help choose heuristics (pattern length thresholds,
15+
repetitiveness heuristics) without modifying library code.
16+
17+
BEAM-native benchmark
18+
---------------------
19+
20+
You can also run a BEAM-native micro-benchmark that invokes the compiled
21+
Gleam modules directly on the Erlang VM. This is useful to measure the
22+
actual runtime performance on the target platform.
23+
24+
Example invocation (from repository root):
25+
26+
```bash
27+
erl -noshell \
28+
-pa build/dev/erlang/gleam_stdlib/ebin \
29+
-pa build/dev/erlang/str/ebin \
30+
-eval "bench_beam:run(), halt()."
31+
```
32+
33+
This will write a CSV file under `scripts/bench_results/` with timing
34+
results for several scenarios (repetitive, random, emoji). The script is
35+
`scripts/bench_beam.erl` and does not modify the repository source; it
36+
only reads the compiled `.beam` files.

scripts/bench_beam.erl

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
%% Lightweight BEAM-native benchmark harness for `str` functions.
2+
%%
3+
%% Usage (from repo root):
4+
%% erl -noshell \
5+
%% -pa build/dev/erlang/gleam_stdlib/ebin \
6+
%% -pa build/dev/erlang/str/ebin \
7+
%% -eval "bench_beam:run(), halt()."
8+
%%
9+
%% The script writes CSV output to `scripts/bench_results/bench_beam_<ts>.csv`.
10+
11+
-module(bench_beam).
12+
-export([run/0]).
13+
14+
%% Simple helpers
15+
ensure_dir(Path) ->
16+
case filelib:is_dir(Path) of
17+
true -> ok;
18+
false -> file:make_dir(Path)
19+
end.
20+
21+
timestamp() ->
22+
{{Y,Mo,D},{H,Mi,S}} = calendar:universal_time(),
23+
lists:flatten(io_lib:format("~4..0B~2..0B~2..0B_~2..0B~2..0B~2..0B", [Y,Mo,D,H,Mi,S])).
24+
25+
do_warmup(_M,_F,_A,0) -> ok;
26+
do_warmup(M,F,A,N) when N > 0 ->
27+
_ = apply(M,F,A),
28+
do_warmup(M,F,A,N-1).
29+
30+
time_fun(M,F,A,Iter) ->
31+
%% Warm-up
32+
do_warmup(M,F,A,5),
33+
{MicroSecs, _} = timer:tc(fun() -> lists:foreach(fun(_) -> _ = apply(M,F,A) end, lists:seq(1,Iter)) end),
34+
MicroSecs div Iter.
35+
36+
gen_repetitive(Bin, N) when is_binary(Bin) ->
37+
iolist_to_binary(lists:duplicate(N, Bin)).
38+
39+
gen_random(Alphabet, N) when is_list(Alphabet) ->
40+
%% Alphabet is a list of integers (string). Build list of N random elements and convert to binary.
41+
Len = length(Alphabet),
42+
Fun = fun(_) -> lists:nth(rand:uniform(Len), Alphabet) end,
43+
Chars = [Fun(Arg) || Arg <- lists:seq(1,N)],
44+
list_to_binary(Chars).
45+
46+
write_csv_header(File) ->
47+
io:format(File, "case,scenario_type,text_len,pat_len,max_border,matches,index_of_us,index_of_auto_us,kmp_us,sliding_us,count_us,count_auto_us,iter~n", []).
48+
49+
measure_case(File, Name, Type, Text, Pat, Iter) ->
50+
%% Compute matches using sliding_search_all for consistency
51+
MatchesList = catch 'str@core':sliding_search_all(Text, Pat),
52+
Matches = case MatchesList of
53+
{'EXIT', _} -> -1;
54+
L -> length(L)
55+
end,
56+
%% Compute prefix table max border for the pattern (0 if failure)
57+
Pi = case catch 'str@core':build_prefix_table(Pat) of
58+
{'EXIT', _} -> [];
59+
R -> R
60+
end,
61+
MaxBorder = case Pi of
62+
[] -> 0;
63+
_ -> lists:max(Pi)
64+
end,
65+
Iof = time_fun('str@core', index_of, [Text, Pat], Iter),
66+
Iaof = time_fun('str@core', index_of_auto, [Text, Pat], Iter),
67+
Kmp = time_fun('str@core', kmp_search_all, [Text, Pat], Iter),
68+
Slide = time_fun('str@core', sliding_search_all, [Text, Pat], Iter),
69+
Cnt = time_fun('str@core', count, [Text, Pat, true], Iter),
70+
Ca = time_fun('str@core', count_auto, [Text, Pat, true], Iter),
71+
io:format(File, "~s,~s,~p,~p,~p,~p,~p,~p,~p,~p,~p,~p,~p~n",
72+
[Name, Type, byte_size(Text), byte_size(Pat), MaxBorder, Matches, Iof, Iaof, Kmp, Slide, Cnt, Ca, Iter]).
73+
74+
run() ->
75+
rand:seed(exsplus, {erlang:monotonic_time(), erlang:unique_integer([positive]), erlang:phash2(node())}),
76+
ensure_dir("scripts/bench_results"),
77+
Ts = timestamp(),
78+
Path = filename:join("scripts/bench_results", "bench_beam_" ++ Ts ++ ".csv"),
79+
{ok, File} = file:open(Path, [write, {encoding, utf8}]),
80+
write_csv_header(File),
81+
io:format("Starting BEAM benchmarks...~n"),
82+
Iter = 200,
83+
84+
%% Scenarios
85+
%% 1) repetitive no match
86+
Text1 = gen_repetitive(<<$a>>, 20000),
87+
Bin1 = gen_repetitive(<<$a>>, 1000),
88+
Pat1 = <<Bin1/binary, $b>>,
89+
io:format("Running repetitive_nomatch (~p bytes text, ~p bytes pat)...~n", [byte_size(Text1), byte_size(Pat1)]),
90+
measure_case(File, "repetitive_nomatch", "repetitive_nomatch", Text1, Pat1, Iter),
91+
92+
%% 2) repetitive many matches
93+
Text2 = gen_repetitive(<<$a>>, 20000),
94+
Pat2 = gen_repetitive(<<$a>>, 50),
95+
io:format("Running repetitive_many (~p bytes text, ~p bytes pat)...~n", [byte_size(Text2), byte_size(Pat2)]),
96+
measure_case(File, "repetitive_many", "repetitive_many", Text2, Pat2, Iter),
97+
98+
%% 3) random small pat
99+
Text3 = gen_random("abcd", 20000),
100+
Pat3 = gen_random("abcd", 20),
101+
io:format("Running random_small_pat (~p bytes text, ~p bytes pat)...~n", [byte_size(Text3), byte_size(Pat3)]),
102+
measure_case(File, "random_small_pat", "random", Text3, Pat3, Iter),
103+
104+
%% 4) large text small pat
105+
Text4 = gen_random("abcd", 200000),
106+
Pat4 = <<"abcdab">>,
107+
io:format("Running large_text_small_pat (~p bytes text, ~p bytes pat)...~n", [byte_size(Text4), byte_size(Pat4)]),
108+
measure_case(File, "large_text_small_pat", "random", Text4, Pat4, Iter div 4),
109+
110+
%% (emoji case omitted in this BEAM harness to avoid encoding edge-cases)
111+
112+
file:close(File),
113+
io:format("Wrote results to ~s~n", [Path]),
114+
ok.

0 commit comments

Comments
 (0)