lupodevelop
diff --git a/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎EXAMPLES.md‎
Lines changed: 38 additions & 1 deletion b/‎EXAMPLES.md‎
Lines changed: 38 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 27 additions & 0 deletions b/‎README.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎gleam.toml‎
Lines changed: 2 additions & 2 deletions b/‎gleam.toml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎scripts/README.md‎
Lines changed: 36 additions & 0 deletions b/‎scripts/README.md‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎scripts/bench_beam.erl‎
Lines changed: 114 additions & 0 deletions b/‎scripts/bench_beam.erl‎
Lines changed: 114 additions & 0 deletions
@@ -2,6 +2,35 @@
 
 All notable changes to this project are documented in this file.
 
+## [1.2.2] - 2026-01-05
+### Added
+- Added internal helper `grapheme_len/1` (internal) to centralize grapheme cluster length computation and avoid repetitive `string.to_graphemes |> list.length` patterns.
+- **Experimental:** Implemented two substring search strategies optimized for different workloads:
+  - **KMP (Knuth–Morris–Pratt)**: prefix-table based search with both full-search (`kmp_search_all`) and early-exit index (`kmp_index_of`) variants. Good performance on long or highly repetitive patterns.
+  - **Sliding-match**: a non-allocating sliding-window matcher (`sliding_search_all`) with an early-exit index variant (`sliding_index_of`). Often fastest for short, non-repetitive patterns.
+- **Experimental, opt-in APIs:** `index_of_auto` and `count_auto` — heuristic-based automatic selection between KMP and Sliding (experimental; disabled by default). Added explicit APIs for deterministic control: `index_of_strategy` and `count_strategy` (accept `core.Kmp` or `core.Sliding`).
+- Added `src/str/config.gleam` with tunable thresholds (`kmp_min_pattern_len`, `kmp_large_text_threshold`, `kmp_large_text_min_pat`, `kmp_border_multiplier`) to allow projects to control heuristic behavior.
+
+### Style
+- Replaced direct grapheme-length patterns with `grapheme_len/1` where appropriate to improve readability and maintainability.
+
+### Tests
+- Added tests verifying grapheme-aware length behavior (ASCII, combining marks, ZWJ emoji sequences, regional flags, and long ASCII strings).
+- Added unit tests for KMP, Sliding, heuristic chooser, and explicit strategy APIs (`test/str_kmp_test.gleam`, `test/str_sliding_test.gleam`, `test/str_strategy_test.gleam`, `test/str_auto_test.gleam`, `test/str_strategy_explicit_test.gleam`). All tests pass locally (355 passed at time of change).
+
+### Performance & Benchmarking
+- Added BEAM-native benchmark harness (`scripts/bench_beam.erl`) and Python micro-benchmark (`scripts/bench_kmp.py`) to evaluate algorithmic trade-offs on the VM. The BEAM harness now records `max_border` (prefix-table max) in CSV output to help heuristic tuning.
+- Micro-optimizations: converted `prefix_eq_list` and `sliding_index_loop` to iterative implementations and removed redundant list reversals in KMP prefix table construction (also reduced `list.last` usage by tracking `k`), lowering per-iteration overhead and improving BEAM measurements.
+- Observed behavior from benchmarks:
+  - KMP performs very well on long and highly repetitive patterns (it no longer exhibits the prior pathological slowdown after optimizations).
+  - Sliding wins on short, largely random patterns.
+  - `index_of_auto` remains heuristic and may choose a non-optimal algorithm for some inputs — explicit strategy APIs are recommended for performance-critical code.
+
+### Fixed
+- Made `remove_prefix`, `remove_suffix`, `ensure_prefix` and `ensure_suffix` grapheme-aware to avoid splitting multi-codepoint graphemes (emoji, combining sequences); added tests to cover these cases.
+
+Contributed by: Daniele (`lupodevelop`)
+
 ## [1.2.1] - 2026-01-02
 ### Fixed
 - Made `repeat_str/2` iterative to avoid deep recursion and improve performance on large repetition counts.
 
@@ -30,6 +30,43 @@ pub fn search_examples() {
 }
 ```
 
+### Experimental Search Strategies & Caching (1.2.2)
+
+```gleam
+import str/core
+
+pub fn search_strategy_examples() {
+  // 1) Use the automatic heuristic (experimental)
+  // The heuristic chooses between a sliding matcher and KMP based on
+  // pattern/text characteristics. It is opt-in and may choose a
+  // non-optimal strategy in some cases.
+  let auto = core.index_of_auto("some long text...", "pat")
+
+  // 2) Force a specific strategy: use this when performance is critical
+  // and you know which algorithm is better for your input shape.
+  let forced_kmp = core.index_of_strategy("long text...", "pattern", core.Kmp)
+  let forced_sliding = core.index_of_strategy("short text", "pat", core.Sliding)
+
+  // 3) Caching KMP maps: precompute pattern maps once and reuse them
+  // across multiple searches to avoid rebuilding prefix tables.
+  let pattern = "abababab..."
+  let maps = core.build_kmp_maps(pattern)
+  let pmap = maps.0
+  let pimap = maps.1
+
+  // Reuse maps across many texts
+  let idx1 = core.kmp_index_of_with_maps("first long text...", pattern, pmap, pimap)
+  let occurrences = core.kmp_search_all_with_maps("another text...", pmap, pimap)
+
+  // Guidance: prefer explicit strategy or caching in hot loops; use
+  // `index_of_auto` for convenience and exploratory testing.
+}
+```
+
+> Note: `index_of_auto` is experimental and its behavior depends on tunable
+> thresholds in `src/str/config.gleam`. For production-critical paths,
+> prefer `index_of_strategy` or precomputing maps via `build_kmp_maps`.
+
 ### Grapheme-Aware Length and String Checks (NEW in 1.1.0)
 
 ```gleam
@@ -394,7 +431,7 @@ This gives you full control over decomposition/normalization order.
 The project uses Gleam's test runner. Example commands:
 
 ```sh
-# run all tests (325 tests)
+# run all tests
 gleam test
 
 # run a single test file (shell navigation)
 
@@ -114,6 +114,33 @@ pub fn main() {
 | `replace_first(text, old, new)` | `"aaa", "a", "b"` | `"baa"` |
 | `replace_last(text, old, new)` | `"aaa", "a", "b"` | `"aab"` |
 
+### ⚠️ Experimental: Search Strategies
+
+**Algorithms:**
+- **KMP**: optimized for long/repetitive patterns
+- **Sliding**: fast for short patterns, zero allocations
+
+**APIs:**
+
+| Function | Description |
+|----------|-------------|
+| `index_of_auto(text, pattern)` | Auto-select algorithm (heuristic) |
+| `index_of_strategy(text, pattern, Kmp\|Sliding)` | Explicit algorithm choice |
+| `count_auto(text, pattern, overlapping)` | Auto-select for counting |
+| `count_strategy(text, pattern, overlapping, Kmp\|Sliding)` | Explicit count algorithm |
+
+**Examples:**
+
+```gleam
+// Force KMP explicitly
+core.index_of_strategy("long text...", "pattern", core.Kmp)
+
+// Let heuristic decide (experimental)
+core.index_of_auto("some text", "pat")
+```
+
+> **Note:** `_auto` variants use heuristics and may not always choose optimally. For performance-critical code, use `_strategy` variants. Configure thresholds in `src/str/config.gleam`.
+
 ### 🧩 Splitting & Partitioning
 
 | Function | Example | Result |
 
@@ -1,9 +1,9 @@
 name = "str"
-version = "1.2.1"
+version = "1.2.2"
 
 # Project metadata (fill or replace placeholders before publishing)
 description = "Unicode-aware string utilities for Gleam: grapheme-safe operations, pragmatic ASCII transliteration, and slug generation."
-licences = ["MIT"]
+licenses = ["MIT"]
 repository = { type = "github", user = "lupodevelop", repo = "str" }
 links = [{ title = "Repository", href = "https://github.com/lupodevelop/str" }]
 
 
@@ -0,0 +1,36 @@
+Benchmarks
+==========
+
+This folder contains micro-benchmarks used to evaluate substring search
+heuristics (sliding-match vs KMP) on representative inputs.
+
+Usage:
+
+    python3 scripts/bench_kmp.py --repeat 5
+
+Notes:
+- These benchmarks are implementation-agnostic: they exercise algorithmic
+  characteristics (O(nm) vs O(n+m)) using Python implementations.
+- They are intended to help choose heuristics (pattern length thresholds,
+  repetitiveness heuristics) without modifying library code.
+
+BEAM-native benchmark
+---------------------
+
+You can also run a BEAM-native micro-benchmark that invokes the compiled
+Gleam modules directly on the Erlang VM. This is useful to measure the
+actual runtime performance on the target platform.
+
+Example invocation (from repository root):
+
+```bash
+erl -noshell \
+  -pa build/dev/erlang/gleam_stdlib/ebin \
+  -pa build/dev/erlang/str/ebin \
+  -eval "bench_beam:run(), halt()."
+```
+
+This will write a CSV file under `scripts/bench_results/` with timing
+results for several scenarios (repetitive, random, emoji). The script is
+`scripts/bench_beam.erl` and does not modify the repository source; it
+only reads the compiled `.beam` files.
@@ -0,0 +1,114 @@
+%% Lightweight BEAM-native benchmark harness for `str` functions.
+%%
+%% Usage (from repo root):
+%%   erl -noshell \
+%%     -pa build/dev/erlang/gleam_stdlib/ebin \
+%%     -pa build/dev/erlang/str/ebin \
+%%     -eval "bench_beam:run(), halt()."
+%%
+%% The script writes CSV output to `scripts/bench_results/bench_beam_<ts>.csv`.
+
+-module(bench_beam).
+-export([run/0]).
+
+%% Simple helpers
+ensure_dir(Path) ->
+  case filelib:is_dir(Path) of
+    true -> ok;
+    false -> file:make_dir(Path)
+  end.
+
+timestamp() ->
+  {{Y,Mo,D},{H,Mi,S}} = calendar:universal_time(),
+  lists:flatten(io_lib:format("~4..0B~2..0B~2..0B_~2..0B~2..0B~2..0B", [Y,Mo,D,H,Mi,S])).
+
+do_warmup(_M,_F,_A,0) -> ok;
+do_warmup(M,F,A,N) when N > 0 ->
+  _ = apply(M,F,A),
+  do_warmup(M,F,A,N-1).
+
+time_fun(M,F,A,Iter) ->
+  %% Warm-up
+  do_warmup(M,F,A,5),
+  {MicroSecs, _} = timer:tc(fun() -> lists:foreach(fun(_) -> _ = apply(M,F,A) end, lists:seq(1,Iter)) end),
+  MicroSecs div Iter.
+
+gen_repetitive(Bin, N) when is_binary(Bin) ->
+  iolist_to_binary(lists:duplicate(N, Bin)).
+
+gen_random(Alphabet, N) when is_list(Alphabet) ->
+  %% Alphabet is a list of integers (string). Build list of N random elements and convert to binary.
+  Len = length(Alphabet),
+  Fun = fun(_) -> lists:nth(rand:uniform(Len), Alphabet) end,
+  Chars = [Fun(Arg) || Arg <- lists:seq(1,N)],
+  list_to_binary(Chars).
+
+write_csv_header(File) ->
+  io:format(File, "case,scenario_type,text_len,pat_len,max_border,matches,index_of_us,index_of_auto_us,kmp_us,sliding_us,count_us,count_auto_us,iter~n", []).
+
+measure_case(File, Name, Type, Text, Pat, Iter) ->
+  %% Compute matches using sliding_search_all for consistency
+  MatchesList = catch 'str@core':sliding_search_all(Text, Pat),
+  Matches = case MatchesList of
+    {'EXIT', _} -> -1;
+    L -> length(L)
+  end,
+  %% Compute prefix table max border for the pattern (0 if failure)
+  Pi = case catch 'str@core':build_prefix_table(Pat) of
+    {'EXIT', _} -> [];
+    R -> R
+  end,
+  MaxBorder = case Pi of
+    [] -> 0;
+    _ -> lists:max(Pi)
+  end,
+  Iof = time_fun('str@core', index_of, [Text, Pat], Iter),
+  Iaof = time_fun('str@core', index_of_auto, [Text, Pat], Iter),
+  Kmp = time_fun('str@core', kmp_search_all, [Text, Pat], Iter),
+  Slide = time_fun('str@core', sliding_search_all, [Text, Pat], Iter),
+  Cnt = time_fun('str@core', count, [Text, Pat, true], Iter),
+  Ca = time_fun('str@core', count_auto, [Text, Pat, true], Iter),
+  io:format(File, "~s,~s,~p,~p,~p,~p,~p,~p,~p,~p,~p,~p,~p~n",
+    [Name, Type, byte_size(Text), byte_size(Pat), MaxBorder, Matches, Iof, Iaof, Kmp, Slide, Cnt, Ca, Iter]).
+
+run() ->
+  rand:seed(exsplus, {erlang:monotonic_time(), erlang:unique_integer([positive]), erlang:phash2(node())}),
+  ensure_dir("scripts/bench_results"),
+  Ts = timestamp(),
+  Path = filename:join("scripts/bench_results", "bench_beam_" ++ Ts ++ ".csv"),
+  {ok, File} = file:open(Path, [write, {encoding, utf8}]),
+  write_csv_header(File),
+  io:format("Starting BEAM benchmarks...~n"),
+  Iter = 200,
+
+  %% Scenarios
+  %% 1) repetitive no match
+  Text1 = gen_repetitive(<<$a>>, 20000),
+  Bin1 = gen_repetitive(<<$a>>, 1000),
+  Pat1 = <<Bin1/binary, $b>>,
+  io:format("Running repetitive_nomatch (~p bytes text, ~p bytes pat)...~n", [byte_size(Text1), byte_size(Pat1)]),
+  measure_case(File, "repetitive_nomatch", "repetitive_nomatch", Text1, Pat1, Iter),
+
+  %% 2) repetitive many matches
+  Text2 = gen_repetitive(<<$a>>, 20000),
+  Pat2 = gen_repetitive(<<$a>>, 50),
+  io:format("Running repetitive_many (~p bytes text, ~p bytes pat)...~n", [byte_size(Text2), byte_size(Pat2)]),
+  measure_case(File, "repetitive_many", "repetitive_many", Text2, Pat2, Iter),
+
+  %% 3) random small pat
+  Text3 = gen_random("abcd", 20000),
+  Pat3 = gen_random("abcd", 20),
+  io:format("Running random_small_pat (~p bytes text, ~p bytes pat)...~n", [byte_size(Text3), byte_size(Pat3)]),
+  measure_case(File, "random_small_pat", "random", Text3, Pat3, Iter),
+
+  %% 4) large text small pat
+  Text4 = gen_random("abcd", 200000),
+  Pat4 = <<"abcdab">>,
+  io:format("Running large_text_small_pat (~p bytes text, ~p bytes pat)...~n", [byte_size(Text4), byte_size(Pat4)]),
+  measure_case(File, "large_text_small_pat", "random", Text4, Pat4, Iter div 4),
+
+  %% (emoji case omitted in this BEAM harness to avoid encoding edge-cases)
+
+  file:close(File),
+  io:format("Wrote results to ~s~n", [Path]),
+  ok.