Describe what you are looking for
Problem
Case-insensitive substring search on ASCII text is one of the most common operations in
databases (SQL CONTAINS, POSITION, LIKE '%pattern%' with case-insensitive collation).
StringZilla currently offers sz_utf8_case_insensitive_find which handles the full Unicode
Case Folding specification (1M+ codepoints, one-to-many expansions like ß → ss). This is
correct and complete, but carries overhead for the common case of pure ASCII input:
- Per-character Unicode case folding lookup tables
- Needle metadata computation for safe-slice extraction
- Multi-codepoint expansion handling
For ASCII, case folding is trivial: c | 0x20 converts A-Z to a-z and is a no-op for everything
else. This can be fused directly into the SIMD comparison loop — no lookup tables, no metadata,
no allocations.
Benchmark evidence
We benchmarked three approaches on Graviton ARM for a case-insensitive search ("XyZ" in a
haystack of varying lengths, lower collation):
| Size |
sz_utf8_case_insensitive_find (ns) |
ASCII NEON prototype (ns) |
| 128 B |
111 |
26 |
| 256 B |
118 |
32 |
| 1 KB |
152 |
69 |
| 4 KB |
295 |
236 |
| 16 KB |
844 |
917 |
| 64 KB |
3,086 |
3,646 |
| 256 KB |
14,973 |
14,489 |
| 1 MB |
54,574 |
58,895 |
The ASCII NEON prototype fuses OR 0x20 into the NEON comparison.
sz_utf8_case_insensitive_find loses for short strings due to the Unicode lookup overhead.
Proposed API
/**
* @brief ASCII case-insensitive substring search.
*
* Equivalent to sz_find but with OR 0x20 applied to both haystack and needle
* bytes during comparison. Only correct for ASCII input (bytes 0x00-0x7F).
* For Unicode input, use sz_utf8_case_insensitive_find instead.
*
* @param haystack The string to search in.
* @param h_length Length of haystack in bytes.
* @param needle The substring to search for.
* @param n_length Length of needle in bytes.
* @return Pointer to first case-insensitive match, or SZ_NULL_CHAR if not found.
*/
SZ_DYNAMIC sz_cptr_t sz_find_ascii_case_insensitive(
sz_cptr_t haystack, sz_size_t h_length,
sz_cptr_t needle, sz_size_t n_length);
Why not just use sz_utf8_case_insensitive_find?
- Performance: 2-6x slower than dedicated ASCII for short strings (the common case in RDBMS)
- An existing legacy code base already has a fast-path for ASCII
Can you contribute to the implementation?
Is your feature request specific to a certain interface?
C implementation
Contact Details
No response
Is there an existing issue for this?
Code of Conduct
Describe what you are looking for
Problem
Case-insensitive substring search on ASCII text is one of the most common operations in
databases (SQL
CONTAINS,POSITION,LIKE '%pattern%'with case-insensitive collation).StringZilla currently offers
sz_utf8_case_insensitive_findwhich handles the full UnicodeCase Folding specification (1M+ codepoints, one-to-many expansions like
ß→ss). This iscorrect and complete, but carries overhead for the common case of pure ASCII input:
For ASCII, case folding is trivial:
c | 0x20converts A-Z to a-z and is a no-op for everythingelse. This can be fused directly into the SIMD comparison loop — no lookup tables, no metadata,
no allocations.
Benchmark evidence
We benchmarked three approaches on Graviton ARM for a case-insensitive search ("XyZ" in a
haystack of varying lengths,
lowercollation):sz_utf8_case_insensitive_find(ns)The ASCII NEON prototype fuses
OR 0x20into the NEON comparison.sz_utf8_case_insensitive_findloses for short strings due to the Unicode lookup overhead.Proposed API
Why not just use
sz_utf8_case_insensitive_find?Can you contribute to the implementation?
Is your feature request specific to a certain interface?
C implementation
Contact Details
No response
Is there an existing issue for this?
Code of Conduct