Skip to content

Feature: sz_find_ascii_case_insensitive — ASCII case-insensitive substring search #311

@belugabehr

Description

@belugabehr

Describe what you are looking for

Problem

Case-insensitive substring search on ASCII text is one of the most common operations in
databases (SQL CONTAINS, POSITION, LIKE '%pattern%' with case-insensitive collation).

StringZilla currently offers sz_utf8_case_insensitive_find which handles the full Unicode
Case Folding specification (1M+ codepoints, one-to-many expansions like ßss). This is
correct and complete, but carries overhead for the common case of pure ASCII input:

  • Per-character Unicode case folding lookup tables
  • Needle metadata computation for safe-slice extraction
  • Multi-codepoint expansion handling

For ASCII, case folding is trivial: c | 0x20 converts A-Z to a-z and is a no-op for everything
else. This can be fused directly into the SIMD comparison loop — no lookup tables, no metadata,
no allocations.

Benchmark evidence

We benchmarked three approaches on Graviton ARM for a case-insensitive search ("XyZ" in a
haystack of varying lengths, lower collation):

Size sz_utf8_case_insensitive_find (ns) ASCII NEON prototype (ns)
128 B 111 26
256 B 118 32
1 KB 152 69
4 KB 295 236
16 KB 844 917
64 KB 3,086 3,646
256 KB 14,973 14,489
1 MB 54,574 58,895

The ASCII NEON prototype fuses OR 0x20 into the NEON comparison.

sz_utf8_case_insensitive_find loses for short strings due to the Unicode lookup overhead.

Proposed API

/**
 * @brief ASCII case-insensitive substring search.
 *
 * Equivalent to sz_find but with OR 0x20 applied to both haystack and needle
 * bytes during comparison. Only correct for ASCII input (bytes 0x00-0x7F).
 * For Unicode input, use sz_utf8_case_insensitive_find instead.
 *
 * @param haystack The string to search in.
 * @param h_length Length of haystack in bytes.
 * @param needle The substring to search for.
 * @param n_length Length of needle in bytes.
 * @return Pointer to first case-insensitive match, or SZ_NULL_CHAR if not found.
 */
SZ_DYNAMIC sz_cptr_t sz_find_ascii_case_insensitive(
    sz_cptr_t haystack, sz_size_t h_length,
    sz_cptr_t needle, sz_size_t n_length);

Why not just use sz_utf8_case_insensitive_find?

  1. Performance: 2-6x slower than dedicated ASCII for short strings (the common case in RDBMS)
  2. An existing legacy code base already has a fast-path for ASCII

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions