Skip to content

Maybe a way to reduce binary size cost by CASE_FOLDING_SIMPLE #1347

@flyingcat

Description

@flyingcat

Since the max length of &[char] is 3, we could simplify the table to:

pub const CASE_FOLDING_SIMPLE: [[char; 4]; 2938] = [
    ['A', 'a', '\0', '\0'],
    ['B', 'b', '\0', '\0'],
    ['C', 'c', '\0', '\0'],
    ['D', 'd', '\0', '\0'],
    ['E', 'e', '\0', '\0'],
    ['F', 'f', '\0', '\0'],
    ['G', 'g', '\0', '\0'],
    ['H', 'h', '\0', '\0'],
    ['I', 'i', '\0', '\0'],
    ['J', 'j', '\0', '\0'],
    ['K', 'k', 'K', '\0'],
    // other items
];

A simple test shows a drop from 220.5kb to 179.5kb on Windows:

use std::env;
mod case_folding_simple;

#[allow(unused)]
fn search_old_version<'a>(data: &'a [(char, &[char])], c: char) -> Result<&'a [char], usize> {
    match data.binary_search_by_key(&c, |&(c1, _)| c1) {
        Ok(i) => Ok(data[i].1),
        Err(i) => Err(i)
    }
}

fn search(data: &[[char; 4]], c: char) -> Result<&[char], usize> {
    match data.binary_search_by_key(&c, |row| row[0]) {
        Ok(i) => {
            let row = &data[i];
            let variants_part = &row[1..];
            let actual_len = variants_part
                .iter()
                .position(|&x| x == '\0')
                .unwrap_or(3);
            Ok(&variants_part[..actual_len])
        }
        Err(i) => Err(i),
    }
}

fn main() {
    let data = case_folding_simple::CASE_FOLDING_SIMPLE;

    let mut c = 'S';
    let args: Vec<String> = env::args().collect();
    if args.len() > 1 {
        if let Some(ch) = args[1].chars().next() {
            c = ch;
        }
    }
    
    match search(&data, c) {
        Ok(v) => println!("found: {:?}", v),
        Err(_) => println!("not found.")
    }
}

p.s. Some code was written with AI assistance. I'm not familiar with Rust, although the idea should be clear enough.


After another digging, this could go further if not mind modifying more code , given that chars in every mapping row

  1. <= u16::MAX, or
  2. > u16::MAX and 1:1 mapping

Therefore, it can be divided into two tables: [[u16; 4]; N] and [[char; 2]; M]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions