tokenizers/
tokenizers/ # Core Rust library (the main crate)
src/ # Library source code
benches/ # Criterion benchmarks
Makefile # Build, test, bench, lint targets (with auto data download)
bindings/
python/ # PyO3 Python bindings
Makefile # Python-specific test and lint targets
node/ # Node.js bindings
docs/ # Documentation
The core Rust crate lives in tokenizers/ (not the repo root). Most make and
cargo commands need to be run from that subdirectory.
- Rust (stable): install via rustup
- Python 3.9+: for the Python bindings
- wget: used by the Makefile to download test/benchmark data (
brew install wgeton macOS) - maturin: for building the Python bindings (
pip install maturin)
git clone https://github.com/huggingface/tokenizers.git
cd tokenizers
# Create a virtualenv (using uv, venv, or your preferred tool)
python -m venv .venv
source .venv/bin/activatecd tokenizers
make test # downloads test data automatically via wget, then runs cargo testThe first run downloads model files and corpora into tokenizers/data/. These
files are gitignored.
cd bindings/python
pip install -e ".[dev]" # install in editable mode with test deps (builds via maturin)
make test # run pytest, then cargo testIf you need to rebuild after Rust changes without reinstalling:
pip install maturin # if not already installed
maturin develop # fast rebuild of the extension modulecd tokenizers
make bench # downloads benchmark data if needed, then runs cargo benchBenchmark results are stored in target/criterion/ for comparison across runs.
To run a specific benchmark:
cargo bench --bench bpe_benchmark
cargo bench --bench bert_benchmark
cargo bench --bench llama3_benchmark
cargo bench --bench layout_benchmarkIf you use uv to manage Python, cargo test
for the Python bindings may fail with:
Library not loaded: /install/lib/libpython3.X.dylib
This is a known issue with
uv's prebuilt Python distributions — the shared library has a broken install
name on macOS. The bindings/python/Makefile detects uv and applies the
workaround automatically when you use make test. If you run cargo test
directly, set these environment variables first:
export DYLD_FALLBACK_LIBRARY_PATH="$(python3 -c 'import sysconfig; print(sysconfig.get_config_var("LIBDIR"))')"
export PYTHONHOME="$(python3 -c 'import sys; print(sys.base_prefix)')"
cargo test --no-default-featuresThe benchmarks expect data files in tokenizers/data/. These are not checked
into the repository. Running make bench or make test from the tokenizers/
directory will download them automatically. If you prefer to run cargo bench
directly, download the data first:
cd tokenizers
make data/big.txt data/gpt2-vocab.json data/gpt2-merges.txt # etc.Or download all benchmark data at once:
make bench # will fetch everything before running benchmarks# Rust core
cd tokenizers
make lint # rustfmt --check + clippy
# Python bindings
cd bindings/python
make style # auto-format
make check-style # check formatting# Rust core — specific test
cd tokenizers
cargo test test_name
# Python bindings — specific test file
cd bindings/python
python -m pytest tests/bindings/test_tokenizer.py -v -k "test_name"
# Python bindings — Rust-side tests only
cargo test --no-default-featuresFor performance work, samply is useful for generating CPU profiles:
cd tokenizers
cargo build --release --example my_bench
samply record ./target/release/examples/my_benchProfile from Python to see the full stack including PyO3 overhead:
samply record python my_script.py