This page tracks practical runtime numbers for cellm CLI tools.
All numbers below are from local runs on March 29, 2026 and are reference-only.
tools/bench/run_llm_backend_matrix.shOutputs:
- Markdown summary table:
docs/benchmarks/runs/llm_backend_matrix_<timestamp>_summary.md - Raw run CSV:
docs/benchmarks/runs/llm_backend_matrix_<timestamp>.csv
Useful overrides:
PASSES=3 GEN_TOKENS=8 PROMPT_TEXT="hi" tools/bench/run_llm_backend_matrix.sh# Skip rebuild if infer is already built
BUILD_INFER=0 tools/bench/run_llm_backend_matrix.sh# Restrict to one backend
BACKENDS="cpu" tools/bench/run_llm_backend_matrix.shNote: in restricted/sandboxed shells, Metal may be unavailable and report n/a for Metal rows.
tools/bench/run_gemma4_mobile_profile.shOutputs:
- Markdown summary table:
docs/benchmarks/runs/gemma4_mobile_profile_<timestamp>_summary.md - Raw run CSV:
docs/benchmarks/runs/gemma4_mobile_profile_<timestamp>.csv
Useful overrides:
# Fast local smoke (CPU only)
PASSES=1 BACKENDS="cpu" GEN_TOKENS=16 tools/bench/run_gemma4_mobile_profile.sh# Real Metal run (outside restricted sandbox)
PASSES=1 BACKENDS="metal" GEN_TOKENS=16 tools/bench/run_gemma4_mobile_profile.sh# Custom model/tokenizer
MODEL_PATH=models/gemma-4-E2B-it-int4-aggr-v5.cellmd \
TOKENIZER_PATH=models/gemma-4-E2B-it/tokenizer.json \
tools/bench/run_gemma4_mobile_profile.sh./target/release/infer \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello, how are you?" \
--chat \
--gen 16 \
--backend cpu./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--max-new-tokens 16 \
--backend cpu| Tool | Run | Vision Time | Prefill | Decode |
|---|---|---|---|---|
infer |
smollm2-135m.cellm, --chat, --gen 8 |
N/A | 12 toks in 2.35s |
8 toks in 1.62s |
infer |
smollm2-135m-int8.cellm, --chat, --gen 16 |
N/A | 12 toks in 2.38s |
16 toks in 3.36s |
vlm-infer ONNX fp16 |
rococo.jpg, --max-new-tokens 16 |
[64,576] in 0.99s |
N/A | 16 toks in 1.54s |
vlm-infer ONNX quantized |
rococo.jpg, --max-new-tokens 24 |
[64,576] in 1.44s |
N/A | 18 toks in 0.35s |
vlm-infer native vision + ONNX decoder |
--vision-backend cellm --decoder-backend onnx |
[64,576] in 5.96s |
N/A | 16 toks in 4.15s |
vlm-infer native vision + native decoder |
--vision-backend cellm --decoder-backend cellm |
[64,576] in 5.65s |
N/A | 24 toks in 18.39s |
| Tool | Backend Arg | Host Log | Vision Time | Prefill | Decode |
|---|---|---|---|---|---|
infer (smollm2-135m-int8, --gen 16) |
--backend cpu |
Backend: cpu (macos/aarch64) |
N/A | 12 toks in 1.77s |
16 toks in 2.07s |
infer (smollm2-135m-int8, --gen 16) |
--backend metal |
Backend: metal (smoke ok) |
N/A | 12 toks in 1.75s |
16 toks in 2.07s |
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) |
--backend cpu |
Backend: cpu (macos/aarch64) |
[64,576] in 2.28s |
N/A | 16 toks in 2.37s |
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) |
--backend metal |
Backend: metal (smoke ok) |
[64,576] in 1.85s |
N/A | 16 toks in 2.56s |
- Metal support is currently validated with smoke + backend selection.
- Full forward kernels are still being expanded, so CPU paths remain the main execution path for several operators.
- For VLM quality,
--split-imagehelps caption relevance but increases latency.