Skip to content

Latest commit

 

History

History
114 lines (86 loc) · 3.9 KB

File metadata and controls

114 lines (86 loc) · 3.9 KB

Benchmarks

This page tracks practical runtime numbers for cellm CLI tools.

All numbers below are from local runs on March 29, 2026 and are reference-only.

How to run

CPU/Metal LLM matrix (automated)

tools/bench/run_llm_backend_matrix.sh

Outputs:

  • Markdown summary table: docs/benchmarks/runs/llm_backend_matrix_<timestamp>_summary.md
  • Raw run CSV: docs/benchmarks/runs/llm_backend_matrix_<timestamp>.csv

Useful overrides:

PASSES=3 GEN_TOKENS=8 PROMPT_TEXT="hi" tools/bench/run_llm_backend_matrix.sh
# Skip rebuild if infer is already built
BUILD_INFER=0 tools/bench/run_llm_backend_matrix.sh
# Restrict to one backend
BACKENDS="cpu" tools/bench/run_llm_backend_matrix.sh

Note: in restricted/sandboxed shells, Metal may be unavailable and report n/a for Metal rows.

Gemma4 mobile production profile (automated)

tools/bench/run_gemma4_mobile_profile.sh

Outputs:

  • Markdown summary table: docs/benchmarks/runs/gemma4_mobile_profile_<timestamp>_summary.md
  • Raw run CSV: docs/benchmarks/runs/gemma4_mobile_profile_<timestamp>.csv

Useful overrides:

# Fast local smoke (CPU only)
PASSES=1 BACKENDS="cpu" GEN_TOKENS=16 tools/bench/run_gemma4_mobile_profile.sh
# Real Metal run (outside restricted sandbox)
PASSES=1 BACKENDS="metal" GEN_TOKENS=16 tools/bench/run_gemma4_mobile_profile.sh
# Custom model/tokenizer
MODEL_PATH=models/gemma-4-E2B-it-int4-aggr-v5.cellmd \
TOKENIZER_PATH=models/gemma-4-E2B-it/tokenizer.json \
tools/bench/run_gemma4_mobile_profile.sh

Text benchmark (infer)

./target/release/infer \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello, how are you?" \
  --chat \
  --gen 16 \
  --backend cpu

VLM benchmark (vlm-infer)

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --max-new-tokens 16 \
  --backend cpu

Baseline runs

Tool Run Vision Time Prefill Decode
infer smollm2-135m.cellm, --chat, --gen 8 N/A 12 toks in 2.35s 8 toks in 1.62s
infer smollm2-135m-int8.cellm, --chat, --gen 16 N/A 12 toks in 2.38s 16 toks in 3.36s
vlm-infer ONNX fp16 rococo.jpg, --max-new-tokens 16 [64,576] in 0.99s N/A 16 toks in 1.54s
vlm-infer ONNX quantized rococo.jpg, --max-new-tokens 24 [64,576] in 1.44s N/A 18 toks in 0.35s
vlm-infer native vision + ONNX decoder --vision-backend cellm --decoder-backend onnx [64,576] in 5.96s N/A 16 toks in 4.15s
vlm-infer native vision + native decoder --vision-backend cellm --decoder-backend cellm [64,576] in 5.65s N/A 24 toks in 18.39s

CPU vs Metal request runs

Tool Backend Arg Host Log Vision Time Prefill Decode
infer (smollm2-135m-int8, --gen 16) --backend cpu Backend: cpu (macos/aarch64) N/A 12 toks in 1.77s 16 toks in 2.07s
infer (smollm2-135m-int8, --gen 16) --backend metal Backend: metal (smoke ok) N/A 12 toks in 1.75s 16 toks in 2.07s
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) --backend cpu Backend: cpu (macos/aarch64) [64,576] in 2.28s N/A 16 toks in 2.37s
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) --backend metal Backend: metal (smoke ok) [64,576] in 1.85s N/A 16 toks in 2.56s

Notes

  • Metal support is currently validated with smoke + backend selection.
  • Full forward kernels are still being expanded, so CPU paths remain the main execution path for several operators.
  • For VLM quality, --split-image helps caption relevance but increases latency.