This file tracks the sequence for moving from ONNX validation to native .cellm VLM execution.
Implemented:
.cellmheader now stores:source_model_typetext_tensor_prefixvision_tensor_prefixprojector_tensor_prefixsource_text_configsource_vision_configsource_projector_config
- Converter fills these fields from HF
config.jsonand tensor-name inspection.
Implemented:
convertnow routes model type fromtext_config.model_typefor multimodal wrappers.- SmolVLM conversion now succeeds end-to-end with:
cargo run --release --bin convert -- \
--input models/hf/smolvlm-256m-instruct \
--output models/smolvlm-256m.cellm \
--dtype f16Implemented:
inferand SDK choose text runner fromsource_text_config.model_typewhen present.- Llama runner now supports multimodal text tensor naming (
model.text_model.*) and prefixed layouts. vlm-infernow supports--decoder-backend cellm --cellm-model <path>:- vision encoder stays ONNX
- decoder runs through native
.cellmLlama path (including int8 quantized.cellm)
vlm-infernow also supports--vision-backend cellm(experimental):- loads vision + projector tensors from
.cellm - runs patch embedding + full ViT encoder blocks + post layernorm + connector projection in Rust
- computes image features (
[64, hidden]) without ONNX model execution
- loads vision + projector tensors from
Current limitation:
- Native vision runtime now uses SIMD-optimized BLAS matmuls on macOS and is much faster than the earlier scalar implementation.
- ONNX Runtime remains faster; Metal-native fused kernels are still pending.
Implemented (working):
--quantize-int8-symmetricinconvertfor Llama text stacks.- Weight-only per-row symmetric int8 + f16 row scales.
- Runtime dequant support in Llama linear layers and logits source.
- Move native vision kernels from
vlm-inferinto reusable backend/kernel crates. - Add optimized kernels (SIMD/Metal) for ViT attention + MLP.
- Add additional source formats:
- PyTorch
.bin/.pt - Flax/JAX
- ONNX ->
.cellm
- PyTorch