You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add native FP8 model support with scale_inv dequantization
Add comprehensive FP8 quantized model support for models like Qwen3-FP8.
This enables loading and running FP8 models with per-block scale factors.
Changes:
bumblebee.ex:
- Add :preserve_source_types option to load_model/2 to keep FP8 types
pytorch_params.ex:
- Pass preserve_source_types through param loading pipeline
- Modify ensure_type/3 to preserve FP8 types when option is set
layers.ex:
- Add fp8_aware_dense/3 layer that handles FP8 quantized weights
- Implements block-wise dequantization using scale_inv parameter
- Automatically falls back to identity scaling for non-FP8 models
layers/transformer.ex:
- Add :attention_dense option to blocks/2, block/2, multi_head_attention/4
- Allows custom dense function for Q, K, V, and output projections
text/qwen3.ex:
- Update decoder to use fp8_aware_dense for attention via attention_dense
- Update gated_ffn to use fp8_aware_dense for FFN layers
- Add scale_inv to params_mapping for all attention and FFN layers
The implementation supports both:
- Pre-dequantization: Convert FP8->F32 before loading
- Native FP8: Load FP8 weights directly, apply scale_inv at runtime
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
0 commit comments