Skip to content

feat: FP8 ViT Attention w/ FlashInfer#1660

Merged
AlpinDale merged 1 commit into
mainfrom
feat/fp8-vit-attn
Apr 28, 2026
Merged

feat: FP8 ViT Attention w/ FlashInfer#1660
AlpinDale merged 1 commit into
mainfrom
feat/fp8-vit-attn

Conversation

@AlpinDale

Copy link
Copy Markdown
Collaborator

Signed-off-by: AlpinDale <alpindale@gmail.com>
@AlpinDale AlpinDale merged commit 1241957 into main Apr 28, 2026
1 check failed
@AlpinDale AlpinDale deleted the feat/fp8-vit-attn branch April 28, 2026 09:05

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7cec407bab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +417 to +419
mm_cfg = get_multimodal_config()
scale_path = mm_cfg.mm_encoder_fp8_scale_path if mm_cfg is not None else None
if scale_path is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Load static FP8 scales without relying on global config context

process_weights_after_loading fetches mm_encoder_fp8_scale_path via get_multimodal_config(), but this method is invoked after model init when set_current_aphrodite_config is no longer active (see BaseModelLoader.load_model calling process_weights_after_loading after the init context). In that normal path mm_cfg is None, so scale_path becomes None even when --mm-encoder-fp8-scale-path is set; static scales are never loaded, scale buffers stay at 1.0, and because _fp8_dynamic_scale is already False, scales are never updated dynamically either.

Useful? React with 👍 / 👎.

Comment on lines +382 to +384
self.fp8_enabled = True
self._fp8_dynamic_scale = mm_cfg.mm_encoder_fp8_scale_path is None
self.fp8_quant = QuantFP8(static=True, group_shape=GroupShape.PER_TENSOR)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require FlashInfer backend when enabling FP8 ViT attention

This branch enables FP8 whenever mm_encoder_attn_dtype == "fp8", but it does not ensure self.attn_backend is FLASHINFER. With default CUDA backend selection, ViT prefers FLASH_ATTN before FLASHINFER, so users can set --mm-encoder-attn-dtype fp8 and still execute _forward_fa/_forward_sdpa paths with no FP8 quantization at all. That creates a silent config mismatch where the FP8 option appears enabled but is effectively ignored unless backend is separately forced.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant