Add mixed-attention Core ML mask support for stateful generation#331
Add mixed-attention Core ML mask support for stateful generation#331Skyline-23 wants to merge 3 commits into
Conversation
- add support for fullAttentionMask and slidingAttentionMask model inputs in the stateful generation path - derive sliding window masks from model metadata or config when needed - add regression tests for additive full and sliding attention mask construction
- add fullAttentionMask and slidingAttentionMask handling to the stateful generation path - resolve sliding window size from model metadata or config for mixed-attention models - add regression tests for additive full and sliding attention mask construction
- factor stateful generation input assembly into a reusable helper - verify full and sliding attention mask keys, shapes, and additive values - keep single-mask generation behavior unchanged while covering mixed-attention inputs
pcuenca
left a comment
There was a problem hiding this comment.
Very interesting and cool PR @Skyline-23! I won't be able to properly test and review it until the end of the week. Meanwhile, a couple of questions:
- The converted example model seems to be using float32 instead of float16 (because of this line, and because the repo takes ~16 GB). Did you try to convert to float16? Did you try any quantization options?
- Are you using or planning to use this Core ML model in a downstream app?
Thanks a lot for the contribution!
|
@pcuenca Sorry for late reply! It's fine. Please review slowly |
|
I gave this a local try on macOS 26 / M-series. PR builds cleanly against current I also pulled A greedy decode of 8 tokens after a 12-token prefill produces coherent output (≈ 2.3 tok/s, Two small observations from testing: 1. The convert script doesn't write "co.huggingface.exporters.sliding_window": str(text_config.sliding_window),2. Inside if includeSlidingAttentionMask {
guard let slidingWindow else {
fatalError(...)
}combined with If a real-model integration fixture would be useful later, I'd be happy to convert and publish a small mixed-attention variant under my own namespace as a possible CI fixture — no need to block this PR on it. |
What
Add support for stateful Core ML language models that require multiple attention masks during generation.
Why
The current runtime only handles attentionMask / causalMask, which is not sufficient for mixed-attention Core ML exports that need separate masks for different layer types.
This change allows the stateful generation path to populate:
when those inputs are present in the Core ML model description.
Implementation
Tests
swift test --filter LanguageModelCoreMLMaskTests
Scope clarification
This PR is intended to support explicit multi-mask Core ML generation contracts in the runtime.
It does not attempt to fix exporter-side approaches that reconstruct multiple masks inside a Core ML graph from a single causalMask input.
Additional context
Closes #330
Example converted Core ML repo using the explicit multi-mask contract:
https://huggingface.co/Skyline23/translategemma-4b-it-coreml