You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Covers TQ-023 (Metal/Apple Silicon backend) and TQ-024 (Fused dequant+attention kernels)
Status: TODO — requires Metal Shading Language + Kotlin/Native interop
Objective
Implement TurboQuant KV-cache compression and decompression as Metal
compute shaders for Apple Silicon, enabling zero-copy unified-memory
KV cache and fused dequant+attention execution.
Why Metal
Apple Silicon unified memory eliminates CPU↔GPU copies for KV cache
Metal Performance Shaders (MPS) provides optimized SDPA primitives
Most on-device inference for SKaiNET targets macOS/iOS (Apple Silicon)
TurboQuant decode is embarrassingly parallel — ideal for GPU compute
GPU-side append (encode on GPU, no CPU round-trip)
shader + Kotlin
M-022
GPU-side read (decode on GPU for raw access)
shader + Kotlin
M-023
Integration with CompressedKvAttention.RAW_STORAGE
bridge code
Phase 4: Fused dequant+SDPA
Task
Description
Files
M-030
turboquant_fused_sdpa shader
turboquant.metal
M-031
Tiled attention with on-the-fly dequant
shader optimization
M-032
Causal mask support in fused kernel
shader
M-033
GQA (grouped-query attention) support
shader
M-034
End-to-end benchmark vs CPU decode+SDPA
benchmark suite
Phase 5: Integration & optimization
Task
Description
Files
M-040
Wire Metal backend into PlatformCpuOpsFactory for macOS/iOS
factory impl
M-041
Fallback to CPU when Metal unavailable
graceful degradation
M-042
Unified-memory placement resolution in MemoryPlanner
planner update
M-043
@KvCache(device = GPU) annotation handling
annotation processor
M-044
Performance tuning: threadgroup sizes, occupancy
shader tuning
Kotlin/Native Metal Interop
cinterop definition (metal.def)
language = Objective-C
headers = Metal/Metal.h MetalPerformanceShaders/MetalPerformanceShaders.h
compilerOpts = -framework Metal -framework MetalPerformanceShaders
linkerOpts = -framework Metal -framework MetalPerformanceShaders -framework Foundation
Key ObjC types to bridge
Metal Type
Kotlin Usage
MTLDevice
GPU device handle
MTLCommandQueue
Serial command submission
MTLCommandBuffer
Batch of GPU commands
MTLComputeCommandEncoder
Dispatch compute kernels
MTLBuffer
GPU/unified memory buffer
MTLComputePipelineState
Compiled shader pipeline
MTLLibrary
Compiled shader library
Unified memory pattern
// Allocate in unified memory — accessible from both CPU and GPUval buffer = device.newBuffer(
length = sizeInBytes,
options =MTLResourceStorageModeShared// unified memory
)
// CPU can read/write directly (no copy needed)val ptr = buffer.contents()
// GPU kernel reads/writes same memory
encoder.setBuffer(buffer, offset =0, index =0)
encoder.dispatchThreads(...)
Performance Targets
Metric
CPU Reference
Metal Target
TurboQuant encode (128d, 4-bit)
~10 μs
< 2 μs
TurboQuant decode (128d, 4-bit)
~8 μs
< 1 μs
Fused dequant+SDPA (8 heads, 128d, 1024 KV)
N/A (separate)
< 100 μs
KV cache memory (4-bit vs FP32)
8x compression
8x compression
CPU↔GPU copies for KV cache
N/A
0 (unified memory)
Acceptance Criteria
Metal shaders compile and run on Apple Silicon (M1+)
Encode/decode correctness matches CPU reference within tolerance