S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM #8712

QQQTech · 2026-03-22T20:30:47Z

QQQTech
Mar 22, 2026

S2LC (Shared Spectral Low-Rank Compression) exploits shared spectral structure across neural network modules derived from the same base model. A shared basis matrix V_common (shape D×R, FP16) is computed once per layer via truncated SVD across the module population; each module’s unique contribution U_k (shape D×R) is projected onto V_common and encoded in two compact codebooks at approximately 3 bits per element. At inference, the fused Triton kernel computes y = x × V_common × U_kᵀ by reconstructing U_k values directly in the GPU register file during the tiled GEMM, producing no intermediate HBM writes; the only write is the final output tensor. CUDA Graph capture eliminates CPU-side kernel launch overhead. Results: 10.1× memory compression over standard LoRA, 3.59 ms forward-pass latency for K=100 concurrent adapters, zero intermediate HBM writes verified by NVIDIA Nsight Compute. Extensions to MoE expert compression, KV cache compression, and variable-depth serving are described in Sections 5–7 and are currently theoretical — the algorithm is specified but not yet benchmarked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM #8712

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM #8712

Uh oh!

QQQTech Mar 22, 2026

Replies: 0 comments

QQQTech
Mar 22, 2026