Skip to content

feat(hslm): sparse ternary matmul — skip zero weights #316

@gHashTag

Description

@gHashTag

Task

Exploit zero-weight sparsity in ternary matrices. Current sparsity=25%, target 50%+ with regularizer.
Skip zero weights in matmul = proportional speedup.

Scientific Background

SpTGEMM: Sparse Ternary GEMM (arxiv:2510.06957)

  • 5.98x speedup at 50% sparsity on Apple M1
  • 5.59x speedup at 25% sparsity with NEON vectorization
  • Blocked interleaved TCSC format > standard CSR/CSC for ternary
  • Performance stable across varying sparsity levels — critical for real networks

BitNet a4.8 Sparsification (arxiv:2411.04965)

  • Hybrid: 1-bit weights + top-k activation sparsification
  • 55% activated parameters through learned sparsification
  • L1 regularization during training explicitly encourages zero weights
  • Achieves competitive performance with 45% fewer compute ops

FATNN: Fast Ternary Neural Networks (ICCV 2021)

  • Ternary inner product: eliminate multiplication entirely
  • +1 → identity, -1 → negation, 0 → skip
  • 1.6-2.5x speedup in pure C++ (no specialized hardware)
  • 2x faster than conventional ternary across 6 convolution configs

Implementation Plan

Phase 1: Sparsity-aware matmul

// Ternary packed format: 2 bits per weight (+1=01, -1=10, 0=00)
// Process only non-zero weights
for (k in 0..K) {
    const w = weights[m][k];  // ternary
    if (w == 0) continue;     // skip zero — 25-50% of iterations
    if (w == 1) acc += input[k];
    else acc -= input[k];     // w == -1
}

Phase 2: Blocked interleaved format (from SpTGEMM paper)

  • 4x4 blocks, interleaved across matrix dimensions
  • Cache-line aligned storage: 64B = 256 ternary weights (2-bit packed)
  • Separate positive/negative indices for branch-free processing

Phase 3: Sparsity regularizer (WDR)

L_total = L_ce + λ * Σ|w_i| * (1 - |w_i|)
  • Weight Discretization Regularizer pushes weights toward {-1, 0, +1}
  • λ controls sparsity level: λ=0.01 → ~25%, λ=0.05 → ~50%
  • Progressive: start λ=0 → anneal to target over first 20K steps

Changes

  • src/hslm/simd_ops.zig: sparse ternary matmul with zero-skip
  • src/hslm/trainer.zig: add WDR regularizer with --sparsity flag
  • Blocked interleaved storage format for weight matrices
  • Benchmark: tri test bench-sparse (varying sparsity 25-75%)

Expected

  • 25% speedup at current sparsity (25%)
  • 4-6x speedup at 50% sparsity (with regularizer)
  • PPL impact: minimal if sparsity < 50%, ~2-5% degradation at 60%+

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent:spawnAuto-spawn agent container

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions