Task
Start training in full precision, progressively anneal to ternary over training schedule.
Lets model establish good representations before ternary constraints.
Scientific Background
BitNet b1.58 Training Recipe (JMLR 2026)
- Two-stage: high LR (4-8×10⁻⁴) for 80% → cosine anneal for 20%
- Higher LR improves ternary convergence (counterintuitive!)
- Weight decay: 0.1 → near zero (decaying, not constant)
- Warmup: 375 steps linear
Progressive Quantization (ACM 2024)
- Train 10% at FP32 → 20% at INT8 → 20% at INT4 → 50% at ternary
- Gradually increasing constraint preserves gradient flow
- Temperature-annealed STE: soft quantization → hard discrete
BitNet Shadow Weights
- Maintain FP16 shadow weights during training
- Shadow weights exist only for gradient flow
- Quantize to ternary for forward pass only
- Discard shadow weights after training
Implementation
const QuantSchedule = struct {
fn getPrecision(step: u64, total: u64) Precision {
const ratio = @as(f32, @floatFromInt(step)) / @as(f32, @floatFromInt(total));
if (ratio < 0.1) return .fp32; // warm start
if (ratio < 0.3) return .int8; // gentle quantize
if (ratio < 0.5) return .int4; // harder
return .ternary; // full constraint
}
};
Temperature annealing alternative:
Q_soft(w, T) = tanh(w / T) // T→0: hard ternary, T→∞: identity
Schedule: T = 10.0 → 0.01 over training (exponential decay)
Changes
src/hslm/trainer.zig: QuantSchedule with step-dependent precision
src/hslm/quantize.zig: temperature-annealed soft quantization
- Shadow weights buffer (FP32, same size as model)
- Flag:
--progressive-quant to enable
Expected
- 10-15% PPL improvement from better initialization
- Most impactful for initial 50K steps
- Compound: progressive quant + TTQ + OHEM = potentially PPL < 60
References
Task
Start training in full precision, progressively anneal to ternary over training schedule.
Lets model establish good representations before ternary constraints.
Scientific Background
BitNet b1.58 Training Recipe (JMLR 2026)
Progressive Quantization (ACM 2024)
BitNet Shadow Weights
Implementation
Temperature annealing alternative:
Changes
src/hslm/trainer.zig: QuantSchedule with step-dependent precisionsrc/hslm/quantize.zig: temperature-annealed soft quantization--progressive-quantto enableExpected
References