feat(hslm): progressive quantization — FP32 warmup → ternary anneal

## Task
Start training in full precision, progressively anneal to ternary over training schedule.
Lets model establish good representations before ternary constraints.

## Scientific Background

### BitNet b1.58 Training Recipe (JMLR 2026)
- Two-stage: high LR (4-8×10⁻⁴) for 80% → cosine anneal for 20%
- **Higher LR improves ternary convergence** (counterintuitive!)
- Weight decay: 0.1 → near zero (decaying, not constant)
- Warmup: 375 steps linear

### Progressive Quantization (ACM 2024)
- Train 10% at FP32 → 20% at INT8 → 20% at INT4 → 50% at ternary
- Gradually increasing constraint preserves gradient flow
- Temperature-annealed STE: soft quantization → hard discrete

### BitNet Shadow Weights
- Maintain FP16 shadow weights during training
- Shadow weights exist only for gradient flow
- Quantize to ternary for forward pass only
- Discard shadow weights after training

## Implementation
```zig
const QuantSchedule = struct {
    fn getPrecision(step: u64, total: u64) Precision {
        const ratio = @as(f32, @floatFromInt(step)) / @as(f32, @floatFromInt(total));
        if (ratio < 0.1) return .fp32;      // warm start
        if (ratio < 0.3) return .int8;      // gentle quantize
        if (ratio < 0.5) return .int4;      // harder
        return .ternary;                     // full constraint
    }
};
```

### Temperature annealing alternative:
```
Q_soft(w, T) = tanh(w / T)  // T→0: hard ternary, T→∞: identity
Schedule: T = 10.0 → 0.01 over training (exponential decay)
```

## Changes
- `src/hslm/trainer.zig`: QuantSchedule with step-dependent precision
- `src/hslm/quantize.zig`: temperature-annealed soft quantization
- Shadow weights buffer (FP32, same size as model)
- Flag: `--progressive-quant` to enable

## Expected
- **10-15% PPL improvement** from better initialization
- Most impactful for initial 50K steps
- Compound: progressive quant + TTQ + OHEM = potentially PPL < 60

## References
- BitNet b1.58: https://www.jmlr.org/papers/volume26/24-2050/24-2050.pdf
- Progressive quantization: https://dl.acm.org/doi/10.1145/3701716.3717578
- BitNet 2B4T: https://arxiv.org/abs/2504.12285
- Shadow weights: https://arxiv.org/abs/2411.05882

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hslm): progressive quantization — FP32 warmup → ternary anneal #321

Task

Scientific Background

BitNet b1.58 Training Recipe (JMLR 2026)

Progressive Quantization (ACM 2024)

BitNet Shadow Weights

Implementation

Temperature annealing alternative:

Changes

Expected

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat(hslm): progressive quantization — FP32 warmup → ternary anneal #321

Description

Task

Scientific Background

BitNet b1.58 Training Recipe (JMLR 2026)

Progressive Quantization (ACM 2024)

BitNet Shadow Weights

Implementation

Temperature annealing alternative:

Changes

Expected

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions