You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+20Lines changed: 20 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,26 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
5
5
6
6
---
7
7
8
+
## [0.2.0] — 2026-04-28
9
+
10
+
### Added (SCAO v2)
11
+
-**Adaptive Warmup (R1)**: Event-driven exit from Phase 1 (Adam) to Phase 2 (Kronecker) based on gradient stability, saving up to 30% of training time.
12
+
-**Dynamic Sparsity (R2)**: Per-layer mask thresholds scaled by gradient energy to curvature ratio; preserves rank in embedding/attention layers while compressing MLPs.
13
+
-**Lazy Preconditioning (R3)**: Event-driven factor updates (cosine-similarity trigger) to maintain throughput on H100/A100 clusters.
14
+
-**gSNR Clipping (R4)**: Element-wise signal-to-noise ratio masks applied before updates to stabilize Foundation Model training.
15
+
-**Adaptive Rank (R5)**: Dynamic `k` selection proportional to layer activity and spectral mass.
16
+
-**Scale Presets**: New optimized configurations: `scao_3b`, `scao_7b`, `scao_40b`, and `scao_125b`.
17
+
-**Int8 EMA (Stable)**: Production-ready 4x reduction in curvature buffer memory with zero convergence loss.
18
+
-**Asynchronous Preconditioning**: Background CUDA compute for factor updates to hide second-order overhead.
19
+
20
+
### Fixed
21
+
-**BFloat16 Robustness**: Enhanced numerical stability via float32 accumulation for all curvature statistics.
22
+
-**Memory Optimization**: Fixed VRAM spikes in block-diagonal preconditioning for massive (1024+) layers.
23
+
-**T4 Stability**: Validated 3B-parameter training on 16GB T4 GPUs (QLoRA).
If you have endorsement rights on arXiv for **cs.LG** (Machine Learning), please consider endorsing our paper to help us share this work with the community:
19
-
20
-
👉 **[Endorse SCAO on arXiv](https://arxiv.org/auth/endorse?x=X3VJ88)**
21
-
22
-
23
16
---
24
17
25
18
## 🧪 Tested on a Home GPU — Three Objections, Three Answers
@@ -48,7 +41,6 @@ If you have endorsement rights on arXiv for **cs.LG** (Machine Learning), please
48
41
49
42
## Table of Contents
50
43
51
-
-[🚀 Support the Research](#-support-the-research)
52
44
1.[The Problem](#1-the-problem)
53
45
2.[SCAO's Solution](#2-scaos-solution)
54
46
3.[Algorithm](#3-algorithm)
@@ -117,6 +109,49 @@ The Kronecker curvature accumulators `L_ema` and `R_ema` are stored in **int8 wi
117
109
118
110
Enable with `SCAO(..., use_int8_ema=True)`. Eigendecomposition still runs in float32 (dequantized on-the-fly), so eigenvector precision is unchanged.
119
111
112
+
---
113
+
114
+
## 🌟 SCAO v2: Scale Presets
115
+
116
+
Starting with v2, SCAO provides **Scale Presets** that automatically configure hyperparameters (k_max, sparsity, async updates) based on your model size. This is the recommended way to use SCAO:
> **Note**: The 16GB VRAM (T4) benchmarks are provided as an **efficiency validation baseline**. SCAO is designed for massive-scale distributed training on A100/H100 clusters, where its asynchronous preconditioning and state compression deliver unmatched throughput-to-convergence ratios.
128
+
129
+
**Example Usage:**
130
+
```python
131
+
from scao import scao_3b
132
+
133
+
# Automatically sets k_max=32, int8_ema=True, async_precond=False for memory safety
134
+
optimizer = scao_3b(model, lr=2e-4)
135
+
```
136
+
137
+
---
138
+
139
+
## 📊 Unified T4 Benchmark
140
+
141
+
We provide a comprehensive benchmark suite in `scao_benchmarks_t4/` to validate performance on Google Colab T4 GPUs.
142
+
143
+
### Features:
144
+
***Synthetic Mode**: Test throughput on GPT-like models (125M to 760M).
145
+
***QLoRA Mode**: Test real LLMs (3B, 7B) using 4-bit quantization and PEFT.
146
+
***Competitors**: Head-to-head comparison against **AdamW, Shampoo, and Muon**.
0 commit comments