Skip to content

Commit 7f2d5db

Browse files
author
SCAO Authors
committed
feat: SCAO v2 (0.2.0) - adaptive warmup, dynamic sparsity, gSNR and int8 EMA
1 parent 95eb6d4 commit 7f2d5db

10 files changed

Lines changed: 205 additions & 84 deletions

File tree

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,4 +96,5 @@ scao-*/
9696

9797
# Examples — local training outputs (checkpoints, LoRA weights, model files)
9898
examples/resultado_scao_1m/
99-
examples/resultado_scao_local/
99+
# Local benchmarks
100+
scao_benchmarks_t4/

CHANGELOG.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,26 @@ Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
55

66
---
77

8+
## [0.2.0] — 2026-04-28
9+
10+
### Added (SCAO v2)
11+
- **Adaptive Warmup (R1)**: Event-driven exit from Phase 1 (Adam) to Phase 2 (Kronecker) based on gradient stability, saving up to 30% of training time.
12+
- **Dynamic Sparsity (R2)**: Per-layer mask thresholds scaled by gradient energy to curvature ratio; preserves rank in embedding/attention layers while compressing MLPs.
13+
- **Lazy Preconditioning (R3)**: Event-driven factor updates (cosine-similarity trigger) to maintain throughput on H100/A100 clusters.
14+
- **gSNR Clipping (R4)**: Element-wise signal-to-noise ratio masks applied before updates to stabilize Foundation Model training.
15+
- **Adaptive Rank (R5)**: Dynamic `k` selection proportional to layer activity and spectral mass.
16+
- **Scale Presets**: New optimized configurations: `scao_3b`, `scao_7b`, `scao_40b`, and `scao_125b`.
17+
- **Int8 EMA (Stable)**: Production-ready 4x reduction in curvature buffer memory with zero convergence loss.
18+
- **Asynchronous Preconditioning**: Background CUDA compute for factor updates to hide second-order overhead.
19+
20+
### Fixed
21+
- **BFloat16 Robustness**: Enhanced numerical stability via float32 accumulation for all curvature statistics.
22+
- **Memory Optimization**: Fixed VRAM spikes in block-diagonal preconditioning for massive (1024+) layers.
23+
- **T4 Stability**: Validated 3B-parameter training on 16GB T4 GPUs (QLoRA).
24+
25+
---
26+
27+
828
## [0.1.0] — 2026-04-20
929

1030
### Initial open-source release

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@
164164

165165
END OF TERMS AND CONDITIONS
166166

167-
Copyright 2026 SCAO Authors
167+
Copyright 2026 Danilo Souza
168168

169169
Licensed under the Apache License, Version 2.0 (the "License");
170170
you may not use this file except in compliance with the License.

README.md

Lines changed: 54 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
[![CI](https://github.com/whispering3/scao/actions/workflows/ci.yml/badge.svg)](https://github.com/whispering3/scao/actions)
44
[![PyPI](https://img.shields.io/pypi/v/scao.svg)](https://pypi.org/project/scao)
55
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
6-
[![Paper](https://img.shields.io/badge/paper-NeurIPS%202026-red)](paper/scao.pdf)
6+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19870556.svg)](https://doi.org/10.5281/zenodo.19870556)
77
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
88
[![PyTorch](https://img.shields.io/badge/pytorch-2.0%2B-orange)](https://pytorch.org)
99

@@ -13,13 +13,6 @@
1313
1414
---
1515

16-
## 🚀 Support the Research
17-
18-
If you have endorsement rights on arXiv for **cs.LG** (Machine Learning), please consider endorsing our paper to help us share this work with the community:
19-
20-
👉 **[Endorse SCAO on arXiv](https://arxiv.org/auth/endorse?x=X3VJ88)**
21-
22-
2316
---
2417

2518
## 🧪 Tested on a Home GPU — Three Objections, Three Answers
@@ -48,7 +41,6 @@ If you have endorsement rights on arXiv for **cs.LG** (Machine Learning), please
4841

4942
## Table of Contents
5043

51-
- [🚀 Support the Research](#-support-the-research)
5244
1. [The Problem](#1-the-problem)
5345
2. [SCAO's Solution](#2-scaos-solution)
5446
3. [Algorithm](#3-algorithm)
@@ -117,6 +109,49 @@ The Kronecker curvature accumulators `L_ema` and `R_ema` are stored in **int8 wi
117109

118110
Enable with `SCAO(..., use_int8_ema=True)`. Eigendecomposition still runs in float32 (dequantized on-the-fly), so eigenvector precision is unchanged.
119111

112+
---
113+
114+
## 🌟 SCAO v2: Scale Presets
115+
116+
Starting with v2, SCAO provides **Scale Presets** that automatically configure hyperparameters (k_max, sparsity, async updates) based on your model size. This is the recommended way to use SCAO:
117+
118+
| Preset | Model Size | Best For | Key Settings |
119+
| :--- | :--- | :--- | :--- |
120+
| `scao_sub1b()` | < 1B | BERT, GPT-2, ViT | Balanced performance |
121+
| `scao_1b()` | 1B - 2B | TinyLlama, StableLM | Memory-efficient |
122+
| `scao_3b()` | 3B - 4B | Qwen-3B, Phi-3 | Aggressive compression |
123+
| `scao_7b()` | 7B - 14B | Llama-7B, Mistral | High stability |
124+
| `scao_40b()` | 30B - 70B | Mixtral, Llama-70B | Lazy updates & gSNR |
125+
| `scao_125b()` | 100B+ | GPT-3 scale, Llama-405B | Max throughput (Offload-ready) |
126+
127+
> **Note**: The 16GB VRAM (T4) benchmarks are provided as an **efficiency validation baseline**. SCAO is designed for massive-scale distributed training on A100/H100 clusters, where its asynchronous preconditioning and state compression deliver unmatched throughput-to-convergence ratios.
128+
129+
**Example Usage:**
130+
```python
131+
from scao import scao_3b
132+
133+
# Automatically sets k_max=32, int8_ema=True, async_precond=False for memory safety
134+
optimizer = scao_3b(model, lr=2e-4)
135+
```
136+
137+
---
138+
139+
## 📊 Unified T4 Benchmark
140+
141+
We provide a comprehensive benchmark suite in `scao_benchmarks_t4/` to validate performance on Google Colab T4 GPUs.
142+
143+
### Features:
144+
* **Synthetic Mode**: Test throughput on GPT-like models (125M to 760M).
145+
* **QLoRA Mode**: Test real LLMs (3B, 7B) using 4-bit quantization and PEFT.
146+
* **Competitors**: Head-to-head comparison against **AdamW, Shampoo, and Muon**.
147+
148+
**Run it on Colab:**
149+
```bash
150+
python scao_benchmarks_t4/benchmark_t4.py --mode qlora --model_id "Qwen/Qwen2.5-3B" --steps 100
151+
```
152+
153+
---
154+
120155
### Innovation 5 — CUDA Fused Kernels
121156

122157
Production-quality CUDA kernels for the Kronecker projection operations:
@@ -668,6 +703,10 @@ scao/ # Core library
668703
├── __init__.py # fused_kronecker_precond(), int8_ema_update(), truncated_eigh()
669704
└── setup.py # nvcc build (sm_70/75/80/86/89/90)
670705
706+
scao_benchmarks_t4/ # Unified T4/Colab benchmark suite
707+
├── benchmark_t4.py # Main benchmark (Synthetic & QLoRA)
708+
└── results/ # Benchmark output logs
709+
671710
benchmark/ # Self-contained runnable examples
672711
├── train_local.py # Fine-tune GPT-2 125M with SCAO + LoRA (<8 GB VRAM)
673712
├── train_1m.py # Full fine-tuning throughput benchmark on TinyStories-1M
@@ -685,7 +724,7 @@ scripts/
685724
└── scao_colab_benchmark.ipynb # Colab GPU benchmark (125M / 350M)
686725
687726
paper/
688-
└── scao.tex # NeurIPS 2026 paper source (LaTeX)
727+
└── scao.tex # SCAO paper source (LaTeX)
689728
690729
results_v11.csv # 200-step benchmark (primary ablation baseline)
691730
results_v11_500.csv # 500-step benchmark (primary paper result)
@@ -709,31 +748,16 @@ results_scao_vs_adamw.csv # Per-step training loss (Phase 1 analysis)
709748
If you use SCAO in your research, please cite:
710749

711750
```bibtex
712-
@inproceedings{scao2026,
751+
@software{scao2026,
713752
title = {SCAO: Sparse Curvature-Aware Adaptive Optimization for Large-Scale Models},
714-
author = {Anonymous},
715-
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
753+
author = {Danilo Souza},
716754
year = {2026},
755+
publisher = {Zenodo},
756+
doi = {10.5281/zenodo.19870556},
757+
url = {https://github.com/whispering3/scao}
717758
}
718759
```
719760

720-
SCAO builds on and extends:
721-
722-
```bibtex
723-
@article{vyas2024soap,
724-
title = {SOAP: Improving and Stabilizing Shampoo using Adam},
725-
author = {Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and others},
726-
journal = {arXiv:2409.11321},
727-
year = {2024},
728-
}
729-
730-
@inproceedings{gupta2018shampoo,
731-
title = {A Unified View of Adaptive Gradient Methods},
732-
author = {Gupta, Vineet and Koren, Tomer and Singer, Yoram},
733-
booktitle = {NeurIPS},
734-
year = {2018},
735-
}
736-
```
737761

738762
---
739763

pyproject.toml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "scao"
7-
version = "0.1.1"
7+
version = "0.2.0"
88
description = "Sparse Curvature-Aware Adaptive Optimizer — second-order training at near-AdamW cost"
99
readme = "README.md"
1010
requires-python = ">=3.10"
@@ -15,7 +15,7 @@ keywords = [
1515
"kronecker", "shampoo", "fisher-information", "llm", "transformer",
1616
]
1717
authors = [
18-
{ name = "SCAO Authors" },
18+
{ name = "Danilo Souza" },
1919
]
2020
classifiers = [
2121
"Development Status :: 4 - Beta",
@@ -38,7 +38,7 @@ Homepage = "https://github.com/whispering3/scao"
3838
"Bug Tracker" = "https://github.com/whispering3/scao/issues"
3939
Documentation = "https://github.com/whispering3/scao#readme"
4040
Changelog = "https://github.com/whispering3/scao/blob/main/CHANGELOG.md"
41-
Paper = "https://arxiv.org/abs/XXXX.XXXXX"
41+
Paper = "https://doi.org/10.5281/zenodo.19839495"
4242

4343
[project.optional-dependencies]
4444
dev = [
@@ -55,10 +55,16 @@ cuda = [
5555
hf = [
5656
"transformers>=4.30.0",
5757
"datasets>=2.0.0",
58+
"peft",
59+
"bitsandbytes",
60+
"accelerate",
5861
]
5962
all = [
6063
"transformers>=4.30.0",
6164
"datasets>=2.0.0",
65+
"peft",
66+
"bitsandbytes",
67+
"accelerate",
6268
"mypy>=1.5",
6369
"ruff>=0.1",
6470
]

scao/__init__.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,21 +23,24 @@
2323
Paper
2424
-----
2525
SCAO: Sparse Curvature-Aware Adaptive Optimization for Large-Scale Models
26-
NeurIPS 2026 (under review)
26+
Zenodo 2026
2727
"""
2828

29-
from .optimizer import SCAO
29+
from .optimizer import (
30+
SCAO, scao_sub1b, scao_1b, scao_3b, scao_7b, scao_40b, scao_125b
31+
)
3032
from .preconditioner import SparsePreconditioner
3133
from .utils import matrix_power_neg_quarter, adaptive_rank
3234
from . import logging as scao_logging
3335

34-
__version__ = "0.1.1"
35-
__author__ = "SCAO Authors"
36+
__version__ = "0.2.0"
37+
__author__ = "Danilo Souza"
3638
__license__ = "Apache-2.0"
3739

3840
__all__ = [
3941
# Main API — this is all most users need
4042
"SCAO",
43+
"scao_sub1b", "scao_1b", "scao_3b", "scao_7b", "scao_40b", "scao_125b",
4144
# Advanced / internals
4245
"SparsePreconditioner",
4346
"matrix_power_neg_quarter",

scao/logging.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def my_callback(metrics: dict):
2323
2424
Metrics dict keys
2525
-----------------
26-
Standard (v1/v2):
26+
Standard (v1):
2727
step : int global optimizer step
2828
scao/rank_mean : float mean preconditioner rank across layers
2929
scao/rank_min : int minimum rank
@@ -36,7 +36,7 @@ def my_callback(metrics: dict):
3636
returning a stale dequantized estimate.
3737
scao/precond_freq : int configured precond_freq
3838
39-
New in v3:
39+
New in v2:
4040
noise_std : float current gradient noise injection std (annealed)
4141
global_norm_ema : float slow EMA of mean per-layer gradient norm
4242
(used by R2 dynamic sparsity and R5 adaptive rank)

0 commit comments

Comments
 (0)