Skip to content

Commit aa3e3a5

Browse files
committed
post: Clever Adam - Literature Review & Project Plan
1 parent 24d8986 commit aa3e3a5

2 files changed

Lines changed: 231 additions & 0 deletions

File tree

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
title: "Clever Adam - Literature Review: Robust Optimization under Heavy-Tailed Gradient Noise"
3+
published: 2026-05-18
4+
description: "A survey of adaptive optimizers, heavy-tailed gradient noise theory, gradient clipping, and robust estimation methods — and where Clever Adam fits in."
5+
tags: ["Research", "Deep Learning", "Optimization", "Heavy-Tailed Noise"]
6+
category: projects
7+
draft: false
8+
pinned: false
9+
---
10+
11+
## Introduction
12+
13+
Stochastic gradient descent (SGD) and its adaptive variants are the workhorses of modern deep learning. A growing body of evidence shows that the stochastic gradients encountered in practice are far from well-behaved: their distribution is typically *heavy-tailed*, exhibiting infinite or near-infinite variance and frequent large outliers [1]. This violates the light-tailed (e.g., sub-Gaussian) assumptions underlying classical convergence theory and degrades the performance of standard optimizers.
14+
15+
This review surveys three lines of research that address this challenge: (i) adaptive optimization methods, (ii) gradient clipping and normalization, and (iii) robust gradient estimation and sign-based approaches.
16+
17+
---
18+
19+
## Adaptive Optimization Methods
20+
21+
**Adam** [2] combines momentum with per-parameter adaptive learning rates via estimates of the first and second moments of the gradient. Despite its popularity, Adam has known convergence issues.
22+
23+
**AMSGrad** [3] identified that the exponential moving average of squared gradients can cause the effective step size to increase, violating monotone convergence guarantees. The fix — maintaining the maximum of past squared gradients — restores convergence in the convex setting but comes at a cost in practice.
24+
25+
Two subsequent variants target different shortcomings:
26+
27+
- **RAdam** [4] observes that the variance of the adaptive learning rate is unreliable during the early training phase (when the second-moment estimate has high bias) and introduces a variance-rectification term to stabilize warm-up.
28+
- **AdaBelief** [5] replaces the second-moment denominator with the variance of the gradient relative to the momentum direction, i.e., it adapts step sizes based on how much the observed gradient *deviates* from its predicted value. This "belief" mechanism yields faster convergence and better generalization in many settings.
29+
- **BDS-Adam** [6] addresses biased gradient estimation and early-training instability through a dual-path framework combining nonlinear gradient mapping with adaptive variance rectification.
30+
31+
A fundamentally different design philosophy emerges in sign-based optimizers. **Lion** [7], discovered via symbolic search, uses only the sign of the momentum to update parameters. Its simplicity and competitive performance suggest that discarding gradient magnitude entirely — rather than adapting it — can be a viable strategy in noisy settings.
32+
33+
> **Key gap:** None of these methods explicitly detects or adapts to the *tail behavior* of the gradient noise. Their convergence guarantees either assume bounded second moments (Adam, AMSGrad, RAdam, AdaBelief) or do not distinguish between light-tailed and heavy-tailed regimes (Lion, BDS-Adam). In practice, the same update rule is applied regardless of whether the observed noise is benign or adversarial.
34+
35+
---
36+
37+
## Heavy-Tailed Gradient Noise: Theory and Evidence
38+
39+
The seminal work of Gürbüzbalaban et al. [1] established that SGD iterates converge to a heavy-tailed stationary distribution, even for smooth losses with bounded gradients. The tail index $\alpha$ (governing the power-law decay of the gradient distribution) depends on the batch size, learning rate, and network architecture: smaller batch sizes and deeper networks yield heavier tails — a regime that is increasingly common in large-scale training.
40+
41+
Subsequent work has deepened this picture from multiple angles:
42+
43+
- **Raj et al.** [8] analyzed the algorithmic stability of heavy-tailed SGD, showing that heavy noise degrades generalization bounds.
44+
- **Zhang et al.** [9] proved that non-convex SGD with heavy-tailed noise can still converge, but the rate depends critically on the tail index and the clipping threshold.
45+
- **Kunstner et al.** [10] challenged the simplistic narrative that noise alone explains the SGD–Adam gap on Transformers, arguing instead that the *sign structure* of the gradient — rather than its variance — may be the dominant factor.
46+
- **Fatkhullin et al.** [11] recently proved that vanilla SGD can achieve minimax-optimal convergence under heavy-tailed noise, but only in probability (not in high probability), leaving room for methods that provide stronger guarantees.
47+
48+
These results collectively motivate the design of optimizers that are *aware* of the noise regime and adapt their update rule accordingly.
49+
50+
---
51+
52+
## Gradient Clipping and Normalization
53+
54+
Gradient clipping — truncating the gradient norm when it exceeds a threshold — is the most widely adopted defense against heavy-tailed noise.
55+
56+
- **Sadiev et al.** [12] established high-probability convergence guarantees for clipped gradient methods under heavy-tailed noise, showing that a fixed clipping threshold suffices to restore $\mathcal{O}(1/\sqrt{T})$ rates even when the noise has only bounded $p$-th moments for $p \in (1,2]$.
57+
- **Hübler et al.** [13] provided a unified analysis connecting gradient clipping to gradient normalization, demonstrating that both can be viewed as projections onto a bounded set, with the choice between them trading off bias and variance.
58+
- **Sun et al.** [14] revisits this comparison for non-convex SGD, establishing tighter convergence rates and clarifying when normalization is preferable to clipping. They further demonstrated that gradient normalization alone — without clipping — is sufficient to ensure SGD convergence under heavy-tailed noise.
59+
- **Chen et al.** [15] shows that clipping provably improves Adam-Norm and AdaGrad-Norm specifically when the noise is heavy-tailed, providing direct justification for integrating clipping into adaptive methods.
60+
- Clipped SGD has also been extended to the convex $(L_0, L_1)$-smooth setting [16], broadening its applicability beyond standard smoothness assumptions.
61+
62+
> **Key limitation:** The clipping threshold remains a hyperparameter that must be tuned; an adaptive mechanism that sets this threshold based on observed noise characteristics would be substantially more practical — a direction we pursue in our work.
63+
64+
---
65+
66+
## Robust Gradient Estimation and Sign-Based Methods
67+
68+
A fundamentally different approach to heavy-tailed noise replaces the gradient mean with a more robust estimator.
69+
70+
### Sign-based methods
71+
72+
**SignSGD** [17] transmits only the sign of each gradient component, achieving compression and implicit robustness: the sign operation naturally clips extreme values. **Korotin et al.** [18] provided high-probability convergence bounds for the sign operator under heavy-tailed noise, demonstrating that sign-based updates provably mitigate the effect of heavy tails at the cost of slower convergence in the light-tailed regime. **Jiang et al.** [19] improved the convergence rate of signSGD through variance reduction via control variates, narrowing the gap with full-gradient methods.
73+
74+
### Median-based methods
75+
76+
**Schaipp et al.** [20] developed a stochastic proximal point method that tracks the median of gradient mini-batches rather than the mean, establishing a formal connection between clipping and median estimation that was not previously known. The **R-SGD-Mini** framework [21] generalizes this to medoid-based sampling, providing finite-sample guarantees under heavy-tailed noise without requiring gradient clipping. Classical robust estimation theory [22] underpins these approaches, showing that influence-function-based gradient estimators can achieve $\sqrt{n}$-consistency even under Huber $\varepsilon$-contamination.
77+
78+
### Adaptive heavy-tailed methods
79+
80+
The **AHTSGD** framework [23] is perhaps the closest prior work to our setting: it dynamically injects $\alpha$-stable noise with a tail index that adapts based on exponentially averaged log-sharpness, transitioning from heavy-tailed (small $\alpha$, for exploration) to lighter-tailed (large $\alpha \to 2$, for convergence) as training progresses.
81+
82+
However, AHTSGD operates on a fundamentally different principle from our work: it *injects* synthetic heavy-tailed noise to escape sharp minima, rather than *detecting and responding to* the inherent heavy-tailed noise in the gradient distribution. Moreover, AHTSGD's noise adaptation relies on log-sharpness of the loss landscape — an indirect geometric proxy — whereas our noise monitor directly estimates tail statistics of the gradient distribution itself (norm CV, top-$k$ extremal ratios, and moving-average deviation), providing a more direct and responsive signal.
83+
84+
---
85+
86+
## Research Gap and Motivation
87+
88+
The literature reveals a precise gap that no existing method fills.
89+
90+
**Adaptive optimizers** (Adam, AMSGrad, RAdam, AdaBelief, BDS-Adam) are the default in deep learning but apply a fixed update rule regardless of the noise regime. Their convergence proofs assume bounded second moments; when this assumption is violated — as it routinely is in practice [1] — they offer no guarantees and degrade empirically. Lion [7] sidesteps this by discarding magnitude entirely, but this is a blunt instrument: in light-tailed settings where Adam's adaptive scaling is beneficial, Lion leaves performance on the table.
91+
92+
**Robust methods** (clipping, normalization, sign-based, median-based) provide strong defenses against heavy-tailed noise but are typically applied as *fixed wrappers* around SGD. They require manual tuning of thresholds (clipping) or incur higher computational cost (median-based), and — critically — they do not integrate with Adam's momentum and curvature estimates. Chen et al. [15] proved that clipping helps Adam under heavy tails, but the clipping threshold was fixed a priori rather than adapted to the observed noise.
93+
94+
**Adaptive heavy-tailed methods** come closest to our vision, but with a fundamental mismatch. AHTSGD [23] adaptively controls the *injection* of synthetic noise for exploration; it does not detect or respond to the *inherent* heavy-tailed noise in the gradient distribution, nor does it integrate with Adam-family optimizers.
95+
96+
### The gap is threefold:
97+
98+
1. No existing optimizer **directly monitors the tail statistics** of the gradient distribution in real time (as opposed to indirect proxies such as loss landscape sharpness)
99+
2. No method **smoothly transitions** between standard and robust update modes within a single optimizer
100+
3. No approach **integrates** noise-aware adaptation with Adam's momentum and curvature estimates
101+
102+
This motivates **Clever Adam**, which addresses all three deficiencies through a lightweight noise monitor (gradient norm CV, top-$k$ extremal ratios, and moving-average deviation), a dual-regime update that applies standard Adam under light-tailed noise and switches to clipped/sign-based updates under heavy-tailed noise, and a smooth blending gate $\alpha = \sigma(f(\text{tail\_metric}))$ that avoids abrupt mode transitions.
103+
104+
---
105+
106+
## References
107+
108+
[1] Gürbüzbalaban, M., Ozdaglar, A., Pattathil, S., et al. "The Heavy-Tail Phenomenon in SGD." *ICML*, 2021.
109+
110+
[2] Kingma, D. P. and Ba, J. "Adam: A Method for Stochastic Optimization." *ICLR*, 2015.
111+
112+
[3] Reddi, S. J., Kale, S., and Kumar, S. "On the Convergence of Adam and Beyond." *ICLR*, 2019.
113+
114+
[4] Liu, L., Jiang, H., He, P., et al. "RAdam: On the Variance of the Adaptive Learning Rate and Beyond." *ICLR*, 2020.
115+
116+
[5] Zhuang, J., Tang, T., Ding, Y., et al. "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients." *NeurIPS*, 2020.
117+
118+
[6] Shao, Y., Weng, S., Sun, H., et al. "BDS-Adam: Adaptive Variance Rectification with Semi-Adaptive Gradient Smoothing." *Scientific Reports*, 2025.
119+
120+
[7] Chen, X., Liang, C., Huang, D., et al. "Symbolic Discovery of Optimization Algorithms." *NeurIPS*, 2023.
121+
122+
[8] Raj, A., et al. "Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions." *AISTATS*, 2023.
123+
124+
[9] Zhang, J., et al. "Robustness Analysis of Non-Convex Stochastic Gradient Descent with Heavy-Tailed Noise." *NeurIPS*, 2020.
125+
126+
[10] Kunstner, F., Chen, J., Lücke, J., et al. "Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be." *ICLR*, 2023.
127+
128+
[11] Fatkhullin, I., Hübler, F., and Lan, G. "Can SGD Handle Heavy-Tailed Noise?" *NeurIPS Workshop*, 2025.
129+
130+
[12] Sadiev, A., et al. "Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Distributed Noise." *NeurIPS*, 2023.
131+
132+
[13] Hübler, F., He, N., et al. "From Gradient Clipping to Normalization for Heavy Tailed SGD." *NeurIPS*, 2024.
133+
134+
[14] Sun, T., Liu, X., and Yuan, K. "Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise." *JMLR*, 2025.
135+
136+
[15] Chen, X., Zhou, Y., and Wang, Z. "Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed." *arXiv:2406.07780*, 2024.
137+
138+
[16] Chezhegov, D., Beznosikov, A., et al. "Convergence of Clipped-SGD for Convex $(L_0, L_1)$-Smooth Optimization with Heavy-Tailed Noise." *arXiv:2505.20817*, 2025.
139+
140+
[17] Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. "signSGD: Compressed Optimisation for Non-Convex Problems." *ICML*, 2018.
141+
142+
[18] Korotin, A., et al. "Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds and Beyond." *arXiv:2502.07923*, 2025.
143+
144+
[19] Jiang, W., Yang, S., Yang, W., and Zhang, L. "Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction." *NeurIPS*, 2024.
145+
146+
[20] Schaipp, F., Garrigos, G., Simsekli, U., and Gower, R. "Tracking the Median of Gradients with a Stochastic Proximal Point Method." *TMLR*, 2025.
147+
148+
[21] Vukovic, M. and Jakovetic, D. "Robust Stochastic First Order Methods in Heavy-Tailed Noise via Medoid Mini-Batch Gradient Sampling." *arXiv:2605.07634*, 2025.
149+
150+
[22] Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. "Robust Estimation via Robust Gradient Estimation." *JRSS-B*, 2020.
151+
152+
[23] Gong, B., Batista, G., and Micheaux, P. L. D. "Adaptive Heavy-Tailed Stochastic Gradient Descent." *arXiv:2508.21353*, 2025.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: "Clever Adam - Project Plan: Adaptive Noise-Aware Optimization"
3+
published: 2026-05-18
4+
description: "Research plan for Clever Adam, an optimizer that detects and adapts to heavy-tailed gradient noise in deep learning."
5+
tags: ["Research", "Deep Learning", "Optimization", "Project Plan"]
6+
category: projects
7+
draft: false
8+
pinned: false
9+
---
10+
11+
## Project Summary
12+
13+
**Problem:** Gradient noise in deep learning is typically heavy-tailed, degrading the convergence of standard Adam.
14+
15+
**Method:** Clever Adam detects gradient tail behavior in real time and adaptively switches between standard Adam (light-tailed) and robust clipped/sign-based updates (heavy-tailed), with a smooth blending gate to avoid abrupt transitions.
16+
17+
---
18+
19+
## Timeline
20+
21+
| Phase | Dates | Tasks |
22+
|-------|-------|-------|
23+
| **Phase 0** | Apr 20 – May 21 | Literature review: Adam, AMSGrad, heavy-tail SGD, clipping theory, sign-based methods. Write literature review document. Environment setup: PyTorch, MLflow, Git. Baseline experiments with standard Adam. Build gradient noise analysis toolkit. |
24+
| **Phase 1a** | May 22 – May 26 | Implement Noise Monitor (norm CV, top-$k$ ratio, deviation). Implement Dual-Regime Update + Smooth Blending Gate. Smoke test on CIFAR-10. |
25+
| **Phase 1b** | May 27 – May 31 | Integrate baseline optimizers: AdamW, RAdam, AdaBelief, AMSGrad, SignSGD, Lion, SGD+Momentum. |
26+
| **Phase 2** | Jun 1 – Jun 9 | Large-scale experiments: optimizer × dataset × batch size × learning rate, 3 seeds each. |
27+
| **Phase 3** | Jun 10 – Jun 15 | Result analysis, failure cases, visualization, statistical tests. |
28+
29+
---
30+
31+
## Clever Adam: Three Core Components
32+
33+
### 1. Noise Monitor
34+
35+
A lightweight, real-time diagnostic that tracks three complementary signals:
36+
37+
- **Coefficient of variation (CV)** of gradient norms — high CV suggests heavy tails
38+
- **Top-$k$ extremal ratio** — frequent large outliers indicate heavy tails
39+
- **Moving-average deviation** — sudden spikes suggest tail events
40+
41+
All are $\mathcal{O}(1)$ per step. No kurtosis computation needed.
42+
43+
### 2. Dual-Regime Update
44+
45+
- **Light-tailed mode:** Standard Adam update for fast convergence
46+
- **Heavy-tailed mode:** Gradient clipping or sign-based projection for robustness
47+
48+
### 3. Smooth Blending Gate
49+
50+
A continuous gate $\alpha = \sigma(f(\text{tail\_metric}))$ interpolates between the two regimes, where $\sigma$ is the sigmoid function. This avoids abrupt mode transitions that could destabilize training.
51+
52+
---
53+
54+
## How It Differs from Existing Work
55+
56+
| | AHTSGD | Clever Adam |
57+
|---|--------|-------------|
58+
| **Goal** | Exploration (escape sharp minima) | Robustness (handle noisy gradients) |
59+
| **Action** | Injects synthetic heavy-tailed noise | Detects inherent heavy-tailed noise |
60+
| **Signal** | Loss landscape log-sharpness (indirect) | Gradient distribution tail statistics (direct) |
61+
| **Base optimizer** | SGD | Adam |
62+
63+
---
64+
65+
## Experiment Matrix
66+
67+
- **Optimizers:** Clever Adam, Adam, AdamW, RAdam, AdaBelief, AMSGrad, SignSGD, Lion, SGD+Momentum
68+
- **Datasets:** CIFAR-10, CIFAR-100, Tiny-ImageNet
69+
- **Batch size:** 32, 64, 128, 256, 512 (primary noise control variable)
70+
- **Learning rate:** $10^{-4}$, $3 \times 10^{-4}$, $10^{-3}$, $3 \times 10^{-3}$
71+
- **Seeds:** 3 per configuration
72+
73+
---
74+
75+
## Expected Contributions
76+
77+
1. A **lightweight noise monitor** for real-time detection of gradient tail behavior
78+
2. A **dual-regime optimization strategy** that integrates standard Adam with robust updates via a smooth blending gate
79+
3. An **empirical characterization** of when and why heavy-tailed noise degrades adaptive optimizers, including honest analysis of failure modes

0 commit comments

Comments
 (0)