Skip to content

Commit 4fea20c

Browse files
committed
Bring back chaos optimizer as an experimental completely optional optimizer, v3 yaaay!
1 parent 2aba971 commit 4fea20c

9 files changed

Lines changed: 1612 additions & 20 deletions

File tree

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,20 @@ All notable changes to OdyssNet will be documented in this file.
44

55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

7+
## [2.5.0] — 2026-04-14
8+
9+
### Added
10+
- **ChaosGrad v3** (`odyssnet/training/chaos_optimizer.py`): Zero-hyperparameter optimizer re-introduced as a fully optional, drop-in custom optimizer. Pass it via `OdyssNetTrainer(model, optimizer=ChaosGrad(...))`. Default optimizer selection (Prodigy / AdamW) is unchanged.
11+
- **Second-moment adaptive normalisation** (`v2` EMA + bias correction + `denom`) — closes the AdamW performance gap by continuously re-calibrating gradient scale.
12+
- **Bias-corrected momentum** (`v_hat = v / (1 - β^t)`) — eliminates cold-start understepping.
13+
- **Grad EMA signal reference** — replaces single-step `prev_grad` with a slow EMA (`α = 0.6`) for more stable hypergradient signals in recurrent regimes.
14+
- **Group-aware frustration bursts** — Hebbian logits (`hebb_factor`, `hebb_decay`) are unconditionally excluded from burst noise. `chaos_core`/`memory`/`projections` receive full bursts; all other groups receive half-scale noise with no meta-reset.
15+
- **9-group parameter classification** (`classify_params`) — `bias`, `norm`, and `scales` promoted from `lightweight` into dedicated groups with appropriate beta equilibria (0.95 for `chaos_core`/`memory`, 0.85 for `gates`).
16+
- `OdyssNetTrainer.trigger_plateau_escape()` re-introduced (no-op when non-ChaosGrad optimizer is active).
17+
- `OdyssNetTrainer.get_diagnostics()` automatically includes `'optimizer'` key with ChaosGrad diagnostics when ChaosGrad is detected.
18+
- `ChaosGrad` exported from `odyssnet` public API.
19+
- Neurogenesis (`trainer.expand()`) handles ChaosGrad migration natively: classified param groups are rebuilt for the grown model and global frustration state is preserved.
20+
721
## [2.4.0] — 2026-04-10
822

923
### Added

CONTRIBUTING.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,8 +200,18 @@ trainer = OdyssNetTrainer(model)
200200

201201
# AdamW: pass an explicit learning rate
202202
trainer = OdyssNetTrainer(model, lr=3e-4)
203+
204+
# ChaosGrad: optional zero-hyperparameter optimizer (pass as custom optimizer)
205+
from odyssnet import ChaosGrad
206+
opt = ChaosGrad(ChaosGrad.classify_params(model), lr=1e-3)
207+
trainer = OdyssNetTrainer(model, optimizer=opt)
203208
```
204209

210+
> **Optimizer selection guide:**
211+
> - **Prodigy** (`lr=None`, default) — best for quick experiments; non-deterministic curves.
212+
> - **AdamW** (explicit `lr`) — reproducible runs, benchmarks, production.
213+
> - **ChaosGrad** (pass as `optimizer=`) — research into self-tuning dynamics; ideal when `hebb_type` is enabled (Hebbian parameters are unconditionally protected from weight decay and burst noise).
214+
205215
---
206216

207217
## ⚡ Hardware Optimization

docs/LIBRARY.md

Lines changed: 123 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ Runs the dynamic system.
195195

196196
## OdyssNet Trainer (`odyssnet.training.trainer`)
197197

198-
The `OdyssNetTrainer` handles the training loop, gradient accumulation, mixed precision (AMP), and experimental features like Ghost Gradients. **Prodigy** is the default optimizer (auto-calibrating, no LR tuning required). Pass an explicit `lr` to use AdamW instead.
198+
The `OdyssNetTrainer` handles the training loop, gradient accumulation, mixed precision (AMP), and experimental features like Ghost Gradients. **Prodigy** is the default optimizer (auto-calibrating, no LR tuning required). Pass an explicit `lr` to use AdamW instead, or supply any custom optimizer — including **ChaosGrad**.
199199

200200
### Initialization
201201

@@ -220,6 +220,11 @@ trainer = OdyssNetTrainer(
220220
# Custom optimizer (bypasses both Prodigy and AdamW)
221221
import torch
222222
trainer = OdyssNetTrainer(model, optimizer=torch.optim.AdamW(model.parameters(), lr=1e-4))
223+
224+
# ChaosGrad — zero-hyperparameter optimizer (optional, see ChaosGrad section below)
225+
from odyssnet import ChaosGrad
226+
opt = ChaosGrad(ChaosGrad.classify_params(model), lr=1e-3)
227+
trainer = OdyssNetTrainer(model, optimizer=opt)
223228
```
224229

225230
**Parameters:**
@@ -481,3 +486,120 @@ trainer.fit(X, Y, epochs=100, thinking_steps=5)
481486
model = OdyssNet(num_neurons=10, input_ids=range(10), output_ids=range(10), vocab_size=[784, 10])
482487
# Model handles projection and decoding automatically.
483488
```
489+
490+
---
491+
492+
## ChaosGrad Optimizer (`odyssnet.training.chaos_optimizer`)
493+
494+
ChaosGrad is a **fully optional**, zero-hyperparameter optimizer designed specifically for OdyssNet. Pass it as a custom optimizer to bypass the default Prodigy / AdamW selection.
495+
496+
The trainer's default behavior is **unchanged** — Prodigy when `lr=None`, AdamW when `lr=float`.
497+
498+
### When to use ChaosGrad
499+
500+
| Situation | Recommendation |
501+
|-----------|----------------|
502+
| Quick prototyping, first run | Prodigy (default) |
503+
| Reproducible benchmarks | AdamW with explicit `lr` |
504+
| Research into self-tuning dynamics, OdyssNet-specific regularisation | **ChaosGrad** |
505+
| Hebbian plasticity enabled (`hebb_type != None`) | ChaosGrad handles hebb params specially |
506+
507+
### Usage
508+
509+
```python
510+
from odyssnet import OdyssNet, OdyssNetTrainer, ChaosGrad
511+
512+
model = OdyssNet(num_neurons=32, input_ids=[0], output_ids=[31], device='cuda')
513+
514+
# Classify parameters for group-specific meta-adaptation
515+
opt = ChaosGrad(ChaosGrad.classify_params(model), lr=1e-3)
516+
trainer = OdyssNetTrainer(model, optimizer=opt, device='cuda')
517+
518+
for epoch in range(100):
519+
loss = trainer.train_batch(x, y, thinking_steps=10)
520+
# No LR schedule needed — ChaosGrad adapts autonomously
521+
522+
# Optional: manual plateau escape
523+
trainer.trigger_plateau_escape()
524+
525+
# Diagnostics
526+
diag = trainer.get_diagnostics(debug=True)
527+
opt_diag = diag['optimizer']
528+
print(f"Frustration: {opt_diag['frustration']:.3f}")
529+
print(f"Avg eff. LR: {opt_diag['avg_effective_lr']:.4f}")
530+
```
531+
532+
You can also pass plain `model.parameters()` without classification — every parameter will use the `lightweight` group defaults.
533+
534+
### Parameter Classification
535+
536+
`ChaosGrad.classify_params(model)` divides OdyssNet parameters into 9 semantic groups:
537+
538+
| Group | Detection | Init Decay | Beta Equil | Burst |
539+
|-------|-----------|-----------|------------|-------|
540+
| `chaos_core` | `W` | 0.01 | 0.95 | Full |
541+
| `memory` | `memory_feedback` | 0.0 | 0.95 | Full |
542+
| `projections` | `embed`/`proj`/`output_decoder` | 0.01 | 0.90 | Full |
543+
| `gates` | `input_gate`, `output_gate`, `core_gate`, `memory_gate` | 0.0 | 0.85 | Half |
544+
| `hebbian` | `hebb_factor`, `hebb_decay` | 0.0 | 0.90 | **None** |
545+
| `norm` | `norm.*` | 0.0 | 0.90 | Half |
546+
| `bias` | `B` | 0.0 | 0.90 | Half |
547+
| `scales` | `input_scale`, `output_scale` | 0.0 | 0.90 | Half |
548+
| `lightweight` | everything else | 0.0 | 0.90 | Half |
549+
550+
**Hebbian Bypass Rule:** `hebb_factor` and `hebb_decay` **never** receive weight decay regardless of any hypergradient signal. Frustration bursts also skip these parameters entirely.
551+
552+
### Public API
553+
554+
| Method | Signature | Description |
555+
|--------|-----------|-------------|
556+
| `classify_params` | `@staticmethod classify_params(model)` | Returns list of classified param-group dicts |
557+
| `step` | `step(closure=None)` | One autonomous optimization step |
558+
| `report_loss` | `report_loss(loss_value)` | Feed loss to the Frustration Accumulator (trainer does this automatically) |
559+
| `trigger_plateau_escape` | `trigger_plateau_escape()` | Force a frustration burst on the next step |
560+
| `get_diagnostics` | `get_diagnostics(debug=False)` | Optimizer health metrics |
561+
562+
### Frustration Accumulator
563+
564+
ChaosGrad tracks loss stagnation internally. When `frustration > 0.75` (or `trigger_plateau_escape()` is called), it injects noise into the momentum buffers and resets meta-parameters toward their calibrated defaults — providing an automatic escape from plateaus without user intervention.
565+
566+
The `OdyssNetTrainer` automatically calls `report_loss()` after every optimizer step when ChaosGrad is detected.
567+
568+
### Neurogenesis Compatibility
569+
570+
`trainer.expand(amount=N)` works transparently with ChaosGrad. The optimizer state (momentum, meta-parameters, second moments) is migrated to the grown network — old neurons preserve their learned adaptation, new neurons start from cold-start calibration. The global frustration state (`_frustration`, `_best_loss`, `_global_step`) is also preserved across the expansion.
571+
572+
### Checkpoint Save / Load
573+
574+
ChaosGrad's global state (`frustration`, `best_loss`, `global_step`) is included in `optimizer.state_dict()` under the key `'chaos_global'` and is restored by `optimizer.load_state_dict()`. This means `save_checkpoint` / `load_checkpoint` round-trips preserve the full optimizer state including frustration dynamics:
575+
576+
```python
577+
from odyssnet import save_checkpoint, load_checkpoint
578+
579+
save_checkpoint(model, trainer, path="run.pt")
580+
epoch, loss = load_checkpoint(model, trainer, path="run.pt")
581+
# trainer.optimizer._frustration is restored
582+
```
583+
584+
If you override the genesis learning rate at load time, ChaosGrad reads it from the param group (not from `defaults`), so the override takes effect on the next step:
585+
586+
```python
587+
epoch, loss = load_checkpoint(model, trainer, path="run.pt", lr=5e-4)
588+
# ChaosGrad now uses genesis_lr=5e-4 for weight decay and update scaling
589+
```
590+
591+
### Interactions with Other Features
592+
593+
| Feature | Interaction | Notes |
594+
|---------|-------------|-------|
595+
| **Synaptic noise** (`synaptic_noise > 0`) | Noise is added to weights *before* the forward pass. ChaosGrad's `sig_wd = cos(g, W)` therefore measures alignment against the *noisy* weight. | Intentional — noisy W is what gradient was computed against. |
596+
| **Gradient clipping** (applied inside trainer) | All three hypergradient signals are computed on clipped gradients. `grad_ema` also tracks clipped gradients. | Clipping reduces signal magnitude but doesn't break adaptation. |
597+
| **Gradient persistence** | Persisted gradients from the previous step are injected *before* the ChaosGrad step. `sig_lr` therefore measures consistency of the *combined* (current + persisted) gradient vs `grad_ema`. | No issue; effectively a soft gradient accumulation. |
598+
| **Gradient accumulation** | `report_loss` is called once per optimizer step (not per micro-batch), with the un-normalized loss value. `global_step` tracks optimizer steps. | Correct — frustration reflects true convergence, not accumulation count. |
599+
| **Gradient checkpointing** | Recomputes activations during backward. Gradient values reaching ChaosGrad are identical whether or not checkpointing is active. | Fully compatible. |
600+
| **AMP (mixed precision)** | ChaosGrad receives gradients after `scaler.unscale_()`in float32 scale. ChaosGrad internally casts gradients to float32 (`g_f = grad.float()`). | Fully compatible. |
601+
| **`regenerate_synapses()`** | When weak entries of `W` are re-initialised, the trainer automatically clears ChaosGrad's per-parameter state for `W`. Cold-start recalibration happens on the next step, re-computing `init_lr` from the new gradient scale. | If `revived == 0` (no weights regenerated), state is preserved. |
602+
| **`transplant_weights()`** | Weight transplantation does *not* transfer optimizer state (by design — cold restart after transplant). ChaosGrad cold-starts on all parameters after loading transplanted weights. | Same behaviour as AdamW / Prodigy after transplant. |
603+
| **Neurogenesis (`trainer.expand()`)** | Per-parameter tensors (`momentum`, `grad_ema`) are zero-padded to the new size. Scalar state (`init_lr`, `per_param_lr`, etc.) is copied unchanged. New neurons start from cold-start calibration. Global frustration is preserved. | Fully compatible. |
604+
| **`classify_params` (skipped)** | If you pass `model.parameters()` directly instead of `classify_params(model)`, all parameters — including Hebbian logits — are treated as `lightweight`. The Hebbian bypass rule (no decay, no burst) does NOT apply. Always use `classify_params` on models with `hebb_type != None`. | Documented limitation; no crash. |
605+
| **Anomaly hook** | ChaosGrad has its own internal plateau escape (frustration burst). The trainer's anomaly hook fires independently based on loss statistics. The two mechanisms don't interfere. | Use both together if needed. |

odyssnet/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
__version__ = "2.4.0"
1+
__version__ = "2.5.0"
22

33
from .core.network import OdyssNet
44
from .training.trainer import OdyssNetTrainer
5+
from .training.chaos_optimizer import ChaosGrad
56
from .utils.odyssstore import save_checkpoint, load_checkpoint, transplant_weights, get_checkpoint_info
67
from .utils.neurogenesis import Neurogenesis
78
from .utils.data import set_seed
@@ -10,6 +11,7 @@
1011
__all__ = [
1112
'OdyssNet',
1213
'OdyssNetTrainer',
14+
'ChaosGrad',
1315
'save_checkpoint',
1416
'load_checkpoint',
1517
'transplant_weights',

0 commit comments

Comments
 (0)