Commit b188e6b
committed
feat: implement GPT training loop with multi-GPU and memory optimizations
- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU.
- Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling.
- Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options.
- Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.1 parent 8149064 commit b188e6b
1 file changed
Lines changed: 2070 additions & 0 deletions
0 commit comments