Commit b188e6b

Eamon2009

committed

feat: implement GPT training loop with multi-GPU and memory optimizations

- Add advanced memory footprint optimization using forward-activation recomputation for LayerNorm and GeLU. - Optimize layer-wise activation buffer layout using a centralized `TensorSpec` registry to support large batch scaling. - Integrate cuBLASLt matmul fusions, optional cuDNN attention layers, and stochastic rounding options. - Fall back gracefully to `cudaMallocManaged` under heavy loads to prevent Outlier/OOM crashes.

1 parent 8149064 commit b188e6bCopy full SHA for b188e6b

1 file changed

CUDA
- main.cu

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit b188e6b

File tree

0 commit comments