Skip to content

Commit a7d4fd1

Browse files
committed
adds perf plots need conv plots
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
1 parent eac12c3 commit a7d4fd1

2 files changed

Lines changed: 6 additions & 2 deletions

File tree

bionemo-recipes/recipes/llama3_native_te/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,8 @@ def compute_model_pflops(seq_len, global_batch_size, step_time_s):
6767

6868
### Low precision performance benchmarks
6969

70-
![Performance Benchmarks Low Precision](../../../docs/docs/assets/images/llama3/llama3_1b_fsdp2_tflops.png)
71-
In the above plot we can see the performance increases as we lower the precision of our transformer layers.
70+
![Performance Benchmarks Low Precision](../../../docs/docs/assets/images/llama3/llama3_8gpu_tflops.png)
71+
In the above plot we can see the performance increases as we lower the precision of our transformer layers across the 1B and 8B variant of LLAMA3.
7272

7373
### Convergence Benchmarks
7474

@@ -94,6 +94,10 @@ are due checkpointing, further work will be done to improve training step time s
9494
Models were trained on 64 NVIDIA H100 GPUs with a micro batch size of 4 and a context length of 4096 for 60,000 steps.
9595
Training was performed with BF16 precision.
9696

97+
### Low Precision convergence benchmarks
98+
99+
<!-- ....TODO from WandB once ready. -->
100+
97101
### Distributed Training
98102

99103
This recipe supports distributed training using DDP, FSDP2, and FSDP2 with Context Parallelism, shown in three separate training entrypoints:
101 KB
Loading

0 commit comments

Comments
 (0)