Skip to content

Commit f6e3139

Browse files
committed
Refactor accelerate test suite, add amplify tests
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 223b211 commit f6e3139

11 files changed

Lines changed: 490 additions & 308 deletions

File tree

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
compute_environment: LOCAL_MACHINE
2+
debug: false
3+
distributed_type: MULTI_GPU
4+
downcast_bf16: 'no'
5+
enable_cpu_affinity: false
6+
machine_rank: 0
7+
main_training_function: main
8+
mixed_precision: bf16
9+
num_machines: 1
10+
num_processes: 1
11+
rdzv_backend: c10d
12+
same_network: true
13+
tpu_env: []
14+
tpu_use_cluster: false
15+
tpu_use_sudo: false
16+
use_cpu: false
17+
dynamo_config:
18+
dynamo_backend: INDUCTOR
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
compute_environment: LOCAL_MACHINE
2+
debug: false
3+
distributed_type: MULTI_GPU
4+
downcast_bf16: 'no'
5+
enable_cpu_affinity: false
6+
machine_rank: 0
7+
main_training_function: main
8+
mixed_precision: fp8
9+
fp8_config:
10+
amax_compute_algorithm: max
11+
amax_history_length: 1024
12+
backend: TE
13+
fp8_format: HYBRID
14+
interval: 1
15+
margin: 0
16+
override_linear_precision:
17+
- false
18+
- false
19+
- false
20+
use_autocast_during_eval: false
21+
num_machines: 1
22+
num_processes: 1
23+
rdzv_backend: c10d
24+
same_network: true
25+
tpu_env: []
26+
tpu_use_cluster: false
27+
tpu_use_sudo: false
28+
use_cpu: false

recipes/esm2_accelerate/hydra_config/L0_sanity.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,5 @@ trainer:
1313
eval_steps: 1000
1414
logging_steps: 10
1515
report_to: "none"
16-
dataloader_num_workers: 0
16+
dataloader_num_workers: 4
1717
warmup_steps: 0
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
defaults:
2+
- defaults
3+
- _self_
4+
5+
model_tag: "nvidia/AMPLIFY_120M"
6+
stop_after_n_steps: 250
7+
8+
trainer:
9+
run_name: "amplify_120M_sanity"
10+
per_device_train_batch_size: 2
11+
per_device_eval_batch_size: 2
12+
save_steps: 1000
13+
eval_steps: 1000
14+
logging_steps: 10
15+
report_to: "none"
16+
dataloader_num_workers: 4
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
defaults:
2+
- defaults
3+
- _self_
4+
5+
model_tag: "nvidia/AMPLIFY_350M"
6+
stop_after_n_steps: 1_000_000
7+
max_seq_length: 512
8+
9+
trainer:
10+
learning_rate: 1e-3
11+
adam_beta2: 0.95
12+
lr_scheduler_type: "cosine_with_min_lr"
13+
lr_scheduler_kwargs:
14+
min_lr: 1e-4
15+
warmup_steps: 1_000
16+
max_steps: 1_000_000

recipes/esm2_accelerate/test_train.py

Lines changed: 0 additions & 307 deletions
This file was deleted.

0 commit comments

Comments
 (0)