Skip to content

Commit 8a770fa

Browse files
committed
Use generic bucket names
1 parent b117f50 commit 8a770fa

12 files changed

Lines changed: 77 additions & 69 deletions

File tree

docs/guides/checkpointing_solutions/gcs_checkpointing.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -28,30 +28,30 @@ startup. The first valid condition met is the one executed:
2828

2929
### MaxText configuration
3030

31-
Flag | Description | Type | Default
32-
:------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :------
33-
`enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False`
34-
`async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True`
35-
`checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000`
36-
`enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.<br>**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False`
37-
`checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with gs://) | `string` | `""`
38-
`checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""`
39-
`load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.<br>**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled)
40-
`load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.<br>**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled)
41-
`lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled)
42-
`force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False`
31+
| Flag | Description | Type | Default |
32+
| :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :-------------- |
33+
| `enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False` |
34+
| `async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True` |
35+
| `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000` |
36+
| `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.<br>**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False` |
37+
| `checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with `gs://`) | `string` | `""` |
38+
| `checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""` |
39+
| `load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.<br>**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled) |
40+
| `load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.<br>**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled) |
41+
| `lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled) |
42+
| `force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False` |
4343

4444
## Storage and format configuration
4545

4646
These settings control the underlying storage mechanism
4747
([Orbax](https://orbax.readthedocs.io)) for performance and compatibility.
4848

49-
Flag | Description | Type | Default
50-
:----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------
51-
`checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB)
52-
`checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True`
53-
`checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True`
54-
`checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96`
55-
`enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False`
56-
`source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.<br>**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"`
57-
`checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None`
49+
| Flag | Description | Type | Default |
50+
| :----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------------------ |
51+
| `checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB) |
52+
| `checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True` |
53+
| `checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True` |
54+
| `checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96` |
55+
| `enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False` |
56+
| `source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.<br>**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"` |
57+
| `checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None` |

docs/guides/data_input_pipeline/data_input_grain.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,10 +110,10 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
110110

111111
```sh
112112
bash src/dependencies/scripts/setup_gcsfuse.sh \
113-
DATASET_GCS_BUCKET=maxtext-dataset \
113+
DATASET_GCS_BUCKET=gs://<your-dataset-bucket> \
114114
MOUNT_PATH=/tmp/gcsfuse && \
115115
python3 -m maxtext.trainers.pre_train.train \
116-
run_name=<RUN_NAME> base_output_directory=gs://<MY_BUCKET> \
116+
run_name=<run-name> base_output_directory=gs://<your-bucket> \
117117
dataset_type=grain \
118118
grain_file_type=arrayrecord # or parquet \
119119
grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \

docs/guides/monitoring_and_debugging/features_and_diagnostics.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true"
8787
python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \
8888
compiled_trainstep_file=my_compiled_train.pickle \
8989
global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
90-
base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
90+
base_output_directory=gs://<your-output-bucket> dataset_path=gs://<your-dataset-bucket>
9191
```
9292

9393
In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in.
@@ -125,7 +125,7 @@ export XLA_FLAGS="--xla_gpu_enable_async_collectives=true"
125125
python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \
126126
compiled_trainstep_file=my_compiled_train.pickle \
127127
attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
128-
base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
128+
base_output_directory=gs://<your-output-bucket> dataset_path=gs://<your-dataset-bucket>
129129
```
130130

131131
As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`.

docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
3737
```
3838
python3 -m maxtext.trainers.pre_train.train \
3939
run_name=${USER}-tpu-job \
40-
base_output_directory="gs://your-output-bucket/" \
41-
dataset_path="gs://your-dataset-bucket/" \
40+
base_output_directory="gs://<your-output-bucket>/" \
41+
dataset_path="gs://<your-dataset-bucket>/" \
4242
steps=100 \
4343
log_period=10 \
4444
managed_mldiagnostics=True
@@ -49,8 +49,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
4949
```
5050
python3 -m maxtext.trainers.pre_train.train \
5151
run_name=${USER}-tpu-job \
52-
base_output_directory="gs://your-output-bucket/" \
53-
dataset_path="gs://your-dataset-bucket/" \
52+
base_output_directory="gs://<your-output-bucket>/" \
53+
dataset_path="gs://<your-dataset-bucket>/" \
5454
steps=100 \
5555
log_period=10 \
5656
profiler=xplane \
@@ -62,8 +62,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
6262
```
6363
python3 -m maxtext.trainers.pre_train.train \
6464
run_name=${USER}-tpu-job \
65-
base_output_directory="gs://your-output-bucket/" \
66-
dataset_path="gs://your-dataset-bucket/" \
65+
base_output_directory="gs://<your-output-bucket>/" \
66+
dataset_path="gs://<your-dataset-bucket>/" \
6767
steps=100 \
6868
log_period=10 \
6969
profiler=xplane \

0 commit comments

Comments
 (0)