Skip to content

Commit f52d9c0

Browse files
Merge pull request #3458 from AI-Hypercomputer:docker_build_fix
PiperOrigin-RevId: 886375125
2 parents 487bb6f + a9da79a commit f52d9c0

39 files changed

Lines changed: 54 additions & 42 deletions

.github/workflows/build_and_push_docker_image.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ jobs:
123123
MODE=${{ inputs.build_mode }}
124124
WORKFLOW=${{ inputs.workflow }}
125125
PACKAGE_DIR=./src
126+
TESTS_DIR=./tests
126127
JAX_VERSION=NONE
127128
LIBTPU_VERSION=NONE
128129
INCLUDE_TEST_ASSETS=true

PREFLIGHT.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
11
# Optimization 1: Multihost recommended network settings
2-
We included all the recommended network settings in [rto_setup.sh](https://github.com/google/maxtext/blob/main/rto_setup.sh).
2+
We included all the recommended network settings in [rto_setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/rto_setup.sh).
33

4-
[preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) will help you apply them based on GCE or GKE platform.
4+
[preflight.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/preflight.sh) will help you apply them based on GCE or GKE platform.
55

66
Before you run ML workload on Multihost with GCE or GKE, simply apply `bash preflight.sh PLATFORM=[GCE or GKE]` to leverage the best DCN network performance.
77

88
Here is an example for GCE:
99
```
10-
bash preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
10+
bash src/dependencies/scripts/preflight.sh PLATFORM=GCE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1111
```
1212

1313
Here is an example for GKE:
1414
```
15-
bash preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
15+
bash src/dependencies/scripts/preflight.sh PLATFORM=GKE && python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
1616
```
1717

1818
# Optimization 2: Numa binding (You can only apply this to v4 and v5p)
1919
NUMA binding is recommended for enhanced performance, as it reduces memory latency and maximizes data throughput, ensuring that your high-performance applications operate more efficiently and effectively.
2020

2121
For GCE,
22-
[preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) will help you install `numactl` dependency, so you can use it directly, here is an example:
22+
[preflight.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/preflight.sh) will help you install `numactl` dependency, so you can use it directly, here is an example:
2323

2424
```
25-
bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
25+
bash src/dependencies/scripts/preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
2626
```
2727

2828
For GKE,
2929
`numactl` should be built into your docker image from [maxtext_tpu_dependencies.Dockerfile](https://github.com/google/maxtext/blob/main/src/dependencies/dockerfiles/maxtext_tpu_dependencies.Dockerfile), so you can use it directly if you built the maxtext docker image. Here is an example
3030

3131
```
32-
bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
32+
bash src/dependencies/scripts/preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?}
3333
```
3434

3535
1. `numactl`: This is the command-line tool used for controlling NUMA policy for processes or shared memory. It's particularly useful on multi-socket systems where memory locality can impact performance.

src/dependencies/dockerfiles/maxtext_gpu_dependencies.Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ ENV ENV_DEVICE=$DEVICE
4141
ARG PACKAGE_DIR
4242
ENV PACKAGE_DIR=$PACKAGE_DIR
4343

44+
ARG TESTS_DIR
45+
ENV TESTS_DIR=$TESTS_DIR
46+
4447
ENV MAXTEXT_ASSETS_ROOT=/deps/src/maxtext/assets
4548
ENV MAXTEXT_TEST_ASSETS_ROOT=/deps/tests/assets
4649
ENV MAXTEXT_PKG_DIR=/deps/src/MaxText
@@ -63,6 +66,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
6366

6467
# Now copy the remaining code (source files that may change frequently)
6568
COPY ${PACKAGE_DIR}/maxtext/ src/MaxText/
69+
COPY ${TESTS_DIR}*/ tests/
6670

6771
# Download test assets from GCS if building image with test assets
6872
ARG INCLUDE_TEST_ASSETS=false

src/dependencies/dockerfiles/maxtext_tpu_dependencies.Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ ENV ENV_DEVICE=$DEVICE
3838
ARG PACKAGE_DIR
3939
ENV PACKAGE_DIR=$PACKAGE_DIR
4040

41+
ARG TESTS_DIR
42+
ENV TESTS_DIR=$TESTS_DIR
43+
4144
ENV MAXTEXT_ASSETS_ROOT=/deps/src/maxtext/assets
4245
ENV MAXTEXT_TEST_ASSETS_ROOT=/deps/tests/assets
4346
ENV MAXTEXT_PKG_DIR=/deps/src/maxtext
@@ -63,6 +66,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
6366

6467
# Now copy the remaining code (source files that may change frequently)
6568
COPY ${PACKAGE_DIR}/maxtext/ src/maxtext/
69+
COPY ${TESTS_DIR}*/ tests/
6670

6771
# Download test assets from GCS if building image with test assets
6872
ARG INCLUDE_TEST_ASSETS=false

src/dependencies/scripts/docker_build_dependency_image.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222

2323
PACKAGE_DIR="${PACKAGE_DIR:-src}"
2424
echo "PACKAGE_DIR: $PACKAGE_DIR"
25+
TESTS_DIR="${TESTS_DIR:-tests}"
26+
echo "TESTS_DIR: $TESTS_DIR"
2527

2628
# Enable "exit immediately if any command fails" option
2729
set -e
@@ -71,6 +73,7 @@ docker_build_args=(
7173
"MODE=${MODE}"
7274
"JAX_VERSION=${JAX_VERSION}"
7375
"PACKAGE_DIR=${PACKAGE_DIR}"
76+
"TESTS_DIR=${TESTS_DIR}"
7477
)
7578

7679
run_docker_build() {
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ echo "Running preflight.sh"
33
# Command Flags:
44
#
55
# Example to invoke this script:
6-
# bash preflight.sh
6+
# bash src/dependencies/scripts/preflight.sh
77

88
# Warning:
99
# For any dependencies, please add them into `setup.sh` or `maxtext_tpu_dependencies.Dockerfile`.
@@ -24,11 +24,11 @@ if command -v sudo >/dev/null 2>&1; then
2424
echo "running rto_setup.sh with sudo"
2525

2626
# apply network settings.
27-
sudo bash rto_setup.sh
27+
sudo bash src/dependencies/scripts/rto_setup.sh
2828
else
2929
# sudo is not available, run the script without sudo
3030
echo "running rto_setup.sh without sudo"
3131

3232
# apply network settings.
33-
bash rto_setup.sh
33+
bash src/dependencies/scripts/rto_setup.sh
3434
fi

src/maxtext/configs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ This directory contains high performance model configurations for different gene
1919

2020
These configurations do 3 things:
2121
* Sets various XLA compiler flags (see [below](/src/maxtext/configs#xla-flags-used-by-maxtext)) as `LIBTPU_INIT_ARGS` to optimize runtime performance.
22-
* Runs [rto_setup.sh](https://github.com/google/maxtext/blob/main/rto_setup.sh) to optimize communication protocols for network performance.
22+
* Runs [rto_setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/rto_setup.sh) to optimize communication protocols for network performance.
2323
(This only needs to be run once on each worker)
2424
* Runs [train.py](https://github.com/google/maxtext/blob/main/src/maxtext/trainers/pre_train/train.py) with specific hyper-parameters (batch size, etc.)
2525

src/maxtext/configs/experimental/1024b.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ for ARGUMENT in "$@"; do
1515
done
1616

1717
# Use preflight.sh to set up env based on platform
18-
bash preflight.sh PLATFORM=$PLATFORM
18+
bash src/dependencies/scripts/preflight.sh PLATFORM=$PLATFORM
1919

2020
# Train
2121
export LIBTPU_INIT_ARGS="--xla_tpu_megacore_fusion_allow_ags=false --xla_enable_async_collective_permute=true --xla_tpu_enable_ag_backward_pipelining=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true"

src/maxtext/configs/experimental/128b.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ for ARGUMENT in "$@"; do
1515
done
1616

1717
# Use preflight.sh to set up env based on platform
18-
bash preflight.sh PLATFORM=$PLATFORM
18+
bash src/dependencies/scripts/preflight.sh PLATFORM=$PLATFORM
1919

2020
# Train
2121
export LIBTPU_INIT_ARGS="--xla_tpu_megacore_fusion_allow_ags=false --xla_enable_async_collective_permute=true --xla_tpu_enable_ag_backward_pipelining=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true"

0 commit comments

Comments
 (0)