You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pathwaysutils/elastic/README.md
+13-16Lines changed: 13 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,11 +19,10 @@ The `pathwaysutils.elastic` primitives provide building blocks to integrate this
19
19
* A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
20
20
*`kubectl` configured to interact with your cluster.
21
21
* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
22
-
* A `PathwaysJob` Custom Resource Definition (CRD) installed on the GKE cluster.
23
22
24
-
## Example: Elastic MaxText Training on Kubernetes
23
+
## Elastic MaxText Training with Pathways on GKE
25
24
26
-
This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices, simulating a worker failure, and observing the job's recovery and continuation.
25
+
This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices using Pathways. See the [PathwaysJob docs](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathwaysjob_api) for more details about the various attributes set in the YAML below.
The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the main container uses`pathwaysutils.elastic` primitives.
56
+
The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the `main` container above is integrated with `pathwaysutils.elastic` primitives.
59
57
60
58
### 2. Running the Elastic Training Loop and Simulating hardware failures
61
59
62
-
The following bash code snippets demonstrates launching the job, monitoring its progress, simulating a worker failure by draining a Kubernetes node, and observing the recovery. Please set the variables before executing this script. At the end of the script, we verify elasticity worked as expected.
60
+
The following bash script demonstrates launching the above elastic maxtext job with Pathways, monitoring its progress, simulating a worker failure by issuing a `SIGILL` to a Pathways worker pod, and observing the recovery. Please set the variables marked as `<>` below before executing the script. At the end of the script, we verify elasticity worked as expected.
63
61
64
62
```bash
65
63
#!/bin/bash
66
-
WORKING_DIR=/path/to/working_dir
64
+
WORKING_DIR=</LOCAL/DIRECTORY/PATH>
67
65
USER_LABEL_SELECTOR="<USER>"
68
66
LOG_DIR="${WORKING_DIR}/logs"
69
67
JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
@@ -105,7 +103,7 @@ echo "Streaming logs from $head_pod to $log_file"
105
103
kubectl logs -f "$head_pod" >> "${log_file}" &
106
104
logs_pid=$!
107
105
echo "Waiting for job to start making progress..."
108
-
sleep 60s # Wait for sometime till the job makes some progress
0 commit comments