Skip to content

Commit 22dba1c

Browse files
authored
Update README.md
1 parent b37dd94 commit 22dba1c

1 file changed

Lines changed: 13 additions & 16 deletions

File tree

pathwaysutils/elastic/README.md

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,10 @@ The `pathwaysutils.elastic` primitives provide building blocks to integrate this
1919
* A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
2020
* `kubectl` configured to interact with your cluster.
2121
* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
22-
* A `PathwaysJob` Custom Resource Definition (CRD) installed on the GKE cluster.
2322

24-
## Example: Elastic MaxText Training on Kubernetes
23+
## Elastic MaxText Training with Pathways on GKE
2524

26-
This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices, simulating a worker failure, and observing the job's recovery and continuation.
25+
This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices using Pathways. See the [PathwaysJob docs](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathwaysjob_api) for more details about the various attributes set in the YAML below.
2726

2827
### 1. Elastic PathwaysJob Definition (`pathwaysjob-elastic.py`)
2928
```yaml
@@ -38,11 +37,10 @@ spec:
3837
topology: 4x8
3938
numSlices: 3
4039
maxSliceRestarts: 2
41-
terminationGracePeriodSeconds: 0
4240
pathwaysDir: "gs://<BUCKET>" # Pre-create this bucket.
4341
controller:
4442
deploymentMode: default
45-
elasticSlices: 2
43+
elasticSlices: 1
4644
template:
4745
spec:
4846
containers:
@@ -55,15 +53,15 @@ spec:
5553
- |
5654
python3 -m MaxText.elastic_train MaxText/configs/base.yml base_output_directory=gs://<BUCKET> per_device_batch_size=4 enable_checkpointing=false remat_policy=full global_parameter_scale=8 steps=50 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=pathways-<USER> enable_pathways_goodput=True
5755
```
58-
The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the main container uses `pathwaysutils.elastic` primitives.
56+
The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the `main` container above is integrated with `pathwaysutils.elastic` primitives.
5957

6058
### 2. Running the Elastic Training Loop and Simulating hardware failures
6159

62-
The following bash code snippets demonstrates launching the job, monitoring its progress, simulating a worker failure by draining a Kubernetes node, and observing the recovery. Please set the variables before executing this script. At the end of the script, we verify elasticity worked as expected.
60+
The following bash script demonstrates launching the above elastic maxtext job with Pathways, monitoring its progress, simulating a worker failure by issuing a `SIGILL` to a Pathways worker pod, and observing the recovery. Please set the variables marked as `<>` below before executing the script. At the end of the script, we verify elasticity worked as expected.
6361

6462
```bash
6563
#!/bin/bash
66-
WORKING_DIR=/path/to/working_dir
64+
WORKING_DIR=</LOCAL/DIRECTORY/PATH>
6765
USER_LABEL_SELECTOR="<USER>"
6866
LOG_DIR="${WORKING_DIR}/logs"
6967
JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
@@ -105,7 +103,7 @@ echo "Streaming logs from $head_pod to $log_file"
105103
kubectl logs -f "$head_pod" >> "${log_file}" &
106104
logs_pid=$!
107105
echo "Waiting for job to start making progress..."
108-
sleep 60s # Wait for sometime till the job makes some progress
106+
sleep 90s
109107
110108
# 3. Simulate Failure: Evict a Worker Pod
111109
echo "Randomly select a worker pod to disrupt..."
@@ -114,20 +112,19 @@ read -r node_name pod_name <<<$(kubectl get pods -o wide | grep "$USER_LABEL_SEL
114112
if [ -z "$pod_name" ] || [ -z "$node_name" ]; then
115113
echo "Warning: Could not find a running worker pod to disrupt. Skipping disruption."
116114
else
117-
echo "Attempting to drain node '$node_name' to evict pod '$pod_name'..."
118-
# Drain the node - this evicts all the pods from the node
119-
kubectl drain "$node_name" --ignore-daemonsets
120-
121-
echo "Node drained. Waiting briefly for training to reconfigure to N-1 slices..."
122-
sleep 60s
115+
echo "Attempting to cordon '$node_name' and kill pod '$pod_name'..."
116+
kubectl cordon "$node_name"
117+
kubectl exec -it "$pod_name" -c pathways-worker -- /bin/sh -c "kill -s SIGILL 1"
118+
echo "Node cordoned. Waiting briefly for training to reconfigure to N-1 slices..."
119+
sleep 90s
123120
124121
# 4. Allow Recovery: Uncordon the Node
125122
echo "Uncordoning node '$node_name' to allow scheduling again."
126123
kubectl uncordon "$node_name"
127124
fi
128125
129126
# 5. Wait for Training to resume on all slices
130-
sleep 60s
127+
sleep 90s
131128
132129
# 6. Terminate the Job and Cleanup
133130
echo "Terminating Run ID $run_id"

0 commit comments

Comments
 (0)