Skip to content

Commit b37dd94

Browse files
authored
Create README.md
1 parent 42640da commit b37dd94

1 file changed

Lines changed: 163 additions & 0 deletions

File tree

pathwaysutils/elastic/README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# Elastic Training with Pathways
2+
3+
This document demonstrates how to leverage the elasticity primitives within `manager.py` to create resilient JAX training loop that can handle hardware failures gracefully. We illustrate this using an example based on the MaxText training loop running on TPUs provisioned by GKE via `PathwaysJob` API.
4+
5+
## Overview
6+
7+
Distributed training jobs, especially long-running ones, are susceptible to various failures, such as machine preemptions or hardware issues. Elasticity allows a training job to adapt to changes in the number of available accelerators without crashing. It typically involves:
8+
9+
1. **Training State Management:** Regularly snapshotting the training state (model params, optimizer state, data iterator state).
10+
2. **Failure Detection:** Pathways Resource Manager detecting when workers join or leave.
11+
3. **Failure Propogation:** Pathways runtime propogates the error to JAX client.
12+
4. **Training Reconfiguration:** Adapting the training computation distribution to the current set of healthy workers.
13+
5. **Resumption:** Continuing training from the last valid snapshot with the new configuration.
14+
15+
The `pathwaysutils.elastic` primitives provide building blocks to integrate this logic into JAX training loops run using the Pathways' `Proxy` JAX backend.
16+
17+
## Prerequisites
18+
19+
* A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
20+
* `kubectl` configured to interact with your cluster.
21+
* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
22+
* A `PathwaysJob` Custom Resource Definition (CRD) installed on the GKE cluster.
23+
24+
## Example: Elastic MaxText Training on Kubernetes
25+
26+
This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices, simulating a worker failure, and observing the job's recovery and continuation.
27+
28+
### 1. Elastic PathwaysJob Definition (`pathwaysjob-elastic.py`)
29+
```yaml
30+
apiVersion: pathways-job.pathways.domain/v1
31+
kind: PathwaysJob
32+
metadata:
33+
name: pathways-<USER>
34+
spec:
35+
maxRestarts: 0
36+
workers:
37+
- type: ct5lp-hightpu-4t
38+
topology: 4x8
39+
numSlices: 3
40+
maxSliceRestarts: 2
41+
terminationGracePeriodSeconds: 0
42+
pathwaysDir: "gs://<BUCKET>" # Pre-create this bucket.
43+
controller:
44+
deploymentMode: default
45+
elasticSlices: 2
46+
template:
47+
spec:
48+
containers:
49+
- name: main
50+
image: <MAXTEXT_IMAGE>
51+
imagePullPolicy: Always
52+
command:
53+
- bash
54+
- -c
55+
- |
56+
python3 -m MaxText.elastic_train MaxText/configs/base.yml base_output_directory=gs://<BUCKET> per_device_batch_size=4 enable_checkpointing=false remat_policy=full global_parameter_scale=8 steps=50 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=pathways-<USER> enable_pathways_goodput=True
57+
```
58+
The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the main container uses `pathwaysutils.elastic` primitives.
59+
60+
### 2. Running the Elastic Training Loop and Simulating hardware failures
61+
62+
The following bash code snippets demonstrates launching the job, monitoring its progress, simulating a worker failure by draining a Kubernetes node, and observing the recovery. Please set the variables before executing this script. At the end of the script, we verify elasticity worked as expected.
63+
64+
```bash
65+
#!/bin/bash
66+
WORKING_DIR=/path/to/working_dir
67+
USER_LABEL_SELECTOR="<USER>"
68+
LOG_DIR="${WORKING_DIR}/logs"
69+
JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
70+
71+
mkdir -p ${LOG_DIR}
72+
73+
run_id=$(date +"%s")
74+
echo "Running Elastic MaxText with Run ID: $run_id"
75+
76+
# 1. Launch the PathwaysJob
77+
kubectl apply -f "$JOB_DEFINITION_FILE"
78+
if [ $? -ne 0 ]; then
79+
echo "Error: Failed to apply job definition."
80+
exit 1
81+
fi
82+
83+
# 2. Monitor the PathwaysJob
84+
echo "Waiting for pods to start..."
85+
head_pod=""
86+
for i in $(seq 1 10)
87+
do
88+
head_pod=$(kubectl get pods | grep "$USER_LABEL_SELECTOR" | grep 'head' | grep 'Running' | awk '{print $1}' | head -n 1)
89+
if [ -n "$head_pod" ]; then
90+
echo "Found head pod: $head_pod"
91+
break
92+
fi
93+
echo "Head pod not found yet, retrying..."
94+
sleep 10s
95+
done
96+
97+
if [ -z "$head_pod" ]; then
98+
echo "Error: Could not find running head pod after multiple attempts. Cleaning up..."
99+
kubectl delete -f "$JOB_DEFINITION_FILE"
100+
exit 1
101+
fi
102+
103+
log_file="${LOG_DIR}/logs_${run_id}.log"
104+
echo "Streaming logs from $head_pod to $log_file"
105+
kubectl logs -f "$head_pod" >> "${log_file}" &
106+
logs_pid=$!
107+
echo "Waiting for job to start making progress..."
108+
sleep 60s # Wait for sometime till the job makes some progress
109+
110+
# 3. Simulate Failure: Evict a Worker Pod
111+
echo "Randomly select a worker pod to disrupt..."
112+
read -r node_name pod_name <<<$(kubectl get pods -o wide | grep "$USER_LABEL_SELECTOR" | grep 'worker-[0-9]-0-' | grep 'Running' | shuf | head -n 1 | awk '{print $7, $1}')
113+
114+
if [ -z "$pod_name" ] || [ -z "$node_name" ]; then
115+
echo "Warning: Could not find a running worker pod to disrupt. Skipping disruption."
116+
else
117+
echo "Attempting to drain node '$node_name' to evict pod '$pod_name'..."
118+
# Drain the node - this evicts all the pods from the node
119+
kubectl drain "$node_name" --ignore-daemonsets
120+
121+
echo "Node drained. Waiting briefly for training to reconfigure to N-1 slices..."
122+
sleep 60s
123+
124+
# 4. Allow Recovery: Uncordon the Node
125+
echo "Uncordoning node '$node_name' to allow scheduling again."
126+
kubectl uncordon "$node_name"
127+
fi
128+
129+
# 5. Wait for Training to resume on all slices
130+
sleep 60s
131+
132+
# 6. Terminate the Job and Cleanup
133+
echo "Terminating Run ID $run_id"
134+
kubectl delete -f "$JOB_DEFINITION_FILE"
135+
# Ensure log streaming process is killed
136+
kill "$logs_pid" 2>/dev/null
137+
echo "Completed Run ID $run_id."
138+
139+
# 6. Verify by printing steps where training reconfigured from N to N-1 slices and later back to N slices
140+
# Expect output like:
141+
# Step: 5, Old Slice Count: 3, New Slice Count: 2 (3 -> 2 slices)
142+
# Step: 17, Old Slice Count: 2, New Slice Count: 3 (2 -> 3 slices)
143+
awk '
144+
/step=/ && /elastic_manager\.good_slice_count=/ {
145+
split($0, fields, " ")
146+
step = ""
147+
good_slice_count = ""
148+
for (i in fields) {
149+
split(fields[i], kv, "=")
150+
if (kv[1] == "step") {
151+
step = kv[2]
152+
} else if (kv[1] == "elastic_manager.good_slice_count") {
153+
good_slice_count = kv[2]
154+
}
155+
}
156+
if (prev_good_slice_count != "" && prev_good_slice_count != good_slice_count) {
157+
print "Step: " step ", Old Slice Count: " prev_good_slice_count ", New Slice Count: " good_slice_count
158+
}
159+
prev_step = step
160+
prev_good_slice_count = good_slice_count
161+
}
162+
' "$log_file"
163+
```

0 commit comments

Comments
 (0)