Create README.md

shauryagup · web-flow · commit b37dd94ed6e3 · 2025-04-14T21:31:02.000-07:00
diff --git a/pathwaysutils/elastic/README.md b/pathwaysutils/elastic/README.md
@@ -0,0 +1,163 @@
+# Elastic Training with Pathways
+
+This document demonstrates how to leverage the elasticity primitives within `manager.py` to create resilient JAX training loop that can handle hardware failures gracefully. We illustrate this using an example based on the MaxText training loop running on TPUs provisioned by GKE via `PathwaysJob` API.
+
+## Overview
+
+Distributed training jobs, especially long-running ones, are susceptible to various failures, such as machine preemptions or hardware issues. Elasticity allows a training job to adapt to changes in the number of available accelerators without crashing. It typically involves:
+
+1.  **Training State Management:** Regularly snapshotting the training state (model params, optimizer state, data iterator state).
+2.  **Failure Detection:** Pathways Resource Manager detecting when workers join or leave.
+3.  **Failure Propogation:** Pathways runtime propogates the error to JAX client.
+4.  **Training Reconfiguration:** Adapting the training computation distribution to the current set of healthy workers.
+5.  **Resumption:** Continuing training from the last valid snapshot with the new configuration.
+
+The `pathwaysutils.elastic` primitives provide building blocks to integrate this logic into JAX training loops run using the Pathways' `Proxy` JAX backend.
+
+## Prerequisites
+
+* A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
+* `kubectl` configured to interact with your cluster.
+* Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
+* A `PathwaysJob` Custom Resource Definition (CRD) installed on the GKE cluster.
+
+## Example: Elastic MaxText Training on Kubernetes
+
+This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices, simulating a worker failure, and observing the job's recovery and continuation. 
+
+### 1. Elastic PathwaysJob Definition (`pathwaysjob-elastic.py`)
+```yaml
+apiVersion: pathways-job.pathways.domain/v1
+kind: PathwaysJob
+metadata:
+  name: pathways-<USER>
+spec:
+  maxRestarts: 0
+  workers:
+  - type: ct5lp-hightpu-4t
+    topology: 4x8 
+    numSlices: 3
+    maxSliceRestarts: 2
+    terminationGracePeriodSeconds: 0
+  pathwaysDir: "gs://<BUCKET>" # Pre-create this bucket.
+  controller:
+    deploymentMode: default
+    elasticSlices: 2
+    template:
+      spec:
+        containers:
+        - name: main
+          image: <MAXTEXT_IMAGE>
+          imagePullPolicy: Always
+          command:
+          - bash
+          - -c
+          - |
+            python3 -m MaxText.elastic_train MaxText/configs/base.yml base_output_directory=gs://<BUCKET> per_device_batch_size=4 enable_checkpointing=false remat_policy=full global_parameter_scale=8 steps=50 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=pathways-<USER> enable_pathways_goodput=True
+```
+The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the main container uses `pathwaysutils.elastic` primitives. 
+
+### 2. Running the Elastic Training Loop and Simulating hardware failures
+
+The following bash code snippets demonstrates launching the job, monitoring its progress, simulating a worker failure by draining a Kubernetes node, and observing the recovery. Please set the variables before executing this script. At the end of the script, we verify elasticity worked as expected.
+
+```bash
+#!/bin/bash
+WORKING_DIR=/path/to/working_dir
+USER_LABEL_SELECTOR="<USER>"
+LOG_DIR="${WORKING_DIR}/logs"
+JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
+
+mkdir -p ${LOG_DIR}
+
+run_id=$(date +"%s")
+echo "Running Elastic MaxText with Run ID: $run_id"
+
+# 1. Launch the PathwaysJob
+kubectl apply -f "$JOB_DEFINITION_FILE"
+if [ $? -ne 0 ]; then
+echo "Error: Failed to apply job definition."
+exit 1
+fi
+
+# 2. Monitor the PathwaysJob
+echo "Waiting for pods to start..."
+head_pod=""
+for i in $(seq 1 10)
+do
+  head_pod=$(kubectl get pods | grep "$USER_LABEL_SELECTOR" | grep 'head' | grep 'Running' | awk '{print $1}' | head -n 1)
+  if [ -n "$head_pod" ]; then
+    echo "Found head pod: $head_pod"
+    break
+  fi
+  echo "Head pod not found yet, retrying..."
+  sleep 10s
+done
+
+if [ -z "$head_pod" ]; then
+  echo "Error: Could not find running head pod after multiple attempts. Cleaning up..."
+  kubectl delete -f "$JOB_DEFINITION_FILE"
+  exit 1
+fi
+
+log_file="${LOG_DIR}/logs_${run_id}.log"
+echo "Streaming logs from $head_pod to $log_file"
+kubectl logs -f "$head_pod" >> "${log_file}" &
+logs_pid=$!
+echo "Waiting for job to start making progress..."
+sleep 60s # Wait for sometime till the job makes some progress
+
+# 3. Simulate Failure: Evict a Worker Pod
+echo "Randomly select a worker pod to disrupt..."
+read -r node_name pod_name <<<$(kubectl get pods -o wide | grep "$USER_LABEL_SELECTOR" | grep 'worker-[0-9]-0-' | grep 'Running' | shuf | head -n 1 | awk '{print $7, $1}')
+
+if [ -z "$pod_name" ] || [ -z "$node_name" ]; then
+  echo "Warning: Could not find a running worker pod to disrupt. Skipping disruption."
+else
+  echo "Attempting to drain node '$node_name' to evict pod '$pod_name'..."
+  # Drain the node - this evicts all the pods from the node
+  kubectl drain "$node_name" --ignore-daemonsets
+  
+  echo "Node drained. Waiting briefly for training to reconfigure to N-1 slices..."
+  sleep 60s
+
+  # 4. Allow Recovery: Uncordon the Node
+  echo "Uncordoning node '$node_name' to allow scheduling again."
+  kubectl uncordon "$node_name"
+fi
+
+# 5. Wait for Training to resume on all slices
+sleep 60s
+
+# 6. Terminate the Job and Cleanup
+echo "Terminating Run ID $run_id"
+kubectl delete -f "$JOB_DEFINITION_FILE"
+# Ensure log streaming process is killed
+kill "$logs_pid" 2>/dev/null 
+echo "Completed Run ID $run_id."
+
+# 6. Verify by printing steps where training reconfigured from N to N-1 slices and later back to N slices
+# Expect output like:
+# Step: 5, Old Slice Count: 3, New Slice Count: 2 (3 -> 2 slices)
+# Step: 17, Old Slice Count: 2, New Slice Count: 3 (2 -> 3 slices)
+awk '
+  /step=/ && /elastic_manager\.good_slice_count=/ {
+    split($0, fields, " ")
+    step = ""
+    good_slice_count = ""
+    for (i in fields) {
+      split(fields[i], kv, "=")
+      if (kv[1] == "step") {
+        step = kv[2]
+      } else if (kv[1] == "elastic_manager.good_slice_count") {
+        good_slice_count = kv[2]
+      }
+    }
+    if (prev_good_slice_count != "" && prev_good_slice_count != good_slice_count) {
+      print "Step: " step ", Old Slice Count: " prev_good_slice_count ", New Slice Count: " good_slice_count
+    }
+    prev_step = step
+    prev_good_slice_count = good_slice_count
+  }
+' "$log_file"
+```