Update README.md

shauryagup · web-flow · commit 22dba1ced1a8 · 2025-04-15T15:36:34.000-07:00
diff --git a/pathwaysutils/elastic/README.md b/pathwaysutils/elastic/README.md
@@ -19,11 +19,10 @@ The `pathwaysutils.elastic` primitives provide building blocks to integrate this
 * A [Pathways compatible GKE cluster](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster) with TPU and CPU nodepools.
 * `kubectl` configured to interact with your cluster.
 * Access to a container image containing JAX, your model code (e.g., MaxText), and the `pathwaysutils` library with elasticity features integrated.
-* A `PathwaysJob` Custom Resource Definition (CRD) installed on the GKE cluster.
 
-## Example: Elastic MaxText Training on Kubernetes
+## Elastic MaxText Training with Pathways on GKE
 
-This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices, simulating a worker failure, and observing the job's recovery and continuation. 
+This example demonstrates running an elastic MaxText job on 3 x v5e-32 slices using Pathways. See the [PathwaysJob docs](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro#pathwaysjob_api) for more details about the various attributes set in the YAML below.
 
 ### 1. Elastic PathwaysJob Definition (`pathwaysjob-elastic.py`)
 ```yaml
@@ -38,11 +37,10 @@ spec:
     topology: 4x8 
     numSlices: 3
     maxSliceRestarts: 2
-    terminationGracePeriodSeconds: 0
   pathwaysDir: "gs://<BUCKET>" # Pre-create this bucket.
   controller:
     deploymentMode: default
-    elasticSlices: 2
+    elasticSlices: 1
     template:
       spec:
         containers:
@@ -55,15 +53,15 @@ spec:
           - |
             python3 -m MaxText.elastic_train MaxText/configs/base.yml base_output_directory=gs://<BUCKET> per_device_batch_size=4 enable_checkpointing=false remat_policy=full global_parameter_scale=8 steps=50 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=pathways-<USER> enable_pathways_goodput=True
 ```
-The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the main container uses `pathwaysutils.elastic` primitives. 
+The MaxText elastic training [script](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/elastic_train.py) invoked by the `main` container above is integrated with `pathwaysutils.elastic` primitives. 
 
 ### 2. Running the Elastic Training Loop and Simulating hardware failures
 
-The following bash code snippets demonstrates launching the job, monitoring its progress, simulating a worker failure by draining a Kubernetes node, and observing the recovery. Please set the variables before executing this script. At the end of the script, we verify elasticity worked as expected.
+The following bash script demonstrates launching the above elastic maxtext job with Pathways, monitoring its progress, simulating a worker failure by issuing a `SIGILL` to a Pathways worker pod, and observing the recovery. Please set the variables marked as `<>` below before executing the script. At the end of the script, we verify elasticity worked as expected.
 
 ```bash
 #!/bin/bash
-WORKING_DIR=/path/to/working_dir
+WORKING_DIR=</LOCAL/DIRECTORY/PATH>
 USER_LABEL_SELECTOR="<USER>"
 LOG_DIR="${WORKING_DIR}/logs"
 JOB_DEFINITION_FILE="${WORKING_DIR}/pathwaysjob-elastic.yaml" # Copy the above yaml into this file
@@ -105,7 +103,7 @@ echo "Streaming logs from $head_pod to $log_file"
 kubectl logs -f "$head_pod" >> "${log_file}" &
 logs_pid=$!
 echo "Waiting for job to start making progress..."
-sleep 60s # Wait for sometime till the job makes some progress
+sleep 90s
 
 # 3. Simulate Failure: Evict a Worker Pod
 echo "Randomly select a worker pod to disrupt..."
@@ -114,20 +112,19 @@ read -r node_name pod_name <<<$(kubectl get pods -o wide | grep "$USER_LABEL_SEL
 if [ -z "$pod_name" ] || [ -z "$node_name" ]; then
   echo "Warning: Could not find a running worker pod to disrupt. Skipping disruption."
 else
-  echo "Attempting to drain node '$node_name' to evict pod '$pod_name'..."
-  # Drain the node - this evicts all the pods from the node
-  kubectl drain "$node_name" --ignore-daemonsets
-  
-  echo "Node drained. Waiting briefly for training to reconfigure to N-1 slices..."
-  sleep 60s
+  echo "Attempting to cordon '$node_name' and kill pod '$pod_name'..."
+  kubectl cordon "$node_name"
+  kubectl exec -it "$pod_name" -c pathways-worker -- /bin/sh -c "kill -s SIGILL 1"
+  echo "Node cordoned. Waiting briefly for training to reconfigure to N-1 slices..."
+  sleep 90s
 
   # 4. Allow Recovery: Uncordon the Node
   echo "Uncordoning node '$node_name' to allow scheduling again."
   kubectl uncordon "$node_name"
 fi
 
 # 5. Wait for Training to resume on all slices
-sleep 60s
+sleep 90s
 
 # 6. Terminate the Job and Cleanup
 echo "Terminating Run ID $run_id"