Support grain data checkpoint for elastic training by aireenmei · Pull Request #3673 · AI-Hypercomputer/maxtext

aireenmei · 2026-04-15T20:51:07Z

Description

migrate RemoteIterator to colocated python class
Add checkpointing logic to RemoteIterator, so data iterator in the colocated sidecar writes checkpoint to the checkpoint path, prevent sending data iterator state to the controller
Add grain.ElasticIterator support controlled by flag grain_use_elastic_iterator, can be used with or without elastic training, with or without colocated_python. This class allows recovering checkpoint with a dynamic scale (up or down), with the following limitations, future work will loose the limitations:
(1). Only arrayrecord files are supported, parquet or tfrecord are not supported
(2). Does not support many-to-one transformations, including packing, filtering
(3). Does not support mixing datasets

Tests

Tested on Pathways saving and restoring data iterator checkpoints with different # of v5e-32 slices
jobset.yaml

With colocated_python

colocated_python_data_input=true colocated_python_checkpointing=true grain_use_elastic_iterator=true, checkpoints in gs://aireenmei-multipod/pathways-v5e/20260424_grain_elastic_2/aireen-pathways-v5e/checkpoints:

Start with 1 slice, set steps=25, checkpoint at step 0, 10, 20, 24: log (log confirms Num devices: 32),
Resume with 2 slices, set steps=45, checkpoint at step 30, 40, 44: log(log confirms Num devices: 64),
Resume with 1 slice, set steps=65, checkpoint at step 50, 60, 64: log(log confirms Num devices: 32)

Without colocated_python

colocated_python_data_input=false colocated_python_checkpointing=false grain_use_elastic_iterator=true, checkpoints in gs://aireenmei-multipod/pathways-v5e/20260424_grain_elastic_3/aireen-pathways-v5e/checkpoints:

Start with 1 slice, set steps=25, checkpoint at step 0, 10, 20, 24: log (log confirms Num devices: 32),
Resume with 2 slices, set steps=45, checkpoint at step 30, 40, 44: log(log confirms Num devices: 64),
Resume with 1 slice, set steps=65, checkpoint at step 50, 60, 64: log(log confirms Num devices: 32)

Index verification

Inspecting the checkpoints in gs://aireenmei-multipod/pathways-v5e/20260424_grain_elastic_2/aireen-pathways-v5e/checkpoints/{step}/iter/process_0.json

Step 20: "global_next_index": 672
Step 24: "global_next_index": 800
Step 30: "global_next_index": 1184
Step 20-24 is on 1 slice, 32 devices, batch_size=32, matches (800-672) / 32 = 4 steps
Step 24-30 is on 2 slices, 64 devices, batch_size=64, matches (1184-800) / 64 = 6 steps

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-15T20:56:29Z

Codecov Report

❌ Patch coverage is 22.28916% with 129 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/maxtext/input_pipeline/multihost_dataloading.py	20.28%	55 Missing ⚠️
src/maxtext/common/checkpointing.py	17.07%	25 Missing and 9 partials ⚠️
...rc/maxtext/input_pipeline/grain_data_processing.py	31.11%	26 Missing and 5 partials ⚠️
...rc/maxtext/input_pipeline/data_processing_utils.py	22.22%	4 Missing and 3 partials ⚠️
src/maxtext/input_pipeline/tfds_data_processing.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-04-24T18:59:27Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-04-24T18:59:59Z

🤖 I'm sorry @aireenmei, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-04-24T19:18:49Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-04-24T19:19:21Z

🤖 I'm sorry @aireenmei, but I was unable to process your request. Please see the logs for more details.

lukebaumann · 2026-04-24T19:30:42Z

How are you confirming that the dataset is resumed from the correct index?

lukebaumann · 2026-04-24T19:33:22Z

+      return item
+
+    # ElasticIterator: every process reads the same shared `process_0.json`.
+    if isinstance(item, ElasticIterator):


We want ElasitcIterator to also be a RemoteIterator. Is that happening?

We want to support ElasitcIterator in both regular (Pathways, mcJAX) and colocated python environments. Because ElasitcIterator allows flexible chip counts, it has use case for regular customers who may change chip counts while keeping training progress. This specific path is for non-colocated cases (including Pathways and mcJAX), the Pathways + colocated case is handled by RemoteIteratorWrapper in the lines above. Let me improve the comments

SujeethJinesh · 2026-04-24T19:44:08Z

-      process_index = jax.process_index() + i * jax.process_count()
-      grain_iters_to_save.append((data_iter.local_iterator, process_index, process_count_total))
-    save_args_composite["iter"] = GrainCheckpointSave(item=grain_iters_to_save)
+    if isinstance(data_iterator[0], RemoteIteratorWrapper):


Can you pull out what this actually is just for easier readability?

I refactored and removed "[0]". Hopefully better readability now

SujeethJinesh · 2026-04-24T20:36:35Z

+
+  def save_state(self, step):
+    step_array = jnp.full(self.dummy_array.shape, step, dtype=jnp.int32)
+    step_array = jax.device_put(step_array, self.cpu_sharding)


Do need any specialization?

Could you explain specialization?

aireenmei · 2026-04-24T21:19:42Z

How are you confirming that the dataset is resumed from the correct index?

Good question. I just added a section "Index verification" in PR description.

aireenmei force-pushed the aireen/elastic_data branch 2 times, most recently from 7bd4359 to 947c587 Compare April 16, 2026 19:35

aireenmei force-pushed the aireen/elastic_data branch 2 times, most recently from e8e9f36 to 5febf80 Compare April 24, 2026 18:55

aireenmei added the gemini-review label Apr 24, 2026

aireenmei added gemini-review and removed gemini-review labels Apr 24, 2026

aireenmei marked this pull request as ready for review April 24, 2026 19:18

aireenmei requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, abhinavclemson, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners April 24, 2026 19:19

aireenmei requested a review from igorts-git as a code owner April 24, 2026 19:19

lukebaumann reviewed Apr 24, 2026

View reviewed changes

SujeethJinesh reviewed Apr 24, 2026

View reviewed changes

aireenmei force-pushed the aireen/elastic_data branch from 5febf80 to c8d0b77 Compare April 24, 2026 22:01

Support Grain elastic checkpointing

70e42c0

aireenmei force-pushed the aireen/elastic_data branch from c8d0b77 to 70e42c0 Compare April 24, 2026 22:26

Conversation

aireenmei commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

With colocated_python

Without colocated_python

Index verification

Checklist

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

lukebaumann commented Apr 24, 2026

Uh oh!

lukebaumann Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

aireenmei Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

SujeethJinesh Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

aireenmei Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

SujeethJinesh Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

aireenmei Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

aireenmei commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aireenmei commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading