Speed up cpu-unit CI by gagika · Pull Request #3700 · AI-Hypercomputer/maxtext

gagika · 2026-04-19T23:12:06Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-19T23:16:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-04-20T03:22:19Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request successfully introduces several optimizations to speed up the cpu-unit CI. By increasing parallelism, introducing test duration recording for better load balancing, and reducing the complexity of MoE compilation tests, the CI efficiency is significantly improved without compromising the validity of the tests.

🔍 General Feedback

Excellent CI Optimizations: The use of pytest-split with stored durations on scheduled runs is a very effective way to handle imbalanced test shards.
Smart Model Overrides: Using override_model_config to reduce the number of layers in compilation tests is a high-impact change that drastically reduces compilation overhead.
Improved Maintainability: The refactoring of MoE tests into helper methods makes the test suite cleaner and easier to maintain.

github-actions · 2026-04-20T03:23:27Z

            SPLIT_ARGS="--splits ${INPUTS_TOTAL_WORKERS} --group ${INPUTS_WORKER_GROUP} -n auto"
+            # On scheduled runs, record per-shard durations so future splits balance by time (LPT).
+            # Merge artifacts offline and commit as .test_durations at repo root.
+            if [ "${INPUTS_IS_SCHEDULED_RUN}" == "true" ]; then


🟢 Recording per-shard durations on scheduled runs is an excellent strategy for optimizing CI load balancing. Merging these artifacts into a central `.test_durations` file will allow `pytest-split` to balance future runs more effectively using the Least Progress Time (LPT) algorithm.

github-actions · 2026-04-20T03:23:27Z

-            "compile_topology=v5p-64",
-            "compile_topology_num_slices=8",
+            "compile_topology=v5p-8",
+            "compile_topology_num_slices=2",


🟢 Reducing the `compile_topology` and `compile_topology_num_slices` here is appropriate for a unit/compilation test. It speeds up the CI by requiring fewer resources and less time to verify that the pipeline parallelism logic compiles correctly.

github-actions · 2026-04-20T03:23:27Z

            "compile_topology_num_slices=1",
            "model_name=mixtral-8x7b",
+            "override_model_config=true",
+            "base_num_decoder_layers=8",


🟢 Using `override_model_config=true` and reducing `base_num_decoder_layers` to 8 is a great optimization for compilation tests. This significantly reduces the graph size and compilation time while still exercising the necessary code paths for MoE models.

github-actions · 2026-04-20T03:23:30Z

          export MAXTEXT_PKG_DIR=$(pwd)/src/maxtext
          # TODO(b/454659463): Enable test_default_hlo_match after volume mount is supported.
-          .venv/bin/python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest and not CompileThenLoad" --durations=0
+          .venv/bin/python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest and not AotJaxprIdenticalTest and not CompileThenLoad and not test_diloco_two_slices" --durations=0


🟡 Skipping `test_diloco_two_slices` helps stabilize the CI, but it would be beneficial to link a tracking issue or add a TODO explaining why this test is being skipped and if it's intended to be re-enabled later.

gagika force-pushed the agagik-ci branch from 9f174a5 to 7f5aa98 Compare April 19, 2026 23:55

gagika added the gemini-review label Apr 20, 2026

github-actions Bot reviewed Apr 20, 2026

View reviewed changes

gagika force-pushed the agagik-ci branch 2 times, most recently from b43b796 to 8182ccd Compare April 21, 2026 01:28

Speed up cpu-unit CI

f9e4a70

gagika force-pushed the agagik-ci branch from 8182ccd to f9e4a70 Compare April 21, 2026 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up cpu-unit CI#3700

Speed up cpu-unit CI#3700
gagika wants to merge 1 commit intomainfrom
agagik-ci

gagika commented Apr 19, 2026

Uh oh!

codecov Bot commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 20, 2026

Uh oh!

github-actions Bot Apr 20, 2026

Uh oh!

github-actions Bot Apr 20, 2026

Uh oh!

github-actions Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gagika commented Apr 19, 2026

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 19, 2026

Codecov Report

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant