Skip to content

Speed up cpu-unit CI#3700

Draft
gagika wants to merge 1 commit intomainfrom
agagik-ci
Draft

Speed up cpu-unit CI#3700
gagika wants to merge 1 commit intomainfrom
agagik-ci

Conversation

@gagika
Copy link
Copy Markdown
Collaborator

@gagika gagika commented Apr 19, 2026

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

  • why is this change being made,
  • the problem being solved and any relevant context,
  • why this is a good solution,
  • some information about the specific implementation,
  • shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request successfully introduces several optimizations to speed up the cpu-unit CI. By increasing parallelism, introducing test duration recording for better load balancing, and reducing the complexity of MoE compilation tests, the CI efficiency is significantly improved without compromising the validity of the tests.

🔍 General Feedback

  • Excellent CI Optimizations: The use of pytest-split with stored durations on scheduled runs is a very effective way to handle imbalanced test shards.
  • Smart Model Overrides: Using override_model_config to reduce the number of layers in compilation tests is a high-impact change that drastically reduces compilation overhead.
  • Improved Maintainability: The refactoring of MoE tests into helper methods makes the test suite cleaner and easier to maintain.

SPLIT_ARGS="--splits ${INPUTS_TOTAL_WORKERS} --group ${INPUTS_WORKER_GROUP} -n auto"
# On scheduled runs, record per-shard durations so future splits balance by time (LPT).
# Merge artifacts offline and commit as .test_durations at repo root.
if [ "${INPUTS_IS_SCHEDULED_RUN}" == "true" ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Recording per-shard durations on scheduled runs is an excellent strategy for optimizing CI load balancing. Merging these artifacts into a central `.test_durations` file will allow `pytest-split` to balance future runs more effectively using the Least Progress Time (LPT) algorithm.

"compile_topology=v5p-64",
"compile_topology_num_slices=8",
"compile_topology=v5p-8",
"compile_topology_num_slices=2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Reducing the `compile_topology` and `compile_topology_num_slices` here is appropriate for a unit/compilation test. It speeds up the CI by requiring fewer resources and less time to verify that the pipeline parallelism logic compiles correctly.

"compile_topology_num_slices=1",
"model_name=mixtral-8x7b",
"override_model_config=true",
"base_num_decoder_layers=8",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Using `override_model_config=true` and reducing `base_num_decoder_layers` to 8 is a great optimization for compilation tests. This significantly reduces the graph size and compilation time while still exercising the necessary code paths for MoE models.

export MAXTEXT_PKG_DIR=$(pwd)/src/maxtext
# TODO(b/454659463): Enable test_default_hlo_match after volume mount is supported.
.venv/bin/python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest and not CompileThenLoad" --durations=0
.venv/bin/python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest and not AotJaxprIdenticalTest and not CompileThenLoad and not test_diloco_two_slices" --durations=0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Skipping `test_diloco_two_slices` helps stabilize the CI, but it would be beneficial to link a tracking issue or add a TODO explaining why this test is being skipped and if it's intended to be re-enabled later.

@gagika gagika force-pushed the agagik-ci branch 2 times, most recently from b43b796 to 8182ccd Compare April 21, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant