|
| 1 | +# HLO Graph Diff Verification Testing |
| 2 | + |
| 3 | +This document provides context for the HLO Graph Diff tests, what HLO is, and how to manage reference baselines. |
| 4 | + |
| 5 | +## Related Files |
| 6 | + |
| 7 | +- **Test Logic**: `tests/integration/hlo_diff_test.py` |
| 8 | +- **Reference Checkpoints baselines**: `tests/utils/reference_hlo_*.txt` |
| 9 | +- **Update Helper script**: `tests/utils/update_hlo_references.py` |
| 10 | +- **GitHub Action Trigger Workflow**: `.github/workflows/update_reference_hlo.yml` |
| 11 | + |
| 12 | +## What is HLO? |
| 13 | + |
| 14 | +**HLO (High-Level Optimizer)** is the intermediate representation used by XLA (Accelerated Linear Algebra) to capture the lowering compiler graph structures. |
| 15 | + |
| 16 | +An HLO module records: |
| 17 | + |
| 18 | +- The sequences of low-level math operations (dot products, convolutions, additions). |
| 19 | +- Array tensor shapes and numerical precisions. |
| 20 | +- Multipod TPU cluster partitioning array sharding mappings. |
| 21 | + |
| 22 | +## Purpose of HloDiffTest |
| 23 | + |
| 24 | +The primary purpose of the `TestHloDiff` validation checks is to ensure that **refactoring PRs are purely refactoring code** and not unintentionally impacting graph compiler lowering or performance. |
| 25 | + |
| 26 | +- **For pure refactors:** The HLO graph layout should remain *strictly identical*. Any detected deviation flags that execution boundaries or operation pipelines might have changed under the hood. |
| 27 | +- **For dependency updates:** Changes to framework dependencies (like updating JAX or XLA versions) *are expected* to slightly alter compiled HLO output layouts, which makes baseline updates appropriate in those scenarios. |
| 28 | + |
| 29 | +______________________________________________________________________ |
| 30 | + |
| 31 | +## How the Test Works |
| 32 | + |
| 33 | +This test runs automatically as part of the [`tpu-integration`](https://github.com/AI-Hypercomputer/maxtext/actions/workflows/build_and_test_maxtext.yml) CI test suite on every Pull Request. |
| 34 | + |
| 35 | +When the test method executes, it performs the following sequence of actions: |
| 36 | + |
| 37 | +1. **Triggers Compilation**: It runs the model training lifecycle compilation-only phase (invoking `train_compile.main()`) without actually allocating hardware compute nodes or running optimization passes. |
| 38 | +2. **Dumps HLO modules**: Instructs the XLA compiler back-end to capture optimizer operations lowering structure graphs and dump them to text files. |
| 39 | +3. **Strict comparison matches**: Compares the structural lines of the generated representation graph directly against baseline `.txt` copies stored under `tests/utils/`. |
| 40 | + |
| 41 | +______________________________________________________________________ |
| 42 | + |
| 43 | +## Updating HLO reference files |
| 44 | + |
| 45 | +When intended architectures transformations alter graph lowering, reference file baselines require updates. |
| 46 | + |
| 47 | +> [!IMPORTANT]\ |
| 48 | +> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.** |
| 49 | +> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines. |
| 50 | +
|
| 51 | +### Method 1: Run the manual GitHub Action Workflow (Highly Recommended) |
| 52 | + |
| 53 | +Triggering the CI workflow guarantees execution runs within the correct environment isolation scope. |
| 54 | + |
| 55 | +#### Option A: Using the GitHub UI |
| 56 | + |
| 57 | +1. Go to the Actions tab in the repository browser. |
| 58 | +2. Find the manual workflow: `Update HLO References (for hlo_diff_test.py)`. |
| 59 | +3. Run it targeting your PR workspace branch. It compiles the graph layout and commits the baseline update files back to the branch automatically. |
| 60 | + |
| 61 | +#### Option B: Using the GitHub CLI (`gh`) |
| 62 | + |
| 63 | +Alternatively, you can trigger the remote workflow via terminal CLI execution: |
| 64 | + |
| 65 | +```bash |
| 66 | +gh workflow run update_reference_hlo.yml --ref <branch> |
| 67 | +``` |
| 68 | + |
| 69 | +> [!NOTE] |
| 70 | +> A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must: |
| 71 | +> |
| 72 | +> 1. Pull the new commit from remote. |
| 73 | +> 2. Squash the commits in your branch once again to keep your PR history clean. |
| 74 | +> 3. Push the squashed commit to remote. |
| 75 | +> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR. |
| 76 | +
|
| 77 | +### Method 2: Local Execution |
| 78 | + |
| 79 | +If you need to test or update baselines manually during development: |
| 80 | + |
| 81 | +```bash |
| 82 | +source .venv/bin/activate |
| 83 | +pytest tests/integration/hlo_diff_test.py -v |
| 84 | +``` |
| 85 | + |
| 86 | +Or to force update the local baselines: |
| 87 | + |
| 88 | +```bash |
| 89 | +python3 tests/utils/update_hlo_references.py |
| 90 | +``` |
0 commit comments