|
| 1 | +# Automated `config_machines.xml` updates |
| 2 | + |
| 3 | +This page describes the automation that watches for upstream changes to |
| 4 | +E3SM's `config_machines.xml`, opens or refreshes a Copilot task when drift is |
| 5 | +detected, and explains how maintainers are expected to review the resulting |
| 6 | +pull request. |
| 7 | + |
| 8 | +## Goal |
| 9 | + |
| 10 | +`mache` keeps a repository-local copy of the upstream E3SM machine list in |
| 11 | +`mache/cime_machine_config/config_machines.xml`. |
| 12 | + |
| 13 | +The automation added here does **not** edit that file directly. Instead, it: |
| 14 | + |
| 15 | +1. Compares the copy in `mache` against the current upstream E3SM source. |
| 16 | +2. Produces a structured report describing any drift for supported machines. |
| 17 | +3. Creates or updates one GitHub issue that assigns the work to Copilot. |
| 18 | +4. Lets Copilot open a PR that updates `config_machines.xml` and any related |
| 19 | + Spack configuration. |
| 20 | + |
| 21 | +This keeps the source-of-truth update in a reviewed pull request rather than a |
| 22 | +silent CI-side commit. |
| 23 | + |
| 24 | +## Pieces of the automation |
| 25 | + |
| 26 | +### Daily workflow |
| 27 | + |
| 28 | +`.github/workflows/cime_machine_config_update.yml` |
| 29 | +: Runs once a day at `0 8 * * *` and can also be started manually with |
| 30 | + `workflow_dispatch`. |
| 31 | + |
| 32 | +The job: |
| 33 | + |
| 34 | +1. Checks out `main`. |
| 35 | +2. Sets up the `py314` Pixi environment. |
| 36 | +3. Installs `mache` from the checked-out repository. |
| 37 | +4. Runs `utils/update_cime_machine_config.py`. |
| 38 | +5. Uploads the generated JSON and Markdown report artifacts. |
| 39 | +6. Runs `utils/manage_cime_machine_config_issue.py` when `GH_CLI_TOKEN` is |
| 40 | + configured. |
| 41 | + |
| 42 | +### Copilot environment workflow |
| 43 | + |
| 44 | +`.github/workflows/copilot-setup-steps.yml` |
| 45 | +: Defines the setup steps the Copilot cloud agent can use on the default |
| 46 | + branch so it starts from a working Pixi environment with `mache` installed. |
| 47 | + |
| 48 | +### Drift report builder |
| 49 | + |
| 50 | +`utils/update_cime_machine_config.py` |
| 51 | +: Downloads the current upstream E3SM `config_machines.xml`, compares it with |
| 52 | + `mache/cime_machine_config/config_machines.xml`, prints a short console |
| 53 | + summary, and optionally writes: |
| 54 | + |
| 55 | + - a JSON report for machine-readable automation, |
| 56 | + - a Markdown issue body for Copilot and human reviewers. |
| 57 | + |
| 58 | +`mache/cime_machine_config/report.py` |
| 59 | +: Contains the structured comparison logic. It determines which supported |
| 60 | + machines changed, identifies module and environment-variable drift, infers |
| 61 | + related package groups, and lists candidate Spack template files to review. |
| 62 | + |
| 63 | +### Issue synchronization |
| 64 | + |
| 65 | +`utils/manage_cime_machine_config_issue.py` |
| 66 | +: Owns the GitHub-side lifecycle for the automation issue. |
| 67 | + |
| 68 | +If drift exists, it creates or updates the issue. |
| 69 | + |
| 70 | +If no drift exists, it closes the existing issue. |
| 71 | + |
| 72 | +If Copilot assignment fails, it falls back to creating or updating the same |
| 73 | +issue without Copilot assignment so the report is still visible. |
| 74 | + |
| 75 | +### Tests |
| 76 | + |
| 77 | +`tests/test_cime_machine_config_report.py` |
| 78 | +: Verifies that the report builder detects relevant drift and that the rendered |
| 79 | + issue body contains the required maintainer instructions. |
| 80 | + |
| 81 | +## How `config_machines.xml` gets updated |
| 82 | + |
| 83 | +The important point is that the scheduled workflow never edits |
| 84 | +`mache/cime_machine_config/config_machines.xml` itself. |
| 85 | + |
| 86 | +The update path is: |
| 87 | + |
| 88 | +1. The workflow detects drift between the `mache` copy and upstream E3SM. |
| 89 | +2. The workflow creates or refreshes a GitHub issue. |
| 90 | +3. Copilot is assigned to that issue. |
| 91 | +4. Copilot opens a pull request against `main`. |
| 92 | +5. That PR updates `mache/cime_machine_config/config_machines.xml` first, then |
| 93 | + any related Spack templates or version strings that the report indicates |
| 94 | + should be reviewed. |
| 95 | +6. A maintainer reviews and merges the PR. |
| 96 | +7. The next daily run compares the merged repository state against upstream |
| 97 | + again. |
| 98 | + |
| 99 | +If the PR fully resolved the drift, the issue is closed automatically on the |
| 100 | +next run. |
| 101 | + |
| 102 | +If only part of the drift was resolved, the issue stays open and its body is |
| 103 | +updated to reflect the remaining work. |
| 104 | + |
| 105 | +## What Copilot is told to do |
| 106 | + |
| 107 | +Copilot receives instructions from two places. |
| 108 | + |
| 109 | +### Fixed API-level instructions |
| 110 | + |
| 111 | +`utils/manage_cime_machine_config_issue.py` adds the following guidance in the |
| 112 | +`agent_assignment` payload: |
| 113 | + |
| 114 | +- Use the issue body as the task definition. |
| 115 | +- Update `config_machines.xml` first. |
| 116 | +- Then update related Spack templates and version strings. |
| 117 | +- Add TODO comments in the PR when prefix or path changes need reviewer |
| 118 | + confirmation. |
| 119 | + |
| 120 | +### Generated issue-body instructions |
| 121 | + |
| 122 | +`mache/cime_machine_config/report.py` renders the issue body for the current |
| 123 | +drift and includes: |
| 124 | + |
| 125 | +- the timestamp and upstream source URL, |
| 126 | +- the workflow run URL, |
| 127 | +- the list of affected supported machines, |
| 128 | +- the required work list, |
| 129 | +- per-machine details such as package groups, prefix or path variables, and |
| 130 | + candidate Spack templates to inspect. |
| 131 | + |
| 132 | +The required work section tells Copilot to: |
| 133 | + |
| 134 | +- update `mache/cime_machine_config/config_machines.xml` for the affected |
| 135 | + supported machines, |
| 136 | +- update Spack templates and version strings when module or environment drift |
| 137 | + implies different package versions, |
| 138 | +- keep the PR focused when the change is only version or module drift, |
| 139 | +- add a TODO in the PR instead of guessing when a new prefix or path is not |
| 140 | + obvious. |
| 141 | + |
| 142 | +## Why this does not create a new issue every day |
| 143 | + |
| 144 | +The workflow is designed to reuse one open issue rather than create a new one |
| 145 | +for every scheduled run. |
| 146 | + |
| 147 | +`utils/manage_cime_machine_config_issue.py` looks for an existing open issue |
| 148 | +with the fixed title stored in the workflow environment: |
| 149 | + |
| 150 | +- `ISSUE_TITLE: Daily config_machines drift detected` |
| 151 | + |
| 152 | +The lifecycle is: |
| 153 | + |
| 154 | +1. If no matching open issue exists and drift is detected, create one. |
| 155 | +2. If a matching open issue already exists and drift is still present, update |
| 156 | + that same issue. |
| 157 | +3. If no drift remains and the issue exists, close it. |
| 158 | + |
| 159 | +That means an unresolved drift while you are away does **not** produce a fresh |
| 160 | +issue every day. The same issue remains open and is refreshed in place. |
| 161 | + |
| 162 | +A new issue would only be created if one of these is true: |
| 163 | + |
| 164 | +- the existing automation issue was manually closed while drift still exists, |
| 165 | +- the issue title configured in the workflow was changed, |
| 166 | +- the existing issue was deleted or otherwise no longer appears as an open |
| 167 | + issue in the repository. |
| 168 | + |
| 169 | +## Reviewer workflow |
| 170 | + |
| 171 | +When Copilot opens a PR from this issue, the reviewer should check the changes |
| 172 | +in this order. |
| 173 | + |
| 174 | +### 1. `config_machines.xml` changes |
| 175 | + |
| 176 | +Verify that the PR updates |
| 177 | +`mache/cime_machine_config/config_machines.xml` only for supported machines |
| 178 | +reported by the workflow, and that those changes match the current upstream |
| 179 | +E3SM machine definitions. |
| 180 | + |
| 181 | +In practice, the easiest cross-check is to compare the PR against the report |
| 182 | +artifact from the workflow run that opened or refreshed the issue. |
| 183 | + |
| 184 | +### 2. Related Spack updates |
| 185 | + |
| 186 | +If the report lists package groups or candidate Spack templates, check that the |
| 187 | +PR updated the relevant `mache/spack/*.yaml` inputs and any version strings |
| 188 | +that should track the new module or environment values. |
| 189 | + |
| 190 | +If the report does not indicate Spack-relevant drift, the PR should usually be |
| 191 | +limited to `config_machines.xml`. |
| 192 | + |
| 193 | +### 3. Ambiguous path or prefix changes |
| 194 | + |
| 195 | +When upstream changes a path-like variable such as `NETCDF_PATH`, the correct |
| 196 | +replacement in `mache` may not be obvious from the XML alone. |
| 197 | + |
| 198 | +In that case, the expected behavior is **not** to guess. The PR should leave a |
| 199 | +TODO note for the reviewer and explain what needs confirmation. |
| 200 | + |
| 201 | +### 4. Validation |
| 202 | + |
| 203 | +At minimum, reviewers or PR authors should run the same local checks used by |
| 204 | +development in this repository. |
| 205 | + |
| 206 | +Generate the current report locally: |
| 207 | + |
| 208 | +```bash |
| 209 | +pixi run -e py314 python utils/update_cime_machine_config.py \ |
| 210 | + --json-output /tmp/cime_machine_config_report.json \ |
| 211 | + --markdown-output /tmp/cime_machine_config_report.md |
| 212 | +``` |
| 213 | + |
| 214 | +Run the focused tests: |
| 215 | + |
| 216 | +```bash |
| 217 | +pixi run -e py314 pytest tests/test_cime_machine_config_report.py |
| 218 | +``` |
| 219 | + |
| 220 | +Run pre-commit on changed files before merging: |
| 221 | + |
| 222 | +```bash |
| 223 | +pixi run -e py314 pre-commit run --files <changed files> |
| 224 | +``` |
| 225 | + |
| 226 | +## Manual dry run for maintainers |
| 227 | + |
| 228 | +To exercise the detection path without waiting for the cron schedule: |
| 229 | + |
| 230 | +1. Trigger the workflow manually with `workflow_dispatch`, or |
| 231 | +2. Run `utils/update_cime_machine_config.py` locally in the Pixi environment. |
| 232 | + |
| 233 | +If `GH_CLI_TOKEN` is not configured, the workflow still generates and uploads |
| 234 | +the report artifacts but skips issue synchronization. |
| 235 | + |
| 236 | +That is a safe way to validate the comparison and report rendering logic |
| 237 | +without asking Copilot to act on the result. |
| 238 | + |
| 239 | +## Operational notes |
| 240 | + |
| 241 | +- `GH_CLI_TOKEN` should be a user token with access to create and update |
| 242 | + issues in the repository. A classic PAT with `repo` scope is sufficient. |
| 243 | +- Copilot assignment additionally depends on Copilot cloud agent being enabled |
| 244 | + for the repository. |
| 245 | +- The workflow uses the repository's current `main` branch as the comparison |
| 246 | + baseline and as the branch Copilot is asked to target. |
0 commit comments