Skip to content

Commit d9e347a

Browse files
avivkellerMattIPv4flakey5
authored
feat(incident): Add incident report 3/8/2026 (#110)
* feat(incident): Add incident report for macOS installer version mismatch Documented the incident regarding a macOS installer package version mismatch due to a Jenkins job failure. Included timeline, impact, root cause, fix, and follow-up work. * Update incident report date to 2026-03-08 * Correct timeline dates for Node.js v22.22.1 incident Updated incident timeline to reflect correct dates for events related to the macOS installer package issue. * Update incidents/2026-03-08.md Co-authored-by: Matt Cowley <me@mattcowley.co.uk> * Update 2026-03-08.md * Update 2026-03-08.md * Update 2026-03-08.md * Update 2026-03-08.md * Rename 2026-03-08.md to 2026-03-04.md * Update 2026-03-04.md * Apply suggestions from code review Co-authored-by: Matt Cowley <me@mattcowley.co.uk> * Update incident report for March 2026 * Update 2026-03-04.md * Update 2026-03-04.md Co-authored-by: Matt Cowley <me@mattcowley.co.uk> * Update 2026-03-04.md * Apply suggestions from code review Co-authored-by: Matt Cowley <me@mattcowley.co.uk> Co-authored-by: flakey5 <73616808+flakey5@users.noreply.github.com> * Update incidents/2026-03-04.md Co-authored-by: Matt Cowley <me@mattcowley.co.uk> * Update wording on time period --------- Co-authored-by: Matt Cowley <me@mattcowley.co.uk> Co-authored-by: flakey5 <73616808+flakey5@users.noreply.github.com>
1 parent 4623617 commit d9e347a

File tree

1 file changed

+69
-0
lines changed

1 file changed

+69
-0
lines changed

incidents/2026-03-04.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# 2026-03-04 Incident Report
2+
3+
- Incident Commander: @ryanaslett
4+
- Severity Level: P1
5+
6+
For several days following the release announcement, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run.
7+
8+
## Timeline
9+
10+
- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct (backup origin server) and the R2 dist-staging bucket.
11+
12+
- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket.
13+
14+
- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket (which serves `nodejs.org`), including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run.
15+
16+
- **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created.
17+
18+
- **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged.
19+
20+
- **2026-03-08 23:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began.
21+
22+
- **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC).
23+
24+
- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to direct but failed to sync to R2, causing version mismatch.
25+
26+
- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. Impact resolved shortly after.
27+
28+
## Impact
29+
30+
Users downloading the macOS installer package from `https://nodejs.org/dist/v22.22.1/node-v22.22.1.pkg` received a file whose SHA256 checksum (`1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) did not match the checksum published in [`SHASUMS256.txt`](https://nodejs.org/dist/latest-v22.x/SHASUMS256.txt) (`ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`).
31+
32+
Both files were legitimately signed by the Node.js Foundation Apple Developer account, but represented different build artifacts from separate Jenkins runs. The file served from direct.nodejs.org was correct, but Cloudflare R2 (serving most users via the release worker) contained the outdated version.
33+
34+
## Root Cause
35+
36+
A workflow issue in the Jenkins release process allowed files to become out of sync between direct.nodejs.org (www) and the R2 bucket.
37+
38+
The release process works as follows:
39+
1. Jenkins builds the macOS package and signs it
40+
2. The package is copied to direct via `scp`
41+
3. Jenkins SSHs into direct and uses `rclone` to copy the file to the R2 dist-staging bucket
42+
4. Releaser runs script which SSHs into direct and copies files from the R2 dist-staging bucket to the R2 dist-prod bucket
43+
5. Script generates `SHASUMS256.txt` based on files on direct, not R2, and writes this to the R2 dist-prod bucket
44+
45+
During the v22.22.1 release:
46+
1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 staging
47+
2. The job was re-run, producing a new signed package at 21:00 UTC
48+
3. The second run successfully copied the new package to direct
49+
4. The `rclone` step to R2 staging failed with `kex_exchange_identification: Connection closed by remote host`
50+
5. The Jenkins job marked the build as failed but did not roll back the direct upload
51+
6. Releaser ran script, which promoted the original package from R2 staging to prod, but generated `SHASUMS256.txt` based on the regenerated package on direct
52+
53+
This left direct with matching package and `SHASUMS256.txt` files, but the R2 prod bucket with the outdated package file, creating a checksum mismatch for most users.
54+
55+
## Fix
56+
57+
The immediate fix was to manually sync the correct file from direct to the R2 dist-staging bucket using `rclone copyto`, and then to the R2 dist-prod bucket.
58+
59+
## Follow-up Work
60+
61+
- Improve Jenkins workflow to prevent partial uploads when rclone fails
62+
- Either roll back direct uploads if R2 sync fails, or upload to both destinations atomically
63+
- Add verification step to compare checksums between direct and R2 before marking build as complete
64+
- Add monitoring/alerting for checksum mismatches between distribution sources
65+
- Investigate why the rclone SSH connection failed mid-release
66+
- Consider adding checksum verification as part of the promotion workflow
67+
- Generate checksums based on R2 dist-prod contents rather than direct
68+
- Add better logging/auditing for release builds to track which artifacts were uploaded where and when
69+
- Create or make known what documentation/sources of truth to point to for any further incidents like this

0 commit comments

Comments
 (0)