fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints#90
fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints#90yasserfaraazkhan wants to merge 3 commits into
Conversation
GitHub's workflow_run webhook payload does not carry workflow_dispatch
inputs (verified against the workflow_run schema), so CMT Provisioner
runs were never able to start provisioning -- the server_versions input
was always read as empty and the handler returned without dispatching
compatibility-matrix-testing.yml.
This change moves CMT trigger out of the webhook path and onto two new
authenticated HTTP endpoints, called directly by the workflows:
POST /cmt_dispatch -- cmt-provisioner.yml POSTs server_versions plus
full context here; matterwick provisions in a
goroutine and returns 202 immediately.
POST /cleanup_e2e -- compatibility-matrix-testing.yml POSTs the
CMT provisioner's run_id here on completion;
matterwick destroys tracked instances under
{repo}-cmt-{run_id}-* keys.
Auth: each endpoint takes a shared secret in a request header
(X-Trigger-Token, X-Cleanup-Token) compared with constant-time
comparison. Endpoints reject 503 if the corresponding secret is
unconfigured rather than running unauthenticated.
The broken "requested" branch in handleWorkflowRunEventWithInputs that
read payload.WorkflowRun.Inputs["server_versions"] is removed. The
"completed" branch is kept as a defensive fallback for stuck instances.
Co-Authored-By: Claude <noreply@anthropic.com>
Adds 18 unit tests covering the synchronous validation surface of both
new endpoints: missing/wrong tokens, malformed JSON, missing required
fields, edge inputs (",,," server_versions, unknown repo type, zero
run_id), and the cleanup map-deletion logic across matching,
non-matching, and multi-key scenarios.
Also drops the CheckLimitRateAndAbortRequest call from both handlers.
The /github_event handler keeps it because synchronous webhook
responses may need to make GitHub API calls before answering, but the
new endpoints only call GitHub from background goroutines (cmt_dispatch)
or talk to the cloud provisioner (cleanup_e2e). Keeping the rate check
would force every request to block on a live GitHub round-trip, which
makes the endpoints untestable in unit tests (no live token) for no
production benefit. Inline comments explain the rationale.
End-to-end coverage of the happy path (provisioning + workflow
dispatch) is deliberately left to e2e_dryrun_test.go, which already
mocks the cloud and github clients; duplicating that mocking here adds
noise without catching anything new. The validation rejection paths
above are the bug-prone surface.
Co-Authored-By: Claude <noreply@anthropic.com>
|
@yasserfaraazkhan: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsI understand the commands that are listed here |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds token-protected ChangesCMT Provisioning and Cleanup Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
server/cmt_dispatch_test.go (1)
47-57: 💤 Low valueConsider using
httptest.NewRequestWithContextto satisfy linter.Static analysis flags
httptest.NewRequestusage. While functionally equivalent in tests, usingNewRequestWithContextaligns with best practices and silences the linter.♻️ Suggested fix
func doCMTDispatchRequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder { t.Helper() - req := httptest.NewRequest(http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body)) req.Header.Set("Content-Type", "application/json") if token != "" { req.Header.Set("X-Trigger-Token", token) } w := httptest.NewRecorder() s.handleCMTDispatch(w, req) return w }You'll need to add
"context"to the imports.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/cmt_dispatch_test.go` around lines 47 - 57, The test helper doCMTDispatchRequest uses httptest.NewRequest which the linter flags; update doCMTDispatchRequest to use httptest.NewRequestWithContext with a background context (e.g., context.Background()) when creating the request, and add "context" to the imports; specifically modify the call in doCMTDispatchRequest (the NewRequest call that constructs req) to NewRequestWithContext and ensure the X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch remain unchanged.server/cleanup_e2e_test.go (1)
35-45: 💤 Low valueConsider using
httptest.NewRequestWithContextto satisfy linter.Same as in
cmt_dispatch_test.go, static analysis flagshttptest.NewRequest. UsingNewRequestWithContextwould align with the linter expectations.♻️ Suggested fix
func doCleanupE2ERequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder { t.Helper() - req := httptest.NewRequest(http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body)) req.Header.Set("Content-Type", "application/json") if token != "" { req.Header.Set("X-Cleanup-Token", token) } w := httptest.NewRecorder() s.handleCleanupE2E(w, req) return w }You'll need to add
"context"to the imports.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/cleanup_e2e_test.go` around lines 35 - 45, Replace httptest.NewRequest with httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context (e.g., context.Background()); also add "context" to the imports. Specifically, update the call in doCleanupE2ERequest (the helper that builds the POST to "/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(), http.MethodPost, ... ) and ensure the package imports include "context".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@server/cleanup_e2e_test.go`:
- Around line 35-45: Replace httptest.NewRequest with
httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context (e.g.,
context.Background()); also add "context" to the imports. Specifically, update
the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".
In `@server/cmt_dispatch_test.go`:
- Around line 47-57: The test helper doCMTDispatchRequest uses
httptest.NewRequest which the linter flags; update doCMTDispatchRequest to use
httptest.NewRequestWithContext with a background context (e.g.,
context.Background()) when creating the request, and add "context" to the
imports; specifically modify the call in doCMTDispatchRequest (the NewRequest
call that constructs req) to NewRequestWithContext and ensure the
X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7b08b625-de55-47f7-b1f4-061630a8a168
📒 Files selected for processing (7)
server/cleanup_e2e.goserver/cleanup_e2e_test.goserver/cmt_dispatch.goserver/cmt_dispatch_test.goserver/config.goserver/server.goserver/workflow_run.go
The "Primary" lookup at handleWorkflowRunEventWithInputs read
payload.WorkflowRun.Inputs["mw_tracking_key"], which is the same field
that broke CMT provisioning -- GitHub's workflow_run webhook payload
does not include workflow_dispatch inputs (verified against the Octokit
schema and the REST API for a real workflow run). The field is always
empty in production, so the "primary" branch never fires; only the
SHA-based fallback runs. That fallback works because every tracking-key
format ends with -{sha}.
Removing the dead read eliminates the misleading "Primary"/"Fallback"
labels and makes the SHA-based scan the documented canonical mechanism.
The mw_tracking_key value is still written when dispatching test
workflows -- it remains queryable via the REST API for any future code
path that wants to do a key-based lookup -- but the webhook-driven
cleanup intentionally does not depend on it.
Co-Authored-By: Claude <noreply@anthropic.com>
Problem
CMT Provisioner workflow_dispatch runs on
mattermost/desktopandmattermost/mattermost-mobileget theworkflow_run.requestedwebhook delivered to matterwick (verified, 202 OK), butcompatibility-matrix-testing.ymlis never dispatched as a follow-up. Two recent CMT runs on desktop confirm the symptom: run 25040877733 (2026-04-28) and run 26026866029 (2026-05-18). Mobile has 0 lifetime successful CMT follow-ups.Root cause
server/workflow_run.go:93readspayload.WorkflowRun.Inputs[""server_versions""]from the webhook payload:GitHub does not include workflow_dispatch inputs on the
workflow_runobject in webhook payloads. Verified against three independent sources:inputs.inputs.inputsin the schema.So
payload.WorkflowRun.Inputsdeserializes to an empty map on every CMT delivery; the handler hits thereturnbranch and never reachesdispatchCMTWorkflow. The webhook still 202s (handler returns cleanly), which is why this looked healthy.The existing unit test workflow_run_test.go:88-105 constructed a fake payload with
inputs, masking the issue.Fix
Move CMT trigger off the webhook path and onto two new authenticated HTTP endpoints. The workflows call matterwick directly, carrying
run_id,sha,ref,server_versions, andrepo— everything the webhook can't.POST /cmt_dispatchX-Trigger-Tokencmt-provisioner.ymlin desktop + mobilePOST /cleanup_e2eX-Cleanup-Tokencompatibility-matrix-testing.ymlin desktop + mobileBoth endpoints:
subtle.ConstantTimeCompareThe existing
handleCMTWithServerVersionsis reused without modification.The broken
requestedbranch inhandleWorkflowRunEventWithInputsis removed. Thecompletedbranch is kept as a defensive fallback for stuck instances.What this does NOT touch
Those go through different code paths and (per the merged PR #89) should already be working post-deploy.
Companion PRs
/cmt_dispatch)CMTTriggerSecretandCleanupSecretto matterwicksecret.yamlRollout
This PR is safe to land before the workflow PRs:
/github_event//cloud_webhooks//shrug_wick//routes are untouched.Suggested order:
MATTERWICK_CMT_TRIGGER_SECRET, rotateMATTERWICK_CLEANUP_SECRET).v11.1.0, v11.2.0.Testing
go build ./...— cleango vet ./...— cleango test ./...) — passes, no regressionsrun_id=0, emptyserver_versions, emptyreposerver_versions="",,,""(parses to empty) (400)/cleanup_e2eremoves matching{repo}-cmt-{run_id}-*keys/cleanup_e2eleaves unrelated keys (e.g. PR instances) untouched/cleanup_e2ehandles multiple keys for the samerun_idThe provisioning-goroutine happy path is not unit-tested here; that path is already covered end-to-end by
e2e_dryrun_test.gowith mocked cloud + GitHub clients.Known limitations
cleanupStaleNonPRE2EInstancesticker reaps them by DNS pattern + age. Same behavior as today.CheckLimitRateAndAbortRequest()was deliberately not added to the new endpoints — they don't make GitHub API calls synchronously, so the rate check would block on a network round-trip with no benefit. Inline comment explains this on both handlers.Test plan
server/cmt_dispatch.go,server/cleanup_e2e.go) for auth + parsing correctnessworkflow_run.goblock and confirms removing therequestedbranch is safego test ./...locally → all passendpoint=/cmt_dispatchlog line + matrix workflow dispatched ~30 min later → confirmendpoint=/cleanup_e2elog line at completion → confirm cloud instances destroyed🤖 Generated with Claude Code