Skip to content

fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints#90

Open
yasserfaraazkhan wants to merge 3 commits into
masterfrom
fix/cmt-direct-dispatch-and-cleanup-endpoint
Open

fix(cmt): add /cmt_dispatch and /cleanup_e2e endpoints#90
yasserfaraazkhan wants to merge 3 commits into
masterfrom
fix/cmt-direct-dispatch-and-cleanup-endpoint

Conversation

@yasserfaraazkhan
Copy link
Copy Markdown
Contributor

Problem

CMT Provisioner workflow_dispatch runs on mattermost/desktop and mattermost/mattermost-mobile get the workflow_run.requested webhook delivered to matterwick (verified, 202 OK), but compatibility-matrix-testing.yml is never dispatched as a follow-up. Two recent CMT runs on desktop confirm the symptom: run 25040877733 (2026-04-28) and run 26026866029 (2026-05-18). Mobile has 0 lifetime successful CMT follow-ups.

Root cause

server/workflow_run.go:93 reads payload.WorkflowRun.Inputs[""server_versions""] from the webhook payload:

serverVersionsStr, ok := payload.WorkflowRun.Inputs[""server_versions""]
if !ok || serverVersionsStr == """" {
    logger.Error(""No server_versions found in workflow inputs"")
    return
}

GitHub does not include workflow_dispatch inputs on the workflow_run object in webhook payloads. Verified against three independent sources:

  1. Octokit webhook schema — lists 35 fields, none named inputs.
  2. The actual REST API representation of run 26026866029 — same 35 fields, no inputs.
  3. The official GitHub webhook docs — no inputs in the schema.

So payload.WorkflowRun.Inputs deserializes to an empty map on every CMT delivery; the handler hits the return branch and never reaches dispatchCMTWorkflow. The webhook still 202s (handler returns cleanly), which is why this looked healthy.

The existing unit test workflow_run_test.go:88-105 constructed a fake payload with inputs, masking the issue.

Fix

Move CMT trigger off the webhook path and onto two new authenticated HTTP endpoints. The workflows call matterwick directly, carrying run_id, sha, ref, server_versions, and repo — everything the webhook can't.

Endpoint Auth header Called by
POST /cmt_dispatch X-Trigger-Token cmt-provisioner.yml in desktop + mobile
POST /cleanup_e2e X-Cleanup-Token compatibility-matrix-testing.yml in desktop + mobile

Both endpoints:

  • Reject 503 if their secret isn't configured (fail closed)
  • Validate token with subtle.ConstantTimeCompare
  • Return 202/200 immediately and do the work in a goroutine

The existing handleCMTWithServerVersions is reused without modification.

The broken requested branch in handleWorkflowRunEventWithInputs is removed. The completed branch is kept as a defensive fallback for stuck instances.

What this does NOT touch

  • Nightly schedule path
  • Master / release-* push path
  • PR-label E2E path

Those go through different code paths and (per the merged PR #89) should already be working post-deploy.

Companion PRs

Rollout

This PR is safe to land before the workflow PRs:

  • Without the secrets in the deployed config, both endpoints return 503, so an accidental early caller fails closed.
  • Existing /github_event / /cloud_webhooks / /shrug_wick / / routes are untouched.

Suggested order:

  1. Land this PR → no behavior change yet (secrets not deployed).
  2. Land + deploy gitops-platform secrets PR with real tokens.
  3. Add GitHub repo secrets (MATTERWICK_CMT_TRIGGER_SECRET, rotate MATTERWICK_CLEANUP_SECRET).
  4. Roll out matterwick deploy to pick up the new config.
  5. Land the desktop workflow PR.
  6. Land the mobile workflow PR.
  7. End-to-end smoke test by dispatching CMT Provisioner with v11.1.0, v11.2.0.

Testing

  • go build ./... — clean
  • go vet ./... — clean
  • Full existing test suite (go test ./...) — passes, no regressions
  • 18 new tests for the two endpoints, all passing:
    • Rejects when secret unconfigured (503)
    • Rejects missing / wrong token (403)
    • Rejects malformed JSON (400)
    • Rejects each missing field (400) including run_id=0, empty server_versions, empty repo
    • Rejects server_versions="",,,"" (parses to empty) (400)
    • Rejects unknown repo type (400)
    • /cleanup_e2e removes matching {repo}-cmt-{run_id}-* keys
    • /cleanup_e2e leaves unrelated keys (e.g. PR instances) untouched
    • /cleanup_e2e handles multiple keys for the same run_id
    • Constant-time comparison rejects same-length wrong tokens

The provisioning-goroutine happy path is not unit-tested here; that path is already covered end-to-end by e2e_dryrun_test.go with mocked cloud + GitHub clients.

Known limitations

  1. If matterwick restarts during the ~30 min provisioning goroutine, instances become orphans. The existing periodic cleanupStaleNonPRE2EInstances ticker reaps them by DNS pattern + age. Same behavior as today.
  2. CheckLimitRateAndAbortRequest() was deliberately not added to the new endpoints — they don't make GitHub API calls synchronously, so the rate check would block on a network round-trip with no benefit. Inline comment explains this on both handlers.

Test plan

  • Reviewer reads new endpoint handlers (server/cmt_dispatch.go, server/cleanup_e2e.go) for auth + parsing correctness
  • Reviewer reads the deleted workflow_run.go block and confirms removing the requested branch is safe
  • Run go test ./... locally → all pass
  • Deploy after secrets land in gitops + repo settings
  • Smoke: dispatch CMT Provisioner on desktop → confirm endpoint=/cmt_dispatch log line + matrix workflow dispatched ~30 min later → confirm endpoint=/cleanup_e2e log line at completion → confirm cloud instances destroyed

🤖 Generated with Claude Code

Yasser Khan and others added 2 commits May 21, 2026 14:46
GitHub's workflow_run webhook payload does not carry workflow_dispatch
inputs (verified against the workflow_run schema), so CMT Provisioner
runs were never able to start provisioning -- the server_versions input
was always read as empty and the handler returned without dispatching
compatibility-matrix-testing.yml.

This change moves CMT trigger out of the webhook path and onto two new
authenticated HTTP endpoints, called directly by the workflows:

  POST /cmt_dispatch  -- cmt-provisioner.yml POSTs server_versions plus
                         full context here; matterwick provisions in a
                         goroutine and returns 202 immediately.
  POST /cleanup_e2e   -- compatibility-matrix-testing.yml POSTs the
                         CMT provisioner's run_id here on completion;
                         matterwick destroys tracked instances under
                         {repo}-cmt-{run_id}-* keys.

Auth: each endpoint takes a shared secret in a request header
(X-Trigger-Token, X-Cleanup-Token) compared with constant-time
comparison. Endpoints reject 503 if the corresponding secret is
unconfigured rather than running unauthenticated.

The broken "requested" branch in handleWorkflowRunEventWithInputs that
read payload.WorkflowRun.Inputs["server_versions"] is removed. The
"completed" branch is kept as a defensive fallback for stuck instances.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds 18 unit tests covering the synchronous validation surface of both
new endpoints: missing/wrong tokens, malformed JSON, missing required
fields, edge inputs (",,," server_versions, unknown repo type, zero
run_id), and the cleanup map-deletion logic across matching,
non-matching, and multi-key scenarios.

Also drops the CheckLimitRateAndAbortRequest call from both handlers.
The /github_event handler keeps it because synchronous webhook
responses may need to make GitHub API calls before answering, but the
new endpoints only call GitHub from background goroutines (cmt_dispatch)
or talk to the cloud provisioner (cleanup_e2e). Keeping the rate check
would force every request to block on a live GitHub round-trip, which
makes the endpoints untestable in unit tests (no live token) for no
production benefit. Inline comments explain the rationale.

End-to-end coverage of the happy path (provisioning + workflow
dispatch) is deliberately left to e2e_dryrun_test.go, which already
mocks the cloud and github clients; duplicating that mocking here adds
noise without catching anything new. The validation rejection paths
above are the bug-prone surface.

Co-Authored-By: Claude <noreply@anthropic.com>
@mm-cloud-bot mm-cloud-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 25, 2026
@mm-cloud-bot
Copy link
Copy Markdown

@yasserfaraazkhan: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

I understand the commands that are listed here

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d79ebee-a14c-4137-ba55-d5f692cb423f

📥 Commits

Reviewing files that changed from the base of the PR and between 1dd004a and d0e3d60.

📒 Files selected for processing (1)
  • server/workflow_run.go

📝 Walkthrough

Walkthrough

Adds token-protected /cmt_dispatch and /cleanup_e2e HTTP endpoints, two config secrets, tests for handler validation, and refactors workflow_run handling to remove provisioning triggers and rely on SHA-based cleanup.

Changes

CMT Provisioning and Cleanup Flow

Layer / File(s) Summary
Configuration and Route Registration
server/config.go, server/server.go
Configuration adds CMTTriggerSecret and CleanupSecret fields; router registers POST /cmt_dispatch and /cleanup_e2e with comments documenting webhook sources.
CMT Dispatch Endpoint
server/cmt_dispatch.go, server/cmt_dispatch_test.go
New /cmt_dispatch handler validates X-Trigger-Token, parses CMTDispatchRequest (owner/repo/sha/ref/run_id/server_versions), derives instance type from repo name, asynchronously provisions via handleCMTWithServerVersions, and returns HTTP 202. Tests cover authentication, JSON validation, field validation, server version parsing, and repo type detection.
E2E Cleanup Endpoint
server/cleanup_e2e.go, server/cleanup_e2e_test.go
New /cleanup_e2e handler validates X-Cleanup-Token, parses CleanupE2ERequest (repo/run_id), locates and removes matching CMT tracking keys from e2eInstances map under lock, asynchronously destroys instances, and returns HTTP 200 with status. Tests cover authentication, JSON validation, field validation, key removal semantics, and token comparison.
Workflow Handler Refactoring
server/workflow_run.go
Removes CMT provisioning from workflow_run "requested" path; now only uses completed action as a defensive cleanup fallback and always performs SHA-based cleanup for E2E workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • mattermost/matterwick#89: Modifies handleWorkflowRunEventWithInputs cleanup behavior between mw_tracking_key-based cleanup and SHA-based destruction.

Suggested labels

release-note-none

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding two new HTTP endpoints (/cmt_dispatch and /cleanup_e2e) to fix the CMT provisioning workflow.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, thoroughly explaining the problem, root cause, fix, testing, and deployment strategy.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cmt-direct-dispatch-and-cleanup-endpoint

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
server/cmt_dispatch_test.go (1)

47-57: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Static analysis flags httptest.NewRequest usage. While functionally equivalent in tests, using NewRequestWithContext aligns with best practices and silences the linter.

♻️ Suggested fix
 func doCMTDispatchRequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cmt_dispatch", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Trigger-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCMTDispatch(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cmt_dispatch_test.go` around lines 47 - 57, The test helper
doCMTDispatchRequest uses httptest.NewRequest which the linter flags; update
doCMTDispatchRequest to use httptest.NewRequestWithContext with a background
context (e.g., context.Background()) when creating the request, and add
"context" to the imports; specifically modify the call in doCMTDispatchRequest
(the NewRequest call that constructs req) to NewRequestWithContext and ensure
the X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.
server/cleanup_e2e_test.go (1)

35-45: 💤 Low value

Consider using httptest.NewRequestWithContext to satisfy linter.

Same as in cmt_dispatch_test.go, static analysis flags httptest.NewRequest. Using NewRequestWithContext would align with the linter expectations.

♻️ Suggested fix
 func doCleanupE2ERequest(t *testing.T, s *Server, body, token string) *httptest.ResponseRecorder {
 	t.Helper()
-	req := httptest.NewRequest(http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
+	req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/cleanup_e2e", bytes.NewBufferString(body))
 	req.Header.Set("Content-Type", "application/json")
 	if token != "" {
 		req.Header.Set("X-Cleanup-Token", token)
 	}
 	w := httptest.NewRecorder()
 	s.handleCleanupE2E(w, req)
 	return w
 }

You'll need to add "context" to the imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/cleanup_e2e_test.go` around lines 35 - 45, Replace httptest.NewRequest
with httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context
(e.g., context.Background()); also add "context" to the imports. Specifically,
update the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/cleanup_e2e_test.go`:
- Around line 35-45: Replace httptest.NewRequest with
httptest.NewRequestWithContext in doCleanupE2ERequest and pass a context (e.g.,
context.Background()); also add "context" to the imports. Specifically, update
the call in doCleanupE2ERequest (the helper that builds the POST to
"/cleanup_e2e") to use httptest.NewRequestWithContext(context.Background(),
http.MethodPost, ... ) and ensure the package imports include "context".

In `@server/cmt_dispatch_test.go`:
- Around line 47-57: The test helper doCMTDispatchRequest uses
httptest.NewRequest which the linter flags; update doCMTDispatchRequest to use
httptest.NewRequestWithContext with a background context (e.g.,
context.Background()) when creating the request, and add "context" to the
imports; specifically modify the call in doCMTDispatchRequest (the NewRequest
call that constructs req) to NewRequestWithContext and ensure the
X-Trigger-Token header handling and subsequent call to s.handleCMTDispatch
remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7b08b625-de55-47f7-b1f4-061630a8a168

📥 Commits

Reviewing files that changed from the base of the PR and between 68ce3b7 and 1dd004a.

📒 Files selected for processing (7)
  • server/cleanup_e2e.go
  • server/cleanup_e2e_test.go
  • server/cmt_dispatch.go
  • server/cmt_dispatch_test.go
  • server/config.go
  • server/server.go
  • server/workflow_run.go

The "Primary" lookup at handleWorkflowRunEventWithInputs read
payload.WorkflowRun.Inputs["mw_tracking_key"], which is the same field
that broke CMT provisioning -- GitHub's workflow_run webhook payload
does not include workflow_dispatch inputs (verified against the Octokit
schema and the REST API for a real workflow run). The field is always
empty in production, so the "primary" branch never fires; only the
SHA-based fallback runs. That fallback works because every tracking-key
format ends with -{sha}.

Removing the dead read eliminates the misleading "Primary"/"Fallback"
labels and makes the SHA-based scan the documented canonical mechanism.
The mw_tracking_key value is still written when dispatching test
workflows -- it remains queryable via the REST API for any future code
path that wants to do a key-based lookup -- but the webhook-driven
cleanup intentionally does not depend on it.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/release-note-label-needed kind/bug Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants