Skip to content

fix(cicd): allow skipped Deploy-to-Morpheus-P-Node to satisfy C-Node deploy needs#716

Merged
nomadicrogue merged 2 commits into
testfrom
dev
Apr 22, 2026
Merged

fix(cicd): allow skipped Deploy-to-Morpheus-P-Node to satisfy C-Node deploy needs#716
nomadicrogue merged 2 commits into
testfrom
dev

Conversation

@abs2023
Copy link
Copy Markdown
Collaborator

@abs2023 abs2023 commented Apr 22, 2026

Summary

Promotes the #715 hotfix from devtest so we can retry the v7 test deploy and actually exercise the new drain / hold / re-register sequence end-to-end.

What's in this PR

Single-commit delta ahead of test:

Everything else in dev is already on test from #714.

Background

First v7 test run (Actions run #1447 / run id 24796856145) revealed a correctness gap in the workflow from #713/#714:

  • Drain-Morpheus-C-Node ran and successfully drained the dev C-Node targets from both NLB target groups.
  • Deploy-to-Morpheus-P-Node was correctly skipped (main-only).
  • Deploy-to-Morpheus-C-Node was silently skipped because the implicit success() guard propagated the skip from the P-Node job — GitHub Actions treats skipped needs as non-success, contrary to the inline comment in the previous PR.
  • Dev C-Node remained functional in ECS but without LB membership until manually re-registered.

Fix (this PR carries)

Explicit if guard on Deploy-to-Morpheus-C-Node that:

  • Requires real success from GHCR-Build-and-Push and Drain-Morpheus-C-Node.
  • Accepts either success or skipped as the outcome for Deploy-to-Morpheus-P-Node.
  • Wraps everything in !cancelled() to short-circuit on manual cancel.

The misleading comment is replaced with an accurate explanation referencing this incident.

Expected behavior after merge to test

On push to test the sequence becomes:

  1. Build + test + GHCR push (~10–20 min).
  2. Drain-Morpheus-C-Node ✅ deregisters the existing dev C-Node targets.
  3. Deploy-to-Morpheus-P-Node skipped (expected).
  4. Deploy-to-Morpheus-C-Noderuns this time because the new if accepts the P-Node skip:
    • update-service with --health-check-grace-period-seconds 600.
    • Poll new task ENI IP.
    • Deregister IP from TGs.
    • Sleep 90s (cnode_rehydration_wait_secs).
    • Re-register IP, wait target-in-service.
    • Public /healthcheck version match on router.dev.mor.org:8082.
  5. Deploy-to-Titan + Deploy-TEE-SecretVM-test run in parallel to the deploy sequence (unchanged).

On dev the rehydration hold is effectively a no-op because EFS-backed BadgerDB is persistent there — perfect for a first validation of the workflow plumbing without depending on rehydration correctness.

Review focus

  • The new if block on Deploy-to-Morpheus-C-Node (lines ~1298–1320 of build.yml).
  • Reasoning captured in the inline comment.
  • No changes to deploy logic, drain logic, or sequencing — only the guard.

Test plan (after you merge to test)

  1. Merge triggers the deploy workflow against dev.
  2. Confirm Deploy-to-Morpheus-Consumer enters in_progress (not skipped) after the drain completes.
  3. Watch the controlled-traffic sequence:
    • New task gets an ENI IP.
    • deregister-targets runs.
    • 90s hold.
    • register-targets + wait target-in-service succeed.
    • /healthcheck version-match passes on the new image tag.
  4. If green, proceed with testmain for the first prd exercise (existing open PR).

Related

Made with Cursor

abs2023 and others added 2 commits April 22, 2026 16:07
…deploy needs

Root cause of the first v7 test-branch run (GH Actions run 24796856145):
the drain job ran and deregistered the dev C-Node from both NLB target
groups, then Deploy-to-Morpheus-C-Node was silently skipped because its
implicit success() guard propagated the skip from Deploy-to-Morpheus-P-Node
(which is main-only and intentionally skips on test pushes). That left
dev with the old C-Node task running in ECS but with no load-balancer
membership until we manually re-registered it.

Fix:

- Add an explicit `if` to Deploy-to-Morpheus-C-Node that:
  - Requires GHCR build + drain success.
  - Accepts Deploy-to-Morpheus-P-Node.result in {success, skipped}.
  - Wraps in `!cancelled()` so manual cancels still short-circuit.
- Rewrite the inline comment that previously (incorrectly) claimed
  skipped jobs are treated as successful for dependency resolution;
  they are not.

No change to the deploy logic itself, the drain job, or any other
workflow sequencing. This is a pure `needs` / guard correctness fix.

Made-with: Cursor
…deploy needs (#715)

## Summary

Hotfix for the first v7 `test`-branch deployment attempt ([Actions run
#1447 / run id
24796856145](https://github.com/MorpheusAIs/Morpheus-Lumerin-Node/actions/runs/24796856145)).

The previous PR (#714#713) introduced a new job sequence:

```
Drain-Morpheus-C-Node → Deploy-to-Morpheus-P-Node → Deploy-to-Morpheus-C-Node
```

On push to `test`, `Deploy-to-Morpheus-P-Node` is correctly **skipped**
(its own `if` restricts it to `main` — no dev P-Node exists). That skip
then propagated onto `Deploy-to-Morpheus-C-Node` through the implicit
`success()` guard that every job has when no explicit `if` tolerates a
skipped dependency. Net effect:

- `Drain-Morpheus-C-Node` ✅ removed the running dev C-Node task from
both NLB target groups.
- `Deploy-to-Morpheus-C-Node` ❌ silently skipped — no new task
registered, no re-register of the existing task.
- Dev C-Node endpoints (`router.dev.mor.org:8082`/`:8545`) stopped
serving traffic until we manually re-registered the old task to the TGs.

## What this PR changes

`.github/workflows/build.yml`, `Deploy-to-Morpheus-C-Node` job only:

### Explicit `if` guard that tolerates the P-Node skip

```yaml
if: |
  !cancelled() &&
  github.repository == 'MorpheusAIs/Morpheus-Lumerin-Node' &&
  needs.GHCR-Build-and-Push.result == 'success' &&
  needs.Drain-Morpheus-C-Node.result == 'success' &&
  (needs.Deploy-to-Morpheus-P-Node.result == 'success' || needs.Deploy-to-Morpheus-P-Node.result == 'skipped') &&
  (
    (github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref == 'refs/heads/test')) ||
    (github.event_name == 'workflow_dispatch' && github.event.inputs.build_all_os == 'true' && github.event.inputs.create_deployment == 'true')
  )
```

- Still requires real success from the drain + GHCR jobs.
- Explicitly accepts `success` OR `skipped` for the P-Node dependency.
- `!cancelled()` short-circuits the job if someone cancels the run, so
we don't try to redeploy a half-cancelled sequence.

### Comment rewrite

The previous inline comment claimed skipped jobs are treated as
successful for dependency resolution. That's the opposite of how GitHub
Actions actually behaves — the note is replaced with an accurate
explanation and a reference to this incident so future-us (or other
maintainers) won't step on the same rake.

## What this PR does NOT change

- No change to the drain job or the controlled-traffic C-Node deploy
sequence (register→dereg→hold→rereg→wait).
- No change to the P-Node job or its `if` gating.
- No change to any other workflow or deploy script.

This is a pure guard-correctness fix.

## Test plan

- [x] YAML syntax validated (`yaml.safe_load`).
- [x] Manually walked through `needs` outcomes:
- On push to `test`: P-Node `skipped`, drain + GHCR `success` → C-Node
**runs**. ✅
  - On push to `main`: all three `success` → C-Node **runs**. ✅
- Drain fails → C-Node **skipped** (won't redeploy while the TG state is
ambiguous). ✅
  - GHCR fails → C-Node **skipped**. ✅
- P-Node fails on main (not skipped) → C-Node **skipped** (we want to
bail rather than leave prd with a v-mismatch between providers and the
consumer). ✅
- [ ] Merge to `dev`, promote to `test`, confirm
`Deploy-to-Morpheus-Consumer` actually runs this time and completes the
drain → hold → re-register → /healthcheck cycle end-to-end.
- [ ] Once green on test, cut `test` → `main` for first prd exercise.

## Related

- Previous PR: #714 (dev → test) carrying #713 (initial
drain/sequence/hold).
- Companion IAM + planning doc already applied in `Morpheus-Infra`.

Made with [Cursor](https://cursor.com)
@nomadicrogue nomadicrogue merged commit 2ddc125 into test Apr 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants