Release v7.0.0: Full TEE Capability — Phase 2 Backend LLM Attestation#712
Merged
Conversation
…mprove GC reclaim
#692) …mprove GC reclaim This MR fixes BadgerDB value-log growth caused by rewriting a single growing activity blob on every prompt. Activity records are now stored as individual TTL keys, enabling efficient expiration and much better garbage collection behavior. It also tunes Badger GC settings and adds focused tests to ensure capacity logic remains correct with the new storage layout.
- Updated the ECS service stabilization wait time to approximately 12.5 minutes with a maximum of 50 attempts, enhancing reliability during deployments. - Removed deprecated pause and resume steps for Active Models refresh, streamlining the deployment process. - Added comments for clarity on the stabilization process and error handling in the script.
…ng (#694) - Updated the ECS service stabilization wait time to approximately 12.5 minutes with a maximum of 50 attempts, enhancing reliability during deployments. - Removed deprecated pause and resume steps for Active Models refresh, streamlining the deployment process. - Added comments for clarity on the stabilization process and error handling in the script.
…RAS, and TLS pinning
## Summary Extends TEE attestation from the provider node (Phase 1) to the **backend LLM endpoints**, creating a full trust chain: hardware -> firmware -> OS -> workload -> TLS connection. ### What it does - **Backend attestation** (`AttestBackend`): at startup, verifies each TEE-marked model's backend via CPU quote (SecretAI portal), TLS cert binding, docker-compose workload verification, GPU attestation (CPU-GPU nonce binding), and NVIDIA NRAS. - **Per-prompt fast verify** (`FastVerifyBackend`): on every inference request, re-fetches the CPU quote and compares its hash + TLS fingerprint against the cached snapshot. Full re-attestation only triggers when something changes. - **Workload verification**: downloads the SecretVM TDX artifact registry from GitHub, parses TDX quotes to extract MRTD/RTMR0-3, and replays the RTMR3 measurement from the backend's docker-compose.yaml to prove the exact workload running inside the TEE. - **NVIDIA NRAS**: sends GPU evidence to NVIDIA's Remote Attestation Service (v4 API) for independent hardware verification (non-fatal if unreachable). - **Health endpoint**: `GET /v1/models/attestation` reports per-model attestation status including workload verification results. ### Configuration New env vars: `ARTIFACT_REGISTRY_URL`, `ARTIFACT_REGISTRY_REFRESH_INTERVAL` (optional).
…ck failures Three bugs caused the TEST environment C-Node deploy to fail after only ~70 seconds instead of waiting for the full ECS task lifecycle: 1. `--max-attempts 50` is not a valid AWS CLI option for `aws ecs wait services-stable` — the command errored immediately instead of polling. 2. Health checks started instantly after the broken waiter, while the old task (stopTimeout=120s) was still draining. 3. `set -e` was active during the health-check loop, so a curl timeout (exit 28) on attempt 5 killed the entire script. Changes (applied to all three deploy jobs — LMN, C-Node, P-Node): - Remove invalid `--max-attempts 50` from `aws ecs wait` (default 40×15s = 10 min is sufficient). - Add a 180-second deployment wait floor based on the ECS task lifecycle: stopTimeout=120s + deregistration_delay≤30s + ALB health threshold≈90s. The waiter runs first; if it finishes early (success or error), the remaining time is filled with a sleep to prevent wasting retries on the old task. - Wrap the health-check loop in `set +e` / `set -e` so curl timeouts don't abort the script. Made-with: Cursor
…ck failures (#701) ## Summary - **Removes invalid `--max-attempts 50`** from `aws ecs wait services-stable` — this flag is not recognized by the AWS CLI and caused the waiter to error immediately instead of polling (the bug that triggered the premature deploy failure in TEST on 2026-04-10). - **Adds a 180-second minimum deployment wait floor** before health-check polling, based on the actual ECS task lifecycle timing from Terraform (`02_mor_router_svc.tf`): `stopTimeout=120s` + `deregistration_delay≤30s` + ALB `healthy_threshold×interval≈90s` = ~4 min worst case. The `aws ecs wait` still runs first (default 40×15s = 10 min); if it finishes early the remaining time is filled with a sleep. - **Wraps the health-check loop in `set +e` / `set -e`** so curl timeouts (exit code 28) during version verification no longer kill the entire script under bash `-e`. Applied to all three deploy jobs: LMN (Titan), C-Node (Morpheus Consumer), and P-Node (Morpheus Provider). ### Root cause analysis (TEST deploy 2026-04-10 14:37 UTC) | Step | What happened | Time | |------|--------------|------| | ECS service update issued | `update-service --force-new-deployment` | 14:37:15 | | `aws ecs wait` fails immediately | `Unknown options: --max-attempts, 50` | 14:37:15 | | Health checks start (no wait) | Old task (v6.2.2-test, 57h uptime) still running | 14:37:16 | | Attempts 1–4 | Version mismatch (old task responding) | 14:37:16 – 14:38:01 | | Attempt 5 | `curl --max-time 10` times out (exit 28), `set -e` kills script | 14:38:16 | | **Total wall time before "failure"** | **~71 seconds** — ECS hadn't even stopped the old task yet | | ## Test plan - [ ] Trigger a TEST branch deploy and confirm the waiter polls correctly (no `Unknown options` error) - [ ] Verify the 180s floor wait appears in logs before health-check polling begins - [ ] Confirm curl timeout during health check doesn't abort the script (visible as `⚠️ Health check failed (curl status: 28)` instead of `##[error] Process completed with exit code 28`) - [ ] Verify version verification succeeds after the new task is up Made with [Cursor](https://cursor.com)
## Summary - Merges `dev` into `test` to pick up the ECS deploy wait timing fix from PR #701. - Fixes the premature deploy failure seen in TEST on 2026-04-10 (only ~71s before aborting instead of waiting for ECS task lifecycle). See #701 for full details. Made with [Cursor](https://cursor.com)
- Incremented the major version number from 6 to 7 in the build workflow. - Updated the funding account address in the mainnet configuration file.
- Introduced full two-hop Trusted Execution Environment (TEE) verification: Phase 1 (consumer to P-Node) and Phase 2 (P-Node to backend LLM). - Updated documentation to reflect new TEE features, including detailed descriptions of the verification processes and guarantees. - Added new sections in the README and various documentation files to clarify the TEE model tagging and its implications for consumers and providers. - Incremented version number to v7.0.0 in configuration files and updated relevant documentation links. - Improved CI/CD pipeline documentation to outline the automated verification processes for TEE models. This release ensures that consumers using v6.0.0+ can seamlessly interact with v7.0.0+ providers without requiring client-side upgrades, enhancing security and trust in the Morpheus network.
#710) ## Summary This PR brings three things into `dev` to complete the **v7.0.0 (full TEE) release** prep: 1. **CI/CD fix (#701 equivalent)** — Fix for ECS deploy wait timing that was premature-failing the post-deploy healthcheck (`8b341f3`). `#701` on `dev` is the squashed version; this branch adds nothing new there. 2. **v7.0.0 version bump + funding-account update** (`71fff3d`) — Bumps `VMAJ_NEW` to `7` in `.github/workflows/build.yml` and updates the funding account in `smart-contracts/deploy/data/config_base_mainnet.json`. This is the major-version bump that marks "full TEE capability". 3. **Docs update for v7.0.0** (`fec3f92`) — Comprehensive documentation pass to accurately reflect the shipped Phase 1 + Phase 2 TEE trust chain. ## Doc changes (v7.0.0) Consistently clarifies the **two-hop trust chain** across all public and internal docs: ``` C-Node (v6.0.0+) ──Phase 1──▶ P-Node -tee image (v7.0.0+) ──Phase 2──▶ Backend LLM consumer P-Node verifies verifies its P-Node own backend ``` **Key correctness fix:** earlier drafts described the consumer as performing Phase 2 backend verification. This is wrong — **Phase 2 runs entirely inside the P-Node**; the consumer never talks to the backend. This means **v6.0.0+ consumers are forward-compatible with v7.0.0+ providers** and get Phase 2 guarantees transparently, with no client-side upgrade needed. Every doc now emphasizes this. ### User-facing docs updated - `readme.md` — v7.0.0 release callout with the two-hop flow and forward-compat note - `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps sections - `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What Consumers See, and What Your P-Node Does" as two distinct hops + v7 troubleshooting rows - `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives both hops - `docs/models-config.json.md` — noted that `isTee` is no longer a local config field (tag-driven) - `docs/proxy-router.all.env` — documented new TEE env vars: `TEE_PORTAL_URL`, `TEE_IMAGE_REPO`, `ARTIFACT_REGISTRY_URL`, `ARTIFACT_REGISTRY_REFRESH_INTERVAL` ### Developer reference updated - `proxy-router/docs/tee-backend-verification.md` — added "Where this fits in the trust chain" section clarifying this doc is **Phase 2 only**, entirely inside the P-Node ### Internal `.ai-docs` updated (technical record of "how") - `.ai-docs/TEE_Attestation_Architecture.md` — status bumped to v2.0, Phase 1/1c/2a marked DONE with real file paths and PR numbers, new §7.7 with the full Phase 2 technical write-up (attestation endpoints, full AttestBackend sequence, fast-verify semantics, reportData layout), open questions resolved - `.ai-docs/TEE_CICD_Supply_Chain_Hardening.md` — added v7.0.0 banner, replaced placeholder diagram with two completed diagrams for Phase 1c and Phase 2, reorganized status tables ## Test plan - [x] Clean `ort`-strategy merge from `origin/dev` into this branch (no conflicts, 11-file delta) - [x] `GOOS=linux GOARCH=amd64 go build -ldflags='-w' -o /tmp/proxy-router-linux ./cmd/` succeeds (67 MB binary, exit 0) - [x] No new `go vet` warnings introduced by the merge (all warnings are pre-existing on `dev`) - [x] `grep` verification: no stale `isTee` references, no "consumer verifies backend" misstatements, no `v6.0.0+` references where `v7.0.0+` is meant - [ ] CI/CD pipeline runs green (ECS deploy + post-deploy RTMR3 verification) - [ ] Reviewer spot-check of the new `.ai-docs/TEE_Attestation_Architecture.md` §7.7 against the actual `proxy-router/internal/attestation/` code ## Release notes This is the v7.0.0 release bump. Downstream merges: 1. This PR: `fix/cicd-ecs-deploy-wait-timing` → `dev` 2. Follow-up: `dev` → `test` (separate PR, opens immediately after this merges) 3. After test validation: `test` → `main` for the v7.0.0 tag Made with [Cursor](https://cursor.com)
…711) ## Summary Propagates the v7.0.0 release prep from `dev` to `test` for validation ahead of the `test → main` cut. Four commits from `dev` (including the freshly-merged #710): - `0fb81ec` — merge commit for #710: fix(cicd) ECS deploy wait timing + v7.0.0 release docs + version bump - `fec3f92` — Enhance TEE capabilities and documentation for v7.0.0 release - `e81f41d` — Merge origin/dev into fix/cicd-ecs-deploy-wait-timing (brings in #701/#703/#705/#708 history via the merge) - `71fff3d` — chore: update version number and funding account in configuration (VMAJ_NEW → 7, mainnet funding account) ## What reviewers should focus on 1. **`VMAJ_NEW=7` in `.github/workflows/build.yml`** — this is the major-version bump that will tag the next test-branch build as `v7.x.x-beta` and, on promotion to main, as `v7.0.0`. 2. **`config_base_mainnet.json` funding-account change** — please confirm the address is the intended one before this reaches main. 3. **`readme.md` v7.0.0 callout** — verifies the two-hop trust chain story reads correctly to a new reader (C-Node v6+ → P-Node v7+ → Backend LLM). 4. **`docs/02.3-proxy-router-tee.md`** and **`docs/02.4-proxy-router-secretvm-quickstart.md`** — the Phase 1 / Phase 2 split. Especially the "What Consumers See, and What Your P-Node Does" section in 02.4, which was the section we reworked most heavily to correct the earlier wording that implied the consumer runs Phase 2. 5. **`.ai-docs/TEE_Attestation_Architecture.md` §7.7** — new technical write-up of the actual Phase 2 implementation (file paths, attestation endpoints, full sequence). Please sanity-check against `proxy-router/internal/attestation/` code. ## Trust-chain clarification (for reviewers) The key correctness fix in the docs: ``` C-Node (v6.0.0+) ──Phase 1──▶ P-Node -tee image (v7.0.0+) ──Phase 2──▶ Backend LLM consumer P-Node verifies verifies its P-Node own backend ``` - **Phase 1** (shipped in v6.x, unchanged in v7): consumer's proxy-router verifies the provider's P-Node attestation (CPU quote, TLS binding, RTMR3 of the `-tee` image) at session open and every prompt. - **Phase 2** (new in v7.0.0, shipped in #699): the **P-Node itself** verifies the backend LLM — CPU quote, TLS pinning, RTMR3 replay of the backend's `docker-compose.yaml`, CPU-GPU nonce binding, NVIDIA NRAS — at startup and every prompt. - The on-chain `tee` model tag is the single switch that enables both hops. - **v6.0.0+ consumers are forward-compatible with v7.0.0+ providers** and get Phase 2 guarantees transparently (Phase 2 is entirely inside the P-Node binary they already attested in Phase 1). No client-side upgrade is required for Phase 2. ## Test plan - [ ] CI/CD builds v7.x.x-beta tag for the test branch - [ ] Post-deploy RTMR3 verification against the live SecretVM test instance passes - [ ] `Deploy-SecretVM-Test` job ECS-wait timing no longer premature-fails - [ ] Spot-check `GET /v1/models/attestation` on a deployed test P-Node shows per-model Phase 2 state - [ ] Reviewer walks through the v7 docs cold (without this PR's context) and confirms the two-hop trust chain is clearly explained ## Downstream After validation on `test`, plan to promote with a `test → main` PR that tags v7.0.0. Made with [Cursor](https://cursor.com)
…ehydration hold
Prevents the mid-deploy "messy outage" we've seen when the C-Node and P-Node
restart simultaneously and the C-Node BadgerDB ends up referencing sessions
whose upstream providers have just cycled.
Changes to .github/workflows/build.yml:
- New job `Drain-Morpheus-C-Node`: discovers the C-Node service's target
groups and deregisters all current targets before any node restarts.
connection_termination=true on the router TGs closes live TCP sessions
promptly so upstream failures arrive as clean 503s instead of half-dead
mid-stream hangs.
- `Deploy-to-Morpheus-P-Node` now depends on `Drain-Morpheus-C-Node` so the
C-Node NLB is quiet before provider traffic is disrupted.
- `Deploy-to-Morpheus-C-Node` now depends on both the drain job and the
P-Node deploy (skipped on `test`, which is still treated as success by
GitHub Actions dependency resolution).
- The C-Node deploy step itself is rewritten to:
1. Register the new task def with `--health-check-grace-period-seconds 600`
so the ECS deployment circuit breaker tolerates the deregister window.
2. Poll until a task on the NEW task definition is RUNNING and has an ENI IP.
3. Deregister that IP from every target group.
4. Sleep `cnode_rehydration_wait_secs` (default 90s) so the proxy-router's
BadgerDB rehydration loop can catch up from on-chain state (ephemeral
BadgerDB on prd, EFS-backed on dev).
5. Re-register the IP to every TG and wait for `target-in-service`.
6. Fall through to the existing public `/healthcheck` version-match loop.
- New workflow_dispatch input `cnode_rehydration_wait_secs` (default "90")
lets operators tune the hold window per run.
Companion changes (separate PR in Morpheus-Infra):
- GitHub Actions IAM policy now grants ELBv2 Describe + Register/Deregister
on Target Groups so the drain/register steps have the permissions they need.
- Planning doc CICD_HA_IMPROVEMENTS_PLAN.md captures the deferred items
(P-Node HA, graceful shutdown, API GW retry tuning, persistent BadgerDB
promotion) that this change does NOT address.
Made-with: Cursor
…ehydration hold (#713) ## Summary Rewrites the C-Node / P-Node deployment sequencing in `build.yml` so the C-Node NLB is fully drained **before** any provider restart, and so the new C-Node task is held out of the load balancer for a configurable rehydration window after it comes up. Addresses the recurring "messy 5-minute" outage pattern where simultaneous C-Node and P-Node restarts leave the C-Node's BadgerDB with orphaned session state and require manual cleanup. ## Changes ### New job: `Drain-Morpheus-C-Node` - Discovers the C-Node service's target groups dynamically (no hard-coded ARNs). - Calls `elbv2 deregister-targets` on every current target. - Waits 45s for deregistration to propagate across AZs. The router TGs have `connection_termination=true` and `deregistration_delay=0`, so live TCP sessions close promptly. - Exports `tg_arns` + `drained_ips` as outputs for the C-Node deploy job. ### Reworked `Deploy-to-Morpheus-C-Node` Now depends on both the drain job and the P-Node deploy, and the `update-service` call is followed by a controlled-traffic sequence: 1. `aws ecs update-service --health-check-grace-period-seconds 600` so the ECS deployment circuit breaker tolerates the deregister window. 2. Poll (up to 15 min) until a task running the **new** task definition is `RUNNING` with an ENI IP. 3. Deregister that IP from every target group. 4. Sleep `cnode_rehydration_wait_secs` (default 90s) to let the proxy-router's BadgerDB rehydration loop catch up from on-chain state. Important for prd where BadgerDB is ephemeral; harmless for dev where it's EFS-backed. 5. Re-register the IP to every TG, then `aws elbv2 wait target-in-service` per TG. 6. Fall through to the existing public `/healthcheck` version-match loop. ### `Deploy-to-Morpheus-P-Node` Gated behind `Drain-Morpheus-C-Node` so the C-Node NLB is already quiet before the provider restarts and briefly takes hosted models offline. ### New workflow_dispatch input - `cnode_rehydration_wait_secs` (default `"90"`): operator-tunable hold window. ### Titan deploy Unchanged — still runs independently and in parallel with the P-Node after container build, as agreed. ## Expected behavior post-merge (prd) | Phase | Duration | What the user sees | |---|---|---| | Drain | ~45s | API GW upstream errors (503), clean TCP close | | P-Node restart | ~2-3 min | Hosted-model sessions briefly unavailable (single-provider SPOF, known) | | C-Node old task stop | ~2 min | Same — NLB already empty | | C-Node new task up | ~30s | Task running, still deregistered | | Rehydration hold | 90s | Task rehydrating BadgerDB from chain, still deregistered | | Re-register + healthy | ~90s | NLB targets come healthy, API GW begins forwarding again | Total: ~6-8 min of deliberate, bounded outage instead of the previous 5 min of messy failure followed by 30 min of manual recovery. ## Companion changes (separate repo) `Morpheus-Infra`: - `01_iam_role_gh_actions.tf`: grants `ELBv2ReadOnly` + `ELBv2TargetRegistration` so the new drain/register steps can authenticate. - `.ai-docs/CICD_HA_IMPROVEMENTS_PLAN.md`: captures deferred follow-ups (persistent BadgerDB promotion, regional P-Node HA, `/readyz` split, graceful shutdown, API GW retry tuning). Both IAM policies have already been applied to dev + prd via terraform. ## Test plan - [x] YAML syntax validated with `yaml.safe_load` - [x] Go build of proxy-router on current HEAD passes (cross-compiled linux/amd64) - [ ] Merge to `dev` → exercise `test` branch deploy (which runs only the C-Node path — P-Node job is `needs`-skipped for `test`) and confirm: - Drain job succeeds and lists target groups - New C-Node task registers a new task-def ENI IP - Deregister → 90s hold → re-register cycle completes without ECS rollback - `/healthcheck` version-match passes - [ ] On prd (next release), confirm P-Node job runs **after** drain, and that the C-Node completes the rehydration-hold sequence before returning traffic. - [ ] Observe v6.1.x → v7.0.x deploy cleanly with no manual BadgerDB intervention. Made with [Cursor](https://cursor.com)
…ehydration hold (#714) ## Summary Promotes #713 from `dev` → `test` so we can exercise the new deployment sequencing against the `dev` infrastructure before cutting `v7.0.0` to `main`. ## What's in this PR Only one commit ahead of `test`: - **`68e39db` — fix(cicd): sequence C-Node drain, P-Node, then C-Node redeploy with rehydration hold** (merged via #713) Everything else that was previously in `dev` (TEE Phase 2, docs, version bump to 7, ECS wait-timing fix) is already live on `test` via #710. ## What changes on merge to test The `test` branch only runs the C-Node deploy path (no P-Node exists in `dev` infra). The reworked workflow will still exercise: 1. `Drain-Morpheus-C-Node` job — discovers `test`/dev-env TGs and deregisters current targets. 2. `Deploy-to-Morpheus-P-Node` is `needs`-skipped; dependency resolution treats it as success. 3. `Deploy-to-Morpheus-C-Node`: - `update-service` with `--health-check-grace-period-seconds 600` - poll new task ENI IP - deregister IP from TGs - sleep 90s (default `cnode_rehydration_wait_secs`) - re-register + `wait target-in-service` - public `/healthcheck` version-match (existing logic) **Note on dev env:** the dev C-Node currently has `switches.efs_storage = true` (persistent BadgerDB on EFS). This means the 90s hold is technically overkill for dev — rehydration has no work to do since state is persistent. That's fine; the hold is a no-op and lets us verify the new workflow paths behave correctly in dev before exercising them on prd (where BadgerDB is ephemeral and the hold matters). ## What to review - [ ] Workflow YAML structure (parallelism, `needs`, `if` gates) - [ ] The two shell blocks (`Drain-Morpheus-C-Node` + the controlled-redeploy section in `Deploy-to-Morpheus-C-Node`) for correctness and idempotency - [ ] Confirm the `cnode_rehydration_wait_secs` default and the guard-against-non-numeric logic - [ ] Verify IAM: dev IAM policy already includes `ELBv2ReadOnly` + `ELBv2TargetRegistration` (applied from `Morpheus-Infra`) ## Test plan (after you merge to test) 1. Merge triggers the deploy workflow against `dev` env. 2. Confirm `Drain-Morpheus-C-Node` job lists the correct TGs for `svc-dev-router` and successfully deregisters. 3. Confirm `Deploy-to-Morpheus-C-Node`: - finds the new task with a fresh task-definition ARN - resolves an ENI IP - deregister → 90s hold → re-register completes - `target-in-service` confirms healthy - public `/healthcheck` at `router.dev.mor.org:8082` returns the new version 4. Observe the dev C-Node's BadgerDB during the hold — should be a no-op (EFS-persistent), confirming the workflow is benign when rehydration isn't needed. 5. If green, cut `v7.0.0` PR from `test` → `main` to exercise the full sequence on prd. ## Related - Companion IAM + planning changes in `Morpheus-Infra` (already applied to dev + prd). - Deferred follow-ups captured in `Morpheus-Infra/.ai-docs/CICD_HA_IMPROVEMENTS_PLAN.md`. Made with [Cursor](https://cursor.com)
…deploy needs
Root cause of the first v7 test-branch run (GH Actions run 24796856145):
the drain job ran and deregistered the dev C-Node from both NLB target
groups, then Deploy-to-Morpheus-C-Node was silently skipped because its
implicit success() guard propagated the skip from Deploy-to-Morpheus-P-Node
(which is main-only and intentionally skips on test pushes). That left
dev with the old C-Node task running in ECS but with no load-balancer
membership until we manually re-registered it.
Fix:
- Add an explicit `if` to Deploy-to-Morpheus-C-Node that:
- Requires GHCR build + drain success.
- Accepts Deploy-to-Morpheus-P-Node.result in {success, skipped}.
- Wraps in `!cancelled()` so manual cancels still short-circuit.
- Rewrite the inline comment that previously (incorrectly) claimed
skipped jobs are treated as successful for dependency resolution;
they are not.
No change to the deploy logic itself, the drain job, or any other
workflow sequencing. This is a pure `needs` / guard correctness fix.
Made-with: Cursor
…deploy needs (#715) ## Summary Hotfix for the first v7 `test`-branch deployment attempt ([Actions run #1447 / run id 24796856145](https://github.com/MorpheusAIs/Morpheus-Lumerin-Node/actions/runs/24796856145)). The previous PR (#714 → #713) introduced a new job sequence: ``` Drain-Morpheus-C-Node → Deploy-to-Morpheus-P-Node → Deploy-to-Morpheus-C-Node ``` On push to `test`, `Deploy-to-Morpheus-P-Node` is correctly **skipped** (its own `if` restricts it to `main` — no dev P-Node exists). That skip then propagated onto `Deploy-to-Morpheus-C-Node` through the implicit `success()` guard that every job has when no explicit `if` tolerates a skipped dependency. Net effect: - `Drain-Morpheus-C-Node` ✅ removed the running dev C-Node task from both NLB target groups. - `Deploy-to-Morpheus-C-Node` ❌ silently skipped — no new task registered, no re-register of the existing task. - Dev C-Node endpoints (`router.dev.mor.org:8082`/`:8545`) stopped serving traffic until we manually re-registered the old task to the TGs. ## What this PR changes `.github/workflows/build.yml`, `Deploy-to-Morpheus-C-Node` job only: ### Explicit `if` guard that tolerates the P-Node skip ```yaml if: | !cancelled() && github.repository == 'MorpheusAIs/Morpheus-Lumerin-Node' && needs.GHCR-Build-and-Push.result == 'success' && needs.Drain-Morpheus-C-Node.result == 'success' && (needs.Deploy-to-Morpheus-P-Node.result == 'success' || needs.Deploy-to-Morpheus-P-Node.result == 'skipped') && ( (github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref == 'refs/heads/test')) || (github.event_name == 'workflow_dispatch' && github.event.inputs.build_all_os == 'true' && github.event.inputs.create_deployment == 'true') ) ``` - Still requires real success from the drain + GHCR jobs. - Explicitly accepts `success` OR `skipped` for the P-Node dependency. - `!cancelled()` short-circuits the job if someone cancels the run, so we don't try to redeploy a half-cancelled sequence. ### Comment rewrite The previous inline comment claimed skipped jobs are treated as successful for dependency resolution. That's the opposite of how GitHub Actions actually behaves — the note is replaced with an accurate explanation and a reference to this incident so future-us (or other maintainers) won't step on the same rake. ## What this PR does NOT change - No change to the drain job or the controlled-traffic C-Node deploy sequence (register→dereg→hold→rereg→wait). - No change to the P-Node job or its `if` gating. - No change to any other workflow or deploy script. This is a pure guard-correctness fix. ## Test plan - [x] YAML syntax validated (`yaml.safe_load`). - [x] Manually walked through `needs` outcomes: - On push to `test`: P-Node `skipped`, drain + GHCR `success` → C-Node **runs**. ✅ - On push to `main`: all three `success` → C-Node **runs**. ✅ - Drain fails → C-Node **skipped** (won't redeploy while the TG state is ambiguous). ✅ - GHCR fails → C-Node **skipped**. ✅ - P-Node fails on main (not skipped) → C-Node **skipped** (we want to bail rather than leave prd with a v-mismatch between providers and the consumer). ✅ - [ ] Merge to `dev`, promote to `test`, confirm `Deploy-to-Morpheus-Consumer` actually runs this time and completes the drain → hold → re-register → /healthcheck cycle end-to-end. - [ ] Once green on test, cut `test` → `main` for first prd exercise. ## Related - Previous PR: #714 (dev → test) carrying #713 (initial drain/sequence/hold). - Companion IAM + planning doc already applied in `Morpheus-Infra`. Made with [Cursor](https://cursor.com)
…deploy needs (#716) ## Summary Promotes the #715 hotfix from `dev` → `test` so we can retry the v7 test deploy and actually exercise the new drain / hold / re-register sequence end-to-end. ## What's in this PR Single-commit delta ahead of `test`: - **`2b14c58` — fix(cicd): allow skipped Deploy-to-Morpheus-P-Node to satisfy C-Node deploy needs** (merged via #715) Everything else in `dev` is already on `test` from #714. ## Background First v7 test run ([Actions run #1447 / run id 24796856145](https://github.com/MorpheusAIs/Morpheus-Lumerin-Node/actions/runs/24796856145)) revealed a correctness gap in the workflow from #713/#714: - `Drain-Morpheus-C-Node` ran and successfully drained the dev C-Node targets from both NLB target groups. - `Deploy-to-Morpheus-P-Node` was correctly skipped (main-only). - `Deploy-to-Morpheus-C-Node` was **silently skipped** because the implicit `success()` guard propagated the skip from the P-Node job — GitHub Actions treats skipped `needs` as non-success, contrary to the inline comment in the previous PR. - Dev C-Node remained functional in ECS but without LB membership until manually re-registered. ## Fix (this PR carries) Explicit `if` guard on `Deploy-to-Morpheus-C-Node` that: - Requires real success from `GHCR-Build-and-Push` and `Drain-Morpheus-C-Node`. - Accepts either `success` or `skipped` as the outcome for `Deploy-to-Morpheus-P-Node`. - Wraps everything in `!cancelled()` to short-circuit on manual cancel. The misleading comment is replaced with an accurate explanation referencing this incident. ## Expected behavior after merge to test On push to `test` the sequence becomes: 1. Build + test + GHCR push (~10–20 min). 2. `Drain-Morpheus-C-Node` ✅ deregisters the existing dev C-Node targets. 3. `Deploy-to-Morpheus-P-Node` skipped (expected). 4. `Deploy-to-Morpheus-C-Node` ✅ **runs this time** because the new `if` accepts the P-Node skip: - `update-service` with `--health-check-grace-period-seconds 600`. - Poll new task ENI IP. - Deregister IP from TGs. - Sleep 90s (`cnode_rehydration_wait_secs`). - Re-register IP, `wait target-in-service`. - Public `/healthcheck` version match on `router.dev.mor.org:8082`. 5. `Deploy-to-Titan` + `Deploy-TEE-SecretVM-test` run in parallel to the deploy sequence (unchanged). On dev the rehydration hold is effectively a no-op because EFS-backed BadgerDB is persistent there — perfect for a first validation of the workflow plumbing without depending on rehydration correctness. ## Review focus - The new `if` block on `Deploy-to-Morpheus-C-Node` (lines ~1298–1320 of `build.yml`). - Reasoning captured in the inline comment. - No changes to deploy logic, drain logic, or sequencing — only the guard. ## Test plan (after you merge to test) 1. Merge triggers the deploy workflow against dev. 2. Confirm `Deploy-to-Morpheus-Consumer` enters `in_progress` (not skipped) after the drain completes. 3. Watch the controlled-traffic sequence: - New task gets an ENI IP. - `deregister-targets` runs. - 90s hold. - `register-targets` + `wait target-in-service` succeed. - `/healthcheck` version-match passes on the new image tag. 4. If green, proceed with `test` → `main` for the first prd exercise (existing open PR). ## Related - Previous merges on the new CICD flow: #713, #714, #715. - Companion IAM + planning doc already applied in `Morpheus-Infra`. Made with [Cursor](https://cursor.com)
nomadicrogue
approved these changes
Apr 23, 2026
morpheusrogue
added a commit
to absgrafx/Morpheus-Lumerin-Node
that referenced
this pull request
Apr 23, 2026
…station Brings in upstream's v7.0.0 release (MorpheusAIs#712): - Phase 2 provider-side backend LLM attestation (TDX quote, TLS pinning, RTMR3 workload replay, CPU-GPU nonce binding, NVIDIA NRAS GPU attestation, per-prompt fast verify) - Single on-chain "tee" tag drives both hops; local isTee flag retired - request_id propagation across inference/attestation log paths - Per-entry Badger activity keys for session storage GC reclaim - ECS deploy / CI-CD wait-timing hardening, docs rewrite, swagger updates - Major version bump to 7 Conflict resolved in proxy-router/internal/blockchainapi/service.go: OpenSession keeps the fork's nil-guard on authConfig (for mobile SDK use) alongside upstream's new log := s.requestLog(ctx) binding, and our execution-reverted retry loop was switched to use the request-scoped logger. Verified: proxy-router builds cleanly and our touched packages (chatstorage, storages, proxyapi, mobile) pass their unit tests. The remaining pre-existing failures (attestation fixture/network tests, TestRating, vet warnings) are inherited unchanged from upstream v7.0.0. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
v7.0.0 — Full TEE capability. This release completes the Morpheus TEE trust chain with Phase 2: Provider-side backend LLM attestation. v6.x closed the loop between the consumer and the provider's proxy-router (P-Node). v7.0.0 closes the next hop: the P-Node now cryptographically verifies its own backend LLM (CPU TDX quote, TLS pinning, workload RTMR3 replay, CPU-GPU nonce binding, NVIDIA NRAS GPU attestation) at startup and on every inference prompt.
The major version bump to 7.0.0 marks the completion of the full two-hop TEE verification chain. It is also designed for seamless forward-compatibility: v6.0.0+ consumers automatically benefit from Phase 2 guarantees when they connect to v7.0.0+ providers — no client-side upgrade is required.
The on-chain
teemodel tag is the single switch that turns on both hops.Headline: Phase 2 TEE Backend Verification (#699, #700)
Every TEE-tagged model's backend LLM is now attested by the P-Node itself, not just assumed trustworthy. The P-Node refuses to forward any inference unless the backend passes all of the following on every request:
:29343/cpu→TEE_PORTAL_URLreportData[0:32] == SHA-256(TLS cert)MODELS=…line indocker-compose.yaml) is what the operator declared:29343/docker-composeArtifactRegistryCSV (auto-refreshed)reportData[32:64] == GPU noncePinnedHTTPClient.VerifyPeerCertificateKey files added (
proxy-router/internal/attestation/):backend_verifier.go—AttestBackend(full, startup) +FastVerifyBackend(per-prompt hot path, no TTL)workload_verifier.go,rtmr.go,tdx_quote.go— RTMR3 replay using SHA-384 extend chainartifacts_registry.go— auto-refresh of the SecretVM TDX artifact registrynras_verifier.go— NVIDIA NRAS v4 API client with JWT EAT validationbackend_verifier_test.go,workload_verifier_test.go,workload_rytn_test.go,nras_verifier_test.go,golden_test.go(~1,450 LOC of coverage)Key integration points:
proxy-router/cmd/main.go— wiresBackendVerifierinto startup and callsAttestBackendonce pertee-tagged modelproxy-router/internal/proxyapi/proxy_receiver.go— callsFastVerifyBackendon theSessionPrompthot path before forwarding any inferenceproxy-router/internal/aiengine/ai_engine.go— returns aPinnedHTTPClientfor TEE modelsproxy-router/internal/proxyapi/controller_http.go— newGET /v1/models/attestationhealth endpoint exposing per-model state (verified/pending/failed+ last-success timestamp + workload match)New environment variables (
TEEconfig block):TEE_PORTAL_URLTEE_IMAGE_REPOghcr.io/morpheusais/morpheus-lumerin-node-teeARTIFACT_REGISTRY_URLARTIFACT_REGISTRY_REFRESH_INTERVALThe
teeTag — One Switch, Two Hops (#708, #709)Prior to v7.0.0 there were transient plans for a separate
tee-gputag. That's been consolidated: the single on-chainteetag now drives the entire trust chain. It turns on both:The local
isTeefield inmodels-config.jsonhas been removed in favor of the blockchain tag as the single source of truth.IsTeeModel(tags)is the sole helper;IsTeeGPUModelis deleted.Operational Robustness
P-Node TEE error wrapping (#703, #704)
When the P-Node's Phase 2 backend attestation fails, the error returned to callers is now wrapped in the correct error type so upstream logic (session open, prompt dispatch) handles it consistently and the consumer-visible failure is actionable rather than a generic 500.
request_idpropagation in every log (#705)Every log line emitted along an inference or attestation path now carries the
request_idfrom its context, so operators can trace a single prompt end-to-end through consumer → P-Node → backend attestation → inference → response. Critical for v7 operations since Phase 2 failures can surface at any of several points.Storage: per-entry Badger activity keys (#692, #693)
Session activity tracking moved from a single aggregate key to per-entry keys. This makes BadgerDB's GC able to reclaim disk space properly as sessions roll over, eliminating a slow-growing storage-bloat issue seen in long-running providers.
CI/CD: ECS deploy wait-timing hardening (#694, #695, #701, #702, #710)
Multiple refinements to the ECS service stabilization + post-deploy attestation-verification window, eliminating intermittent premature health-check failures that were flakily failing otherwise-successful deploys.
Documentation — Full v7 Doc Pass (#710)
All public-facing and internal TEE documentation was audited and rewritten in this release to accurately describe the two-hop trust chain and the forward-compatibility story:
User-facing:
readme.md— new v7.0.0 release callout with the two-hop diagram and forward-compat notedocs/02.3-proxy-router-tee.md— rewrote "What This Guarantees (and What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps sectionsdocs/02.4-proxy-router-secretvm-quickstart.md— rewrote "What Consumers See, and What Your P-Node Does" as two distinct hops + v7 troubleshootingdocs/03-provider-offer.md— clarified theteeon-chain tag drives both hopsdocs/models-config.json.md— notedisTeeis no longer a local config field (tag-driven)docs/proxy-router.all.env— documented all new TEE env varsDeveloper reference:
proxy-router/docs/tee-backend-verification.md— new 286-line developer reference for Phase 2, with mermaid sequence + trust-chain diagramsproxy-router/docs/docs.go,swagger.json,swagger.yaml— auto-generated API docs include the newGET /v1/models/attestationendpointInternal (
.ai-docs/):TEE_Attestation_Architecture.md— status bumped to v2.0; Phases 1 / 1c / 2a marked DONE with real file paths and PR numbers; new §7.7 full Phase 2 technical write-upTEE_CICD_Supply_Chain_Hardening.md— v7.0.0 banner; trust-chain diagram updated with completed Phase 1c and Phase 2 boxesConfiguration Updates
.github/workflows/build.yml—VMAJ_NEW=7(major-version bump); builds frommainwill now tag asv7.x.x.smart-contracts/deploy/data/config_base_mainnet.json—fundingAccountrotated from0x1FE04BC15Cf2c5A2d41a0b3a96725596676eBa1Eto0x5160C0311A95E0A1072FA85Df23712A7BA1cD4b1.Consumer / Provider Compatibility Matrix
This forward-compatibility is the key design principle of the v7 release — upgrading providers instantly strengthens the network for all existing v6+ consumers.
PRs Included (main → test diff, 28 commits)
Phase 2 TEE Backend Verification (headline)
"tee"tag for everything (consolidate on single on-chain tag)Storage & CI/CD hardening
Verification
All changes were validated on
testthrough the automated pipeline:GET /v1/models/attestationand verified liveTest Plan
VMAJ_NEW=7produces a cleanv7.0.0build tag on merge to maindocker-compose.tee.ymldeployed to a SecretVM instance boots cleanly andGET /v1/models/attestationreturnsverifiedper TEE-tagged modelrequest_idshould appear at both session-open and each prompt)docker-compose.yaml) causes the P-Node to refuse the sessionfundingAccountrotation inconfig_base_mainnet.jsonis correct before promotionmainafter mergeBlocked until review + branch protection approval.
Made with Cursor