Skip to content

Release v7.0.0: Full TEE Capability — Phase 2 Backend LLM Attestation#712

Merged
nomadicrogue merged 34 commits into
mainfrom
test
Apr 23, 2026
Merged

Release v7.0.0: Full TEE Capability — Phase 2 Backend LLM Attestation#712
nomadicrogue merged 34 commits into
mainfrom
test

Conversation

@abs2023
Copy link
Copy Markdown
Collaborator

@abs2023 abs2023 commented Apr 22, 2026

Overview

v7.0.0 — Full TEE capability. This release completes the Morpheus TEE trust chain with Phase 2: Provider-side backend LLM attestation. v6.x closed the loop between the consumer and the provider's proxy-router (P-Node). v7.0.0 closes the next hop: the P-Node now cryptographically verifies its own backend LLM (CPU TDX quote, TLS pinning, workload RTMR3 replay, CPU-GPU nonce binding, NVIDIA NRAS GPU attestation) at startup and on every inference prompt.

The major version bump to 7.0.0 marks the completion of the full two-hop TEE verification chain. It is also designed for seamless forward-compatibility: v6.0.0+ consumers automatically benefit from Phase 2 guarantees when they connect to v7.0.0+ providers — no client-side upgrade is required.

C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend

The on-chain tee model tag is the single switch that turns on both hops.


Headline: Phase 2 TEE Backend Verification (#699, #700)

Every TEE-tagged model's backend LLM is now attested by the P-Node itself, not just assumed trustworthy. The P-Node refuses to forward any inference unless the backend passes all of the following on every request:

Check What it proves Where
Portal-verified CPU TDX quote Backend runs on genuine Intel TDX hardware :29343/cpuTEE_PORTAL_URL
TLS certificate pinning Inference TLS terminates inside the attested enclave — no CDN/MITM can slip in reportData[0:32] == SHA-256(TLS cert)
Workload RTMR3 replay The exact set of loaded models (MODELS=… line in docker-compose.yaml) is what the operator declared RTMR3 replay vs. backend's :29343/docker-compose
MRTD + RTMR0–2 artifact lookup Firmware / VM config / kernel / initramfs all match a published SecretVM build ArtifactRegistry CSV (auto-refreshed)
CPU-GPU nonce binding GPU evidence cannot be replayed from another box reportData[32:64] == GPU nonce
NVIDIA NRAS v4 attestation Independent hardware-level validation of the GPU NRAS REST + JWT EAT signature check
Per-prompt fast verify Backend identity hasn't changed since initial attestation — runs on every prompt hash + TLS fingerprint compare (~50 ms)
Pinned-cert HTTP client Onward inference connection refuses any TLS cert whose fingerprint doesn't match PinnedHTTPClient.VerifyPeerCertificate

Key files added (proxy-router/internal/attestation/):

  • backend_verifier.goAttestBackend (full, startup) + FastVerifyBackend (per-prompt hot path, no TTL)
  • workload_verifier.go, rtmr.go, tdx_quote.go — RTMR3 replay using SHA-384 extend chain
  • artifacts_registry.go — auto-refresh of the SecretVM TDX artifact registry
  • nras_verifier.go — NVIDIA NRAS v4 API client with JWT EAT validation
  • Backend verifier test suite: backend_verifier_test.go, workload_verifier_test.go, workload_rytn_test.go, nras_verifier_test.go, golden_test.go (~1,450 LOC of coverage)

Key integration points:

  • proxy-router/cmd/main.go — wires BackendVerifier into startup and calls AttestBackend once per tee-tagged model
  • proxy-router/internal/proxyapi/proxy_receiver.go — calls FastVerifyBackend on the SessionPrompt hot path before forwarding any inference
  • proxy-router/internal/aiengine/ai_engine.go — returns a PinnedHTTPClient for TEE models
  • proxy-router/internal/proxyapi/controller_http.go — new GET /v1/models/attestation health endpoint exposing per-model state (verified / pending / failed + last-success timestamp + workload match)

New environment variables (TEE config block):

Variable Default Purpose
TEE_PORTAL_URL SecretAI Portal CPU quote parse + verification endpoint
TEE_IMAGE_REPO ghcr.io/morpheusais/morpheus-lumerin-node-tee Image repo for cosign attestation verification
ARTIFACT_REGISTRY_URL SecretVM TDX artifacts CSV MRTD + RTMR0–2 lookup source
ARTIFACT_REGISTRY_REFRESH_INTERVAL (configurable) How often to re-fetch the registry

The tee Tag — One Switch, Two Hops (#708, #709)

Prior to v7.0.0 there were transient plans for a separate tee-gpu tag. That's been consolidated: the single on-chain tee tag now drives the entire trust chain. It turns on both:

  • Phase 1 on the consumer: C-Node (v6.0.0+) verifies the P-Node's attestation
  • Phase 2 on the provider: P-Node (v7.0.0+) verifies its own backend LLM

The local isTee field in models-config.json has been removed in favor of the blockchain tag as the single source of truth. IsTeeModel(tags) is the sole helper; IsTeeGPUModel is deleted.


Operational Robustness

P-Node TEE error wrapping (#703, #704)

When the P-Node's Phase 2 backend attestation fails, the error returned to callers is now wrapped in the correct error type so upstream logic (session open, prompt dispatch) handles it consistently and the consumer-visible failure is actionable rather than a generic 500.

request_id propagation in every log (#705)

Every log line emitted along an inference or attestation path now carries the request_id from its context, so operators can trace a single prompt end-to-end through consumer → P-Node → backend attestation → inference → response. Critical for v7 operations since Phase 2 failures can surface at any of several points.

Storage: per-entry Badger activity keys (#692, #693)

Session activity tracking moved from a single aggregate key to per-entry keys. This makes BadgerDB's GC able to reclaim disk space properly as sessions roll over, eliminating a slow-growing storage-bloat issue seen in long-running providers.

CI/CD: ECS deploy wait-timing hardening (#694, #695, #701, #702, #710)

Multiple refinements to the ECS service stabilization + post-deploy attestation-verification window, eliminating intermittent premature health-check failures that were flakily failing otherwise-successful deploys.


Documentation — Full v7 Doc Pass (#710)

All public-facing and internal TEE documentation was audited and rewritten in this release to accurately describe the two-hop trust chain and the forward-compatibility story:

User-facing:

  • readme.md — new v7.0.0 release callout with the two-hop diagram and forward-compat note
  • docs/02.3-proxy-router-tee.md — rewrote "What This Guarantees (and What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps sections
  • docs/02.4-proxy-router-secretvm-quickstart.md — rewrote "What Consumers See, and What Your P-Node Does" as two distinct hops + v7 troubleshooting
  • docs/03-provider-offer.md — clarified the tee on-chain tag drives both hops
  • docs/models-config.json.md — noted isTee is no longer a local config field (tag-driven)
  • docs/proxy-router.all.env — documented all new TEE env vars

Developer reference:

  • proxy-router/docs/tee-backend-verification.md — new 286-line developer reference for Phase 2, with mermaid sequence + trust-chain diagrams
  • proxy-router/docs/docs.go, swagger.json, swagger.yaml — auto-generated API docs include the new GET /v1/models/attestation endpoint

Internal (.ai-docs/):

  • TEE_Attestation_Architecture.md — status bumped to v2.0; Phases 1 / 1c / 2a marked DONE with real file paths and PR numbers; new §7.7 full Phase 2 technical write-up
  • TEE_CICD_Supply_Chain_Hardening.md — v7.0.0 banner; trust-chain diagram updated with completed Phase 1c and Phase 2 boxes

Configuration Updates

  • .github/workflows/build.ymlVMAJ_NEW=7 (major-version bump); builds from main will now tag as v7.x.x.
  • smart-contracts/deploy/data/config_base_mainnet.jsonfundingAccount rotated from 0x1FE04BC15Cf2c5A2d41a0b3a96725596676eBa1E to 0x5160C0311A95E0A1072FA85Df23712A7BA1cD4b1.

Consumer / Provider Compatibility Matrix

Consumer Provider TEE behavior
Pre-v6 any No TEE verification
v6.0.0+ v6.x Phase 1 only (consumer verifies P-Node); backend LLM not attested
v6.0.0+ v7.0.0+ Full Phase 1 + Phase 2 — the consumer transparently gains Phase 2 guarantees via the attested P-Node binary. No client-side upgrade required.
v7.0.0+ v7.0.0+ Full Phase 1 + Phase 2

This forward-compatibility is the key design principle of the v7 release — upgrading providers instantly strengthens the network for all existing v6+ consumers.


PRs Included (main → test diff, 28 commits)

Phase 2 TEE Backend Verification (headline)

Storage & CI/CD hardening


Verification

All changes were validated on test through the automated pipeline:

  • Build → cosign sign → RTMR3 compute → deploy to SecretVM test VM → post-deploy attestation verification
  • Per-model Phase 2 attestation exposed at GET /v1/models/attestation and verified live

Test Plan

  • Verify VMAJ_NEW=7 produces a clean v7.0.0 build tag on merge to main
  • Verify docker-compose.tee.yml deployed to a SecretVM instance boots cleanly and GET /v1/models/attestation returns verified per TEE-tagged model
  • Verify a v6.0.0+ consumer opens a session against a v7.0.0+ provider and transparently gets Phase 2 guarantees (no client upgrade)
  • Verify Phase 2 fast-verify fires on every prompt (log request_id should appear at both session-open and each prompt)
  • Verify workload RTMR3 mismatch (e.g. altered docker-compose.yaml) causes the P-Node to refuse the session
  • Verify TLS certificate change on the backend triggers a hard fail (MITM signal) and refused prompt
  • Verify CPU-GPU nonce mismatch causes attestation failure
  • Verify NRAS outage degrades gracefully (does not block inference) but CPU-GPU binding still enforced
  • Verify fundingAccount rotation in config_base_mainnet.json is correct before promotion
  • Verify existing non-TEE models are unaffected (zero overhead)
  • All CI checks pass on main after merge

Blocked until review + branch protection approval.

Made with Cursor

alex-sandrk and others added 28 commits March 27, 2026 12:00
#692)

…mprove GC reclaim

This MR fixes BadgerDB value-log growth caused by rewriting a single
growing activity blob on every prompt.
Activity records are now stored as individual TTL keys, enabling
efficient expiration and much better garbage collection behavior.

It also tunes Badger GC settings and adds focused tests to ensure
capacity logic remains correct with the new storage layout.
- Updated the ECS service stabilization wait time to approximately 12.5 minutes with a maximum of 50 attempts, enhancing reliability during deployments.
- Removed deprecated pause and resume steps for Active Models refresh, streamlining the deployment process.
- Added comments for clarity on the stabilization process and error handling in the script.
…ng (#694)

- Updated the ECS service stabilization wait time to approximately 12.5
minutes with a maximum of 50 attempts, enhancing reliability during
deployments.
- Removed deprecated pause and resume steps for Active Models refresh,
streamlining the deployment process.
- Added comments for clarity on the stabilization process and error
handling in the script.
## Summary

Extends TEE attestation from the provider node (Phase 1) to the
**backend LLM endpoints**, creating a full trust chain: hardware ->
firmware -> OS -> workload -> TLS connection.

### What it does

- **Backend attestation** (`AttestBackend`): at startup, verifies each
TEE-marked model's backend via CPU quote (SecretAI portal), TLS cert
binding, docker-compose workload verification, GPU attestation (CPU-GPU
nonce binding), and NVIDIA NRAS.
- **Per-prompt fast verify** (`FastVerifyBackend`): on every inference
request, re-fetches the CPU quote and compares its hash + TLS
fingerprint against the cached snapshot. Full re-attestation only
triggers when something changes.
- **Workload verification**: downloads the SecretVM TDX artifact
registry from GitHub, parses TDX quotes to extract MRTD/RTMR0-3, and
replays the RTMR3 measurement from the backend's docker-compose.yaml to
prove the exact workload running inside the TEE.
- **NVIDIA NRAS**: sends GPU evidence to NVIDIA's Remote Attestation
Service (v4 API) for independent hardware verification (non-fatal if
unreachable).
- **Health endpoint**: `GET /v1/models/attestation` reports per-model
attestation status including workload verification results.

### Configuration

New env vars: `ARTIFACT_REGISTRY_URL`,
`ARTIFACT_REGISTRY_REFRESH_INTERVAL` (optional).
…ck failures

Three bugs caused the TEST environment C-Node deploy to fail after only
~70 seconds instead of waiting for the full ECS task lifecycle:

1. `--max-attempts 50` is not a valid AWS CLI option for `aws ecs wait
   services-stable` — the command errored immediately instead of polling.

2. Health checks started instantly after the broken waiter, while the old
   task (stopTimeout=120s) was still draining.

3. `set -e` was active during the health-check loop, so a curl timeout
   (exit 28) on attempt 5 killed the entire script.

Changes (applied to all three deploy jobs — LMN, C-Node, P-Node):

- Remove invalid `--max-attempts 50` from `aws ecs wait` (default 40×15s
  = 10 min is sufficient).
- Add a 180-second deployment wait floor based on the ECS task lifecycle:
  stopTimeout=120s + deregistration_delay≤30s + ALB health threshold≈90s.
  The waiter runs first; if it finishes early (success or error), the
  remaining time is filled with a sleep to prevent wasting retries on the
  old task.
- Wrap the health-check loop in `set +e` / `set -e` so curl timeouts
  don't abort the script.

Made-with: Cursor
…ck failures (#701)

## Summary

- **Removes invalid `--max-attempts 50`** from `aws ecs wait
services-stable` — this flag is not recognized by the AWS CLI and caused
the waiter to error immediately instead of polling (the bug that
triggered the premature deploy failure in TEST on 2026-04-10).
- **Adds a 180-second minimum deployment wait floor** before
health-check polling, based on the actual ECS task lifecycle timing from
Terraform (`02_mor_router_svc.tf`): `stopTimeout=120s` +
`deregistration_delay≤30s` + ALB `healthy_threshold×interval≈90s` = ~4
min worst case. The `aws ecs wait` still runs first (default 40×15s = 10
min); if it finishes early the remaining time is filled with a sleep.
- **Wraps the health-check loop in `set +e` / `set -e`** so curl
timeouts (exit code 28) during version verification no longer kill the
entire script under bash `-e`.

Applied to all three deploy jobs: LMN (Titan), C-Node (Morpheus
Consumer), and P-Node (Morpheus Provider).

### Root cause analysis (TEST deploy 2026-04-10 14:37 UTC)

| Step | What happened | Time |
|------|--------------|------|
| ECS service update issued | `update-service --force-new-deployment` |
14:37:15 |
| `aws ecs wait` fails immediately | `Unknown options: --max-attempts,
50` | 14:37:15 |
| Health checks start (no wait) | Old task (v6.2.2-test, 57h uptime)
still running | 14:37:16 |
| Attempts 1–4 | Version mismatch (old task responding) | 14:37:16 –
14:38:01 |
| Attempt 5 | `curl --max-time 10` times out (exit 28), `set -e` kills
script | 14:38:16 |
| **Total wall time before "failure"** | **~71 seconds** — ECS hadn't
even stopped the old task yet | |

## Test plan

- [ ] Trigger a TEST branch deploy and confirm the waiter polls
correctly (no `Unknown options` error)
- [ ] Verify the 180s floor wait appears in logs before health-check
polling begins
- [ ] Confirm curl timeout during health check doesn't abort the script
(visible as `⚠️ Health check failed (curl status: 28)` instead of
`##[error] Process completed with exit code 28`)
- [ ] Verify version verification succeeds after the new task is up


Made with [Cursor](https://cursor.com)
## Summary

- Merges `dev` into `test` to pick up the ECS deploy wait timing fix
from PR #701.
- Fixes the premature deploy failure seen in TEST on 2026-04-10 (only
~71s before aborting instead of waiting for ECS task lifecycle).

See #701 for full details.


Made with [Cursor](https://cursor.com)
- Incremented the major version number from 6 to 7 in the build workflow.
- Updated the funding account address in the mainnet configuration file.
Brings in:
- #701 fix(cicd): fix ECS deploy wait timing (squashed mirror of this branch's own commit)
- #703 wrap in correct error on P-node tee attestation fail
- #705 pass request_id in context in every log
- #708 feat: 'tee' tag for everything
- Introduced full two-hop Trusted Execution Environment (TEE) verification: Phase 1 (consumer to P-Node) and Phase 2 (P-Node to backend LLM).
- Updated documentation to reflect new TEE features, including detailed descriptions of the verification processes and guarantees.
- Added new sections in the README and various documentation files to clarify the TEE model tagging and its implications for consumers and providers.
- Incremented version number to v7.0.0 in configuration files and updated relevant documentation links.
- Improved CI/CD pipeline documentation to outline the automated verification processes for TEE models.

This release ensures that consumers using v6.0.0+ can seamlessly interact with v7.0.0+ providers without requiring client-side upgrades, enhancing security and trust in the Morpheus network.
#710)

## Summary

This PR brings three things into `dev` to complete the **v7.0.0 (full
TEE) release** prep:

1. **CI/CD fix (#701 equivalent)** — Fix for ECS deploy wait timing that
was premature-failing the post-deploy healthcheck (`8b341f3`). `#701` on
`dev` is the squashed version; this branch adds nothing new there.
2. **v7.0.0 version bump + funding-account update** (`71fff3d`) — Bumps
`VMAJ_NEW` to `7` in `.github/workflows/build.yml` and updates the
funding account in
`smart-contracts/deploy/data/config_base_mainnet.json`. This is the
major-version bump that marks "full TEE capability".
3. **Docs update for v7.0.0** (`fec3f92`) — Comprehensive documentation
pass to accurately reflect the shipped Phase 1 + Phase 2 TEE trust
chain.

## Doc changes (v7.0.0)

Consistently clarifies the **two-hop trust chain** across all public and
internal docs:

```
C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend
```

**Key correctness fix:** earlier drafts described the consumer as
performing Phase 2 backend verification. This is wrong — **Phase 2 runs
entirely inside the P-Node**; the consumer never talks to the backend.
This means **v6.0.0+ consumers are forward-compatible with v7.0.0+
providers** and get Phase 2 guarantees transparently, with no
client-side upgrade needed. Every doc now emphasizes this.

### User-facing docs updated
- `readme.md` — v7.0.0 release callout with the two-hop flow and
forward-compat note
- `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and
What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps
sections
- `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What
Consumers See, and What Your P-Node Does" as two distinct hops + v7
troubleshooting rows
- `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives
both hops
- `docs/models-config.json.md` — noted that `isTee` is no longer a local
config field (tag-driven)
- `docs/proxy-router.all.env` — documented new TEE env vars:
`TEE_PORTAL_URL`, `TEE_IMAGE_REPO`, `ARTIFACT_REGISTRY_URL`,
`ARTIFACT_REGISTRY_REFRESH_INTERVAL`

### Developer reference updated
- `proxy-router/docs/tee-backend-verification.md` — added "Where this
fits in the trust chain" section clarifying this doc is **Phase 2
only**, entirely inside the P-Node

### Internal `.ai-docs` updated (technical record of "how")
- `.ai-docs/TEE_Attestation_Architecture.md` — status bumped to v2.0,
Phase 1/1c/2a marked DONE with real file paths and PR numbers, new §7.7
with the full Phase 2 technical write-up (attestation endpoints, full
AttestBackend sequence, fast-verify semantics, reportData layout), open
questions resolved
- `.ai-docs/TEE_CICD_Supply_Chain_Hardening.md` — added v7.0.0 banner,
replaced placeholder diagram with two completed diagrams for Phase 1c
and Phase 2, reorganized status tables

## Test plan

- [x] Clean `ort`-strategy merge from `origin/dev` into this branch (no
conflicts, 11-file delta)
- [x] `GOOS=linux GOARCH=amd64 go build -ldflags='-w' -o
/tmp/proxy-router-linux ./cmd/` succeeds (67 MB binary, exit 0)
- [x] No new `go vet` warnings introduced by the merge (all warnings are
pre-existing on `dev`)
- [x] `grep` verification: no stale `isTee` references, no "consumer
verifies backend" misstatements, no `v6.0.0+` references where `v7.0.0+`
is meant
- [ ] CI/CD pipeline runs green (ECS deploy + post-deploy RTMR3
verification)
- [ ] Reviewer spot-check of the new
`.ai-docs/TEE_Attestation_Architecture.md` §7.7 against the actual
`proxy-router/internal/attestation/` code

## Release notes

This is the v7.0.0 release bump. Downstream merges:
1. This PR: `fix/cicd-ecs-deploy-wait-timing` → `dev`
2. Follow-up: `dev` → `test` (separate PR, opens immediately after this
merges)
3. After test validation: `test` → `main` for the v7.0.0 tag

Made with [Cursor](https://cursor.com)
…711)

## Summary

Propagates the v7.0.0 release prep from `dev` to `test` for validation
ahead of the `test → main` cut.

Four commits from `dev` (including the freshly-merged #710):

- `0fb81ec` — merge commit for #710: fix(cicd) ECS deploy wait timing +
v7.0.0 release docs + version bump
- `fec3f92` — Enhance TEE capabilities and documentation for v7.0.0
release
- `e81f41d` — Merge origin/dev into fix/cicd-ecs-deploy-wait-timing
(brings in #701/#703/#705/#708 history via the merge)
- `71fff3d` — chore: update version number and funding account in
configuration (VMAJ_NEW → 7, mainnet funding account)

## What reviewers should focus on

1. **`VMAJ_NEW=7` in `.github/workflows/build.yml`** — this is the
major-version bump that will tag the next test-branch build as
`v7.x.x-beta` and, on promotion to main, as `v7.0.0`.
2. **`config_base_mainnet.json` funding-account change** — please
confirm the address is the intended one before this reaches main.
3. **`readme.md` v7.0.0 callout** — verifies the two-hop trust chain
story reads correctly to a new reader (C-Node v6+ → P-Node v7+ → Backend
LLM).
4. **`docs/02.3-proxy-router-tee.md`** and
**`docs/02.4-proxy-router-secretvm-quickstart.md`** — the Phase 1 /
Phase 2 split. Especially the "What Consumers See, and What Your P-Node
Does" section in 02.4, which was the section we reworked most heavily to
correct the earlier wording that implied the consumer runs Phase 2.
5. **`.ai-docs/TEE_Attestation_Architecture.md` §7.7** — new technical
write-up of the actual Phase 2 implementation (file paths, attestation
endpoints, full sequence). Please sanity-check against
`proxy-router/internal/attestation/` code.

## Trust-chain clarification (for reviewers)

The key correctness fix in the docs:

```
C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend
```

- **Phase 1** (shipped in v6.x, unchanged in v7): consumer's
proxy-router verifies the provider's P-Node attestation (CPU quote, TLS
binding, RTMR3 of the `-tee` image) at session open and every prompt.
- **Phase 2** (new in v7.0.0, shipped in #699): the **P-Node itself**
verifies the backend LLM — CPU quote, TLS pinning, RTMR3 replay of the
backend's `docker-compose.yaml`, CPU-GPU nonce binding, NVIDIA NRAS — at
startup and every prompt.
- The on-chain `tee` model tag is the single switch that enables both
hops.
- **v6.0.0+ consumers are forward-compatible with v7.0.0+ providers**
and get Phase 2 guarantees transparently (Phase 2 is entirely inside the
P-Node binary they already attested in Phase 1). No client-side upgrade
is required for Phase 2.

## Test plan

- [ ] CI/CD builds v7.x.x-beta tag for the test branch
- [ ] Post-deploy RTMR3 verification against the live SecretVM test
instance passes
- [ ] `Deploy-SecretVM-Test` job ECS-wait timing no longer
premature-fails
- [ ] Spot-check `GET /v1/models/attestation` on a deployed test P-Node
shows per-model Phase 2 state
- [ ] Reviewer walks through the v7 docs cold (without this PR's
context) and confirms the two-hop trust chain is clearly explained

## Downstream

After validation on `test`, plan to promote with a `test → main` PR that
tags v7.0.0.

Made with [Cursor](https://cursor.com)
abs2023 and others added 3 commits April 22, 2026 13:53
…ehydration hold

Prevents the mid-deploy "messy outage" we've seen when the C-Node and P-Node
restart simultaneously and the C-Node BadgerDB ends up referencing sessions
whose upstream providers have just cycled.

Changes to .github/workflows/build.yml:

- New job `Drain-Morpheus-C-Node`: discovers the C-Node service's target
  groups and deregisters all current targets before any node restarts.
  connection_termination=true on the router TGs closes live TCP sessions
  promptly so upstream failures arrive as clean 503s instead of half-dead
  mid-stream hangs.

- `Deploy-to-Morpheus-P-Node` now depends on `Drain-Morpheus-C-Node` so the
  C-Node NLB is quiet before provider traffic is disrupted.

- `Deploy-to-Morpheus-C-Node` now depends on both the drain job and the
  P-Node deploy (skipped on `test`, which is still treated as success by
  GitHub Actions dependency resolution).

- The C-Node deploy step itself is rewritten to:
    1. Register the new task def with `--health-check-grace-period-seconds 600`
       so the ECS deployment circuit breaker tolerates the deregister window.
    2. Poll until a task on the NEW task definition is RUNNING and has an ENI IP.
    3. Deregister that IP from every target group.
    4. Sleep `cnode_rehydration_wait_secs` (default 90s) so the proxy-router's
       BadgerDB rehydration loop can catch up from on-chain state (ephemeral
       BadgerDB on prd, EFS-backed on dev).
    5. Re-register the IP to every TG and wait for `target-in-service`.
    6. Fall through to the existing public `/healthcheck` version-match loop.

- New workflow_dispatch input `cnode_rehydration_wait_secs` (default "90")
  lets operators tune the hold window per run.

Companion changes (separate PR in Morpheus-Infra):
- GitHub Actions IAM policy now grants ELBv2 Describe + Register/Deregister
  on Target Groups so the drain/register steps have the permissions they need.
- Planning doc CICD_HA_IMPROVEMENTS_PLAN.md captures the deferred items
  (P-Node HA, graceful shutdown, API GW retry tuning, persistent BadgerDB
  promotion) that this change does NOT address.

Made-with: Cursor
…ehydration hold (#713)

## Summary

Rewrites the C-Node / P-Node deployment sequencing in `build.yml` so the
C-Node NLB is fully drained **before** any provider restart, and so the
new C-Node task is held out of the load balancer for a configurable
rehydration window after it comes up.

Addresses the recurring "messy 5-minute" outage pattern where
simultaneous C-Node and P-Node restarts leave the C-Node's BadgerDB with
orphaned session state and require manual cleanup.

## Changes

### New job: `Drain-Morpheus-C-Node`
- Discovers the C-Node service's target groups dynamically (no
hard-coded ARNs).
- Calls `elbv2 deregister-targets` on every current target.
- Waits 45s for deregistration to propagate across AZs. The router TGs
have `connection_termination=true` and `deregistration_delay=0`, so live
TCP sessions close promptly.
- Exports `tg_arns` + `drained_ips` as outputs for the C-Node deploy
job.

### Reworked `Deploy-to-Morpheus-C-Node`
Now depends on both the drain job and the P-Node deploy, and the
`update-service` call is followed by a controlled-traffic sequence:

1. `aws ecs update-service --health-check-grace-period-seconds 600` so
the ECS deployment circuit breaker tolerates the deregister window.
2. Poll (up to 15 min) until a task running the **new** task definition
is `RUNNING` with an ENI IP.
3. Deregister that IP from every target group.
4. Sleep `cnode_rehydration_wait_secs` (default 90s) to let the
proxy-router's BadgerDB rehydration loop catch up from on-chain state.
Important for prd where BadgerDB is ephemeral; harmless for dev where
it's EFS-backed.
5. Re-register the IP to every TG, then `aws elbv2 wait
target-in-service` per TG.
6. Fall through to the existing public `/healthcheck` version-match
loop.

### `Deploy-to-Morpheus-P-Node`
Gated behind `Drain-Morpheus-C-Node` so the C-Node NLB is already quiet
before the provider restarts and briefly takes hosted models offline.

### New workflow_dispatch input
- `cnode_rehydration_wait_secs` (default `"90"`): operator-tunable hold
window.

### Titan deploy
Unchanged — still runs independently and in parallel with the P-Node
after container build, as agreed.

## Expected behavior post-merge (prd)

| Phase | Duration | What the user sees |
|---|---|---|
| Drain | ~45s | API GW upstream errors (503), clean TCP close |
| P-Node restart | ~2-3 min | Hosted-model sessions briefly unavailable
(single-provider SPOF, known) |
| C-Node old task stop | ~2 min | Same — NLB already empty |
| C-Node new task up | ~30s | Task running, still deregistered |
| Rehydration hold | 90s | Task rehydrating BadgerDB from chain, still
deregistered |
| Re-register + healthy | ~90s | NLB targets come healthy, API GW begins
forwarding again |

Total: ~6-8 min of deliberate, bounded outage instead of the previous 5
min of messy failure followed by 30 min of manual recovery.

## Companion changes (separate repo)

`Morpheus-Infra`:
- `01_iam_role_gh_actions.tf`: grants `ELBv2ReadOnly` +
`ELBv2TargetRegistration` so the new drain/register steps can
authenticate.
- `.ai-docs/CICD_HA_IMPROVEMENTS_PLAN.md`: captures deferred follow-ups
(persistent BadgerDB promotion, regional P-Node HA, `/readyz` split,
graceful shutdown, API GW retry tuning).

Both IAM policies have already been applied to dev + prd via terraform.

## Test plan

- [x] YAML syntax validated with `yaml.safe_load`
- [x] Go build of proxy-router on current HEAD passes (cross-compiled
linux/amd64)
- [ ] Merge to `dev` → exercise `test` branch deploy (which runs only
the C-Node path — P-Node job is `needs`-skipped for `test`) and confirm:
  - Drain job succeeds and lists target groups
  - New C-Node task registers a new task-def ENI IP
- Deregister → 90s hold → re-register cycle completes without ECS
rollback
  - `/healthcheck` version-match passes
- [ ] On prd (next release), confirm P-Node job runs **after** drain,
and that the C-Node completes the rehydration-hold sequence before
returning traffic.
- [ ] Observe v6.1.x → v7.0.x deploy cleanly with no manual BadgerDB
intervention.

Made with [Cursor](https://cursor.com)
…ehydration hold (#714)

## Summary

Promotes #713 from `dev` → `test` so we can exercise the new deployment
sequencing against the `dev` infrastructure before cutting `v7.0.0` to
`main`.

## What's in this PR

Only one commit ahead of `test`:

- **`68e39db` — fix(cicd): sequence C-Node drain, P-Node, then C-Node
redeploy with rehydration hold** (merged via #713)

Everything else that was previously in `dev` (TEE Phase 2, docs, version
bump to 7, ECS wait-timing fix) is already live on `test` via #710.

## What changes on merge to test

The `test` branch only runs the C-Node deploy path (no P-Node exists in
`dev` infra). The reworked workflow will still exercise:

1. `Drain-Morpheus-C-Node` job — discovers `test`/dev-env TGs and
deregisters current targets.
2. `Deploy-to-Morpheus-P-Node` is `needs`-skipped; dependency resolution
treats it as success.
3. `Deploy-to-Morpheus-C-Node`:
   - `update-service` with `--health-check-grace-period-seconds 600`
   - poll new task ENI IP
   - deregister IP from TGs
   - sleep 90s (default `cnode_rehydration_wait_secs`)
   - re-register + `wait target-in-service`
   - public `/healthcheck` version-match (existing logic)

**Note on dev env:** the dev C-Node currently has `switches.efs_storage
= true` (persistent BadgerDB on EFS). This means the 90s hold is
technically overkill for dev — rehydration has no work to do since state
is persistent. That's fine; the hold is a no-op and lets us verify the
new workflow paths behave correctly in dev before exercising them on prd
(where BadgerDB is ephemeral and the hold matters).

## What to review

- [ ] Workflow YAML structure (parallelism, `needs`, `if` gates)
- [ ] The two shell blocks (`Drain-Morpheus-C-Node` + the
controlled-redeploy section in `Deploy-to-Morpheus-C-Node`) for
correctness and idempotency
- [ ] Confirm the `cnode_rehydration_wait_secs` default and the
guard-against-non-numeric logic
- [ ] Verify IAM: dev IAM policy already includes `ELBv2ReadOnly` +
`ELBv2TargetRegistration` (applied from `Morpheus-Infra`)

## Test plan (after you merge to test)

1. Merge triggers the deploy workflow against `dev` env.
2. Confirm `Drain-Morpheus-C-Node` job lists the correct TGs for
`svc-dev-router` and successfully deregisters.
3. Confirm `Deploy-to-Morpheus-C-Node`:
   - finds the new task with a fresh task-definition ARN
   - resolves an ENI IP
   - deregister → 90s hold → re-register completes
   - `target-in-service` confirms healthy
- public `/healthcheck` at `router.dev.mor.org:8082` returns the new
version
4. Observe the dev C-Node's BadgerDB during the hold — should be a no-op
(EFS-persistent), confirming the workflow is benign when rehydration
isn't needed.
5. If green, cut `v7.0.0` PR from `test` → `main` to exercise the full
sequence on prd.

## Related

- Companion IAM + planning changes in `Morpheus-Infra` (already applied
to dev + prd).
- Deferred follow-ups captured in
`Morpheus-Infra/.ai-docs/CICD_HA_IMPROVEMENTS_PLAN.md`.

Made with [Cursor](https://cursor.com)
abs2023 and others added 3 commits April 22, 2026 16:07
…deploy needs

Root cause of the first v7 test-branch run (GH Actions run 24796856145):
the drain job ran and deregistered the dev C-Node from both NLB target
groups, then Deploy-to-Morpheus-C-Node was silently skipped because its
implicit success() guard propagated the skip from Deploy-to-Morpheus-P-Node
(which is main-only and intentionally skips on test pushes). That left
dev with the old C-Node task running in ECS but with no load-balancer
membership until we manually re-registered it.

Fix:

- Add an explicit `if` to Deploy-to-Morpheus-C-Node that:
  - Requires GHCR build + drain success.
  - Accepts Deploy-to-Morpheus-P-Node.result in {success, skipped}.
  - Wraps in `!cancelled()` so manual cancels still short-circuit.
- Rewrite the inline comment that previously (incorrectly) claimed
  skipped jobs are treated as successful for dependency resolution;
  they are not.

No change to the deploy logic itself, the drain job, or any other
workflow sequencing. This is a pure `needs` / guard correctness fix.

Made-with: Cursor
…deploy needs (#715)

## Summary

Hotfix for the first v7 `test`-branch deployment attempt ([Actions run
#1447 / run id
24796856145](https://github.com/MorpheusAIs/Morpheus-Lumerin-Node/actions/runs/24796856145)).

The previous PR (#714#713) introduced a new job sequence:

```
Drain-Morpheus-C-Node → Deploy-to-Morpheus-P-Node → Deploy-to-Morpheus-C-Node
```

On push to `test`, `Deploy-to-Morpheus-P-Node` is correctly **skipped**
(its own `if` restricts it to `main` — no dev P-Node exists). That skip
then propagated onto `Deploy-to-Morpheus-C-Node` through the implicit
`success()` guard that every job has when no explicit `if` tolerates a
skipped dependency. Net effect:

- `Drain-Morpheus-C-Node` ✅ removed the running dev C-Node task from
both NLB target groups.
- `Deploy-to-Morpheus-C-Node` ❌ silently skipped — no new task
registered, no re-register of the existing task.
- Dev C-Node endpoints (`router.dev.mor.org:8082`/`:8545`) stopped
serving traffic until we manually re-registered the old task to the TGs.

## What this PR changes

`.github/workflows/build.yml`, `Deploy-to-Morpheus-C-Node` job only:

### Explicit `if` guard that tolerates the P-Node skip

```yaml
if: |
  !cancelled() &&
  github.repository == 'MorpheusAIs/Morpheus-Lumerin-Node' &&
  needs.GHCR-Build-and-Push.result == 'success' &&
  needs.Drain-Morpheus-C-Node.result == 'success' &&
  (needs.Deploy-to-Morpheus-P-Node.result == 'success' || needs.Deploy-to-Morpheus-P-Node.result == 'skipped') &&
  (
    (github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref == 'refs/heads/test')) ||
    (github.event_name == 'workflow_dispatch' && github.event.inputs.build_all_os == 'true' && github.event.inputs.create_deployment == 'true')
  )
```

- Still requires real success from the drain + GHCR jobs.
- Explicitly accepts `success` OR `skipped` for the P-Node dependency.
- `!cancelled()` short-circuits the job if someone cancels the run, so
we don't try to redeploy a half-cancelled sequence.

### Comment rewrite

The previous inline comment claimed skipped jobs are treated as
successful for dependency resolution. That's the opposite of how GitHub
Actions actually behaves — the note is replaced with an accurate
explanation and a reference to this incident so future-us (or other
maintainers) won't step on the same rake.

## What this PR does NOT change

- No change to the drain job or the controlled-traffic C-Node deploy
sequence (register→dereg→hold→rereg→wait).
- No change to the P-Node job or its `if` gating.
- No change to any other workflow or deploy script.

This is a pure guard-correctness fix.

## Test plan

- [x] YAML syntax validated (`yaml.safe_load`).
- [x] Manually walked through `needs` outcomes:
- On push to `test`: P-Node `skipped`, drain + GHCR `success` → C-Node
**runs**. ✅
  - On push to `main`: all three `success` → C-Node **runs**. ✅
- Drain fails → C-Node **skipped** (won't redeploy while the TG state is
ambiguous). ✅
  - GHCR fails → C-Node **skipped**. ✅
- P-Node fails on main (not skipped) → C-Node **skipped** (we want to
bail rather than leave prd with a v-mismatch between providers and the
consumer). ✅
- [ ] Merge to `dev`, promote to `test`, confirm
`Deploy-to-Morpheus-Consumer` actually runs this time and completes the
drain → hold → re-register → /healthcheck cycle end-to-end.
- [ ] Once green on test, cut `test` → `main` for first prd exercise.

## Related

- Previous PR: #714 (dev → test) carrying #713 (initial
drain/sequence/hold).
- Companion IAM + planning doc already applied in `Morpheus-Infra`.

Made with [Cursor](https://cursor.com)
…deploy needs (#716)

## Summary

Promotes the #715 hotfix from `dev` → `test` so we can retry the v7 test
deploy and actually exercise the new drain / hold / re-register sequence
end-to-end.

## What's in this PR

Single-commit delta ahead of `test`:

- **`2b14c58` — fix(cicd): allow skipped Deploy-to-Morpheus-P-Node to
satisfy C-Node deploy needs** (merged via #715)

Everything else in `dev` is already on `test` from #714.

## Background

First v7 test run ([Actions run #1447 / run id
24796856145](https://github.com/MorpheusAIs/Morpheus-Lumerin-Node/actions/runs/24796856145))
revealed a correctness gap in the workflow from #713/#714:

- `Drain-Morpheus-C-Node` ran and successfully drained the dev C-Node
targets from both NLB target groups.
- `Deploy-to-Morpheus-P-Node` was correctly skipped (main-only).
- `Deploy-to-Morpheus-C-Node` was **silently skipped** because the
implicit `success()` guard propagated the skip from the P-Node job —
GitHub Actions treats skipped `needs` as non-success, contrary to the
inline comment in the previous PR.
- Dev C-Node remained functional in ECS but without LB membership until
manually re-registered.

## Fix (this PR carries)

Explicit `if` guard on `Deploy-to-Morpheus-C-Node` that:

- Requires real success from `GHCR-Build-and-Push` and
`Drain-Morpheus-C-Node`.
- Accepts either `success` or `skipped` as the outcome for
`Deploy-to-Morpheus-P-Node`.
- Wraps everything in `!cancelled()` to short-circuit on manual cancel.

The misleading comment is replaced with an accurate explanation
referencing this incident.

## Expected behavior after merge to test

On push to `test` the sequence becomes:

1. Build + test + GHCR push (~10–20 min).
2. `Drain-Morpheus-C-Node` ✅ deregisters the existing dev C-Node
targets.
3. `Deploy-to-Morpheus-P-Node` skipped (expected).
4. `Deploy-to-Morpheus-C-Node` ✅ **runs this time** because the new `if`
accepts the P-Node skip:
   - `update-service` with `--health-check-grace-period-seconds 600`.
   - Poll new task ENI IP.
   - Deregister IP from TGs.
   - Sleep 90s (`cnode_rehydration_wait_secs`).
   - Re-register IP, `wait target-in-service`.
   - Public `/healthcheck` version match on `router.dev.mor.org:8082`.
5. `Deploy-to-Titan` + `Deploy-TEE-SecretVM-test` run in parallel to the
deploy sequence (unchanged).

On dev the rehydration hold is effectively a no-op because EFS-backed
BadgerDB is persistent there — perfect for a first validation of the
workflow plumbing without depending on rehydration correctness.

## Review focus

- The new `if` block on `Deploy-to-Morpheus-C-Node` (lines ~1298–1320 of
`build.yml`).
- Reasoning captured in the inline comment.
- No changes to deploy logic, drain logic, or sequencing — only the
guard.

## Test plan (after you merge to test)

1. Merge triggers the deploy workflow against dev.
2. Confirm `Deploy-to-Morpheus-Consumer` enters `in_progress` (not
skipped) after the drain completes.
3. Watch the controlled-traffic sequence:
   - New task gets an ENI IP.
   - `deregister-targets` runs.
   - 90s hold.
   - `register-targets` + `wait target-in-service` succeed.
   - `/healthcheck` version-match passes on the new image tag.
4. If green, proceed with `test` → `main` for the first prd exercise
(existing open PR).

## Related

- Previous merges on the new CICD flow: #713, #714, #715.
- Companion IAM + planning doc already applied in `Morpheus-Infra`.

Made with [Cursor](https://cursor.com)
@nomadicrogue nomadicrogue merged commit 4c42883 into main Apr 23, 2026
34 checks passed
morpheusrogue added a commit to absgrafx/Morpheus-Lumerin-Node that referenced this pull request Apr 23, 2026
…station

Brings in upstream's v7.0.0 release (MorpheusAIs#712):
- Phase 2 provider-side backend LLM attestation (TDX quote, TLS pinning,
  RTMR3 workload replay, CPU-GPU nonce binding, NVIDIA NRAS GPU attestation,
  per-prompt fast verify)
- Single on-chain "tee" tag drives both hops; local isTee flag retired
- request_id propagation across inference/attestation log paths
- Per-entry Badger activity keys for session storage GC reclaim
- ECS deploy / CI-CD wait-timing hardening, docs rewrite, swagger updates
- Major version bump to 7

Conflict resolved in proxy-router/internal/blockchainapi/service.go:
OpenSession keeps the fork's nil-guard on authConfig (for mobile SDK use)
alongside upstream's new log := s.requestLog(ctx) binding, and our
execution-reverted retry loop was switched to use the request-scoped logger.

Verified: proxy-router builds cleanly and our touched packages
(chatstorage, storages, proxyapi, mobile) pass their unit tests. The
remaining pre-existing failures (attestation fixture/network tests,
TestRating, vet warnings) are inherited unchanged from upstream v7.0.0.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants