merge: dev to test — v7.0.0 release (ECS deploy fix + full TEE docs)#711
Merged
Conversation
- Incremented the major version number from 6 to 7 in the build workflow. - Updated the funding account address in the mainnet configuration file.
- Introduced full two-hop Trusted Execution Environment (TEE) verification: Phase 1 (consumer to P-Node) and Phase 2 (P-Node to backend LLM). - Updated documentation to reflect new TEE features, including detailed descriptions of the verification processes and guarantees. - Added new sections in the README and various documentation files to clarify the TEE model tagging and its implications for consumers and providers. - Incremented version number to v7.0.0 in configuration files and updated relevant documentation links. - Improved CI/CD pipeline documentation to outline the automated verification processes for TEE models. This release ensures that consumers using v6.0.0+ can seamlessly interact with v7.0.0+ providers without requiring client-side upgrades, enhancing security and trust in the Morpheus network.
#710) ## Summary This PR brings three things into `dev` to complete the **v7.0.0 (full TEE) release** prep: 1. **CI/CD fix (#701 equivalent)** — Fix for ECS deploy wait timing that was premature-failing the post-deploy healthcheck (`8b341f3`). `#701` on `dev` is the squashed version; this branch adds nothing new there. 2. **v7.0.0 version bump + funding-account update** (`71fff3d`) — Bumps `VMAJ_NEW` to `7` in `.github/workflows/build.yml` and updates the funding account in `smart-contracts/deploy/data/config_base_mainnet.json`. This is the major-version bump that marks "full TEE capability". 3. **Docs update for v7.0.0** (`fec3f92`) — Comprehensive documentation pass to accurately reflect the shipped Phase 1 + Phase 2 TEE trust chain. ## Doc changes (v7.0.0) Consistently clarifies the **two-hop trust chain** across all public and internal docs: ``` C-Node (v6.0.0+) ──Phase 1──▶ P-Node -tee image (v7.0.0+) ──Phase 2──▶ Backend LLM consumer P-Node verifies verifies its P-Node own backend ``` **Key correctness fix:** earlier drafts described the consumer as performing Phase 2 backend verification. This is wrong — **Phase 2 runs entirely inside the P-Node**; the consumer never talks to the backend. This means **v6.0.0+ consumers are forward-compatible with v7.0.0+ providers** and get Phase 2 guarantees transparently, with no client-side upgrade needed. Every doc now emphasizes this. ### User-facing docs updated - `readme.md` — v7.0.0 release callout with the two-hop flow and forward-compat note - `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps sections - `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What Consumers See, and What Your P-Node Does" as two distinct hops + v7 troubleshooting rows - `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives both hops - `docs/models-config.json.md` — noted that `isTee` is no longer a local config field (tag-driven) - `docs/proxy-router.all.env` — documented new TEE env vars: `TEE_PORTAL_URL`, `TEE_IMAGE_REPO`, `ARTIFACT_REGISTRY_URL`, `ARTIFACT_REGISTRY_REFRESH_INTERVAL` ### Developer reference updated - `proxy-router/docs/tee-backend-verification.md` — added "Where this fits in the trust chain" section clarifying this doc is **Phase 2 only**, entirely inside the P-Node ### Internal `.ai-docs` updated (technical record of "how") - `.ai-docs/TEE_Attestation_Architecture.md` — status bumped to v2.0, Phase 1/1c/2a marked DONE with real file paths and PR numbers, new §7.7 with the full Phase 2 technical write-up (attestation endpoints, full AttestBackend sequence, fast-verify semantics, reportData layout), open questions resolved - `.ai-docs/TEE_CICD_Supply_Chain_Hardening.md` — added v7.0.0 banner, replaced placeholder diagram with two completed diagrams for Phase 1c and Phase 2, reorganized status tables ## Test plan - [x] Clean `ort`-strategy merge from `origin/dev` into this branch (no conflicts, 11-file delta) - [x] `GOOS=linux GOARCH=amd64 go build -ldflags='-w' -o /tmp/proxy-router-linux ./cmd/` succeeds (67 MB binary, exit 0) - [x] No new `go vet` warnings introduced by the merge (all warnings are pre-existing on `dev`) - [x] `grep` verification: no stale `isTee` references, no "consumer verifies backend" misstatements, no `v6.0.0+` references where `v7.0.0+` is meant - [ ] CI/CD pipeline runs green (ECS deploy + post-deploy RTMR3 verification) - [ ] Reviewer spot-check of the new `.ai-docs/TEE_Attestation_Architecture.md` §7.7 against the actual `proxy-router/internal/attestation/` code ## Release notes This is the v7.0.0 release bump. Downstream merges: 1. This PR: `fix/cicd-ecs-deploy-wait-timing` → `dev` 2. Follow-up: `dev` → `test` (separate PR, opens immediately after this merges) 3. After test validation: `test` → `main` for the v7.0.0 tag Made with [Cursor](https://cursor.com)
nomadicrogue
approved these changes
Apr 22, 2026
11 tasks
nomadicrogue
added a commit
that referenced
this pull request
Apr 23, 2026
…#712) ## Overview **v7.0.0 — Full TEE capability.** This release completes the Morpheus TEE trust chain with **Phase 2: Provider-side backend LLM attestation**. v6.x closed the loop between the consumer and the provider's proxy-router (P-Node). v7.0.0 closes the next hop: the P-Node now cryptographically verifies its **own backend LLM** (CPU TDX quote, TLS pinning, workload RTMR3 replay, CPU-GPU nonce binding, NVIDIA NRAS GPU attestation) at startup and on **every inference prompt**. The major version bump to **7.0.0** marks the completion of the full two-hop TEE verification chain. It is also designed for **seamless forward-compatibility**: v6.0.0+ consumers automatically benefit from Phase 2 guarantees when they connect to v7.0.0+ providers — no client-side upgrade is required. ``` C-Node (v6.0.0+) ──Phase 1──▶ P-Node -tee image (v7.0.0+) ──Phase 2──▶ Backend LLM consumer P-Node verifies verifies its P-Node own backend ``` The on-chain `tee` model tag is the single switch that turns on both hops. --- ## Headline: Phase 2 TEE Backend Verification (#699, #700) Every TEE-tagged model's backend LLM is now attested **by the P-Node itself**, not just assumed trustworthy. The P-Node refuses to forward any inference unless the backend passes all of the following on every request: | Check | What it proves | Where | |---|---|---| | **Portal-verified CPU TDX quote** | Backend runs on genuine Intel TDX hardware | `:29343/cpu` → `TEE_PORTAL_URL` | | **TLS certificate pinning** | Inference TLS terminates inside the attested enclave — no CDN/MITM can slip in | `reportData[0:32] == SHA-256(TLS cert)` | | **Workload RTMR3 replay** | The exact set of loaded models (`MODELS=…` line in `docker-compose.yaml`) is what the operator declared | RTMR3 replay vs. backend's `:29343/docker-compose` | | **MRTD + RTMR0–2 artifact lookup** | Firmware / VM config / kernel / initramfs all match a published SecretVM build | `ArtifactRegistry` CSV (auto-refreshed) | | **CPU-GPU nonce binding** | GPU evidence cannot be replayed from another box | `reportData[32:64] == GPU nonce` | | **NVIDIA NRAS v4 attestation** | Independent hardware-level validation of the GPU | NRAS REST + JWT EAT signature check | | **Per-prompt fast verify** | Backend identity hasn't changed since initial attestation — **runs on every prompt** | hash + TLS fingerprint compare (~50 ms) | | **Pinned-cert HTTP client** | Onward inference connection refuses any TLS cert whose fingerprint doesn't match | `PinnedHTTPClient.VerifyPeerCertificate` | **Key files added** (`proxy-router/internal/attestation/`): - `backend_verifier.go` — `AttestBackend` (full, startup) + `FastVerifyBackend` (per-prompt hot path, no TTL) - `workload_verifier.go`, `rtmr.go`, `tdx_quote.go` — RTMR3 replay using SHA-384 extend chain - `artifacts_registry.go` — auto-refresh of the SecretVM TDX artifact registry - `nras_verifier.go` — NVIDIA NRAS v4 API client with JWT EAT validation - Backend verifier test suite: `backend_verifier_test.go`, `workload_verifier_test.go`, `workload_rytn_test.go`, `nras_verifier_test.go`, `golden_test.go` (~1,450 LOC of coverage) **Key integration points:** - `proxy-router/cmd/main.go` — wires `BackendVerifier` into startup and calls `AttestBackend` once per `tee`-tagged model - `proxy-router/internal/proxyapi/proxy_receiver.go` — calls `FastVerifyBackend` on the `SessionPrompt` hot path before forwarding any inference - `proxy-router/internal/aiengine/ai_engine.go` — returns a `PinnedHTTPClient` for TEE models - `proxy-router/internal/proxyapi/controller_http.go` — new `GET /v1/models/attestation` health endpoint exposing per-model state (`verified` / `pending` / `failed` + last-success timestamp + workload match) **New environment variables** (`TEE` config block): | Variable | Default | Purpose | |---|---|---| | `TEE_PORTAL_URL` | SecretAI Portal | CPU quote parse + verification endpoint | | `TEE_IMAGE_REPO` | `ghcr.io/morpheusais/morpheus-lumerin-node-tee` | Image repo for cosign attestation verification | | `ARTIFACT_REGISTRY_URL` | SecretVM TDX artifacts CSV | MRTD + RTMR0–2 lookup source | | `ARTIFACT_REGISTRY_REFRESH_INTERVAL` | (configurable) | How often to re-fetch the registry | --- ## The `tee` Tag — One Switch, Two Hops (#708, #709) Prior to v7.0.0 there were transient plans for a separate `tee-gpu` tag. That's been consolidated: **the single on-chain `tee` tag** now drives the entire trust chain. It turns on **both**: - **Phase 1** on the consumer: C-Node (v6.0.0+) verifies the P-Node's attestation - **Phase 2** on the provider: P-Node (v7.0.0+) verifies its own backend LLM The local `isTee` field in `models-config.json` has been removed in favor of the blockchain tag as the single source of truth. `IsTeeModel(tags)` is the sole helper; `IsTeeGPUModel` is deleted. --- ## Operational Robustness ### P-Node TEE error wrapping (#703, #704) When the P-Node's Phase 2 backend attestation fails, the error returned to callers is now wrapped in the correct error type so upstream logic (session open, prompt dispatch) handles it consistently and the consumer-visible failure is actionable rather than a generic 500. ### `request_id` propagation in every log (#705) Every log line emitted along an inference or attestation path now carries the `request_id` from its context, so operators can trace a single prompt end-to-end through consumer → P-Node → backend attestation → inference → response. Critical for v7 operations since Phase 2 failures can surface at any of several points. ### Storage: per-entry Badger activity keys (#692, #693) Session activity tracking moved from a single aggregate key to per-entry keys. This makes BadgerDB's GC able to reclaim disk space properly as sessions roll over, eliminating a slow-growing storage-bloat issue seen in long-running providers. ### CI/CD: ECS deploy wait-timing hardening (#694, #695, #701, #702, #710) Multiple refinements to the ECS service stabilization + post-deploy attestation-verification window, eliminating intermittent premature health-check failures that were flakily failing otherwise-successful deploys. --- ## Documentation — Full v7 Doc Pass (#710) All public-facing and internal TEE documentation was audited and rewritten in this release to accurately describe the two-hop trust chain and the forward-compatibility story: **User-facing:** - `readme.md` — new v7.0.0 release callout with the two-hop diagram and forward-compat note - `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps sections - `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What Consumers See, and What Your P-Node Does" as two distinct hops + v7 troubleshooting - `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives both hops - `docs/models-config.json.md` — noted `isTee` is no longer a local config field (tag-driven) - `docs/proxy-router.all.env` — documented all new TEE env vars **Developer reference:** - `proxy-router/docs/tee-backend-verification.md` — new 286-line developer reference for Phase 2, with mermaid sequence + trust-chain diagrams - `proxy-router/docs/docs.go`, `swagger.json`, `swagger.yaml` — auto-generated API docs include the new `GET /v1/models/attestation` endpoint **Internal (`.ai-docs/`):** - `TEE_Attestation_Architecture.md` — status bumped to v2.0; Phases 1 / 1c / 2a marked DONE with real file paths and PR numbers; new §7.7 full Phase 2 technical write-up - `TEE_CICD_Supply_Chain_Hardening.md` — v7.0.0 banner; trust-chain diagram updated with completed Phase 1c and Phase 2 boxes --- ## Configuration Updates - **`.github/workflows/build.yml`** — `VMAJ_NEW=7` (major-version bump); builds from `main` will now tag as `v7.x.x`. - **`smart-contracts/deploy/data/config_base_mainnet.json`** — `fundingAccount` rotated from `0x1FE04BC15Cf2c5A2d41a0b3a96725596676eBa1E` to `0x5160C0311A95E0A1072FA85Df23712A7BA1cD4b1`. --- ## Consumer / Provider Compatibility Matrix | Consumer | Provider | TEE behavior | |---|---|---| | Pre-v6 | any | No TEE verification | | v6.0.0+ | v6.x | Phase 1 only (consumer verifies P-Node); backend LLM not attested | | **v6.0.0+** | **v7.0.0+** | **Full Phase 1 + Phase 2 — the consumer transparently gains Phase 2 guarantees via the attested P-Node binary. No client-side upgrade required.** | | v7.0.0+ | v7.0.0+ | Full Phase 1 + Phase 2 | This forward-compatibility is the key design principle of the v7 release — upgrading providers instantly strengthens the network for all existing v6+ consumers. --- ## PRs Included (main → test diff, 28 commits) ### Phase 2 TEE Backend Verification (headline) - #699 — feat: Phase 2 TEE backend verification - #700 — Phase 2 (dev → test merge) - #703 / #704 — wrap in correct error on P-Node TEE attestation fail - #705 — pass request_id in context in every log - #708 / #709 — feat: `"tee"` tag for everything (consolidate on single on-chain tag) ### Storage & CI/CD hardening - #692 / #693 — fix(storage): per-entry Badger activity keys + GC reclaim - #694 / #695 — refactor(workflows): improve ECS service stabilization timeout handling - #701 / #702 — fix(cicd): ECS deploy wait timing - #710 — ECS deploy wait timing + v7.0.0 release docs + version bump - #711 — merge: dev to test — v7.0.0 release --- ## Verification All changes were validated on `test` through the automated pipeline: - Build → cosign sign → RTMR3 compute → deploy to SecretVM test VM → post-deploy attestation verification - Per-model Phase 2 attestation exposed at `GET /v1/models/attestation` and verified live --- ## Test Plan - [ ] Verify `VMAJ_NEW=7` produces a clean `v7.0.0` build tag on merge to main - [ ] Verify `docker-compose.tee.yml` deployed to a SecretVM instance boots cleanly and `GET /v1/models/attestation` returns `verified` per TEE-tagged model - [ ] Verify a v6.0.0+ consumer opens a session against a v7.0.0+ provider and transparently gets Phase 2 guarantees (no client upgrade) - [ ] Verify Phase 2 fast-verify fires on every prompt (log `request_id` should appear at both session-open and each prompt) - [ ] Verify workload RTMR3 mismatch (e.g. altered `docker-compose.yaml`) causes the P-Node to refuse the session - [ ] Verify TLS certificate change on the backend triggers a hard fail (MITM signal) and refused prompt - [ ] Verify CPU-GPU nonce mismatch causes attestation failure - [ ] Verify NRAS outage degrades gracefully (does not block inference) but CPU-GPU binding still enforced - [ ] Verify `fundingAccount` rotation in `config_base_mainnet.json` is correct before promotion - [ ] Verify existing non-TEE models are unaffected (zero overhead) - [ ] All CI checks pass on `main` after merge --- ## Blocked until review + branch protection approval. Made with [Cursor](https://cursor.com)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Propagates the v7.0.0 release prep from
devtotestfor validation ahead of thetest → maincut.Four commits from
dev(including the freshly-merged #710):0fb81ec— merge commit for fix(cicd): ECS deploy wait timing + v7.0.0 release docs + version bump #710: fix(cicd) ECS deploy wait timing + v7.0.0 release docs + version bumpfec3f92— Enhance TEE capabilities and documentation for v7.0.0 releasee81f41d— Merge origin/dev into fix/cicd-ecs-deploy-wait-timing (brings in fix(cicd): fix ECS deploy wait timing to prevent premature health-check failures #701/wrap in correct error on P-node tee attestation fail #703/pass request_id in context in every log #705/feat: "tee" tag for everything #708 history via the merge)71fff3d— chore: update version number and funding account in configuration (VMAJ_NEW → 7, mainnet funding account)What reviewers should focus on
VMAJ_NEW=7in.github/workflows/build.yml— this is the major-version bump that will tag the next test-branch build asv7.x.x-betaand, on promotion to main, asv7.0.0.config_base_mainnet.jsonfunding-account change — please confirm the address is the intended one before this reaches main.readme.mdv7.0.0 callout — verifies the two-hop trust chain story reads correctly to a new reader (C-Node v6+ → P-Node v7+ → Backend LLM).docs/02.3-proxy-router-tee.mdanddocs/02.4-proxy-router-secretvm-quickstart.md— the Phase 1 / Phase 2 split. Especially the "What Consumers See, and What Your P-Node Does" section in 02.4, which was the section we reworked most heavily to correct the earlier wording that implied the consumer runs Phase 2..ai-docs/TEE_Attestation_Architecture.md§7.7 — new technical write-up of the actual Phase 2 implementation (file paths, attestation endpoints, full sequence). Please sanity-check againstproxy-router/internal/attestation/code.Trust-chain clarification (for reviewers)
The key correctness fix in the docs:
-teeimage) at session open and every prompt.docker-compose.yaml, CPU-GPU nonce binding, NVIDIA NRAS — at startup and every prompt.teemodel tag is the single switch that enables both hops.Test plan
Deploy-SecretVM-Testjob ECS-wait timing no longer premature-failsGET /v1/models/attestationon a deployed test P-Node shows per-model Phase 2 stateDownstream
After validation on
test, plan to promote with atest → mainPR that tags v7.0.0.Made with Cursor