Skip to content

merge: dev to test — v7.0.0 release (ECS deploy fix + full TEE docs)#711

Merged
nomadicrogue merged 4 commits into
testfrom
dev
Apr 22, 2026
Merged

merge: dev to test — v7.0.0 release (ECS deploy fix + full TEE docs)#711
nomadicrogue merged 4 commits into
testfrom
dev

Conversation

@abs2023
Copy link
Copy Markdown
Collaborator

@abs2023 abs2023 commented Apr 22, 2026

Summary

Propagates the v7.0.0 release prep from dev to test for validation ahead of the test → main cut.

Four commits from dev (including the freshly-merged #710):

What reviewers should focus on

  1. VMAJ_NEW=7 in .github/workflows/build.yml — this is the major-version bump that will tag the next test-branch build as v7.x.x-beta and, on promotion to main, as v7.0.0.
  2. config_base_mainnet.json funding-account change — please confirm the address is the intended one before this reaches main.
  3. readme.md v7.0.0 callout — verifies the two-hop trust chain story reads correctly to a new reader (C-Node v6+ → P-Node v7+ → Backend LLM).
  4. docs/02.3-proxy-router-tee.md and docs/02.4-proxy-router-secretvm-quickstart.md — the Phase 1 / Phase 2 split. Especially the "What Consumers See, and What Your P-Node Does" section in 02.4, which was the section we reworked most heavily to correct the earlier wording that implied the consumer runs Phase 2.
  5. .ai-docs/TEE_Attestation_Architecture.md §7.7 — new technical write-up of the actual Phase 2 implementation (file paths, attestation endpoints, full sequence). Please sanity-check against proxy-router/internal/attestation/ code.

Trust-chain clarification (for reviewers)

The key correctness fix in the docs:

C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend
  • Phase 1 (shipped in v6.x, unchanged in v7): consumer's proxy-router verifies the provider's P-Node attestation (CPU quote, TLS binding, RTMR3 of the -tee image) at session open and every prompt.
  • Phase 2 (new in v7.0.0, shipped in feat: Phase 2 TEE backend verification #699): the P-Node itself verifies the backend LLM — CPU quote, TLS pinning, RTMR3 replay of the backend's docker-compose.yaml, CPU-GPU nonce binding, NVIDIA NRAS — at startup and every prompt.
  • The on-chain tee model tag is the single switch that enables both hops.
  • v6.0.0+ consumers are forward-compatible with v7.0.0+ providers and get Phase 2 guarantees transparently (Phase 2 is entirely inside the P-Node binary they already attested in Phase 1). No client-side upgrade is required for Phase 2.

Test plan

  • CI/CD builds v7.x.x-beta tag for the test branch
  • Post-deploy RTMR3 verification against the live SecretVM test instance passes
  • Deploy-SecretVM-Test job ECS-wait timing no longer premature-fails
  • Spot-check GET /v1/models/attestation on a deployed test P-Node shows per-model Phase 2 state
  • Reviewer walks through the v7 docs cold (without this PR's context) and confirms the two-hop trust chain is clearly explained

Downstream

After validation on test, plan to promote with a test → main PR that tags v7.0.0.

Made with Cursor

abs2023 and others added 4 commits April 22, 2026 09:57
- Incremented the major version number from 6 to 7 in the build workflow.
- Updated the funding account address in the mainnet configuration file.
Brings in:
- #701 fix(cicd): fix ECS deploy wait timing (squashed mirror of this branch's own commit)
- #703 wrap in correct error on P-node tee attestation fail
- #705 pass request_id in context in every log
- #708 feat: 'tee' tag for everything
- Introduced full two-hop Trusted Execution Environment (TEE) verification: Phase 1 (consumer to P-Node) and Phase 2 (P-Node to backend LLM).
- Updated documentation to reflect new TEE features, including detailed descriptions of the verification processes and guarantees.
- Added new sections in the README and various documentation files to clarify the TEE model tagging and its implications for consumers and providers.
- Incremented version number to v7.0.0 in configuration files and updated relevant documentation links.
- Improved CI/CD pipeline documentation to outline the automated verification processes for TEE models.

This release ensures that consumers using v6.0.0+ can seamlessly interact with v7.0.0+ providers without requiring client-side upgrades, enhancing security and trust in the Morpheus network.
#710)

## Summary

This PR brings three things into `dev` to complete the **v7.0.0 (full
TEE) release** prep:

1. **CI/CD fix (#701 equivalent)** — Fix for ECS deploy wait timing that
was premature-failing the post-deploy healthcheck (`8b341f3`). `#701` on
`dev` is the squashed version; this branch adds nothing new there.
2. **v7.0.0 version bump + funding-account update** (`71fff3d`) — Bumps
`VMAJ_NEW` to `7` in `.github/workflows/build.yml` and updates the
funding account in
`smart-contracts/deploy/data/config_base_mainnet.json`. This is the
major-version bump that marks "full TEE capability".
3. **Docs update for v7.0.0** (`fec3f92`) — Comprehensive documentation
pass to accurately reflect the shipped Phase 1 + Phase 2 TEE trust
chain.

## Doc changes (v7.0.0)

Consistently clarifies the **two-hop trust chain** across all public and
internal docs:

```
C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend
```

**Key correctness fix:** earlier drafts described the consumer as
performing Phase 2 backend verification. This is wrong — **Phase 2 runs
entirely inside the P-Node**; the consumer never talks to the backend.
This means **v6.0.0+ consumers are forward-compatible with v7.0.0+
providers** and get Phase 2 guarantees transparently, with no
client-side upgrade needed. Every doc now emphasizes this.

### User-facing docs updated
- `readme.md` — v7.0.0 release callout with the two-hop flow and
forward-compat note
- `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and
What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps
sections
- `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What
Consumers See, and What Your P-Node Does" as two distinct hops + v7
troubleshooting rows
- `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives
both hops
- `docs/models-config.json.md` — noted that `isTee` is no longer a local
config field (tag-driven)
- `docs/proxy-router.all.env` — documented new TEE env vars:
`TEE_PORTAL_URL`, `TEE_IMAGE_REPO`, `ARTIFACT_REGISTRY_URL`,
`ARTIFACT_REGISTRY_REFRESH_INTERVAL`

### Developer reference updated
- `proxy-router/docs/tee-backend-verification.md` — added "Where this
fits in the trust chain" section clarifying this doc is **Phase 2
only**, entirely inside the P-Node

### Internal `.ai-docs` updated (technical record of "how")
- `.ai-docs/TEE_Attestation_Architecture.md` — status bumped to v2.0,
Phase 1/1c/2a marked DONE with real file paths and PR numbers, new §7.7
with the full Phase 2 technical write-up (attestation endpoints, full
AttestBackend sequence, fast-verify semantics, reportData layout), open
questions resolved
- `.ai-docs/TEE_CICD_Supply_Chain_Hardening.md` — added v7.0.0 banner,
replaced placeholder diagram with two completed diagrams for Phase 1c
and Phase 2, reorganized status tables

## Test plan

- [x] Clean `ort`-strategy merge from `origin/dev` into this branch (no
conflicts, 11-file delta)
- [x] `GOOS=linux GOARCH=amd64 go build -ldflags='-w' -o
/tmp/proxy-router-linux ./cmd/` succeeds (67 MB binary, exit 0)
- [x] No new `go vet` warnings introduced by the merge (all warnings are
pre-existing on `dev`)
- [x] `grep` verification: no stale `isTee` references, no "consumer
verifies backend" misstatements, no `v6.0.0+` references where `v7.0.0+`
is meant
- [ ] CI/CD pipeline runs green (ECS deploy + post-deploy RTMR3
verification)
- [ ] Reviewer spot-check of the new
`.ai-docs/TEE_Attestation_Architecture.md` §7.7 against the actual
`proxy-router/internal/attestation/` code

## Release notes

This is the v7.0.0 release bump. Downstream merges:
1. This PR: `fix/cicd-ecs-deploy-wait-timing` → `dev`
2. Follow-up: `dev` → `test` (separate PR, opens immediately after this
merges)
3. After test validation: `test` → `main` for the v7.0.0 tag

Made with [Cursor](https://cursor.com)
@nomadicrogue nomadicrogue merged commit 63db4f4 into test Apr 22, 2026
16 checks passed
nomadicrogue added a commit that referenced this pull request Apr 23, 2026
…#712)

## Overview

**v7.0.0 — Full TEE capability.** This release completes the Morpheus
TEE trust chain with **Phase 2: Provider-side backend LLM attestation**.
v6.x closed the loop between the consumer and the provider's
proxy-router (P-Node). v7.0.0 closes the next hop: the P-Node now
cryptographically verifies its **own backend LLM** (CPU TDX quote, TLS
pinning, workload RTMR3 replay, CPU-GPU nonce binding, NVIDIA NRAS GPU
attestation) at startup and on **every inference prompt**.

The major version bump to **7.0.0** marks the completion of the full
two-hop TEE verification chain. It is also designed for **seamless
forward-compatibility**: v6.0.0+ consumers automatically benefit from
Phase 2 guarantees when they connect to v7.0.0+ providers — no
client-side upgrade is required.

```
C-Node (v6.0.0+)  ──Phase 1──▶  P-Node -tee image (v7.0.0+)  ──Phase 2──▶  Backend LLM
                  consumer                                     P-Node
                  verifies                                     verifies its
                  P-Node                                       own backend
```

The on-chain `tee` model tag is the single switch that turns on both
hops.

---

## Headline: Phase 2 TEE Backend Verification (#699, #700)

Every TEE-tagged model's backend LLM is now attested **by the P-Node
itself**, not just assumed trustworthy. The P-Node refuses to forward
any inference unless the backend passes all of the following on every
request:

| Check | What it proves | Where |
|---|---|---|
| **Portal-verified CPU TDX quote** | Backend runs on genuine Intel TDX
hardware | `:29343/cpu` → `TEE_PORTAL_URL` |
| **TLS certificate pinning** | Inference TLS terminates inside the
attested enclave — no CDN/MITM can slip in | `reportData[0:32] ==
SHA-256(TLS cert)` |
| **Workload RTMR3 replay** | The exact set of loaded models (`MODELS=…`
line in `docker-compose.yaml`) is what the operator declared | RTMR3
replay vs. backend's `:29343/docker-compose` |
| **MRTD + RTMR0–2 artifact lookup** | Firmware / VM config / kernel /
initramfs all match a published SecretVM build | `ArtifactRegistry` CSV
(auto-refreshed) |
| **CPU-GPU nonce binding** | GPU evidence cannot be replayed from
another box | `reportData[32:64] == GPU nonce` |
| **NVIDIA NRAS v4 attestation** | Independent hardware-level validation
of the GPU | NRAS REST + JWT EAT signature check |
| **Per-prompt fast verify** | Backend identity hasn't changed since
initial attestation — **runs on every prompt** | hash + TLS fingerprint
compare (~50 ms) |
| **Pinned-cert HTTP client** | Onward inference connection refuses any
TLS cert whose fingerprint doesn't match |
`PinnedHTTPClient.VerifyPeerCertificate` |

**Key files added** (`proxy-router/internal/attestation/`):
- `backend_verifier.go` — `AttestBackend` (full, startup) +
`FastVerifyBackend` (per-prompt hot path, no TTL)
- `workload_verifier.go`, `rtmr.go`, `tdx_quote.go` — RTMR3 replay using
SHA-384 extend chain
- `artifacts_registry.go` — auto-refresh of the SecretVM TDX artifact
registry
- `nras_verifier.go` — NVIDIA NRAS v4 API client with JWT EAT validation
- Backend verifier test suite: `backend_verifier_test.go`,
`workload_verifier_test.go`, `workload_rytn_test.go`,
`nras_verifier_test.go`, `golden_test.go` (~1,450 LOC of coverage)

**Key integration points:**
- `proxy-router/cmd/main.go` — wires `BackendVerifier` into startup and
calls `AttestBackend` once per `tee`-tagged model
- `proxy-router/internal/proxyapi/proxy_receiver.go` — calls
`FastVerifyBackend` on the `SessionPrompt` hot path before forwarding
any inference
- `proxy-router/internal/aiengine/ai_engine.go` — returns a
`PinnedHTTPClient` for TEE models
- `proxy-router/internal/proxyapi/controller_http.go` — new `GET
/v1/models/attestation` health endpoint exposing per-model state
(`verified` / `pending` / `failed` + last-success timestamp + workload
match)

**New environment variables** (`TEE` config block):

| Variable | Default | Purpose |
|---|---|---|
| `TEE_PORTAL_URL` | SecretAI Portal | CPU quote parse + verification
endpoint |
| `TEE_IMAGE_REPO` | `ghcr.io/morpheusais/morpheus-lumerin-node-tee` |
Image repo for cosign attestation verification |
| `ARTIFACT_REGISTRY_URL` | SecretVM TDX artifacts CSV | MRTD + RTMR0–2
lookup source |
| `ARTIFACT_REGISTRY_REFRESH_INTERVAL` | (configurable) | How often to
re-fetch the registry |

---

## The `tee` Tag — One Switch, Two Hops (#708, #709)

Prior to v7.0.0 there were transient plans for a separate `tee-gpu` tag.
That's been consolidated: **the single on-chain `tee` tag** now drives
the entire trust chain. It turns on **both**:
- **Phase 1** on the consumer: C-Node (v6.0.0+) verifies the P-Node's
attestation
- **Phase 2** on the provider: P-Node (v7.0.0+) verifies its own backend
LLM

The local `isTee` field in `models-config.json` has been removed in
favor of the blockchain tag as the single source of truth.
`IsTeeModel(tags)` is the sole helper; `IsTeeGPUModel` is deleted.

---

## Operational Robustness

### P-Node TEE error wrapping (#703, #704)
When the P-Node's Phase 2 backend attestation fails, the error returned
to callers is now wrapped in the correct error type so upstream logic
(session open, prompt dispatch) handles it consistently and the
consumer-visible failure is actionable rather than a generic 500.

### `request_id` propagation in every log (#705)
Every log line emitted along an inference or attestation path now
carries the `request_id` from its context, so operators can trace a
single prompt end-to-end through consumer → P-Node → backend attestation
→ inference → response. Critical for v7 operations since Phase 2
failures can surface at any of several points.

### Storage: per-entry Badger activity keys (#692, #693)
Session activity tracking moved from a single aggregate key to per-entry
keys. This makes BadgerDB's GC able to reclaim disk space properly as
sessions roll over, eliminating a slow-growing storage-bloat issue seen
in long-running providers.

### CI/CD: ECS deploy wait-timing hardening (#694, #695, #701, #702,
#710)
Multiple refinements to the ECS service stabilization + post-deploy
attestation-verification window, eliminating intermittent premature
health-check failures that were flakily failing otherwise-successful
deploys.

---

## Documentation — Full v7 Doc Pass (#710)

All public-facing and internal TEE documentation was audited and
rewritten in this release to accurately describe the two-hop trust chain
and the forward-compatibility story:

**User-facing:**
- `readme.md` — new v7.0.0 release callout with the two-hop diagram and
forward-compat note
- `docs/02.3-proxy-router-tee.md` — rewrote "What This Guarantees (and
What It Doesn't)" with explicit Phase 1 / Phase 2 / remaining-gaps
sections
- `docs/02.4-proxy-router-secretvm-quickstart.md` — rewrote "What
Consumers See, and What Your P-Node Does" as two distinct hops + v7
troubleshooting
- `docs/03-provider-offer.md` — clarified the `tee` on-chain tag drives
both hops
- `docs/models-config.json.md` — noted `isTee` is no longer a local
config field (tag-driven)
- `docs/proxy-router.all.env` — documented all new TEE env vars

**Developer reference:**
- `proxy-router/docs/tee-backend-verification.md` — new 286-line
developer reference for Phase 2, with mermaid sequence + trust-chain
diagrams
- `proxy-router/docs/docs.go`, `swagger.json`, `swagger.yaml` —
auto-generated API docs include the new `GET /v1/models/attestation`
endpoint

**Internal (`.ai-docs/`):**
- `TEE_Attestation_Architecture.md` — status bumped to v2.0; Phases 1 /
1c / 2a marked DONE with real file paths and PR numbers; new §7.7 full
Phase 2 technical write-up
- `TEE_CICD_Supply_Chain_Hardening.md` — v7.0.0 banner; trust-chain
diagram updated with completed Phase 1c and Phase 2 boxes

---

## Configuration Updates

- **`.github/workflows/build.yml`** — `VMAJ_NEW=7` (major-version bump);
builds from `main` will now tag as `v7.x.x`.
- **`smart-contracts/deploy/data/config_base_mainnet.json`** —
`fundingAccount` rotated from
`0x1FE04BC15Cf2c5A2d41a0b3a96725596676eBa1E` to
`0x5160C0311A95E0A1072FA85Df23712A7BA1cD4b1`.

---

## Consumer / Provider Compatibility Matrix

| Consumer | Provider | TEE behavior |
|---|---|---|
| Pre-v6 | any | No TEE verification |
| v6.0.0+ | v6.x | Phase 1 only (consumer verifies P-Node); backend LLM
not attested |
| **v6.0.0+** | **v7.0.0+** | **Full Phase 1 + Phase 2 — the consumer
transparently gains Phase 2 guarantees via the attested P-Node binary.
No client-side upgrade required.** |
| v7.0.0+ | v7.0.0+ | Full Phase 1 + Phase 2 |

This forward-compatibility is the key design principle of the v7 release
— upgrading providers instantly strengthens the network for all existing
v6+ consumers.

---

## PRs Included (main → test diff, 28 commits)

### Phase 2 TEE Backend Verification (headline)
- #699 — feat: Phase 2 TEE backend verification
- #700 — Phase 2 (dev → test merge)
- #703 / #704 — wrap in correct error on P-Node TEE attestation fail
- #705 — pass request_id in context in every log
- #708 / #709 — feat: `"tee"` tag for everything (consolidate on single
on-chain tag)

### Storage & CI/CD hardening
- #692 / #693 — fix(storage): per-entry Badger activity keys + GC
reclaim
- #694 / #695 — refactor(workflows): improve ECS service stabilization
timeout handling
- #701 / #702 — fix(cicd): ECS deploy wait timing
- #710 — ECS deploy wait timing + v7.0.0 release docs + version bump
- #711 — merge: dev to test — v7.0.0 release

---

## Verification

All changes were validated on `test` through the automated pipeline:
- Build → cosign sign → RTMR3 compute → deploy to SecretVM test VM →
post-deploy attestation verification
- Per-model Phase 2 attestation exposed at `GET /v1/models/attestation`
and verified live

---

## Test Plan

- [ ] Verify `VMAJ_NEW=7` produces a clean `v7.0.0` build tag on merge
to main
- [ ] Verify `docker-compose.tee.yml` deployed to a SecretVM instance
boots cleanly and `GET /v1/models/attestation` returns `verified` per
TEE-tagged model
- [ ] Verify a v6.0.0+ consumer opens a session against a v7.0.0+
provider and transparently gets Phase 2 guarantees (no client upgrade)
- [ ] Verify Phase 2 fast-verify fires on every prompt (log `request_id`
should appear at both session-open and each prompt)
- [ ] Verify workload RTMR3 mismatch (e.g. altered
`docker-compose.yaml`) causes the P-Node to refuse the session
- [ ] Verify TLS certificate change on the backend triggers a hard fail
(MITM signal) and refused prompt
- [ ] Verify CPU-GPU nonce mismatch causes attestation failure
- [ ] Verify NRAS outage degrades gracefully (does not block inference)
but CPU-GPU binding still enforced
- [ ] Verify `fundingAccount` rotation in `config_base_mainnet.json` is
correct before promotion
- [ ] Verify existing non-TEE models are unaffected (zero overhead)
- [ ] All CI checks pass on `main` after merge

---

## Blocked until review + branch protection approval.

Made with [Cursor](https://cursor.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants