Skip to content

Commit 7186b35

Browse files
butler54claude
andcommitted
docs: address PR #73 review comments and merge PR #75 documentation
This commit addresses all review comments from bpradipt and pawelpros on PR #73, merges documentation from PR #75, and updates container images. Documentation changes: - README: Replace "peer-pod infrastructure" wording to clarify Azure vs bare metal - README: Update OCP version requirements from 4.17+ to 4.19.28+ (OSC 1.12 requirement) - README: Clarify PCR collection differs for Azure (get-pcr.sh) vs bare metal (manual) - README: Distinguish Azure (kata-remote) from bare metal (kata-cc) runtime classes - values-secret.yaml.template: Add missing kbsPrivateKey secret - values-secret.yaml.template: Reorganize with clear section headers and improved docs - gen-secrets.sh: Add prominent alert when values-secret file is created - Merge docs/nfd-matchall-bug.md from PR #75 (NFD matchAll bug report) - Merge docs/pcr-reference-values-bare-metal.md from PR #75 (PCR collection guide) Code cleanup: - Delete obsolete qgs-config-cm.yaml (QGS args now inline) - Delete obsolete qgs-sgx-cm.yaml (QCNL config via downwardAPI) - Remove commented-out detect-runtime-class reference in values-baremetal.yaml Image updates: - intel-dpo-sgx.yaml: Update intel-sgx-plugin to sha256:4ac8769c (v0.35.0) - pccs-deployment.yaml: Update osc-pccs to sha256:edf57087 (v1.12) - qgs-ds.yaml: Update osc-tdx-qgs to sha256:308d66da (v1.12) Resolves review comments from: - bpradipt: peer-pod wording, OCP versions, PCR clarification - pawelpros: obsolete ConfigMaps, image digests, PCR requirements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent d61db58 commit 7186b35

11 files changed

Lines changed: 441 additions & 97 deletions

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Validated pattern for deploying confidential containers on OpenShift using the [Validated Patterns](https://validatedpatterns.io/) framework.
44

5-
Confidential containers use hardware-backed Trusted Execution Environments (TEEs) to isolate workloads from cluster and hypervisor administrators. This pattern deploys and configures the Red Hat CoCo stack — including the sandboxed containers operator, Trustee (Key Broker Service), and peer-pod infrastructure — on Azure and bare metal.
5+
Confidential containers use hardware-backed Trusted Execution Environments (TEEs) to isolate workloads from cluster and hypervisor administrators. This pattern deploys and configures the Red Hat CoCo stack — including the sandboxed containers operator, Trustee (Key Broker Service) operator, and Kata infrastructure — on Azure cloud instances and bare metal.
66

77
## Topologies
88

@@ -24,7 +24,7 @@ Azure deployments use peer-pods, which provision confidential VMs (`Standard_DCa
2424

2525
Breaking change from v3. This is the first version using GA (Generally Available) releases of the CoCo stack:
2626

27-
- **OpenShift Sandboxed Containers 1.12+** (requires OCP 4.17+)
27+
- **OpenShift Sandboxed Containers 1.12+** (requires OCP 4.19.28+)
2828
- **Red Hat Build of Trustee 1.1** (GA release; all versions prior to 1.0 were Technology Preview)
2929
- External chart repositories for [Trustee](https://github.com/validatedpatterns/trustee-chart), [sandboxed-containers](https://github.com/validatedpatterns/sandboxed-containers-chart), and [sandboxed-policies](https://github.com/validatedpatterns/sandboxed-policies-chart)
3030
- Self-signed certificates via cert-manager (Let's Encrypt no longer required)
@@ -46,13 +46,13 @@ All previous versions used pre-GA (Technology Preview) releases of Trustee:
4646

4747
**Azure deployments:**
4848

49-
- OpenShift 4.17+ cluster on Azure (self-managed via `openshift-install` or ARO)
49+
- OpenShift 4.19.28+ cluster on Azure (self-managed via `openshift-install` or ARO)
5050
- Azure `Standard_DCas_v5` VM quota in your target region (these are confidential computing VMs and are not available in all regions). See the note below for more details.
5151
- Azure DNS hosting the cluster's DNS zone
5252

5353
**Bare metal deployments:**
5454

55-
- OpenShift 4.17+ cluster on bare metal with Intel TDX or AMD SEV-SNP hardware
55+
- OpenShift 4.19.28+ cluster on bare metal with Intel TDX or AMD SEV-SNP hardware
5656
- BIOS/firmware configured to enable TDX or SEV-SNP
5757
- Available block devices for LVMS storage (auto-discovered)
5858
- For Intel TDX: an Intel PCS API key from [api.portal.trustedservices.intel.com](https://api.portal.trustedservices.intel.com/)
@@ -68,7 +68,7 @@ All previous versions used pre-GA (Technology Preview) releases of Trustee:
6868
These scripts generate the cryptographic material and attestation measurements needed by Trustee and the peer-pod VMs. Run them once before your first deployment.
6969

7070
1. `bash scripts/gen-secrets.sh` — generates KBS key pairs, PCCS certificates/tokens (for bare metal), and copies `values-secret.yaml.template` to `~/values-secret-coco-pattern.yaml`
71-
2. `bash scripts/get-pcr.sh` — retrieves PCR measurements from the peer-pod VM image and stores them at `~/.coco-pattern/measurements.json` (requires `podman`, `skopeo`, and `~/pull-secret.json`). **Not required for bare metal deployments.**
71+
2. `bash scripts/get-pcr.sh` — retrieves PCR measurements from the peer-pod VM image and stores them at `~/.coco-pattern/measurements.json` (requires `podman`, `skopeo`, and `~/pull-secret.json`). **Azure only.** Bare metal uses manual PCR collection — see [docs/pcr-reference-values-bare-metal.md](docs/pcr-reference-values-bare-metal.md) for the procedure. Store the measurements at `~/.coco-pattern/measurements.json`.
7272
3. Review and customise `~/values-secret-coco-pattern.yaml` — this file is loaded into Vault and provides secrets to the pattern. For bare metal, uncomment the PCCS secrets section and provide your Intel PCS API key.
7373

7474
> **Note:** `gen-secrets.sh` will not overwrite existing secrets. Delete `~/.coco-pattern/` if you need to regenerate.
@@ -85,7 +85,7 @@ These scripts generate the cryptographic material and attestation measurements n
8585
1. Set `main.clusterGroupName: trusted-hub` in `values-global.yaml`
8686
2. Deploy the hub cluster: `./pattern.sh make install`
8787
3. Wait for ACM (`MultiClusterHub`) to reach `Running` state on the hub
88-
4. Provision a second OpenShift 4.17+ cluster on Azure for the spoke
88+
4. Provision a second OpenShift 4.19.28+ cluster on Azure for the spoke
8989
5. Import the spoke into ACM with label `clusterGroup=spoke`
9090
(see [importing a cluster](https://validatedpatterns.io/learn/importing-a-cluster/))
9191
6. ACM will automatically deploy the `spoke` clusterGroup applications (sandboxed containers, workloads) to the imported cluster
@@ -118,7 +118,7 @@ Two sample applications are deployed on the cluster running confidential workloa
118118
- `secure` — a confidential container with a strict policy; `oc exec` is denied even for `kubeadmin`
119119
- `insecure-policy` — a confidential container with a relaxed policy allowing `oc exec` (useful for testing the Confidential Data Hub)
120120

121-
Each confidential pod runs on its own `Standard_DC2as_v5` Azure VM (visible in the Azure portal). Pods use `runtimeClassName: kata-remote`.
121+
On Azure, each confidential pod runs on its own `Standard_DC2as_v5` Azure VM (visible in the Azure portal) using `runtimeClassName: kata-remote`. On bare metal, pods use `runtimeClassName: kata-cc` and run directly on the underlying TDX or SEV-SNP hardware.
122122

123123
- **kbs-access**: A web service that retrieves and presents secrets obtained from the Trustee Key Broker Service (KBS) via the Confidential Data Hub (CDH). Useful for verifying end-to-end attestation and secret delivery in locked-down environments.
124124

charts/all/intel-dcap/templates/intel-dpo-sgx.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ kind: SgxDevicePlugin
33
metadata:
44
name: sgxdeviceplugin-sample
55
spec:
6-
image: registry.connect.redhat.com/intel/intel-sgx-plugin@sha256:f2c77521c6dae6b4db1896a5784ba8b06a5ebb2a01684184fc90143cfcca7bf4
6+
image: registry.connect.redhat.com/intel/intel-sgx-plugin@sha256:4ac8769c4f0a82b3ea04cf1532f15e9935c71fe390ff5a9dc3ee57f970a65f0b
77
enclaveLimit: 110
88
provisionLimit: 110
99
logLevel: 4

charts/all/intel-dcap/templates/pccs-deployment.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ spec:
3737
privileged: true # Required for chcon to work on host files
3838
containers:
3939
- name: pccs
40-
image: registry.redhat.io/openshift-sandboxed-containers/osc-pccs@sha256:de64fc7b13aaa7e466e825d62207f77e7c63a4f9da98663c3ab06abc45f2334d
40+
image: registry.redhat.io/openshift-sandboxed-containers/osc-pccs@sha256:edf57087220516115512dc6687b96045cbc69472a3d6d381619f66beae032ad6
4141
envFrom:
4242
- secretRef:
4343
name: pccs-secrets

charts/all/intel-dcap/templates/qgs-config-cm.yaml

Lines changed: 0 additions & 9 deletions
This file was deleted.

charts/all/intel-dcap/templates/qgs-ds.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ spec:
2222
dnsPolicy: ClusterFirstWithHostNet
2323
initContainers:
2424
- name: platform-registration
25-
image: registry.redhat.io/openshift-sandboxed-containers/osc-tdx-qgs@sha256:86b23461c4eea073f4535a777374a54e934c37ac8c96c6180030f92ebf970524
25+
image: registry.redhat.io/openshift-sandboxed-containers/osc-tdx-qgs@sha256:308d66da94d3116cc16530a4a2cd60a995458f331e1cb3487c6ea3fdae60545b
2626
restartPolicy: Always
2727
command: [ '/usr/bin/dcap-registration-flow' ]
2828
env:
@@ -47,7 +47,7 @@ spec:
4747
mountPath: /sys/firmware/efi/efivars
4848
containers:
4949
- name: tdx-qgs
50-
image: registry.redhat.io/openshift-sandboxed-containers/osc-tdx-qgs@sha256:86b23461c4eea073f4535a777374a54e934c37ac8c96c6180030f92ebf970524
50+
image: registry.redhat.io/openshift-sandboxed-containers/osc-tdx-qgs@sha256:308d66da94d3116cc16530a4a2cd60a995458f331e1cb3487c6ea3fdae60545b
5151
args:
5252
- -p=4050
5353
- -n=4

charts/all/intel-dcap/templates/qgs-sgx-cm.yaml

Lines changed: 0 additions & 16 deletions
This file was deleted.

docs/nfd-matchall-bug.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Bug Report: NFD `matchAll` Field Silently Dropped, Causing False TEE Labels
2+
3+
## Summary
4+
5+
`matchAll` is not a valid field in the NFD `NodeFeatureRule.spec.rules[]` schema. The correct field for AND-logic matching is `matchFeatures` (top-level on each rule). When `matchAll` is used, the OpenShift NFD operator silently drops it during the `nfd.openshift.io/v1alpha1` to `nfd.k8s-sigs.io/v1alpha1` conversion, leaving rules with no match predicates. These empty rules match every node unconditionally, applying all TEE labels regardless of hardware.
6+
7+
## Impact
8+
9+
- Every node receives ALL TEE labels: `intel.feature.node.kubernetes.io/tdx`, `amd.feature.node.kubernetes.io/snp`, `ibm.feature.node.kubernetes.io/se`, and `intel.feature.node.kubernetes.io/sgx`
10+
- The OpenShift sandboxed containers operator fails with: **`"multiple TEE platforms detected; only one per cluster supported"`**
11+
- KataConfig cannot reconcile, so no kata runtime handler is installed
12+
- All confidential container pods fail with: `failed to find runtime handler kata-snp from runtime list`
13+
14+
## Root Cause
15+
16+
The NFD `Rule` schema supports two match fields:
17+
18+
| Field | Behavior | Valid? |
19+
|-------|----------|--------|
20+
| `matchFeatures` | Top-level list of feature matchers; ALL must match (AND) | Yes |
21+
| `matchAny` | List of match groups; ANY must match (OR) | Yes |
22+
| `matchAll` | **Does not exist in the NFD API** | No |
23+
24+
When the chart template uses `matchAll`:
25+
26+
```yaml
27+
# BROKEN - matchAll is not a valid field
28+
- name: "amd.sev-snp"
29+
labels:
30+
amd.feature.node.kubernetes.io/snp: "true"
31+
matchAll:
32+
- matchFeatures:
33+
- feature: cpu.security
34+
matchExpressions:
35+
sev.snp.enabled: { op: Exists }
36+
```
37+
38+
The OpenShift NFD operator creates a shadow resource under `nfd.k8s-sigs.io/v1alpha1`. During this conversion, `matchAll` is an unrecognized field and is silently stripped. The resulting live resource has:
39+
40+
```yaml
41+
# RESULT - no match conditions, matches every node
42+
- name: "amd.sev-snp"
43+
labels:
44+
amd.feature.node.kubernetes.io/snp: "true"
45+
labelsTemplate: ""
46+
varsTemplate: ""
47+
```
48+
49+
## Evidence
50+
51+
**Node:** `master-03` (Intel Xeon, model family 6, ID 207, vendor: Intel)
52+
53+
**NFD-reported hardware features (`cpu.security`):**
54+
55+
```bash
56+
sgx.enabled: "true"
57+
sgx.epc: "4257210368"
58+
```
59+
60+
Note: `sev.snp.enabled`, `tdx.enabled`, and `se.enabled` are **not present** in the node's feature data.
61+
62+
**Labels applied to the node (all false positives except sgx):**
63+
64+
```bash
65+
amd.feature.node.kubernetes.io/snp=true # FALSE - Intel CPU, no SEV-SNP
66+
intel.feature.node.kubernetes.io/tdx=true # FALSE - no tdx.enabled in cpu.security
67+
ibm.feature.node.kubernetes.io/se=true # FALSE - Intel CPU, no SE
68+
intel.feature.node.kubernetes.io/sgx=true # CORRECT - sgx.enabled is true
69+
feature.node.kubernetes.io/runtime.kata=true # CORRECT - matchAny works (valid field)
70+
```
71+
72+
**Sandbox operator log:**
73+
74+
```text
75+
INFO controllers.KataConfig failed to detect TEE platform
76+
{"err": "multiple TEE platforms detected; only one per cluster supported"}
77+
```
78+
79+
## Fix
80+
81+
Replace `matchAll` with `matchFeatures` in each rule. The `matchFeatures` list at the rule level uses AND logic (all entries must match), which is the intended behavior.
82+
83+
Additionally, add vendor-discriminating CPUID checks to prevent cross-platform false positives:
84+
85+
```yaml
86+
# FIXED - uses matchFeatures (valid field) with vendor guard
87+
- name: "amd.sev-snp"
88+
labels:
89+
amd.feature.node.kubernetes.io/snp: "true"
90+
matchFeatures:
91+
- feature: cpu.cpuid
92+
matchExpressions:
93+
SVM: { op: Exists } # AMD-only CPUID flag
94+
- feature: cpu.security
95+
matchExpressions:
96+
sev.snp.enabled: { op: Exists }
97+
```
98+
99+
| Rule | `matchAll` (broken) | `matchFeatures` (fixed) | Vendor guard added |
100+
|------|---------------------|-------------------------|--------------------|
101+
| `amd.sev-snp` | Dropped silently | AND: SVM + sev.snp.enabled | `SVM` (AMD) |
102+
| `intel.sgx` | Dropped silently | AND: SGX + SGXLC + sgx.enabled + X86_SGX | `SGX`, `SGXLC` (Intel) |
103+
| `intel.tdx` | Dropped silently | AND: VMX + tdx.enabled | `VMX` (Intel) |
104+
| `ibm.se.enabled` | Dropped silently | AND: se.enabled | None (s390x only) |
105+
106+
## Affected Versions
107+
108+
Any deployment using the consolidated `NodeFeatureRule` (`consolidated-hardware-features`) introduced in commit `57ec5f4`. The original separate NFD rule files (`amd-nfd-rules.yaml`, `intel-nfd-rules.yaml`) used `matchFeatures` correctly but the consolidation mistakenly introduced `matchAll`.
109+
110+
## Remediation
111+
112+
After deploying the corrected chart:
113+
114+
```bash
115+
# Remove false labels so NFD can re-evaluate
116+
oc label node <node> amd.feature.node.kubernetes.io/snp- \
117+
intel.feature.node.kubernetes.io/tdx- \
118+
ibm.feature.node.kubernetes.io/se-
119+
120+
# Restart sandbox operator to re-evaluate KataConfig
121+
oc delete pod -n openshift-sandboxed-containers-operator -l app=controller-manager
122+
```

0 commit comments

Comments
 (0)