Skip to content

Commit 857a9f0

Browse files
JAORMXclaude
andcommitted
Add guest/harden package for VM kernel and capability hardening
Introduce a reusable guest-side hardening package that applies kernel sysctl defaults (kptr_restrict, dmesg_restrict, BPF) and drops unneeded capabilities from the bounding set via prctl. Document the new package in docs/SECURITY.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4f9d0a5 commit 857a9f0

7 files changed

Lines changed: 466 additions & 0 deletions

File tree

docs/SECURITY.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ hardening recommendations for propolis.
1010
- [Networking Trust Boundary](#networking-trust-boundary)
1111
- [Guest Escape Blast Radius](#guest-escape-blast-radius)
1212
- [Hardening Recommendations](#hardening-recommendations)
13+
- [Guest Hardening](#guest-hardening)
1314
- [Egress Policy Security Model](#egress-policy-security-model)
1415
- [Tar Extraction Defenses](#tar-extraction-defenses)
1516
- [Process Identity Verification](#process-identity-verification)
@@ -202,6 +203,59 @@ remain alive for networking to function. For the simplest deployments,
202203
the default runner-side networking ties the network stack to the VM's
203204
lifetime with no extra coordination.
204205

206+
## Guest Hardening
207+
208+
The `guest/harden` package provides reusable kernel and capability
209+
hardening for microVM init processes. It is guest-side code
210+
(`//go:build linux`) with no CGO or krun dependencies.
211+
212+
### Recommended usage
213+
214+
Call the hardening functions in your guest init boot sequence:
215+
216+
1. Mount `/proc` and `/sys` first (sysctls need procfs).
217+
2. Call `harden.KernelDefaults(logger)` to apply sysctls.
218+
3. Perform all privileged operations (mounts, network config, chown).
219+
4. Call `harden.DropBoundingCaps(keep...)` as the last privileged step.
220+
221+
### Kernel sysctls
222+
223+
`KernelDefaults` applies the following sysctls. Each is set
224+
independently; individual failures are logged as warnings rather than
225+
aborting boot, because not all kernels support every sysctl.
226+
227+
| Sysctl | Value | Purpose |
228+
|--------|-------|---------|
229+
| `kernel.kptr_restrict` | `2` | Hide kernel pointers from all users. Prevents information leaks that aid exploit development. |
230+
| `kernel.dmesg_restrict` | `1` | Restrict `dmesg` to privileged users. Prevents unprivileged processes from reading kernel log messages that may contain sensitive addresses or operations. |
231+
| `kernel.unprivileged_bpf_disabled` | `1` | Disable unprivileged BPF. Prevents unprivileged users from loading BPF programs, which have historically been a source of kernel privilege escalation vulnerabilities. |
232+
233+
### Capability bounding set
234+
235+
`DropBoundingCaps(keep...)` drops all Linux capabilities from the
236+
bounding set except those explicitly listed. This limits what
237+
capabilities child processes can acquire, even through setuid binaries
238+
or file capabilities.
239+
240+
For a typical SSH-based guest, the minimal keep set is:
241+
242+
| Capability | Number | Reason |
243+
|-----------|--------|--------|
244+
| `CAP_SETUID` | 7 | sshd credential switching to sandbox user |
245+
| `CAP_SETGID` | 6 | sshd group switching |
246+
| `CAP_NET_BIND_SERVICE` | 10 | Binding port 22 (privileged port) |
247+
248+
### Threat model
249+
250+
These hardening measures are defense-in-depth for the guest
251+
environment. An attacker who has compromised the guest workload would
252+
need a hypervisor escape to reach the host; however, guest hardening:
253+
254+
- Raises the bar for local privilege escalation within the guest
255+
- Reduces information available for exploit development (kernel pointers, dmesg)
256+
- Limits the attack surface of dangerous subsystems (BPF)
257+
- Constrains what a compromised process can do even with root inside the guest
258+
205259
## Egress Policy Security Model
206260

207261
The DNS-based egress policy (`WithEgressPolicy()`) restricts VM outbound

guest/harden/capability.go

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//go:build linux
5+
6+
package harden
7+
8+
import (
9+
"fmt"
10+
"os"
11+
"strconv"
12+
"strings"
13+
"syscall"
14+
)
15+
16+
// Linux capability constants. Only the subset typically needed by guest
17+
// init processes is defined here.
18+
const (
19+
CapChown uintptr = 0
20+
CapSetUID uintptr = 7
21+
CapSetGID uintptr = 6
22+
CapKill uintptr = 5
23+
CapNetBindService uintptr = 10
24+
)
25+
26+
// prctl constants for capability bounding set manipulation.
27+
const (
28+
prCapBSetDrop = 24 // PR_CAPBSET_DROP
29+
)
30+
31+
// capLastCap reads the highest valid capability number from
32+
// /proc/sys/kernel/cap_last_cap. Falls back to 41 (CAP_CHECKPOINT_RESTORE,
33+
// the highest cap on Linux 6.x kernels) if the file is unreadable.
34+
func capLastCap() uintptr {
35+
data, err := os.ReadFile("/proc/sys/kernel/cap_last_cap")
36+
if err != nil {
37+
return 41
38+
}
39+
n, err := parseCapLastCap(string(data))
40+
if err != nil {
41+
return 41
42+
}
43+
return n
44+
}
45+
46+
// parseCapLastCap parses the content of /proc/sys/kernel/cap_last_cap.
47+
func parseCapLastCap(content string) (uintptr, error) {
48+
n, err := strconv.Atoi(strings.TrimSpace(content))
49+
if err != nil {
50+
return 0, fmt.Errorf("parsing cap_last_cap: %w", err)
51+
}
52+
return uintptr(n), nil
53+
}
54+
55+
// DropBoundingCaps drops all capabilities from the bounding set except
56+
// those listed in keep. This limits what capabilities child processes
57+
// can acquire even through setuid binaries or file capabilities.
58+
//
59+
// Call this as the last privileged operation before starting the
60+
// workload — all mounts, network config, and chown calls must be
61+
// complete before caps are dropped.
62+
func DropBoundingCaps(keep ...uintptr) error {
63+
keepSet := make(map[uintptr]struct{}, len(keep))
64+
for _, c := range keep {
65+
keepSet[c] = struct{}{}
66+
}
67+
68+
last := capLastCap()
69+
for cap := uintptr(0); cap <= last; cap++ {
70+
if _, ok := keepSet[cap]; ok {
71+
continue
72+
}
73+
if err := capBSetDrop(cap); err != nil {
74+
return fmt.Errorf("dropping cap %d: %w", cap, err)
75+
}
76+
}
77+
return nil
78+
}
79+
80+
// capBSetDrop calls prctl(PR_CAPBSET_DROP, cap) to remove a single
81+
// capability from the bounding set.
82+
func capBSetDrop(cap uintptr) error {
83+
_, _, errno := syscall.Syscall(
84+
syscall.SYS_PRCTL,
85+
prCapBSetDrop,
86+
cap,
87+
0,
88+
)
89+
if errno != 0 {
90+
return fmt.Errorf("prctl(PR_CAPBSET_DROP, %d): %w", cap, errno)
91+
}
92+
return nil
93+
}
94+
95+
// keepSetContains reports whether cap is in the given keep set.
96+
func keepSetContains(keep []uintptr, cap uintptr) bool {
97+
for _, k := range keep {
98+
if k == cap {
99+
return true
100+
}
101+
}
102+
return false
103+
}

guest/harden/capability_test.go

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//go:build linux
5+
6+
package harden
7+
8+
import (
9+
"testing"
10+
11+
"github.com/stretchr/testify/assert"
12+
"github.com/stretchr/testify/require"
13+
)
14+
15+
func TestParseCapLastCap(t *testing.T) {
16+
t.Parallel()
17+
18+
tests := []struct {
19+
name string
20+
content string
21+
want uintptr
22+
wantErr bool
23+
}{
24+
{
25+
name: "typical value",
26+
content: "40\n",
27+
want: 40,
28+
},
29+
{
30+
name: "higher kernel",
31+
content: "41\n",
32+
want: 41,
33+
},
34+
{
35+
name: "no trailing newline",
36+
content: "40",
37+
want: 40,
38+
},
39+
{
40+
name: "whitespace padding",
41+
content: " 40 \n",
42+
want: 40,
43+
},
44+
{
45+
name: "non-numeric",
46+
content: "abc\n",
47+
wantErr: true,
48+
},
49+
{
50+
name: "empty",
51+
content: "",
52+
wantErr: true,
53+
},
54+
}
55+
56+
for _, tt := range tests {
57+
t.Run(tt.name, func(t *testing.T) {
58+
t.Parallel()
59+
got, err := ParseCapLastCapForTest(tt.content)
60+
if tt.wantErr {
61+
assert.Error(t, err)
62+
return
63+
}
64+
require.NoError(t, err)
65+
assert.Equal(t, tt.want, got)
66+
})
67+
}
68+
}
69+
70+
func TestKeepSetContains(t *testing.T) {
71+
t.Parallel()
72+
73+
keep := []uintptr{CapSetUID, CapSetGID, CapNetBindService}
74+
75+
tests := []struct {
76+
name string
77+
cap uintptr
78+
want bool
79+
}{
80+
{name: "CAP_SETUID in set", cap: CapSetUID, want: true},
81+
{name: "CAP_SETGID in set", cap: CapSetGID, want: true},
82+
{name: "CAP_NET_BIND_SERVICE in set", cap: CapNetBindService, want: true},
83+
{name: "CAP_CHOWN not in set", cap: CapChown, want: false},
84+
{name: "CAP_KILL not in set", cap: CapKill, want: false},
85+
{name: "arbitrary cap not in set", cap: 99, want: false},
86+
}
87+
88+
for _, tt := range tests {
89+
t.Run(tt.name, func(t *testing.T) {
90+
t.Parallel()
91+
assert.Equal(t, tt.want, KeepSetContainsForTest(keep, tt.cap))
92+
})
93+
}
94+
}
95+
96+
func TestKeepSetContains_EmptySet(t *testing.T) {
97+
t.Parallel()
98+
99+
// With an empty keep set, nothing should be kept.
100+
assert.False(t, KeepSetContainsForTest(nil, CapSetUID))
101+
assert.False(t, KeepSetContainsForTest([]uintptr{}, CapSetGID))
102+
}
103+
104+
func TestCapConstants(t *testing.T) {
105+
t.Parallel()
106+
107+
// Verify the capability constants match Linux kernel values.
108+
assert.Equal(t, uintptr(0), CapChown)
109+
assert.Equal(t, uintptr(5), CapKill)
110+
assert.Equal(t, uintptr(6), CapSetGID)
111+
assert.Equal(t, uintptr(7), CapSetUID)
112+
assert.Equal(t, uintptr(10), CapNetBindService)
113+
}
114+
115+
func TestCapLastCap_ReadsProc(t *testing.T) {
116+
t.Parallel()
117+
118+
// capLastCap should return a reasonable value from /proc or the
119+
// fallback. On any Linux system the value should be >= 0.
120+
got := capLastCap()
121+
assert.GreaterOrEqual(t, got, uintptr(0))
122+
// Modern kernels have at least 40 capabilities.
123+
assert.GreaterOrEqual(t, got, uintptr(36))
124+
}

guest/harden/doc.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//go:build linux
5+
6+
// Package harden provides guest-side kernel and capability hardening for
7+
// microVM init processes. It restricts kernel information leaks, limits
8+
// unprivileged access to dangerous subsystems, and drops unneeded
9+
// capabilities from the bounding set.
10+
//
11+
// Consumers (e.g. apiary-init) should call [KernelDefaults] early in the
12+
// boot sequence (after /proc is mounted) and [DropBoundingCaps] last,
13+
// just before starting the workload, so that all privileged operations
14+
// (mounts, network config, chown) are already complete.
15+
package harden

guest/harden/export_test.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//go:build linux
5+
6+
package harden
7+
8+
// Test-only exports for verifying internal logic without root privileges.
9+
var (
10+
ParseCapLastCapForTest = parseCapLastCap
11+
KeepSetContainsForTest = keepSetContains
12+
SysctlPathForTest = sysctlPath
13+
DefaultsForTest = defaults
14+
)

guest/harden/sysctl.go

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
//go:build linux
5+
6+
package harden
7+
8+
import (
9+
"fmt"
10+
"log/slog"
11+
"os"
12+
"strings"
13+
)
14+
15+
// Set writes value to the sysctl identified by key. The key uses the
16+
// standard dotted notation (e.g. "kernel.kptr_restrict") which is
17+
// converted to the /proc/sys/ path (/proc/sys/kernel/kptr_restrict).
18+
func Set(key, value string) error {
19+
path := "/proc/sys/" + strings.ReplaceAll(key, ".", "/")
20+
if err := os.WriteFile(path, []byte(value), 0o644); err != nil {
21+
return fmt.Errorf("sysctl %s=%s: %w", key, value, err)
22+
}
23+
return nil
24+
}
25+
26+
// kernelDefault is a single sysctl key-value pair with a human-readable
27+
// reason for why it is set.
28+
type kernelDefault struct {
29+
key string
30+
value string
31+
reason string
32+
}
33+
34+
// defaults lists the recommended kernel sysctls for guest hardening.
35+
var defaults = []kernelDefault{
36+
{
37+
key: "kernel.kptr_restrict",
38+
value: "2",
39+
reason: "hide kernel pointers from all users",
40+
},
41+
{
42+
key: "kernel.dmesg_restrict",
43+
value: "1",
44+
reason: "restrict dmesg to privileged users",
45+
},
46+
{
47+
key: "kernel.unprivileged_bpf_disabled",
48+
value: "1",
49+
reason: "disable unprivileged BPF",
50+
},
51+
}
52+
53+
// KernelDefaults applies recommended kernel sysctl hardening. Each
54+
// setting is applied independently; failures are logged as warnings
55+
// rather than aborting boot, because individual sysctls may not be
56+
// available on all kernel versions.
57+
func KernelDefaults(logger *slog.Logger) {
58+
for _, d := range defaults {
59+
logger.Info("applying sysctl", "key", d.key, "value", d.value, "reason", d.reason)
60+
if err := Set(d.key, d.value); err != nil {
61+
logger.Warn("sysctl failed", "key", d.key, "error", err)
62+
}
63+
}
64+
}
65+
66+
// sysctlPath converts a dotted sysctl key to its /proc/sys/ path.
67+
func sysctlPath(key string) string {
68+
return "/proc/sys/" + strings.ReplaceAll(key, ".", "/")
69+
}

0 commit comments

Comments
 (0)