Skip to content

Commit 9ef5086

Browse files
authored
docs(k8s-proxy): add DaemonSet architecture + auto-replay environments guide (#842)
1 parent c352a42 commit 9ef5086

3 files changed

Lines changed: 316 additions & 12 deletions

File tree

vale_styles/config/vocabularies/Base/accept.txt

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,3 +132,51 @@ keploy-daemonset
132132
keploy-agent
133133
recordingsessions
134134
replaysessions
135+
TGID[s]?
136+
[Rr]efcount[s]?
137+
GitOps
138+
envFrom
139+
valueFrom
140+
[Cc]onfigMap[s]?
141+
ServiceAccount[s]?
142+
imagePullSecret[s]?
143+
NetPolic(y|ies)
144+
NetworkPolic(y|ies)
145+
containerd
146+
launchd
147+
systemd
148+
pm2
149+
SPDY
150+
mTLS
151+
PodTemplate[Ss]pec
152+
podSelector
153+
matchLabels
154+
backoff
155+
[Aa]ir-?gap(?:ped|ping)?
156+
kubelet
157+
keployContext
158+
keploy-replay-runner
159+
ReplayJob[s]?
160+
CreateReplayJobRequest
161+
runner-mode
162+
cluster-mode
163+
crd
164+
runner
165+
sidecar
166+
[Kk]3s
167+
[Kk]0s
168+
kindNet
169+
randAlphaNum
170+
secretKeyRef
171+
HostPath
172+
PostStart
173+
[Cc]group[s]?
174+
[Uu]serspace
175+
[Tt]eardown
176+
[Rr]eplayer
177+
[Rr]ehydrate[ds]?
178+
[Rr]eachability
179+
[Ww]alkthrough
180+
[Dd]ev
181+
[Cc]Rs?
182+
[Ss]ubresource[s]?
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
---
2+
id: k8s-proxy-daemonset-architecture
3+
title: K8s Proxy DaemonSet Architecture & Auto-Replay Environments
4+
sidebar_label: DaemonSet & Auto-Replay
5+
description: How Keploy's DaemonSet recording works under the hood and the three environments where auto-replay can run—in-cluster, Docker daemon runner, and a separate replay cluster.
6+
tags:
7+
- kubernetes
8+
- k8s proxy
9+
- daemonset
10+
- architecture
11+
- auto-replay
12+
- enterprise
13+
keywords:
14+
- keploy daemonset
15+
- eBPF capture
16+
- RecordingSession CRD
17+
- auto-replay modes
18+
- cluster-mode replay
19+
- replay-runner
20+
- docker daemon replay
21+
---
22+
23+
import ProductTier from '@site/src/components/ProductTier';
24+
25+
<ProductTier tiers="Enterprise" offerings="Self-Hosted, Dedicated" />
26+
27+
The Keploy Kubernetes Proxy supports two recording modes—**Sidecar** and **DaemonSet**—and two independent **auto-replay environments** that the same proxy can dispatch to. This page explains the moving parts of DaemonSet recording and then walks through both replay environments end to end.
28+
29+
If you only want the install steps, see [the K8s Proxy quickstart](/docs/quickstart/k8s-proxy/) or [the customer cluster-mode setup guide](/docs/running-keploy/k8s-proxy-api/). This document is the "behind the scenes" reference.
30+
31+
---
32+
33+
## Part 1—DaemonSet recording architecture
34+
35+
### Why DaemonSet mode
36+
37+
Sidecar mode injects a `keploy-agent` container into your application Pod via a `MutatingAdmissionWebhook` and rolls the Deployment. That works, but it has two non-trivial requirements:
38+
39+
1. **Write RBAC on the application namespace.** The proxy needs `patch deployments` to add the sidecar.
40+
2. **An application restart at recording start.** The injected sidecar only takes effect on the next rollout.
41+
42+
In production environments where Keploy must operate under read-only RBAC on the application namespace, or where rolling the Pod has unacceptable cost, neither requirement is acceptable. DaemonSet mode removes both.
43+
44+
### The three components
45+
46+
```
47+
┌────────────── Source cluster ──────────────────────────────────────────┐
48+
│ │
49+
│ ┌───────────────┐ ┌─────────────────────────────────────┐ │
50+
│ │ Application │ │ k8s-proxy (Deployment) │ │
51+
│ │ Pods │ │ - controller-runtime manager │ │
52+
│ │ (unchanged, │ │ - REST API (/record/start, etc.) │ │
53+
│ │ no sidecar) │ │ - persists to MinIO + MongoDB │ │
54+
│ └───────┬───────┘ └──────────────┬──────────────────────┘ │
55+
│ │ │ │
56+
│ │ traffic captured by eBPF │ creates RecordingSession │
57+
│ │ ▼ │
58+
│ │ ┌────────────────────────────┐ │
59+
│ │ │ kube-apiserver / etcd │ │
60+
│ │ │ • RecordingSession CRD │ │
61+
│ │ │ • ReplaySession CRD │ │
62+
│ │ └──────────────┬─────────────┘ │
63+
│ │ │ watch │
64+
│ ┌───────▼─────────────────────────────────┐ │ │
65+
│ │ keploy-daemonset (per node) │◀─┘ │
66+
│ │ - controller-runtime watches the CR │ │
67+
│ │ - resolves matching Pods on this node │ │
68+
│ │ - programs target_namespace_pids + │ │
69+
│ │ target_cgroup_ids BPF maps │ │
70+
│ │ - eBPF programs filter by those maps │ │
71+
│ │ - uploads test cases + mocks back to │ │
72+
│ │ k8s-proxy over HTTP │ │
73+
│ └─────────────────────────────────────────┘ │
74+
└────────────────────────────────────────────────────────────────────────┘
75+
```
76+
77+
The pieces:
78+
79+
1. **k8s-proxy Deployment.** Same single-replica controller you already run for Sidecar mode. It owns the REST API the Console calls (`/record/start`, `/record/stop`, `/test/start`, etc.), persists captured artifacts to MinIO + MongoDB, and dispatches auto-replay (see Part 2).
80+
2. **`recordingsessions.keploy.io` CRD.** A small Custom Resource the proxy creates at `/record/start`. Each CR is named after the target Deployment and carries a `podSelector`, the list of containers to trace, and the desired mock format. The CRD is the authoritative coordination object between the control plane (k8s-proxy) and the data plane (DaemonSet). Status flows back as a `perNode` array on the CR's `status` subresource.
81+
3. **`keploy-daemonset` DaemonSet.** One Pod per node, running the same enterprise binary you ship for Sidecar mode but in agent-only mode. Each Pod loads its eBPF programs, watches the RecordingSession CR via controller-runtime, and is responsible for capturing traffic from the application Pods that landed on its node.
82+
83+
A `replaysessions.keploy.io` CRD ships alongside RecordingSession but is not used by any current replay environment—it exists so the controller-runtime scheme registers cleanly when a future in-cluster served-replay path is wired up.
84+
85+
### What you don't get without the DaemonSet
86+
87+
If `daemonset.enabled=false` in the chart, `/record/start` falls back to the Sidecar path: the proxy injects the agent via the webhook and rolls the application Pod. Both modes drive the same REST API and persist to the same MongoDB schema, so the rest of the Console (Reports, Schema Coverage, Auto-Replay history) does not need to know which mode produced the data.
88+
89+
---
90+
91+
## Part 2—Auto-replay environments
92+
93+
When a recording session ends—either because the cooldown window expires or because `/record/stop` was called—the proxy fires an auto-replay against the freshly recorded test sets. Where that replay actually runs is controlled by `KEPLOY_AUTO_REPLAY_MODE`. Two values are supported, deliberately independent of each other:
94+
95+
| Mode | Replay runs on… | Best for |
96+
| --------- | ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
97+
| `runner` | a Docker daemon outside the cluster | Customers who don't want any pod scheduling for replay; long-lived runners that pull work over HTTP. |
98+
| `cluster` | a separate Kubernetes cluster you provide | Production with read-only RBAC on the source cluster; replay runs against an isolated Pod in a customer-owned cluster. |
99+
100+
`cluster` is the default in current builds. The mode is process-wide on each k8s-proxy Pod—flipping it requires a Helm upgrade or `kubectl set env` and a rollout.
101+
102+
### How dispatch works
103+
104+
`/record/stop` runs the recording teardown synchronously and then enters a dispatch branch in `pkg/http/handlers.go`. The branch reads `cfg.AutoReplayMode` and routes to the matching handler, which stands up a replay environment from the captured test cases. Both modes eventually drive the OSS replayer (`go.keploy.io/server/v3/pkg/service/replay`)—what differs is **where the application under test actually runs** during replay.
105+
106+
The default replay-start delay is **10 seconds** in both modes. This gives the replayed application time to bind its port before the OSS replayer fires the first test case. Callers can override it via `auto_replay_config.delay` in the `/record/start` body.
107+
108+
---
109+
110+
### Mode A—`runner` (Docker daemon)
111+
112+
```
113+
[/record/stop]
114+
115+
116+
k8s-proxy
117+
• POSTs a CreateReplayJobRequest to its own
118+
/replay-jobs endpoint, which puts a ReplayJob
119+
in an in-memory store with status=pending
120+
121+
(somewhere outside the cluster, on a host with Docker installed)
122+
keploy-replay-runner ─poll──▶ k8s-proxy /replay-jobs/poll
123+
binary (HTTPS, shared bearer token)
124+
125+
│ receives a job:
126+
│ { record_id, test_set_ids[], image, env, app_port, ... }
127+
128+
docker run <image> (the application container)
129+
docker run keploy/enterprise (the keploy agent, on the same
130+
user-defined Docker network)
131+
132+
│ keploy enterprise replay … --record-id=<id>
133+
│ downloads mocks + test cases from k8s-proxy via HTTP
134+
│ runs the OSS replayer
135+
136+
docker rm <both containers>
137+
138+
│ POST /replay-jobs/{jobID}/complete
139+
140+
k8s-proxy
141+
• merges the report into Mongo
142+
• surfaces the run on the Console reports dashboard
143+
```
144+
145+
The runner is a small standalone binary (`cmd/replay-runner` in the k8s-proxy repo). It is not deployed by the chart—operators install it on whichever machine has the Docker daemon, point it at the proxy with a shared token, and start it as a systemd unit / launchd service / pm2 job.
146+
147+
**Configuration on the k8s-proxy side:**
148+
149+
```yaml
150+
env:
151+
KEPLOY_AUTO_REPLAY_MODE: runner
152+
```
153+
154+
**Configuration on the runner side** (CLI flags or env):
155+
156+
| Flag | Env | Description |
157+
| ---------------- | --------------------- | ------------------------------------------------------------------------------ |
158+
| `--platform-url` | `KEPLOY_PLATFORM_URL` | k8s-proxy's externally reachable URL (the same `ingressUrl` the Console uses). |
159+
| `--shared-token` | `KEPLOY_SHARED_TOKEN` | Bearer token. Read from the k8s-proxy `<release>-shared-token` Secret. |
160+
| `--runner-id` | `KEPLOY_RUNNER_ID` | Stable identifier for this runner; used for heartbeat + job assignment. |
161+
| `--keploy-bin` | `KEPLOY_BIN` | Path to the `keploy enterprise` binary that drives the replay. |
162+
| `--work-dir` | `KEPLOY_WORK_DIR` | Scratch directory for downloaded mocks and reports. |
163+
| `--cluster-name` | `KEPLOY_CLUSTER_NAME` | Optional. When set, the runner only picks up jobs scoped to this cluster. |
164+
165+
The runner heartbeats while a job is in progress and POSTs the final report back to `/replay-jobs/{jobID}/complete`. The k8s-proxy never touches the runner's host—it just exposes the queue.
166+
167+
**When to use it:** customers who can't (or don't want to) run replay Pods inside a Kubernetes cluster at all—typically when the customer has a dedicated VM for test execution, or when air-gapping the replay environment from production is a hard requirement. The trade-off is one more piece of infrastructure to operate.
168+
169+
---
170+
171+
### Mode B—`cluster` (separate replay cluster)
172+
173+
This is the **recommended** production mode and is also the default. It keeps the source cluster strictly read-only and runs every replay in a customer-provided second cluster reached through a kubeconfig.
174+
175+
```
176+
┌── Source cluster (read-only RBAC) ────────────────────────────────────┐
177+
│ │
178+
[/record/stop] ──▶ k8s-proxy │
179+
│ │ reads source Deployment (image, ports, env, │
180+
│ │ ConfigMap/Secret refs)—read-only │
181+
│ │ rehydrates referenced ConfigMaps + Secrets │
182+
│ │ into the replay namespace │
183+
│ │ │
184+
└───────────────────────┼───────────────────────────────────────────────┘
185+
│ kubeconfig (mounted as a Secret)
186+
187+
┌── Replay cluster (customer-managed) ──────────────────────────────────┐
188+
│ │
189+
│ ┌───────────────────────────────────────────────────────────┐ │
190+
│ │ Replay namespace (e.g. keploy-replay) │ │
191+
│ │ │ │
192+
│ │ Pod <app>-rpl-xxxxxx │ │
193+
│ │ ├─ application container (image from source Deployment) │ │
194+
│ │ └─ keploy-agent sidecar (replays mocks) │ │
195+
│ │ Service <app>-rpl-xxxxxx-svc │ │
196+
│ │ NetPolicy <app>-rpl-xxxxxx-deny-egress │ │
197+
│ │ Rehydrated ConfigMaps + Secrets │ │
198+
│ │ │ │
199+
│ │ All resources cleaned up after the session ends. │ │
200+
│ └───────────────────────────────────────────────────────────┘ │
201+
└───────────────────────────────────────────────────────────────────────┘
202+
```
203+
204+
**Flow on `/record/stop`:**
205+
206+
1. k8s-proxy reads the source Deployment's `PodTemplateSpec` (read-only).
207+
2. It rehydrates every `envFrom` / `valueFrom` / volume `ConfigMap` and `Secret` referenced by the Pod template into the replay-cluster's namespace, using the mounted kubeconfig. ServiceAccount-token Secrets are intentionally skipped—they are cluster-bound.
208+
3. It creates a standalone Pod (`<app>-rpl-<random>`) plus a backing Service and a deny-all-egress NetworkPolicy in the replay cluster. The Pod runs the application image alongside the keploy-agent sidecar.
209+
4. It opens a SPDY port-forward through the replay cluster's API server to the agent port and the recorded application port. The OSS replayer drives test cases through that local forward—k8s-proxy never needs in-cluster network reachability into the replay cluster.
210+
5. When replay ends, the proxy deletes the Pod, Service, and NetworkPolicy. ConfigMaps and Secrets are left in place; they're rehydrated again next run if the source spec changed.
211+
212+
**What stays the same as `runner` mode:** the OSS replayer, the report shape, the Mongo collections (`testrunReports`, `testsetReports`, `testcaseReports`, `autoReplayMetrics`, `k8sSchemaCoverageReports`), and the Console UI.
213+
214+
**What's different:** every Pod / Service / NetworkPolicy write goes to the replay cluster. The source cluster never sees a write from Keploy.
215+
216+
**Configuration:**
217+
218+
```yaml
219+
env:
220+
KEPLOY_AUTO_REPLAY_MODE: cluster
221+
KEPLOY_REPLAY_KUBECONFIG_PATH: /etc/replay/kubeconfig
222+
KEPLOY_REPLAY_NAMESPACE: keploy-replay
223+
# Optional—pre-existing imagePullSecret in the replay namespace
224+
# KEPLOY_REPLAY_IMAGE_PULL_SECRET: my-pull-secret
225+
226+
extraVolumes:
227+
- name: replay-kubeconfig
228+
secret:
229+
secretName: replay-kubeconfig
230+
231+
extraVolumeMounts:
232+
- name: replay-kubeconfig
233+
mountPath: /etc/replay
234+
readOnly: true
235+
```
236+
237+
The kubeconfig in the Secret should grant the proxy `create / update / patch / delete` on Pods, Services, NetworkPolicies, ConfigMaps, and Secrets **in the replay namespace only**, plus `pods/portforward` and `pods/log`. See the customer setup guide for a copy-paste Role + RoleBinding template.
238+
239+
**Graceful fallback:** if `KEPLOY_AUTO_REPLAY_MODE=cluster` is set but `KEPLOY_REPLAY_KUBECONFIG_PATH` is empty or the file is missing, k8s-proxy logs a warning and skips the trailing replay rather than failing the recording session.
240+
241+
**When to use it:** any production environment where the source cluster must remain untouched, or where you want hard isolation between recording and replay environments. The trade-off is operating a second Kubernetes cluster; for many teams a small managed cluster (1 or 2 small nodes) is sufficient since replays are short-lived and serialized per `(namespace, deployment)` pair.
242+
243+
---
244+
245+
## Picking a combination
246+
247+
Recording mode and replay environment are orthogonal—every combination is valid, and the choice is independent on each side:
248+
249+
| You want… | Recording mode | Replay environment |
250+
| ---------------------------------------------------------------------------------------------- | -------------- | ------------------ |
251+
| Fastest setup, you already have a Docker host outside the cluster | Sidecar | `runner` |
252+
| No application restart, you already have a Docker host outside the cluster | DaemonSet | `runner` |
253+
| Production with read-only RBAC on the source namespace, second K8s cluster available | DaemonSet | `cluster` |
254+
| Production with read-only RBAC on the source namespace, no spare K8s cluster but a Docker host | DaemonSet | `runner` |
255+
256+
For the operational walkthrough of the cluster-mode setup, see the K8s Proxy REST API guide's setup section.

0 commit comments

Comments
 (0)