Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c6515fd
adaptive_export: production AE — streaming export + write-integrity
Jun 8, 2026
390ee45
adaptive_export: ADAPTIVE_PASSTHROUGH firehose loop
entlein Jun 8, 2026
9535139
ci: dx-image workflow — build + publish dx-daemon to ghcr
entlein Jun 9, 2026
4ab9d07
Revert ci: dx-image workflow — wrong repo
entlein Jun 9, 2026
6139e3b
adaptive_export: unit-normalize trigger watermark cursor + load-test …
Jun 16, 2026
fd80e5f
e2e_test/adaptive_export_loadtest: AE fixture-isolation load-test har…
Jun 16, 2026
1001f6d
e2e_test/adaptive_export_loadtest: document AE implied contracts (C1-…
Jun 16, 2026
4e62453
adaptive_export_loadtest: C15 write-duration contract + DX-steering d…
Jun 16, 2026
746421f
adaptive_export/trigger: update test SQL substrings for multiIf norma…
entlein Jun 16, 2026
808a3ea
adaptive_export_loadtest: exp_control uses real now_s event_time (no …
Jun 16, 2026
9798a69
adaptive_export: ADAPTIVE_RECONCILE per-pull write-fidelity instrument
Jun 17, 2026
519bff5
harness: exp_pipeline_reconcile — skip empty-key rows (0 rows != LOSS 1)
Jun 17, 2026
3f6f409
harness: log4shell_fire.sh — reliably fire + restart the log4j-chain …
Jun 17, 2026
239e032
harness: log4shell_fire.sh — detection-signal framing (Cyber Verifica…
Jun 17, 2026
10ad397
adaptive_export(passthrough): precompiled + concurrent firehose, drop…
Jun 17, 2026
f4fe5c4
adaptive_export: bazel BUILD deps for internal/reconcile + pxl compil…
Jun 17, 2026
ee2dcfb
adaptive_export(pxl): raise Pixie 10k result cap via #px:set query flag
Jun 17, 2026
0fa7eb8
adaptive_export/sink: content_type silent-drop contract suite
entlein Jun 9, 2026
1102ce5
adaptive_export_loadtest: DX-steered-vs-ALL datavolume reduction harness
Jun 17, 2026
fed1c6e
adaptive_export_loadtest: deep AE NFR benchmark harness
Jun 18, 2026
7532c87
adaptive_export_loadtest: fix DX-reduction dead-arm (clear stale stee…
Jun 18, 2026
0094ec7
nfr harness: fix lag (dateDiff) + drop racy broker-pct completeness
Jun 18, 2026
e59180a
dx-reduction harness: report ROWS reduction (primary) + bytes (second…
Jun 18, 2026
84eac90
ae deployment: add memory limit (1Gi) + raise cpu limit to 1 core
Jun 18, 2026
00aeaf3
ae bootstrap: separate the secret from the re-applied infra bundle
Jun 18, 2026
d8192aa
dx-reduction harness: fire BOTH attack stages so DX steers the backend
Jun 18, 2026
61ef494
adaptive_export: rename whitelist→allowlist across streaming path + a…
Jun 18, 2026
9feda72
ae(clickhouse): create forensic_db.dx_attack_graph at boot
Jun 18, 2026
2c7cfcf
ae(clickhouse): dx_attack_graph numeric cols Int64/Float64 (px-readable)
Jun 18, 2026
48b100e
adaptive_export(streaming): add #px:set max_output_rows cap flag to s…
Jun 18, 2026
8e7f1c6
ae(clickhouse): create dx_attack_graph_malicious view at boot
Jun 19, 2026
39a2f12
ae(control): /dx/attack_graph ingest endpoint -> ClickHouse
Jun 19, 2026
3b1d484
adaptive_export: convention pass — consolidate CH HTTP, drop dead cod…
entlein Jun 20, 2026
e9c1331
adaptive_export: restore executable bits on harness scripts (post-hea…
entlein Jun 20, 2026
15ee7a3
adaptive_export: lint pass + restore @px load prefix in cmd/BUILD.bazel
entlein Jun 20, 2026
31febbe
adaptive_export: address user review #2 #4 #6 + 4 outstanding CodeRab…
entlein Jun 20, 2026
fdfb451
test(harness): consolidate to one run-picture + e2e CI workflow
Jun 21, 2026
dc644c3
revert(bazel): drop stray buildifier attribute-reorder in stirling co…
Jun 21, 2026
fcab916
ci: fix run-genfiles + run-container-lint on PR 53
entlein Jun 21, 2026
3094068
ci: apply gazelle's actual kwarg order to stirling container_images
entlein Jun 21, 2026
19e84ad
ci: fix container-lint Errors on e2e workflow + new harness scripts
entlein Jun 21, 2026
e0a19dd
ci: silence 6 SHELLCHECK Warnings to clear container-lint exit code
entlein Jun 21, 2026
e6a6237
feat(ae-control): bearer-JWT auth + input validation (CodeRabbit foll…
Jun 21, 2026
714c1fb
fix(ae): remaining CodeRabbit Majors (#53 followup)
Jun 21, 2026
dfdc465
fix(ae-trigger): escape a PollLimit-saturated watermark boundary (fro…
Jun 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .arclint
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
"(^private\/credentials\/.*\\.yaml)",
"(^src/operator/client/versioned/)",
"(^src/operator/apis/px.dev/v1alpha1/zz_generated.deepcopy.go)",
"(^src/e2e_test/adaptive_export_loadtest/tools/loadgen/)",
"(^src/stirling/bpf_tools/bcc_bpf/system-headers)",
"(^src/stirling/mysql/testing/.*\\.json$)",
"(^src/stirling/obj_tools/testdata/go/test_go_binary.go)",
Expand Down
4 changes: 4 additions & 0 deletions .bazelignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,7 @@ third_party/threadstacks
tools/chef/nodes
# To keep third party dependencies separate, privy is intentional setup as a separate bazel workspace
src/datagen/pii/privy

# adaptive_export_loadtest generator is a docker-built test tool (see its README);
# build-agent to replace with a bazel target. Until then, keep it out of gazelle.
src/e2e_test/adaptive_export_loadtest/tools/loadgen
128 changes: 128 additions & 0 deletions .github/workflows/e2e_log4shell_soc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
# e2e-log4shell-soc — stand up a real SOC stack on k3s, fire log4shell end-to-end,
# assert every canonical harness script actually runs, and profile dx in real life.
#
# Heavy: needs eBPF (Pixie PEM) + 16cpu/64gb → the oracle self-hosted runner, NOT
# ubuntu-latest. Deploy mirrors the sovereignsocdemo lab recipe (k8sstormcenter/soc
# make targets) — that kit is makefile-agent's; keep the deploy block in sync with it.
#
# Uses EXISTING k8sstormcenter/pixie repo secrets (no new ones): PX_DEPLOY_KEY,
# PX_API_KEY (Pixie enroll), DX_ENTLEIN_PAT (private entlein/dx image pull),
# CLICKHOUSE_*_PASSWORD, TAILSCALE_AUTH_KEY. Manual by default (it provisions a
# whole cluster); flip the schedule on once it's green.
name: e2e-log4shell-soc
on:
workflow_dispatch:
inputs:
dx_image:
description: dx-daemon image to test (default = .image-tags pin)
required: false
default: ""
soc_ref:
description: k8sstormcenter/soc branch
required: false
default: "218-clickhouse-schema"
permissions:
contents: read

jobs:
e2e:
runs-on: oracle-vm-16cpu-64gb-x86-64 # eBPF + 16cpu/64gb; ubuntu-latest cannot run Pixie
timeout-minutes: 90
env:
KUBECONFIG: /etc/rancher/k3s/k3s.yaml
HARNESS: src/e2e_test/adaptive_export_loadtest/harness
steps:
- name: Checkout pixie (harness scripts)
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Install k3s
run: |
curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644
for i in $(seq 1 60); do kubectl get nodes --no-headers 2>/dev/null | grep -q ' Ready' && break; sleep 5; done
kubectl get nodes

- name: Deploy the SOC stack (Pixie + kubescape + ClickHouse + AE + dx + log4j chain)
env:
PX_CLOUD_ADDR: pixie.austrianopencloudcommunity.org
PX_DEPLOY_KEY: ${{ secrets.PX_DEPLOY_KEY }}
PX_API_KEY: ${{ secrets.PX_API_KEY }}
TS_AUTHKEY: ${{ secrets.TAILSCALE_AUTH_KEY }}
DX_ENTLEIN_PAT: ${{ secrets.DX_ENTLEIN_PAT }} # private entlein/dx image pull
CLICKHOUSE_ANALYST_PASSWORD: ${{ secrets.CLICKHOUSE_ANALYST_PASSWORD }}
CLICKHOUSE_INGEST_PASSWORD: ${{ secrets.CLICKHOUSE_INGEST_PASSWORD }}
CLICKHOUSE_PIXIE_PASSWORD: ${{ secrets.CLICKHOUSE_PIXIE_PASSWORD }}
run: |
set -euo pipefail
sudo apt-get update -qq && sudo apt-get install -y python3-yaml
git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc
cd soc
make pixie # vizier + AE
make kubescape || true # node-agent (netStreaming)
bash tree/clickhouse-lab/install.sh # forensic_db
make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)
if [ -n "${{ inputs.dx_image }}" ]; then
kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true
fi
# optimal config + enable pprof for the real-life profile (DX_TELEMETRY_CACHE/DX_BENCH
# are defaults in main, set here too in case the kit's manifest predates them)
kubectl -n honey set env ds/dx-daemon DX_PPROF_ADDR=0.0.0.0:6060 DX_TELEMETRY_CACHE=1 DX_BENCH=pemdirect
kubectl -n honey rollout status ds/dx-daemon --timeout=120s

- name: Wait for stack healthy
run: |
set -euo pipefail
kubectl wait --for=condition=Ready pod -l name=adaptive-export -n pl --timeout=300s
kubectl wait --for=condition=Ready pod -l app=dx-daemon -n honey --timeout=300s
kubectl -n pl get pods; kubectl -n honey get pods; kubectl -n log4j-poc get pods
# dx must be non-blind on pemdirect (the optimal default from #29/#33)
kubectl -n honey logs ds/dx-daemon | grep -E "bench=pemdirect|telemetry cache ENABLED" | head

- name: Run canonical harness scripts — assert each actually runs
run: |
set -uo pipefail
mkdir -p /tmp/evidence; fail=0
for s in log4shell_fire exp_matrix nfr exp_row_reconcile; do
echo "::group::$s"
if bash "$HARNESS/$s.sh" > "/tmp/evidence/$s.log" 2>&1; then
echo "PASS $s"; tail -5 "/tmp/evidence/$s.log"
else
echo "FAIL $s (exit $?)"; tail -30 "/tmp/evidence/$s.log"; fail=1
fi
echo "::endgroup::"
done
# detection gate: dx must rule in the log4shell chain (not just run the script)
kubectl -n honey logs ds/dx-daemon | grep -iE "RULE IN|ruled_in" | tee /tmp/evidence/dx_ruleins.txt
if ! grep -qiE "log4shell|control-plane-credential-abuse|RULE IN" \
/tmp/evidence/dx_ruleins.txt; then
echo "NO dx rule-in — detection failed"; fail=1
fi
exit $fail

- name: Profile dx in real life (pprof + metrics)
if: always()
run: |
set -uo pipefail
POD=$(kubectl -n honey get pod -l app=dx-daemon -o jsonpath='{.items[0].metadata.name}')
kubectl -n honey port-forward "$POD" 6060:6060 9095:9095 & PF=$!; sleep 5
# 30s CPU profile under a fresh fire + heap, served by DX_PPROF_ADDR=:6060
( bash "$HARNESS/log4shell_fire.sh" >/dev/null 2>&1 || true ) &
curl -s --max-time 40 -o /tmp/evidence/dx_cpu.pprof \
"http://127.0.0.1:6060/debug/pprof/profile?seconds=30" || true
curl -s "http://127.0.0.1:6060/debug/pprof/heap" -o /tmp/evidence/dx_heap.pprof || true
curl -s "http://127.0.0.1:9095/metrics" -o /tmp/evidence/dx_metrics.txt || true
go tool pprof -top -nodecount=25 /tmp/evidence/dx_cpu.pprof > /tmp/evidence/dx_cpu_top.txt 2>&1 || true
kill $PF 2>/dev/null || true
echo "=== dx CPU top ==="; head -30 /tmp/evidence/dx_cpu_top.txt
echo "=== verdict latency ==="
grep -E \
"dx_(time_to_verdict|bench_query_duration)_seconds_(sum|count)" \
/tmp/evidence/dx_metrics.txt || true

- name: Upload evidence + profiles
if: always()
uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 # v4.4.3
with:
name: e2e-log4shell-evidence
path: /tmp/evidence/
retention-days: 14
2 changes: 1 addition & 1 deletion .github/workflows/vizier_release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ jobs:
git commit -s -m "Release Helm chart Vizier ${VERSION}"
git push origin "gh-pages"
update-gh-artifacts-manifest:
runs-on: oracle-8cpu-32gb-x86-64
runs-on: oracle-vm-16cpu-64gb-x86-64
needs: [get-dev-image, create-github-release]
container:
image: ${{ needs.get-dev-image.outputs.image-with-tag }}
Expand Down
11 changes: 11 additions & 0 deletions k8s/vizier/bootstrap/adaptive_export_deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,17 @@ spec:
containers:
- name: adaptive-export
image: vizier-adaptive_export_image:latest
# Bounded so AE can never memory-pressure a node (measured: AE uses
# only ~16-38Mi steady; passthrough with the raised 1M-row cap can
# spike, so 1Gi caps the worst case). CPU was pinned at the old 300m
# limit under concurrent passthrough → raised to 1 core.
resources:
requests:
cpu: 200m
memory: 128Mi
limits:
cpu: "1"
memory: 1Gi
env:
- name: PL_NAMESPACE
valueFrom:
Expand Down
7 changes: 7 additions & 0 deletions k8s/vizier/bootstrap/adaptive_export_secrets.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# SEED-ONLY template — NOT in kustomization.yaml (separation of concerns).
# Real credentials are written by `make ae-auth` (pixie-api-key from keys.env,
# clickhouse-dsn = the fixed forensic-CH constant). Do NOT add this back to the
# bundle: a re-apply would clobber the real pixie-api-key with the placeholder
# (the recurring "AE unauthenticated / writes 0" bug). Apply this by hand ONLY
# to seed a brand-new cluster so the AE pod's secretKeyRef resolves before
# ae-auth runs.
---
apiVersion: v1
kind: Secret
Expand Down
7 changes: 6 additions & 1 deletion k8s/vizier/bootstrap/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,10 @@ resources:
- cert_provisioner_job.yaml
- vizier_crd_role.yaml
- adaptive_export_role.yaml
- adaptive_export_secrets.yaml
# adaptive_export_secrets.yaml is intentionally NOT bundled here: it holds real
# credentials (pixie-api-key, clickhouse-dsn) owned by `make ae-auth`. Bundling
# it meant every infra re-apply clobbered the real key with the placeholder.
# Separation of concerns: infra (role+deployment) re-appliable; secret is
# created ONCE by ae-auth and never touched by this kustomization. ponytail:
# apply adaptive_export_secrets.yaml manually only to seed a fresh cluster.
- adaptive_export_deployment.yaml
4 changes: 2 additions & 2 deletions skaffold/skaffold_vizier.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ build:
bazel:
target: //src/vizier/services/cloud_connector:cloud_connector_server_image.tar
args:
- --config=x86_64_sysroot
- --compilation_mode=opt
- --config=x86_64_sysroot
- --compilation_mode=opt
- image: vizier-cert_provisioner_image
context: .
bazel:
Expand Down
14 changes: 14 additions & 0 deletions src/api/go/pxapi/opts.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,17 @@ func WithDirectCredsInsecure() ClientOption {
c.insecureDirect = true
}
}

// WithDirectTLSSkipVerify is the secure-by-default option for direct (standalone /
// node-local PEM) connections: the transport IS TLS-encrypted, but the server cert
// is not chain/hostname-verified. Use this instead of WithDirectCredsInsecure when
// the direct endpoint serves TLS with a self-signed / service cert whose SAN does
// not match the node IP (e.g. vizier-pem's direct-query port served with
// service-tls-certs, dialed at HOST_IP). Unlike WithDisableTLSVerification it does
// NOT require a "cluster.local" address, so it works for the node-IP direct dial.
// Bearer creds (the minted JWT) therefore ride an encrypted channel, never plaintext.
func WithDirectTLSSkipVerify() ClientOption {
return func(c *Client) {
c.disableTLSVerification = true
}
}
98 changes: 98 additions & 0 deletions src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Adaptive Export (AE) — implied contracts

What AE *currently assumes but does not enforce*. Each ⚠️ is an **implied** contract
(a silent assumption); 🔴 marks ones we've observed violated, with the fix. Grounded
in `src/vizier/services/adaptive_export/` (trigger, controller, sink, config) + the
`forensic_db` DDL.

## End-to-end data flow + where each contract sits

```mermaid
flowchart TD
subgraph PROD["Producer (per node)"]
VEC["Vector kubescape_enrich sink<br/>(or load-test fixtures)"]
end
subgraph CH1["ClickHouse — input"]
KL["forensic_db.kubescape_logs<br/>MergeTree ORDER BY (event_time, hostname)<br/>TTL toDateTime(event_time)+30d"]
end
subgraph AE["adaptive_export (per node DaemonSet)"]
TRG["TRIGGER: poll 250ms<br/>WHERE hostname=NODE AND event_time>=watermark<br/>ORDER BY event_time LIMIT N"]
CTL["CONTROLLER: hash + active-set<br/>window [event_time-Before, now)"]
PXL["DATA-PLANE: PxL per (ns,pod)×table<br/>refresh every 30s while window open"]
end
subgraph VZ["Pixie"]
QB["vizier-query-broker → PEMs"]
end
subgraph CH2["ClickHouse — output (forensic_db)"]
ATTR["adaptive_attribution<br/>ReplacingMergeTree(t_end)<br/>ORDER BY (hostname, anomaly_hash)"]
WM["trigger_watermark<br/>ReplacingMergeTree(updated_at)"]
PROT["http/dns/pgsql/conn_stats/...<br/>plain MergeTree (NO dedup)"]
end

VEC -->|"C1 ⚠️ event_time UNIT = seconds<br/>C2 ⚠️ hostname = k8s node name"| KL
KL -->|"C3 🔴 event_time monotone ≥ watermark<br/>C4 ⚠️ boundary dedup by content fp"| TRG
TRG --> CTL
CTL -->|"C5 ⚠️ anomaly_hash = f(pid,comm,pod,ns) only"| ATTR
TRG -->|"C6 ⚠️ watermark persist throttled ~5s"| WM
CTL --> PXL
PXL -->|"C7 needs registered vizier"| QB
QB -->|"C8 🔴 plain MergeTree + 30s re-pull → dup"| PROT
PXL -->|"C9 ⚠️ write only if rows>0"| PROT
ATTR -. "C10 ⚠️ join: events.pod = ns/pod ↔ attribution.pod = bare" .- PROT
```

## Boot / dependency contract

```mermaid
flowchart LR
ENV["ENV (all non-empty or FATAL):<br/>PIXIE_CLUSTER_ID · CLUSTER_NAME<br/>PIXIE_API_KEY · CLICKHOUSE_DSN"] --> BOOT
CM["cm/pl-cloud-config<br/>PL_CLOUD_ADDR=…:443"] -->|"C11 🔴 missing :443 → crashloop"| BOOT
BOOT["AE boot"] --> DDL["C12 self-applies forensic_db DDL<br/>(ADAPTIVE_SKIP_APPLY=false)"]
BOOT --> CTRLPLANE["control plane: CH only"]
BOOT --> DATAPLANE["data plane: needs query-broker<br/>(C7) + ADAPTIVE_PUSH_PIXIE_ROWS"]
```

## Contract register

| # | Contract (implied) | Enforced? | Status / fix |
|---|---|---|---|
| C1 | `kubescape_logs.event_time` is unix **seconds** (one unit end-to-end) | ❌ trigger auto-detects s/ms/ns; DDL `toDateTime()` assumes seconds | 🔴 **F8 root** — see C3; AE-2 standardize+normalize |
| C2 | `hostname` = the k8s **node** name (AE polls `WHERE hostname=node`) | ❌ convention only | ⚠️ fixtures must use a real node, else no AE ever reads them |
| C3 | every new anomaly's `event_time` ≥ current watermark (monotone) | ❌ strict HWM filter | 🔴 **F8** — a larger-unit / out-of-order / future row poisons the HWM → all later rows silently dropped. **Fix (PR #53):** normalize cursor to nanos (`chNormEventTimeNanos`); AE-9: ingest-order cursor / bounded-lookback+dedup + below-watermark metric |
| C4 | rows sharing `event_time` at the boundary are deduped by content fingerprint | ✅ `seenAtBoundary` | ok |
| C5 | `anomaly_hash = SHA256(pid,comm,pod,ns)[:16]` — identity is the **workload**, independent of event_time/RuleID | ✅ | ok (N events for one target → 1 attribution row) |
| C6 | `trigger_watermark` persisted value tracks the live cursor | ❌ throttled ~5s | ⚠️ external readers/restart see up to 5s stale; AE-7 flush-on-shutdown |
| C7 | data-plane requires a **registered** vizier query-broker | ❌ | ⚠️ control plane works without it; data plane silently does nothing |
| C8 | re-pulling a window is idempotent | ❌ protocol tables plain MergeTree (no dedup) + 30s re-pull | 🔴 duplicate inflation. **Fix:** single-shot (`ADAPTIVE_PUSH_REFRESH_SEC=-1`, or `AFTER<refresh`); AE-6 ReplacingMergeTree protocol tables |
| C9 | a protocol table row is written only if Pixie returned ≥1 row | ✅ `WritePixieRows len==0 → nil` | ok (empty workload → 0 rows, by design) |
| C10 | join key: `events.pod` = `"ns/pod"` (upid_to_pod_name) vs `adaptive_attribution.pod` = **bare** pod | ❌ asymmetric | ⚠️ consumers must `concat(namespace,'/',pod)` to join (burned the volume tool) |
| C11 | `PL_CLOUD_ADDR` carries `:443` | ❌ | 🔴 missing → AE crashloops / 0 writes (per-PG fix) |
| C12 | AE owns + self-applies the `forensic_db` DDL | ✅ when `ADAPTIVE_SKIP_APPLY=false` | ok; but DDL TTL/PARTITION assume seconds (C1) |
| C13 | `adaptive_attribution` / protocol writes are durable | ❌ best-effort: logged, non-fatal, **not retried** | 🔴 silent loss under CH hiccup; AE-4 retry+count |
| C14 | **DX⊇AE invariant**: AE write-set ⊇ DX read-set (AE persists everything dx queries) | ❌ by convention | ⚠️ validated per-table in the load-test, not enforced in code |
| C15 | **Write-duration (the one DX steers on):** once an anomaly opens a pod's window, AE **keeps re-pulling + writing that pod's forensic data continuously** until `t_end` expires OR DX explicitly stops it. `t_end = now + After`, extended by each new anomaly for the hash. | ❌ partial | 🔴 **last week's "wrote then stopped" bug.** Premature stop modes under investigation (E8-data RCA): (a) F8 — extension anomalies dropped → `t_end` not extended → expires early; (b) EmptyResultSkip negative cache skips a (pod,table) mid-window after N empty pulls; (c) prune/in-flight race; (d) my `PUSH_REFRESH=-1` single-shot is a TEST affordance that *violates* this contract (writes once) — production must re-pull. |

## DX steering contract (what DX can rely on / control)

```mermaid
sequenceDiagram
participant DX
participant AE
participant Pixie
participant CH as forensic_db
Note over AE: anomaly (or DX referral) opens window [t_start, t_end=now+After]
loop every PushRefreshInterval until t_end OR DX stop (C15)
AE->>Pixie: PxL per table for (ns,pod), slice since last_upper
Pixie-->>AE: rows
AE->>CH: write rows (write ⊇ DX read, C14)
end
DX->>AE: StartExport / StopExport / extend t_end (control surface, CONTROL_ADDR)
Note over AE: stop ONLY on t_end or DX stop — never silently early (C15)
```

- **DX controls:** (1) open/extend a window (each referral/anomaly extends `t_end`), (2) explicit **StopExport** via the control surface (`CONTROL_ADDR`, design rev-3 — confirm wired), (3) the active set (which pods AE over-captures).
- **DX relies on:** C5 (stable hash identity), C14 (write ⊇ read), **C15 (no premature stop)**, C9 (0 rows only when the workload is genuinely silent), C10 (the `ns/pod` ↔ bare join). For DX to steer dependably, C3/C8/C13/C15 must move from 🔴 to ✅.

## Legend
✅ enforced in code · ⚠️ implied (assumed, not checked) · 🔴 observed violated (fix noted).
Full repro + backlog: `FINDINGS_AND_BACKLOG.md`. The fixes for C3/C1 are on PR #53 (`ae-prod`).
Loading