AE control auth + CodeRabbit #53 followup by entlein · Pull Request #68 · k8sstormcenter/pixie

entlein · 2026-06-21T21:01:17Z

AE control auth + CodeRabbit #53 followup

Followup to #53 (merge #53 first — this stacks on ae-prod; I'll retarget to main once #53 lands). Captured all 48 inline + 8 review + 12 issue CodeRabbit comments to biz/PoC/log4j/ae_pr53_coderabbit_feedback/ (JSON + DIGEST.md) so nothing is lost.

✅ Done in this PR

Bearer-JWT auth on the control surface — uses the same shared lib the broker/PEM use (px.dev/pixie/src/shared/services/utils); SetAuth verifies via jwtutils.ParseToken (signature + audience), middleware on Handler(), /healthz exempt. dx already mints this exact service JWT — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH, default off) for safe incremental rollout. Unit test mints a real JWT and asserts 401/pass/healthz. (CodeRabbit server.go: protect control endpoints.)
Input validation — reject invalid t_end (≤0) and query windows (start≥end, non-positive). (server.go: reject invalid t_end/window.)

Companion change (separate, entlein/dx)

dx aeclient.go must attach Authorization: Bearer <minted JWT> on its AE control POSTs. Then flip CONTROL_REQUIRE_AUTH=true + mount PL_JWT_SIGNING_KEY on AE. (Tracked; small.)

Already resolved on ae-prod (verified — no action)

http.Server timeouts (main.go) — already added (CodeRabbit r3379377432).
~9 harness Majors (ae_config/ae_vs_all/deploy_ae/exp_datavolume_extreme/exp_log4j_reps/… fail-fast) — those scripts were deleted in the harness consolidation; moot.
The http_events=0 "bug" — self-retracted by the author (misdiagnosis).

Remaining AE-code Majors — triaged (to address as commits here)

Safe / clear:

streaming/filter.go:196 — timer leak on deltaCh close.
passthrough/passthrough.go:147 — per-table query/write timeouts.
pixieapi/pixieapi.go:89 — direct-mode TLS-skip uses wrong option (→ WithDirectTLSSkipVerify).
stats.py:57 — guard false "EXACT" when zero PASS rows.
schema.sql:485 — retention on ae_reconcile (unbounded growth).
controller.go:347 — don't fan out Pixie rows when attribution persistence fails.

Risky / needs careful testing before a presentation — defer with care:

controller.go:255 — rehydrated active windows never restart fan-out (state-machine path).
trigger/clickhouse.go:303 — inclusive-watermark paging (this is the F8 watermark area — change only with the F8 fixtures).
schema.sql:430 — adaptive_attribution ORDER BY key width (schema change — migration risk).
sink/clickhouse.go:433 — Record blocks hot path on CH (async refactor — perf-sensitive).

Signed-off-by: entlein <einentlein@gmail.com>

@px

New env-gated background loop that runs the same PxL shape AE's anomaly-gated path uses, but with an empty Target (no ns/pod predicate) and over a configurable rolling window. Writes via the existing sink so the byte-shape of forensic_db rows is comparable between the PASSTHROUGH=1 phase (EVERYTHING) and the PASSTHROUGH=0 phase (AE-FILTER). One-shot A/B that yields the per-table capture fraction of the adaptive write path. - internal/passthrough/passthrough.go — Loop + Config; defaults to 30s window / 30s refresh / clickhouse.PixieTables() table list. - internal/passthrough/passthrough_test.go — 6 tests; the load-bearing one is TestLoop_EmitsEmptyTargetPxL (asserts neither df.namespace nor df.pod predicates appear in the emitted PxL). - cmd/main.go: ADAPTIVE_PASSTHROUGH + _WINDOW_SEC + _REFRESH_SEC env knobs. Adapter is constructed unconditionally when passthrough is on (joins the existing PushPixie / streaming construction path so the same pxapi grpc stream is reused). Loop is registered with the shutdown WaitGroup so SIGTERM waits for the in-flight tick. - cmd/BUILD.bazel: drop @px// load (other AE BUILD.bazel files use //bazel — sticking out as the only one with @px is a leftover from a prior gazelle run; align). Add passthrough dep. Signed-off-by: entlein <einentlein@gmail.com>

Stand-alone workflow that builds entlein/dx (private Active-Diagnosis Framework) into ghcr.io/k8sstormcenter/dx-daemon. Separates the dx image publish from the bazel-based vizier_release pipeline; the dx repo ships its own Dockerfile.dxd (Go cross-compile + distroless final stage) so it doesn't need to live as a submodule inside src/vizier/services/dx. Triggers: - tag push 'release/dx/v*' on this repo cuts a release build, image tag derived from the tag suffix (release/dx/v0.1.0 -> image tag 0.1.0). - workflow_dispatch lets us build any dx ref on demand with a custom tag (default: short sha of the resolved dx commit). Pulls dx via DX_ENTLEIN_PAT (already configured on the repo). Multi-arch build (linux/amd64, linux/arm64); Dockerfile.dxd cross-compiles in the native BUILDPLATFORM stage and the final stage is COPY-only, so target emulation isn't required. Signed-off-by: entlein <einentlein@gmail.com>

@px

The dx image build pipeline lives in entlein/dx itself (PR #53, branch feat/bazel-release): bazel-based with @px external pin to pixie's ae-prod tip, pushes to docker.io/entlein/dx-daemon on a release/dx/v* tag in the dx repo. The pixie-side buildx workflow this reverts duplicated that intent in the wrong repo + the wrong build system (docker buildx instead of bazel + pl_go_image macros) + the wrong registry (ghcr.io/k8sstormcenter instead of docker.io/entlein). Signed-off-by: entlein <einentlein@gmail.com>

…affordances Fixes the silent-halt bug: the trigger gated on a RAW event_time high-water-mark, so a single anomaly in a larger unit (ms/ns) drove the watermark past all real seconds rows and AE stopped processing forever (data still on Pixie). Normalize event_time to canonical nanoseconds in the poll SELECT filter+order and in the in-memory/persisted cursor, boundary-dedup, and maxSeen (normalizeEventTimeNanos + chNormEventTimeNanos). Validated at the data layer: vs a poisoned watermark the raw filter returns 0 rows, the normalized filter recovers all 60. Also adds ADAPTIVE_PUSH_REFRESH_SEC (negative = single-shot pull) for the reproducible load-test harness, an in-package trigger unit test, and an e2e hermetic load test (mock PixieQuerier, exact rows+bytes). Signed-off-by: entlein <einentlein@gmail.com>

…ness Consolidate the adaptive_export load-test harness under src/e2e_test/ (matching vzconn_loadtest / px_cluster conventions). Control-plane experiments (E1-E4, E6, E8 sustained) are proven exactly-reproducible on a live rig; the data-plane experiments (E5, E8 data-mode) are authored and pending live validation on a vizier-registered rig (status documented in README + FINDINGS_AND_BACKLOG). - harness/: shell + python (inject, exp_control, exp_e8, stats, ...) + lib helpers - fixtures/EXPERIMENTS.md: curated kubescape_logs data-set catalog + expected outputs - k8s/: isolated sinks + per-rep generator pod (no probes) - tools/loadgen/: cleanloadgen + httpsink (docker-built test tool; .bazelignore'd pending a bazel target — lib/pq is already vendored in the module) - FINDINGS_AND_BACKLOG.md: F8 watermark-poison bug + the fix + AE backlog The AE Go unit/e2e tests live with the service (internal/{trigger,e2e}). Signed-off-by: entlein <einentlein@gmail.com>

…C14) + diagrams Signed-off-by: entlein <einentlein@gmail.com>

…iagram; gen sustained-DNS mode C15 = AE must keep re-pulling+writing an active pod until t_end or DX stop (the contract DX steers on; last week's 'wrote then stopped' is its violation). Add DX-steering sequence diagram. Generator gains SUSTAIN_SEC (distinct-DNS trickle, a Pixie-traced protocol) + configurable SETTLE_PRE_MS warm-up for fresh-pod capture; harness wires GEN_SUSTAIN_SEC/GEN_SETTLE_MS. Signed-off-by: entlein <einentlein@gmail.com>

…lisation The 700821d trigger unit-normalisation wrapped event_time in a multiIf(...) inside both the WHERE filter and the ORDER BY. Three existing tests in watermark_test.go + one in clickhouse_test.go pinned the raw 'event_time >= N' substring and broke at HEAD. Update each test's expected substring to match the new normalized form (') >= <ns-scaled N>' — the closing paren of multiIf, then the value in canonical nanoseconds). Per-test ns-scaling: watermark_test.go:94 1744000000000000000 already ns -> unchanged watermark_test.go:125 InitialWatermark=42 < 1e10 sec -> * 1e9 watermark_test.go:156 InitialWatermark=7 < 1e10 sec -> * 1e9 watermark_test.go:297 event_time='5000' < 1e10 sec -> * 1e9 clickhouse_test.go:82 ref.T=1744..303e9 ns already ns -> unchanged go test ./src/vizier/services/adaptive_export/... all green. Signed-off-by: entlein <einentlein@gmail.com>

…future-stamp watermark poison) Signed-off-by: entlein <einentlein@gmail.com>

Records one forensic_db.ae_reconcile row per data-plane pull (read_count vs wrote_count, window, ns/pod) across ALL three write paths — controller fan-out (filter), passthrough firehose, and streaming scanner — so a reconcile run localizes loss to query (R5: read<PEM) vs sink (R6: wrote<read) and quantifies re-pull dup (C8). Counts alone (write >= read) were proven insufficient. - new internal/reconcile leaf package (Row, Recorder, Nop) — no import cycle - sink.Record: CH-backed recorder (INSERT INTO forensic_db.ae_reconcile) - ae_reconcile table: schema.sql + KnownTables + OperatorOwnedTables (synced); not a pixie table (absent from PixieTables, so VerifyPixieSchema ignores it) - wired: passthrough.tick, controller.pushPixieRows (deferred, all return paths), streaming.scanner.Run; gated by ADAPTIVE_RECONCILE=true, else Nop - unit test proves read/wrote capture incl. the sink-drop read>wrote shape - fixed apply_test trailing-tables guard for the new operator table - harness: exp_row_reconcile.sh (row-level PEM<->CH), ae_vs_all.sh, exp_datavolume_extreme.sh Signed-off-by: entlein <einentlein@gmail.com>

px -o json empty result previously printed one blank line → counted as a phantom LOSS=1. Guard: empty set → 0-byte keys file; drop all-empty-field keys. Confirmed against the controlled log4j run (backend http 14/14 exact, conn 66>=12, no loss). Signed-off-by: entlein <einentlein@gmail.com>

…log4shell Reliable BY CONSTRUCTION against bob#140 (stateful/unreliable exploit on re-fire): fresh-JVM backend (delete pod) + attacker-before-backend + the WORKING resolvable FQDN attacker.<ns>.svc.cluster.local:1389 (NOT the bare attacker-ns.svc which NXDOMAINs and gets dropped), then VERIFY the actual backend->:1389 LDAP egress in forensic_db.conn_stats and RETRY until confirmed (the validity gate). Never assumes the exploit fired. Node-side. Signed-off-by: entlein <einentlein@gmail.com>

…tion) Reword from offensive 'exploit' to detection-signal-generation language: this validates the kubescape->DX->AE detection chain. No logic change. Signed-off-by: entlein <einentlein@gmail.com>

… http2 - pxl.CompilePassthrough/Render: precompile per-table PxL once (fixed window => constant relative start_time), only the two time_ bounds are stamped per tick. Rendered output is byte-identical to QueryFor with an empty Target (TestCompilePassthrough_MatchesQueryFor), so this is a structural change, not a capture change. upid->pod/ns stays in PxL. - passthrough: tickConcurrent fans every table out at once (was a serial loop); shared pull() helper. Sink/recorder are pool/HTTP-backed and already called concurrently elsewhere. - drop http2_messages.beta from the firehose set (not materialised on every cluster => ""Table not found"" every tick); shared PixieTables/DDL lists untouched. - toggle ADAPTIVE_PASSTHROUGH_COMPILED (default on; =false reverts to the legacy serial QueryFor path). Signed-off-by: entlein <einentlein@gmail.com>

…e.go Fixes "missing strict dependencies: import of .../internal/reconcile" that broke the AE image build for passthrough, sink, streaming, controller, cmd (pre-existing since the ADAPTIVE_RECONCILE commit added the package + imports without bazel deps; never CI-built). Also wires the new pxl/compile.go srcs + passthrough/pxl test srcs (compiled_test.go, reconcile_test.go, compile_test.go). - new internal/reconcile/BUILD.bazel (go_library, stdlib-only) - +//internal/reconcile dep: passthrough, sink, streaming, controller, cmd - pxl go_library +compile.go; pxl_test +compile_test.go; passthrough_test +compiled_test.go,reconcile_test.go (+reconcile dep) Signed-off-by: entlein <einentlein@gmail.com>

F1 RCA: Pixie caps every px.display at max_output_rows_per_table (default 10000, query_flags.go) — the planner add_limit_to_batch_result_sink_rule silently truncates wide firehose windows / busy pods at the READ (write path is clean: ae_reconcile shows read==wrote). Fix uses Pixie own native knob — prepend `#px:set max_output_rows_per_table=1000000` to every generated PxL (QueryFor + CompilePassthrough) so all pull paths are uncapped. Validated on rig: a 14208-row window returned 10000 (capped) vs 14298 (with flag). No pagination loop, no extra round-trips. See memory project-ae-passthrough-10k-cap. Signed-off-by: entlein <einentlein@gmail.com>

Consolidates the recurring content_type silent-drop incident class into one default-suite test gate (6 tests, ~15ms): I1 TestContract_ContentTypeIsInt64InSchema I2 TestContract_FastEncodeContentTypeAsInt I3 TestContract_SilentDropDetected I3.b TestContract_SilentDropNotTriggeredOnSuccess I3.c TestContract_SilentDropToleratesMissingSummaryHeader I4 TestContract_HTTPEventsRoundTrip Top-of-file docstring chronicles the incident timeline so future operators can grep their way to the contract. Signed-off-by: entlein <einentlein@gmail.com>

Measures the AdaptiveExport value prop: datavolume REDUCTION of DX-steered AE (rev-3 streaming, AE writes only DX-steered activeSet pods over the control surface) vs saving ALL data (passthrough firehose). Two arms, same fixed load, forensic_db active-part deltas (rows+bytes) per table; reduction = 1 - DX/ALL. Uses the canonical resolvable JNDI FQDN (attacker.attacker-ns.svc.cluster.local) so the chain fires + DX can classify (a malformed host → NXDOMAIN → no steer). Successor to ae_vs_all.sh, whose AE arm used the rev-2 controller gate + stale JNDI. Signed-off-by: entlein <einentlein@gmail.com>

Measures all AE non-functional requirements under steady load on the rig: throughput (rows+bytes/sec), capture completeness (AE read vs broker count = F1 cap proof), write fidelity (read==wrote + write-error count), end-to-end freshness latency (now - max(time_) in CH), resource footprint (AE pod cpu/mem idle vs loaded), per-cycle cadence. Emits a structured report; companion to exp_dx_steering_reduction.sh. Real-data only. Signed-off-by: entlein <einentlein@gmail.com>

…ring + live-pod guard) Run-1 reported false 100% reduction because stale adaptive_attribution windows rehydrated DEAD pods (deleted loaders) into the activeSet → AE streamed dead pods → 0 rows. Clear adaptive_attribution before the DX arm so the activeSet only gets freshly-steered LIVE pods; add a guard that prints the steered pods + marshalsec fire count + live log4j-poc pods so a dead-arm result is caught, not reported as a reduction. Signed-off-by: entlein <einentlein@gmail.com>

lag query used now()-DateTime64 (type error -> na); use dateDiff(second,...). Capture-completeness vs broker was window-misaligned (623% artifact) -> drop it; report tot_read vs tot_wrote (read==wrote) + errs instead. The 10k-cap/completeness proof is the dedicated F1 test (max_read>10000 vs broker for the SAME window). Signed-off-by: entlein <einentlein@gmail.com>

…ary) Run-2 byte-delta reduction came out negative because system.parts byte delta is compaction-noisy (merges land mid-window). Report rows reduction as primary (actual captured-row count, noise-free); keep bytes as secondary context. Signed-off-by: entlein <einentlein@gmail.com>

Eviction-RCA finding (PR #63 NFR campaign): AE had NO memory limit (only cpu 300m) and was CPU-pinned at 300m under concurrent passthrough. AE measured tiny (16-38Mi steady), but the raised 1M-row passthrough cap can spike, so cap at 1Gi so AE can never memory-pressure a node; raise cpu limit 300m->1 core (was throttling). NOTE node evictions were NOT AE/OOM — node-01 went NotReady (network/heartbeat); the memory consumer is PEM (1365Mi, OOMs at the 2Gi default). Signed-off-by: entlein <einentlein@gmail.com>

Root cause of the recurring "AE unauthenticated / writes 0 / crashloop" reverts: kustomization.yaml bundled adaptive_export_secrets.yaml (placeholder pixie-api-key) with the role+deployment, so EVERY infra re-apply (make log4j) clobbered the real key that ae-auth had written. Separation of concerns: remove the secret from the kustomization — infra (role+deployment) stays re-appliable; the secret holds real creds and is owned solely by `make ae-auth`, created once, never touched by infra re-applies. Secret manifest kept as a hand-applied seed-only template (documented). Signed-off-by: entlein <einentlein@gmail.com>

The DX-steered arm was failing because the harness only fired stage-1 (JNDI/LDAP). That generates ldap-egress but NO kubescape R0001 → backend never flagged → DX no case → indeterminate → AE steers wrong/no pods. R0001 comes from stage-2 (post- exploitation exec). fire() now does stage-1 (JNDI) + stage-2 (whoami/shadow/token/ getent in the backend) → kubescape R0001+R0006 → DX rules backend MALIGNANT → backend enters AE activeSet → reduction is measurable. Verified live: DX evidence unexpected-spawn+sensitive-file-read → verdict ruled_in generic=MALIGNANT. Signed-off-by: entlein <einentlein@gmail.com>

…dd DX-steering diagnostics Standing terminology rule: allowlist/blocklist, never whitelist/blacklist. Pure rename (no behavior change) of the rev-3 streaming filter: FilterModeWhitelist → FilterModeAllowlist MaxWhitelistSize → MaxAllowlistSize ADAPTIVE_STREAM_MAX_WHITELIST → ADAPTIVE_STREAM_MAX_ALLOWLIST (env) mode=whitelist log string → mode=allowlist plus all comments/identifiers/tests in streaming, activeset, cmd/main. DX-steering diagnostics (the reason DX-arm-writes-0 has been hard to RCA — we could not tell "empty ActiveSet" from "broker returned 0 rows"): - scanner: log the empty-allowlist short-circuit (was silent) so an empty ActiveSet is visible in logs, distinct from "query completed rows=0". - FilterUpdater: emitted-filter log Debug→Info so the steered pod count per ActiveSet change is visible without debug logging. NOTE: ADAPTIVE_STREAM_MAX_WHITELIST env renamed → tooling that sets the old name must switch to ADAPTIVE_STREAM_MAX_ALLOWLIST. Signed-off-by: entlein <einentlein@gmail.com>

The Pixie dx_evidence_graph UI reads dx_attack_graph via px.DataFrame clickhouse_dsn, whose query template hardcodes event_time + hostname and ORDER BY event_time. A table without those columns fails 'Unknown identifier event_time'; a table created by hand (local, not via the operator) isn't globally registered. Fix: make AE own it like the other forensic tables. - schema.sql: dx_attack_graph DDL with event_time(UInt64 nanos) + hostname, edge columns, fromUnixTimestamp64Nano partition/TTL (nanos-correct). - KnownTables + OperatorOwnedTables: register it so Apply creates it at boot. - apply_test: assert last-applied DDL == last OperatorOwnedTables entry (robust to appended operator tables) instead of hardcoding trigger_watermark. go test ./.../clickhouse green. Signed-off-by: entlein <einentlein@gmail.com>

Pixie's clickhouse_dsn type mapper reads UInt8 as BOOLEAN and does not handle UInt16/UInt32/Float32 -> px fails with 'Column[N] given incorrect type' rendering the dx_evidence_graph. weight/max_severity/num_findings -> Int64, confidence -> Float64 (map cleanly to INT64/FLOAT64). Verified live: px run returns all 6 edges with every column. event_time stays UInt64 (matches kubescape_logs, which px reads). Signed-off-by: entlein <einentlein@gmail.com>

…canner buildPxL The DX/streaming arm silently capped each per-table pull at Pixie's default 10000-row limit while the passthrough/ALL arm (pxl.CompilePassthrough / QueryFor) already raises it to 1,000,000 via the broker's #px:set query flag. Validated live on 6a33dac0: a single streaming http_events pull returned exactly rows=10000 (the cap). Left unfixed this UNDER-counts the DX arm and OVERSTATES the DX-vs-ALL volume reduction. Prepend the same #px:set directive to the streaming scanner's PxL so both arms are uncapped and comparable. Signed-off-by: entlein <einentlein@gmail.com>

Adds the rule-ins-only view (condition != '') to the canonical schema.sql, registers it in KnownTables + OperatorOwnedTables (after dx_attack_graph), and teaches DDL() to extract CREATE VIEW headers. AE now creates it on boot so the dx_evidence_graph UI's default malicious-only read is standard, not a per-rig manual step. Tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: entlein <einentlein@gmail.com>

dx POSTs a JSON array of edges to /dx/attack_graph; AE writes them to forensic_db.dx_attack_graph via JSONEachRow (Applier.WriteAttackGraph). Wired in main.go when CONTROL_ADDR is set; 501 if the sink is unset. This is the AE half of the dx->AE->CH attack-graph write path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: entlein <einentlein@gmail.com>

…e, fix stale tests PR-53 review follow-ups (see review summary in conversation): 1. License headers added to 17 src/e2e_test/adaptive_export_loadtest/ harness/*.{sh,py} scripts. Matches the convention every other src/e2e_test/*/sh in pixie already follows. 2. Dead code removed (was unreachable per golang.org/x/tools/cmd/deadcode): - internal/script/script.go: IsClickHouseScript, IsScriptForCluster, GetActions, getScriptName, getInterval, templateScript, plus the ScriptConfig and ScriptActions types. The cron-script sync flow they served was replaced by the streaming model; only the Script and ScriptDefinition types remain. - internal/pixie/pixie.go: Client.GetPresetScripts (replaced by builtinPresetScripts() inline in cmd/main.go). - internal/streaming/{supervisor,writer}.go: SupervisorStats, TableStats, Stats types + Supervisor.Stats and BatchWriter.Stats methods (no production reader). Atomic counters dropped; the existing flush log preserves the per-flush summary. 3. Stale passthrough tests fixed. TestLoop_DefaultsTablesToPixieTables and TestNew_AppliesDefaults asserted len(clickhouse.PixieTables()) == 13 but passthrough.New strips excludedTables (http2_messages.beta), yielding 12. Tests now compare against filterExcluded(clickhouse.PixieTables()) and add an extra assertion that excluded tables were not written. 4. ClickHouse HTTP client consolidation: new internal/chhttp/ package collapses three near-identical HTTP CH clients (clickhouse.Applier, sink.ClickHouseHTTP, trigger.ClickHouseWatermarkStore) into one. Centralises endpoint validation, basic-auth header, 30s default timeout, fail-loud INSERT settings (4 CH input_format knobs), and the X-ClickHouse-Summary read path. The 4 INSERT call sites in the three callers all route through chhttp.Client.Insert now; SELECT through Query or QueryStream (the latter preserves the QueryActive streaming behaviour). Net code: -200 LOC across the three callers plus a 200-LOC chhttp package with its own tests. 5. Pixie service scaffold wired into cmd/main.go: services.SetupService("adaptive-export", 50900) + services.SetupSSLClientFlags() + services.PostFlagSetupAndParse() + services.SetupServiceLogging(). Matches the pattern every other pixie Go service uses. CheckServiceFlags() is deliberately skipped (AE does not run a TLS gRPC server). Existing AE env-var reads (ADAPTIVE_*) are untouched and still authoritative for tuning knobs. All 14 adaptive_export internal packages pass go test (1 new chhttp, 13 unchanged), including the 3 differential oracle tests in pxl/compile_test.go. arc lint OKAY for everything except 3 pre-existing findings unrelated to this commit (loadgen has its own go.mod; one QF1001 De Morgan's-law nit in reconcile_test.go from c9f19b6). Signed-off-by: entlein <einentlein@gmail.com>

…der) Signed-off-by: entlein <einentlein@gmail.com>

@px

User flagged on review 4536971862: - cmd/BUILD.bazel:18 dropped the '@px' external-workspace prefix on pl_build_system.bzl load. Restored — the standalone AE build pulls pixie as @px and needs the qualifier. Lint cleanup (PR-53 scope, no production code touched outside renames): - chhttp/chhttp_test.go: errcheck on two w.Write calls + gofumpt on the gotSettings declaration. - passthrough/reconcile_test.go: staticcheck QF1001 (De Morgan's law) on the conn_stats sink-drop assertion. - script/script.go: rename ScriptId -> ScriptID (ST1003). Propagated to pixie/pixie.go and cmd/main.go callsites. - internal/e2e/BUILD.bazel and internal/trigger/BUILD.bazel: gazelle drift — adding loadtest_test.go and clickhouse_internal_test.go to srcs. - k8s/00-sinks.yaml + gen-pod.tmpl.yaml: yamllint compliance — document-start marker, dedent sequence items per .yamllint indent-sequences=false, tighten flow-mapping spaces, collapse multi-space after commas. YAML semantics unchanged. - harness/stats.py: flake8 E501 — split a long line into two. Final arc lint state on PR-53 file scope: 0 Errors, 14 Warnings (SHELLCHECK SC2155/SC2086 in pre-existing harness scripts from ae7b86f, not introduced or modified by this commit). Pre-existing loadgen typecheck failures (separate go.mod) are unaffected. Signed-off-by: entlein <einentlein@gmail.com>

…bit items User review 4536971862: #2 passthrough/reconcile_test.go — strengthened with two new tests that exercise the FULL chain (loop → real sink.ClickHouseHTTP → httptest CH endpoint → reconcile recorder). TestTick_ReconcileCatchesCHSilentDrop mimics CH's X-ClickHouse-Summary silent-drop shape (200 OK, written_rows=0) and asserts the loop records WroteCount=0 with a silent-drop attribution — the exact R6 (sink-layer loss) regression the instrument exists to detect. TestTick_ReconcileAttributesCHFailureCorrectly covers the 500-response branch. The pre-existing in-process-fake test stays as the wiring check. #4 pixie/pixie.go — added a long comment justifying the API-key auth choice. pixie.Client targets the Pixie CLOUD (cloudpb's PluginService), whose auth interceptor accepts pixie-api-key and rejects JWT service tokens (those are for INSIDE-cluster vizier services). The pixieapi.Adapter that talks to vizier directly already uses JWT via jwtutils.GenerateJWTForService — same pattern as cloud_connector/vizhealth/checker.go:111. So the split is intentional; flipping pixie.go to JWT would break cloud auth, not improve it. #6 pxl/queryfor — hardened escapePxL so a raw \n, \r, \t or NUL byte in Target.Pod/Target.Namespace can't terminate the PxL string literal and inject a new statement. Added TestQueryFor_RejectsInjectionInTargetFields driving QueryFor with 7 adversarial pod/namespace shapes (newline, single-quote-only, CR, backslash-escape-of-escape, NUL, tab) plus the regex_match fallback path, asserting the output line count and every statement's leading token. Extends existing TestEscapePxL_TableDriven with the new escape mappings. CodeRabbit oversights (verified each against current code): CR r3379377432 (main.go) — wrapped the control-surface listener in http.Server with Read/ReadHeader/Write/Idle timeouts so a slow client can't pin a goroutine indefinitely. CR r3379377607 (pixieapi.go) — switched direct-mode dial from pxapi.WithDisableTLSVerification (env + addr-gated, brittle) to pxapi.WithDirectTLSSkipVerify (added in PR #49 b523ce3 for the same node-IP-dial scenario). Removed the cluster.local + PX_DISABLE_TLS precondition in NewDirect; the always-skip semantics match the AE deployment shape. Refactored the obsolete env-gate test. CR r3379377645 (streaming/filter.go) — the deltaCh-close path returned without calling disarm(), leaking the timer's goroutine on shutdown. Now calls disarm() before return on chan-close. CR r3426923299 (sink/clickhouse.go Record) — capped Record at a 2s per-call context timeout. The 30s chhttp default was too long for the scanner/passthrough/controller hot paths that call Record inline; reconcile is best-effort by contract, so a stalled CH must not pin the pull loop. All 14 AE packages pass go test. arc lint: 0 Errors on PR-53 file scope. Signed-off-by: entlein <einentlein@gmail.com>

Finish the harness consolidation #53 started: adopt exp_matrix.sh (canonical ALL-vs-DX reduction matrix) + nfr.sh (throughput/mem/verdict-latency) as the single runners; cut the overlapping variants (ae_vs_all, exp_datavolume_extreme, exp_dx_steering_reduction, exp_ae_nfr_benchmark, exp_pipeline_reconcile) and the superseded standalone setup (deploy_ae, build_gen_image). README rewritten as the single 'how to run' source: two families — fixture-isolation (run.sh + E-series) and live-attack e2e (log4shell_fire → exp_matrix → nfr → exp_row_reconcile). Add e2e_log4shell_soc.yaml: k3s + full SOC stack (Pixie/kubescape/ClickHouse/AE/dx via k8sstormcenter/soc) on the oracle runner; asserts every canonical harness script runs + dx rules in; profiles dx in real life (pprof CPU/heap + verdict latency, uploaded). Uses existing repo secrets. Signed-off-by: entlein <einentlein@gmail.com>

…ntainer_images BUILD This file is unrelated to adaptive_export — the only change was buildifier alphabetizing container_type/bazel_sdk_versions (no functional change). Restore main's version to keep #53's diff to real AE changes. Signed-off-by: entlein <einentlein@gmail.com>

run-genfiles: the PR had reordered the kwargs in stirling/.../container_images/ BUILD.bazel alphabetically (bazel_sdk_versions before container_type) — a local buildifier or gazelle drift from when the file was first committed. CI gazelle wants the original order (container_type first), so the diff loop fails. Restored to origin/main's ordering. Local gazelle disagrees with CI's expected output (tooling-version drift); CI is authoritative. run-container-lint: two findings. 1. staticcheck QF1001 (De Morgan's law) in pxl/queryfor_test.go:318. Rewrote the !(a || b || c || d) negated disjunction as the equivalent !a && !b && !c && !d. Test behaviour unchanged; verified locally with TestQueryFor_RejectsInjection. 2. golangci-lint typechecking failed on src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/{cleanloadgen, httpsink}/main.go because that subtree carries its own go.mod and does not resolve as a package under the root px.dev/pixie module. Added the loadgen subtree to .arclint's exclude list. Same fix the existing entries for other-module subtrees apply. Local 'arc lint' on the PR-53 file scope: 0 Errors, 14 SHELLCHECK Warnings + 26 SHELLCHECK Advice (all pre-existing in ae7b86f harness scripts; not introduced by this PR). Signed-off-by: entlein <einentlein@gmail.com>

run-genfiles CI re-failed after f244ffc reverted this file to main's ordering — turns out gazelle on this repo IS alphabetizing the kwargs (bazel_sdk_versions before container_type), and CI runs 'gazelle fix' then 'git diff' to catch any drift. So main's ordering is no longer gazelle-stable; the file has to match gazelle's preference, not main's. Verified locally with 'bazel run //:gazelle -- fix'; the file is now idempotent (a second gazelle run produces no diff). Signed-off-by: entlein <einentlein@gmail.com>

run-container-lint re-failed after the run-genfiles fix because two files added on this branch (a03aa15) had unfixed lint errors: .github/workflows/e2e_log4shell_soc.yaml — 5 yamllint Errors: - 1 indentation: list items under steps: must be parent-aligned per the repo's .yamllint config (indent-sequences: false), not 2-indented. Dedented every step item + its run: block by 2 spaces. - 4 line-length (>120 chars): split the long kubectl-set-image, the long grep-detection gate, the curl pprof URL, and the verdict-latency grep across continuation lines. Semantics unchanged. src/e2e_test/adaptive_export_loadtest/harness/{exp_matrix.sh, nfr.sh} — missing Apache headers. Applied via arc lint --apply-patches; exec bits restored. Local arc lint on PR-53 file scope: 0 Errors, 14 SHELLCHECK Warnings + 26 Advice (all pre-existing in harness scripts, unchanged). Signed-off-by: entlein <einentlein@gmail.com>

arc lint --apply-patches exits non-zero on Warning level too. Resolved each: replaced unused 'for i/t in ...' loop vars with '_', split a SC2155 declare-and-assign, dropped two never-referenced hip/pip assignments. Signed-off-by: entlein <einentlein@gmail.com>

…owup) AE control endpoints had no auth. Add it using the SAME shared lib the vizier broker/PEM use (px.dev/pixie/src/shared/services/utils): SetAuth verifies a bearer JWT via jwtutils.ParseToken (signature + audience), middleware on Handler() with /healthz exempt. dx already mints this exact service JWT (GenerateJWTForService, PL_JWT_SIGNING_KEY) — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH + PL_JWT_SIGNING_KEY), default OFF so it merges before dx sends the bearer; flip on after dx is updated. Also: reject invalid t_end (<=0) and query windows (start>=end, non-positive). Test mints a real JWT via the shared lib; asserts 401 on missing/bad token, pass on valid, /healthz open.

coderabbitai · 2026-06-21T21:01:25Z

Warning

Review limit reached

@entlein, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 20 minutes and 50 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 06a889d9-f649-45ae-ae24-ef1deaa2c32a

📥 Commits

Reviewing files that changed from the base of the PR and between 714c1fb and dfdc465.

📒 Files selected for processing (2)

src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go

📝 Walkthrough

Walkthrough

This PR adds an adaptive export operator with ClickHouse-backed trigger, sink, streaming, passthrough, and control packages, plus schema and Pixie integration changes, load-test tooling and harness scripts, a SOC GitHub workflow, and supporting bootstrap, packaging, and documentation updates.

Changes

Adaptive export operator and load-test rig

Layer / File(s)	Summary
Bootstrap and packaging baseline `.arclint`, `.bazelignore`, `k8s/vizier/bootstrap/*`, `src/vizier/services/adaptive_export/BUILD.bazel`, `src/stirling/.../BUILD.bazel`, `.github/workflows/vizier_release.yaml`	Adds repo exclusions for the Docker-built loadgen, updates adaptive-export bootstrap manifests, and introduces Bazel bundle/push targets plus small infrastructure adjustments.
Load-test contracts and fixtures `src/e2e_test/adaptive_export_loadtest/.md`, `.../fixtures/`, `.../k8s/`, `.../tools/loadgen/`	Adds adaptive-export contract and findings docs, experiment definitions, sink and generator manifests, and a standalone Go load generator and HTTP sink image.
Harness orchestration and SOC workflow `.github/workflows/e2e_log4shell_soc.yaml`, `src/e2e_test/adaptive_export_loadtest/harness/*`	Adds the shell/Python harness for control-plane, data-plane, matrix, reconciliation, profiling, and benchmark runs, and wires those scripts into a manual SOC workflow.
Schema, query, and event foundations `src/vizier/services/adaptive_export/internal/{anomaly,kubescape,chhttp,clickhouse,pxl,reconcile}/*`	Adds anomaly hashing, kubescape row extraction, ClickHouse HTTP helpers, embedded schema and DDL parsing, PxL generation utilities, and reconciliation record types with tests and integration coverage.
Pixie clients and plugin management `src/api/go/pxapi/opts.go`, `src/vizier/services/adaptive_export/internal/{pixie,pixieapi,script,config}/*`	Adds direct Pixie TLS-skip support, direct/query adapters, plugin enablement and script creation paths, and narrows the remaining script/config data model.
ClickHouse write and trigger path `src/vizier/services/adaptive_export/internal/{sink,trigger}/*`	Adds ClickHouse-backed sink and trigger implementations for writing attribution and Pixie rows, querying active state, polling kubescape logs, normalizing and persisting watermarks, and detecting silent writes.
Controller state and control endpoints `src/vizier/services/adaptive_export/internal/{activeset,controller,control}/*`	Adds the controller’s active-window state machine, rehydration and prune logic, and an authenticated HTTP control surface for export, query, and attack-graph submission.
Streaming, passthrough, and operator entrypoint `src/vizier/services/adaptive_export/cmd/`, `src/vizier/services/adaptive_export/internal/{streaming,passthrough,e2e}/`	Adds streaming and passthrough export loops, wires them from the new operator-style `main`, and adds hermetic and end-to-end tests for the runtime flow.

Sequence Diagram(s)

sequenceDiagram
  participant ClickHouse
  participant Trigger
  participant Controller
  participant Sink
  participant Pixie

  Trigger->>ClickHouse: poll kubescape_logs with watermark
  Trigger->>Controller: emit kubescape event
  Controller->>Sink: write adaptive_attribution
  Controller->>Pixie: query matching tables for active window
  Sink->>ClickHouse: insert attribution and Pixie rows

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

k8sstormcenter/pixie#53: Both PRs modify the adaptive-export trigger watermark handling and reconciliation-related ClickHouse write path contracts.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title 'AE control auth + CodeRabbit `#53` followup' clearly summarizes the main changes: implementing bearer-JWT authentication on the control surface and addressing CodeRabbit feedback from PR `#53`.
Description check	✅ Passed	The pull request description is comprehensive and directly related to the changeset, detailing the authentication implementation, input validation, companion changes, and remaining work items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ae-followup-auth

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- controller: don't fan out Pixie rows when attribution Sink.Write fails (avoids orphaned rows; release in-flight slot, non-fatal). (controller.go:347) - controller.Rehydrate: re-arm rev-1 pushPixieRows for restored windows so a restart doesn't silently miss post-restart Pixie data. (controller.go:255) - passthrough.pull: per-table timeout context so a hung dependency can't stall the sweep / delay shutdown (covers serial+concurrent ticks). (passthrough.go:147) - schema: TTL (30d) on ae_reconcile to cap append-only growth. (schema.sql:485) Verified no-change-needed: adaptive_attribution ORDER BY (hostname, anomaly_hash) is safe — anomaly_hash already encodes namespace+pod (anomaly/hash.go), so rows never collapse across ns/pod. (schema.sql:430) Already on ae-prod (no-op): filter timer leak, stats.py EXACT guard, sink async, watermark paging, pixieapi WithDirectTLSSkipVerify, http.Server timeouts.

entlein · 2026-06-21T21:11:52Z

@dx-agent — cut you a release to regression-test this followup.

Release: release/vizier/v0.14.19-aeprod20 → image ghcr.io/k8sstormcenter/vizier-adaptive_export_image:0.14.19-aeprod20
Build: https://github.com/k8sstormcenter/pixie/actions/runs/27917599684 (oracle runner, ~32 min — image lands when green).
Tree: #53 (ae-prod) + this PR's control-auth + the CodeRabbit Major fixes.

What changed vs aeprod19 (what to exercise):

control-plane auth — bearer-JWT via the shared jwtutils lib. Flag-gated, DEFAULT OFF (CONTROL_REQUIRE_AUTH unset) → dx→AE works unchanged, no dx-side change needed for this test. (Auth e2e is a later step once dx attaches the bearer.)
controller: no fan-out when attribution Sink.Write fails (no orphaned Pixie rows); Rehydrate now re-arms rev-1 fan-out for restored windows.
passthrough: per-table timeout (hung dep can't stall the sweep).
schema: 30d TTL on ae_reconcile.

Ask: swap your rig's AE to 0.14.19-aeprod20 (keep dx as-is) and confirm the cast4/ExtremeC regression still green — reduction + no-loss + detection (log4shell/react2argo rule-in + dx_attack_graph writes), AE 2/2 capturing. The Rehydrate change is the one to watch (restart → does rev-1 fan-out resume?). Ping back with pass/fail and I'll flip auth on next.

Thanks!

coderabbitai

Actionable comments posted: 44

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e_log4shell_soc.yaml:
- Around line 14-31: The e2e_log4shell_soc.yaml workflow lacks concurrency
guards, which allows multiple simultaneous runs to race and corrupt shared
self-hosted runner resources like k3s and /etc/rancher/k3s/. Add a concurrency
block at the top level of the workflow (alongside the on and jobs sections) that
specifies a concurrency group to ensure only one run executes at a time. The
concurrency group should cancel in-progress runs when a new run is triggered,
using a meaningful identifier like the workflow name or a fixed string to ensure
all runs of this workflow serialize properly on the oracle-vm-16cpu-64gb-x86-64
runner.
- Around line 58-65: The workflow is directly interpolating `inputs.soc_ref` and
`inputs.dx_image` into shell commands within the run step, creating a command
injection vulnerability. Move these inputs to an env section at the step level,
assign them to environment variables, and validate that they contain only
allowed characters (e.g., alphanumeric, hyphens, dots for versions). Replace the
inline `${{ inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the
git clone and kubectl commands with properly quoted environment variable
references like `"$SOC_REF"` and `"$DX_IMAGE"` respectively.

In @.github/workflows/vizier_release.yaml:
- Line 143: The runner label oracle-vm-16cpu-64gb-x86-64 in the runs-on field is
not recognized by actionlint, causing lint validation to fail. If this is an
intentional custom self-hosted runner label, register it in your actionlint
configuration file (typically .actionlintrc or actionlint config) to whitelist
this label and allow the workflow to pass validation. Alternatively, if this
label does not exist in your infrastructure, replace it with a valid standard
GitHub Actions runner label or your organization's properly configured
self-hosted runner label.

In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md`:
- Line 96: The `## Legend` heading is missing required blank lines for markdown
linting compliance with rule MD022. Add a blank line before the `## Legend`
heading and a blank line after it to separate it from the surrounding content.
This ensures proper spacing around the heading section in the CONTRACTS.md file.

In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md`:
- Around line 50-59: The markdown file has formatting violations in fenced code
blocks at lines 50-59 and 95-97. Add a blank line before and after each fenced
code block to comply with MD031, and add language identifiers to the opening
fence markers (e.g., change ``` to ```sql for SQL code blocks and ```text or
appropriate language for other blocks). This applies to all fenced code blocks
that currently lack language tags or proper spacing.

In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md`:
- Around line 17-18: The documentation in EXPERIMENTS.md incorrectly states that
fixture timestamps are in unix nanoseconds, but the actual harness
implementation in lib.sh (lines 145-147) uses seconds for the event_time field.
Update the documentation at lines 17-18 to accurately reflect that timestamps
are in seconds rather than nanoseconds, and also correct the same misstatement
at lines 35-36 to ensure consistency with the actual harness behavior and
prevent users from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh`:
- Around line 77-80: The first inj command in the E4 test case (line 79) uses ||
true which silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh`:
- Around line 74-80: The INJECT command invocations in the stall-detection
experiment are using the || true pattern to suppress failures, which causes
failed anomaly inserts to be treated as success and can produce false STALL
flags. Remove the || true suffix from both INJECT command calls (the first one
with the --same-time flag and the second one without it) to allow injection
failures to properly propagate and be detected, ensuring accurate stall
detection in the experiment.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh`:
- Around line 80-85: The measure function accepts a valid parameter but does not
include it when writing results to the RES file at line 84. Add the valid
parameter to the printf statement that writes to RES so it captures whether each
measurement is valid. Additionally, update the summary computation at lines
121-124 to filter the results in RES by the validity column before computing the
aggregate statistics, ensuring the reported means only include measurements
marked as valid.
- Around line 114-116: The matrix arms currently have
ADAPTIVE_PUSH_PIXIE_ROWS=false set on both the ALL arm run_arm call at line 114
and the DX arm run_arm call at line 116, which disables the data-plane write
path and prevents proper measurement of data-volume reduction. To fix this,
change ADAPTIVE_PUSH_PIXIE_ROWS=false to ADAPTIVE_PUSH_PIXIE_ROWS=true in at
least one of these run_arm calls (or both as needed) to enable the data-plane
write path so that data-volume reduction can be properly measured during the
adaptive export load test.
- Line 78: The r0001_recent function has a timestamp unit mismatch where it
multiplies by 1000000000 to convert to nanoseconds, but the event_time field is
stored in seconds according to the harness documentation. This causes the WHERE
clause comparison to fail or undercount valid events. Fix this by removing the
multiplication factor of 1000000000 from the timestamp calculation, so that the
comparison is between seconds and seconds instead of seconds and nanoseconds.
Change the expression from toUInt64((now()-130))*1000000000 to
toUInt64(now()-130).

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh`:
- Around line 54-56: The reconciliation test setup on line 55 of
exp_row_reconcile.sh disables the write path by setting
ADAPTIVE_PUSH_PIXIE_ROWS=false, which prevents the CH protocol-row writes that
the test is meant to validate, causing false reconciliation failures. Remove the
ADAPTIVE_PUSH_PIXIE_ROWS=false flag from the kubectl set env command in the
adaptive-export deployment to ensure the write path remains enabled during the
reconciliation test.

In `@src/e2e_test/adaptive_export_loadtest/harness/inject.sh`:
- Around line 23-27: The row-shape header comment in the inject.sh file
incorrectly documents event_time as unix-NANOS on line 26, but the actual
contract documented in lines 32-37 and the script's usage shows it is
unix-SECONDS. Correct the comment to change "event_time (UInt64 unix-NANOS)" to
"event_time (UInt64 unix-SECONDS)" to accurately reflect the seconds unit that
the script actually uses and expects.
- Around line 96-108: Replace the predictable temporary filename
`/tmp/inject_resp.$$` with a securely generated temporary file using `mktemp`.
Create a new temporary file at the start of the code block using `mktemp` and
store it in a variable, then substitute all occurrences of `/tmp/inject_resp.$$`
with this variable throughout the curl command, the response file output
redirection, the cat command that displays the response, and the rm command that
cleans up the file. This eliminates the security risk of predictable temp file
names that could be subject to symlink attacks or collisions.

In `@src/e2e_test/adaptive_export_loadtest/harness/lib.sh`:
- Line 33: The bash script header marks KUBECONFIG as required using set -uo
pipefail, but the kubectl operations at lines 51-53 execute without verifying
that KUBECONFIG is actually set and points to a valid file. Add an explicit
guard before the kubectl operations to check that KUBECONFIG is set and
non-empty, ensuring that kubectl commands execute against the intended cluster
context and not the caller's default context. This guard should exit with an
error if KUBECONFIG is unset or empty to prevent unintended mutations.
- Around line 89-94: The chq function is missing timeout controls on the curl
command, which can cause the harness to hang indefinitely if the port-forward or
ClickHouse service becomes unresponsive. Add curl timeout options such as
--connect-timeout for the connection establishment phase and --max-time for the
overall request duration to the curl invocation in the chq function to ensure
the harness can proceed even if the backend stalls.

In `@src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh`:
- Around line 61-66: Add validation immediately after the kubectl commands that
set the BIP and BPORT variables to ensure both contain non-empty values before
proceeding to the curl loop. If either BIP or BPORT is empty, the script should
exit with a clear error message indicating backend service resolution failed.
This prevents malformed requests from being sent by the attacker deployment and
ensures the script fails fast with meaningful diagnostics instead of producing
misleading log4j vulnerability detection results.

In `@src/e2e_test/adaptive_export_loadtest/harness/nfr.sh`:
- Around line 49-56: The script detects no-loss proof mismatches by printing
MISMATCH or READ!=WROTE in the loop that iterates over http_events, dns_events,
and conn_stats, but never exits with a failure status, causing automated runs to
pass despite data loss. Add a flag variable initialized before the loop to track
whether any mismatches occur (checking both the wrote versus CH_rows comparison
and the read versus wrote comparison), then after the loop completes, check this
flag and exit the script with a non-zero status code if any mismatches were
detected.
- Around line 34-39: The setarm function on line 34 suppresses all output and
errors from the kubectl commands using `>/dev/null 2>&1`, which hides failures
from kubectl set env and kubectl rollout status. The run_phase function on line
38 calls setarm without checking its return status, allowing the test to proceed
even if the adaptive-export reconfiguration fails. Remove the output suppression
redirection from the setarm function to allow errors to be visible, and add a
return status check immediately after the setarm call in run_phase so that the
phase aborts with an appropriate error message if the kubectl reconfiguration
fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/run.sh`:
- Around line 32-33: The EVID variable assignment on line 32 uses a hardcoded
user-specific path that is not portable across different environments. Replace
the hardcoded path with a portable default, such as using environment variables
or a system temporary directory. Additionally, the critical pipeline stages in
lines 40-56 lack explicit failure checks, allowing the script to continue even
if a stage fails and produces an invalid verdict set. Add explicit error
handling (such as set -e at the beginning of the script or individual error
checks) to ensure the script exits immediately when any critical stage fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/stats.py`:
- Around line 57-76: The issue is that when there are no numeric PASS values for
a metric (the vals list is empty), the code continues without setting repro_ok
to False, allowing a false positive "EXACTLY REPRODUCIBLE" verdict. To fix this,
add repro_ok = False before the continue statement in the block where not vals
is True, so that any metric lacking valid PASS measurements will correctly mark
the reproducibility check as failed.

In `@src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml`:
- Around line 28-37: Add security hardening to the httpsink container
specification in the pod spec by setting automountServiceAccountToken to false
at the spec level to prevent service account token mounting, and within the
container specification add allowPrivilegeEscalation set to false, drop
unnecessary Linux capabilities, and configure a seccomp profile to restrict
system calls. Apply the same hardening pattern to the pg-sink pod specification
as well to ensure both long-lived sink pods follow the same security baseline
and reduce unnecessary privilege and credential exposure.

In `@src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml`:
- Around line 18-38: The gen-pod.tmpl.yaml pod spec lacks essential security
hardening configurations that should be applied to reduce the attack surface.
Add a securityContext block within the pod spec to enforce non-root execution,
disable privilege escalation, and set a read-only root filesystem. Additionally,
add automountServiceAccountToken set to false at the pod spec level to prevent
unnecessary service account token mounting. These hardening measures should be
applied to the spec section containing the restartPolicy and containers
definitions to ensure the cleanloadgen workload runs with minimal privileges and
reduced security risks.

In `@src/e2e_test/adaptive_export_loadtest/README.md`:
- Around line 51-57: The markdown file has linting compliance issues with the
Layout heading and its associated code fence block. First, ensure there is a
blank line before the "## Layout" heading and a blank line after it to satisfy
MD022 heading spacing requirements. Second, add a language tag to the opening
fence (using triple backticks) to satisfy MD040 requirements. Third, ensure
there is a blank line before the opening fence and a blank line after the
closing fence to satisfy MD031 fenced block spacing requirements. These same
spacing rules also apply to the heading at line 71.

In
`@src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go`:
- Around line 64-71: The envInt function currently accepts negative integer
values from environment variables, which can cause unintended behavior in
generation loops. After successfully parsing the integer value with strconv.Atoi
in the envInt function, add a validation check to ensure the parsed value is
non-negative. If the value is negative, call fatalf with an appropriate error
message indicating that the environment variable must be a non-negative integer.

In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod`:
- Line 3: The Go version directive in the go.mod file is set to version 1.22,
which is no longer receiving active security updates. Update the go directive
from version 1.22 to version 1.26 (or 1.25 if necessary) to ensure the loadgen
binaries run on a currently supported Go runtime baseline and receive ongoing
security patches.

In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 799-807: The issue is that isOperatorManagedScript() function uses
a broad prefix check (ch-) to identify operator-managed scripts, which can
accidentally match user-authored scripts that start with ch-. This causes
unintended deletion of user scripts when INSTALL_PRESET_SCRIPTS=true. Replace
the prefix-based matching logic in isOperatorManagedScript() with an explicit
list of known operator-managed script names that the function should check
against. Similarly, apply the same fix to the duplicate logic mentioned at lines
839-847 to ensure both locations use the same explicit script name matching
approach.

In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 36-38: The QueryStream method claims to support unbounded
incremental streaming but uses the same http.Client that has a timeout set via
DefaultTimeout. In Go, http.Client.Timeout applies to the entire
request-response cycle including body reads, which will cut off long-running
responses mid-stream. For QueryStream, create a separate http.Client instance
without a timeout constraint, or modify the client initialization to
conditionally apply the timeout only for non-streaming operations. Ensure
QueryStream uses the timeout-free client while other operations continue to use
the DefaultTimeout-constrained client.

In `@src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go`:
- Around line 145-152: The test uses a single 30-second context timeout for both
the Apply and VerifyPixieSchema operations, which means if Apply consumes most
of the time, VerifyPixieSchema may fail with a deadline exceeded error even
though the schema is correct. Create separate context.WithTimeout calls for each
operation so that both Apply and VerifyPixieSchema each get their own
independent 30-second timeout deadline.

In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql`:
- Around line 68-70: The kubescape_logs table has unsafe unit handling for the
event_time column in both the PARTITION BY and TTL expressions. The
toDateTime(event_time) calls assume event_time is in seconds, but if the actual
values are in milliseconds or nanoseconds, this will produce incorrect partition
keys and TTL horizons. Determine the actual unit of event_time (likely
milliseconds or nanoseconds), then update both the PARTITION BY
toYYYYMM(toDateTime(event_time)) expression on line 68 and the TTL
toDateTime(event_time) + INTERVAL 30 DAY DELETE expression on line 69 to apply
the appropriate unit conversion before passing to toDateTime(), such as dividing
by 1000 for milliseconds or by 1000000000 for nanoseconds.

In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 155-169: The TestBadInputRejected test function is missing
explicit test cases for the new t_end and window validators. Add test cases that
verify the server returns http.StatusBadRequest for t_end <= 0 on the
/export/start endpoint, and for window[0] >= window[1] and non-positive window
values on the /query endpoint. Follow the existing pattern in
TestBadInputRejected by using the do() helper function with appropriate test
payloads for each validator and asserting the expected 400 status code with
descriptive failure messages.

In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 120-143: The handleDXAttackGraph method is vulnerable to memory
DoS attacks because the decode function reads the entire unbounded request body
and then copies it into a second buffer. Wrap the request body with
http.MaxBytesReader to enforce a maximum size limit before passing the request
to the decode function. This will ensure oversized requests fail gracefully
instead of exhausting memory resources.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 316-366: The code modifies c.active[hash] before calling
c.sink.Write(), but if the write fails, it returns without rolling back the
changes to c.active. This allows in-flight workers already processing the same
hash to continue using unpersisted state extensions. Save the prior state of
c.active[hash] before making any modifications (before the block that either
creates a new row or updates the existing row), and then add rollback logic in
the error handling block after the c.sink.Write() call fails to restore the
original state, ensuring that in-flight workers do not reference orphaned state.

In `@src/vizier/services/adaptive_export/internal/e2e/e2e_test.go`:
- Around line 146-167: The test currently validates only the payload shape
within the INSERT bodies, but does not verify that the INSERT statement targets
the correct table forensic_db.adaptive_attribution. Add an assertion that checks
the captured SQL statements (via insertedSQL) to ensure they contain the correct
table reference, preventing regressions where the wrong table could be inserted
into while still maintaining the correct payload structure. This check should be
added alongside or after the existing body content validation loop that checks
for matched payloads.

In `@src/vizier/services/adaptive_export/internal/passthrough/passthrough.go`:
- Around line 124-131: The compiled path silently skips invalid tables when
pxl.CompilePassthrough fails but does not record these failures using
l.rec(...), which changes the error recording semantics compared to legacy mode.
In the error handling block where pxl.CompilePassthrough(table, cfg.Window)
returns an error, add a call to l.rec(...) to record the failure before
continuing to skip the table. Apply this same fix to the similar error handling
code at lines 235-239 that also needs to record failures when tables cannot be
compiled.

In `@src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go`:
- Around line 34-37: The capRec struct's Record method appends to the rows slice
without synchronization, which causes race conditions when multiple goroutines
call Record concurrently. Add a sync.Mutex field to the capRec struct and use it
to protect the append operation in the Record method by locking before appending
to rows and unlocking after. Also update any code that reads the rows field to
acquire the same lock before accessing the slice to ensure consistent visibility
of all recorded rows.

In `@src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go`:
- Around line 112-160: Add a TODO comment to the Query method of the Adapter
struct that documents the need to implement a long-lived client with JWT-refresh
mechanism for high-throughput direct-mode scenarios. Place this TODO near the
existing comment about GC-time reclamation and connection-leak mitigation to
ensure future developers are aware of the technical debt and the conditions
under which it should be addressed.

In `@src/vizier/services/adaptive_export/internal/sink/clickhouse.go`:
- Around line 190-207: The unconditional Info log statement that logs "sink:
pixie write completed" with the WithFields call executes on every batch write in
the hot path, creating excessive log volume. Wrap the entire logging block (from
the summary := res.Summary assignment through the Info call) behind a
debug-level check using log.IsLevelEnabled(log.DebugLevel) or a temporary
feature flag so this diagnostic logging only occurs when explicitly enabled for
debugging rather than on every write.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode_test.go`:
- Around line 36-58: The encodeViaJSON and parseNDJSON helper functions are
ignoring errors from enc.Encode and json.Unmarshal respectively, which can allow
tests to pass with invalid data. Modify both encodeViaJSON and parseNDJSON to
accept a *testing.T parameter as the first argument, then check the error
returns from enc.Encode and json.Unmarshal and call t.Fatalf to fail the test
immediately if either operation returns an error. Finally, update all call sites
throughout the test file where encodeViaJSON and parseNDJSON are invoked to pass
the test instance t as the first argument.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go`:
- Around line 184-187: The appendFloat function is being called with float32 and
float64 values without first validating that they are finite numbers. Non-finite
values like NaN and Infinity will be emitted as invalid JSON, causing
whole-batch inserts to fail. Add a check before calling appendFloat in both the
float32 case (around line 184) and float64 case (around line 186) to verify the
value is finite using a function like math.IsNaN or math.IsInf. If a non-finite
value is detected, reject it or handle it appropriately instead of passing it to
appendFloat. Apply the same validation pattern to the similar float handling
code around lines 215-218.

In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go`:
- Around line 158-164: The shared 60-second timeout context created before the
loop causes the test to be flaky because it applies to the entire table
iteration sequence rather than individual tables. Move the context.WithTimeout
call inside the for loop that iterates over chpkg.PixieTables() so that each
table gets its own independent 60-second timeout. This way, if one table's
WritePixieRows call takes time, it won't affect the timeout for subsequent
tables, ensuring failures are isolated to the specific table being tested.

In `@src/vizier/services/adaptive_export/internal/streaming/writer.go`:
- Around line 117-134: The flush() method clears buf unconditionally after
attempting WritePixieRows, which causes data loss when the write fails. Only
clear buf by setting it to buf[:0] when WritePixieRows succeeds (move this
operation into the else block), not in the error case. Additionally, in the
shutdown branch (lines 138-140), the timeout context is derived from an
already-canceled parent context, causing the final flush to fail fast. For the
shutdown case, create the timeout context using context.Background() instead of
the canceled parent context to give the final flush operation a proper chance to
complete.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 73-78: The PollLimit field documentation incorrectly states that 0
means unlimited behavior, but the New() function actually rewrites 0 to 10000.
Update the comment for the PollLimit field to accurately reflect the actual
runtime behavior that 0 is rewritten to the default value of 10000, ensuring
operators understand the correct semantics when tuning this setting.

In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`:
- Around line 123-132: The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0f1144a8-c2c2-4681-a318-0629757dcbdb

📥 Commits

Reviewing files that changed from the base of the PR and between 855db81 and 714c1fb.

📒 Files selected for processing (116)

.arclint
.bazelignore
.github/workflows/e2e_log4shell_soc.yaml
.github/workflows/vizier_release.yaml
k8s/vizier/bootstrap/adaptive_export_deployment.yaml
k8s/vizier/bootstrap/adaptive_export_secrets.yaml
k8s/vizier/bootstrap/kustomization.yaml
skaffold/skaffold_vizier.yaml
src/api/go/pxapi/opts.go
src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md
src/e2e_test/adaptive_export_loadtest/README.md
src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md
src/e2e_test/adaptive_export_loadtest/harness/ae_config.sh
src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh
src/e2e_test/adaptive_export_loadtest/harness/exp_e5.sh
src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh
src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh
src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh
src/e2e_test/adaptive_export_loadtest/harness/inject.sh
src/e2e_test/adaptive_export_loadtest/harness/lib.sh
src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh
src/e2e_test/adaptive_export_loadtest/harness/nfr.sh
src/e2e_test/adaptive_export_loadtest/harness/run.sh
src/e2e_test/adaptive_export_loadtest/harness/stats.py
src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml
src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml
src/e2e_test/adaptive_export_loadtest/tools/loadgen/Dockerfile
src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go
src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/httpsink/main.go
src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod
src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel
src/vizier/services/adaptive_export/BUILD.bazel
src/vizier/services/adaptive_export/cmd/BUILD.bazel
src/vizier/services/adaptive_export/cmd/main.go
src/vizier/services/adaptive_export/internal/activeset/BUILD.bazel
src/vizier/services/adaptive_export/internal/activeset/activeset.go
src/vizier/services/adaptive_export/internal/activeset/activeset_test.go
src/vizier/services/adaptive_export/internal/anomaly/BUILD.bazel
src/vizier/services/adaptive_export/internal/anomaly/hash.go
src/vizier/services/adaptive_export/internal/anomaly/hash_bench_test.go
src/vizier/services/adaptive_export/internal/anomaly/hash_test.go
src/vizier/services/adaptive_export/internal/chhttp/BUILD.bazel
src/vizier/services/adaptive_export/internal/chhttp/chhttp.go
src/vizier/services/adaptive_export/internal/chhttp/chhttp_test.go
src/vizier/services/adaptive_export/internal/clickhouse/BUILD.bazel
src/vizier/services/adaptive_export/internal/clickhouse/apply.go
src/vizier/services/adaptive_export/internal/clickhouse/apply_test.go
src/vizier/services/adaptive_export/internal/clickhouse/columns_test.go
src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
src/vizier/services/adaptive_export/internal/clickhouse/ddl_test.go
src/vizier/services/adaptive_export/internal/clickhouse/insert.go
src/vizier/services/adaptive_export/internal/clickhouse/insert_test.go
src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go
src/vizier/services/adaptive_export/internal/clickhouse/schema.sql
src/vizier/services/adaptive_export/internal/config/BUILD.bazel
src/vizier/services/adaptive_export/internal/config/definition.go
src/vizier/services/adaptive_export/internal/control/BUILD.bazel
src/vizier/services/adaptive_export/internal/control/server.go
src/vizier/services/adaptive_export/internal/control/server_test.go
src/vizier/services/adaptive_export/internal/controller/BUILD.bazel
src/vizier/services/adaptive_export/internal/controller/controller.go
src/vizier/services/adaptive_export/internal/controller/controller_test.go
src/vizier/services/adaptive_export/internal/e2e/BUILD.bazel
src/vizier/services/adaptive_export/internal/e2e/e2e_test.go
src/vizier/services/adaptive_export/internal/e2e/loadtest_test.go
src/vizier/services/adaptive_export/internal/kubescape/BUILD.bazel
src/vizier/services/adaptive_export/internal/kubescape/extract.go
src/vizier/services/adaptive_export/internal/kubescape/extract_test.go
src/vizier/services/adaptive_export/internal/passthrough/BUILD.bazel
src/vizier/services/adaptive_export/internal/passthrough/compiled_test.go
src/vizier/services/adaptive_export/internal/passthrough/passthrough.go
src/vizier/services/adaptive_export/internal/passthrough/passthrough_test.go
src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go
src/vizier/services/adaptive_export/internal/pixie/pixie.go
src/vizier/services/adaptive_export/internal/pixieapi/BUILD.bazel
src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go
src/vizier/services/adaptive_export/internal/pixieapi/pixieapi_test.go
src/vizier/services/adaptive_export/internal/pxl/BUILD.bazel
src/vizier/services/adaptive_export/internal/pxl/compile.go
src/vizier/services/adaptive_export/internal/pxl/compile_test.go
src/vizier/services/adaptive_export/internal/pxl/pxl.go
src/vizier/services/adaptive_export/internal/pxl/queryfor.go
src/vizier/services/adaptive_export/internal/pxl/queryfor_bench_test.go
src/vizier/services/adaptive_export/internal/pxl/queryfor_test.go
src/vizier/services/adaptive_export/internal/pxl/tables.go
src/vizier/services/adaptive_export/internal/pxl/tables_test.go
src/vizier/services/adaptive_export/internal/reconcile/BUILD.bazel
src/vizier/services/adaptive_export/internal/reconcile/reconcile.go
src/vizier/services/adaptive_export/internal/script/script.go
src/vizier/services/adaptive_export/internal/sink/BUILD.bazel
src/vizier/services/adaptive_export/internal/sink/clickhouse.go
src/vizier/services/adaptive_export/internal/sink/clickhouse_test.go
src/vizier/services/adaptive_export/internal/sink/content_type_contract_test.go
src/vizier/services/adaptive_export/internal/sink/encode_bench_test.go
src/vizier/services/adaptive_export/internal/sink/fastencode.go
src/vizier/services/adaptive_export/internal/sink/fastencode_test.go
src/vizier/services/adaptive_export/internal/sink/integration_test.go
src/vizier/services/adaptive_export/internal/streaming/BUILD.bazel
src/vizier/services/adaptive_export/internal/streaming/filter.go
src/vizier/services/adaptive_export/internal/streaming/filter_test.go
src/vizier/services/adaptive_export/internal/streaming/integration_test.go
src/vizier/services/adaptive_export/internal/streaming/notifier.go
src/vizier/services/adaptive_export/internal/streaming/notifier_test.go
src/vizier/services/adaptive_export/internal/streaming/scanner.go
src/vizier/services/adaptive_export/internal/streaming/scanner_test.go
src/vizier/services/adaptive_export/internal/streaming/supervisor.go
src/vizier/services/adaptive_export/internal/streaming/writer.go
src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel
src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
src/vizier/services/adaptive_export/internal/trigger/clickhouse_test.go
src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go
src/vizier/services/adaptive_export/internal/trigger/integration_test.go
src/vizier/services/adaptive_export/internal/trigger/watermark.go
src/vizier/services/adaptive_export/internal/trigger/watermark_test.go

💤 Files with no reviewable changes (2)

src/vizier/services/adaptive_export/internal/pxl/pxl.go
src/vizier/services/adaptive_export/internal/config/definition.go

coderabbitai · 2026-06-21T21:23:58Z

        git push origin "gh-pages"
  update-gh-artifacts-manifest:
-    runs-on: oracle-8cpu-32gb-x86-64
+    runs-on: oracle-vm-16cpu-64gb-x86-64


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Runner label is currently unresolved by lint/scheduling contracts.

runs-on: oracle-vm-16cpu-64gb-x86-64 is flagged as unknown, which can block release workflow validation or leave the job unschedulable unless this is explicitly configured as a self-hosted label.

🔧 Suggested fix

- runs-on: oracle-vm-16cpu-64gb-x86-64 + runs-on: + - self-hosted + - oracle-vm-16cpu-64gb-x86-64

If this is intentional custom-label usage, also register it in your actionlint config to keep CI green.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

runs-on: oracle-vm-16cpu-64gb-x86-64

runs-on:

- self-hosted

- oracle-vm-16cpu-64gb-x86-64

🧰 Tools

🪛 actionlint (1.7.12)

[error] 143-143: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/vizier_release.yaml at line 143, The runner label oracle-vm-16cpu-64gb-x86-64 in the runs-on field is not recognized by actionlint, causing lint validation to fail. If this is an intentional custom self-hosted runner label, register it in your actionlint configuration file (typically .actionlintrc or actionlint config) to whitelist this label and allow the workflow to pass validation. Alternatively, if this label does not exist in your infrastructure, replace it with a valid standard GitHub Actions runner label or your organization's properly configured self-hosted runner label.

Source: Linters/SAST tools

coderabbitai · 2026-06-21T21:23:58Z

+- **DX controls:** (1) open/extend a window (each referral/anomaly extends `t_end`), (2) explicit **StopExport** via the control surface (`CONTROL_ADDR`, design rev-3 — confirm wired), (3) the active set (which pods AE over-captures).
+- **DX relies on:** C5 (stable hash identity), C14 (write ⊇ read), **C15 (no premature stop)**, C9 (0 rows only when the workload is genuinely silent), C10 (the `ns/pod` ↔ bare join). For DX to steer dependably, C3/C8/C13/C15 must move from 🔴 to ✅.
+
+## Legend


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix markdownlint heading spacing in the legend section.

Line 96 (## Legend) is missing the required surrounding blank line and triggers MD022.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 96-96: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md` at line 96, The `## Legend` heading is missing required blank lines for markdown linting compliance with rule MD022. Add a blank line before the `## Legend` heading and a blank line after it to separate it from the surrounding content. This ensures proper spacing around the heading section in the CONTRACTS.md file.

Source: Linters/SAST tools

coderabbitai · 2026-06-21T21:23:59Z

+```sql
+PARTITION BY toYYYYMM(toDateTime(event_time))
+TTL toDateTime(event_time) + INTERVAL 30 DAY
+```
+`toDateTime()` interprets its arg as **seconds**. Reproduced on the rig:
+```
+toDateTime(1781559074162913804)            = 2106-02-07   (saturates at DateTime max)
+toDateTime(1781559074162913804)+30 DAY     = 1970-01-30   (OVERFLOWS past max → wraps to 1970)
+(... ) < now()                             = 1            (already_expired)
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize fenced-code formatting and add fence language tags.

Lines 50–59 and 95–97 include fenced blocks without language identifiers and without required surrounding blank lines, which triggers MD031/MD040.

Also applies to: 95-97

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 50-50: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

[warning] 53-53: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

[warning] 55-55: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

[warning] 55-55: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

[warning] 59-59: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md` around lines 50 - 59, The markdown file has formatting violations in fenced code blocks at lines 50-59 and 95-97. Add a blank line before and after each fenced code block to comply with MD031, and add language identifiers to the opening fence markers (e.g., change ``` to ```sql for SQL code blocks and ```text or appropriate language for other blocks). This applies to all fenced code blocks that currently lack language tags or proper spacing.

Source: Linters/SAST tools

coderabbitai · 2026-06-21T21:23:59Z

+isolates each rep even with overlapping windows). Timestamps are explicit unix
+nanos — fixtures NEVER use wall-clock `now()`.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timestamp unit contract is inconsistent with harness behavior.

Line 17 and Line 35 document nanosecond semantics, but src/e2e_test/adaptive_export_loadtest/harness/lib.sh (Line 145–147) states fixture event_time is in seconds. Following this doc as written can undersize ADAPTIVE_WINDOW_BEFORE_SEC and miss band rows.

Suggested doc correction

- Timestamps are explicit unix nanos — fixtures NEVER use wall-clock `now()`. + Timestamps are explicit unix seconds — fixtures NEVER use wall-clock `now()`. - 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0)/1e9 + margin` so window start + 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0) + margin` so window start

Also applies to: 35-36

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md` around lines 17 - 18, The documentation in EXPERIMENTS.md incorrectly states that fixture timestamps are in unix nanoseconds, but the actual harness implementation in lib.sh (lines 145-147) uses seconds for the event_time field. Update the documentation at lines 17-18 to accurately reflect that timestamps are in seconds rather than nanoseconds, and also correct the same misstatement at lines 35-36 to ensure consistency with the actual harness behavior and prevent users from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

coderabbitai · 2026-06-21T21:23:59Z

+    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
+      --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
+      --event-time "$T" --count "$BURST" --same-time >&2 || true
+  else
+    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
+      --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
+      --event-time "$T" >&2 || true


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not suppress injector failures in a stall-detection experiment.

Line 74/Line 78 use || true, so failed anomaly inserts are treated as success; this can produce false STALL flags at Line 90.

Suggested fix

- "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \ + "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \ --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \ - --event-time "$T" --count "$BURST" --same-time >&2 || true + --event-time "$T" --count "$BURST" --same-time >&2 || die "inject failed at tick=$tick" @@ - "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \ + "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \ --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \ - --event-time "$T" >&2 || true + --event-time "$T" >&2 || die "inject failed at tick=$tick"

Also applies to: 90-90

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh` around lines 74 - 80, The INJECT command invocations in the stall-detection experiment are using the || true pattern to suppress failures, which causes failed anomaly inserts to be treated as success and can produce false STALL flags. Remove the || true suffix from both INJECT command calls (the first one with the --same-time flag and the second one without it) to allow injection failures to properly propagate and be detected, ensuring accurate stall detection in the experiment.

coderabbitai · 2026-06-21T21:24:00Z

+	case float32:
+		appendFloat(buf, float64(x))
+	case float64:
+		appendFloat(buf, x)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-finite floats before fast-path emission.

appendFloat will emit NaN/+Inf/-Inf, which is invalid JSON. This can break whole-batch inserts instead of falling back safely.

Proposed fix

import ( "bytes" "encoding/json" "errors" "fmt" + "math" "strconv" "sync" "time" "unicode/utf8" @@ case float32: - appendFloat(buf, float64(x)) + f := float64(x) + if math.IsNaN(f) || math.IsInf(f, 0) { + return errFastEncodeUnsupported + } + appendFloat(buf, f) case float64: + if math.IsNaN(x) || math.IsInf(x, 0) { + return errFastEncodeUnsupported + } appendFloat(buf, x)

Also applies to: 215-218

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go` around lines 184 - 187, The appendFloat function is being called with float32 and float64 values without first validating that they are finite numbers. Non-finite values like NaN and Infinity will be emitted as invalid JSON, causing whole-batch inserts to fail. Add a check before calling appendFloat in both the float32 case (around line 184) and float64 case (around line 186) to verify the value is finite using a function like math.IsNaN or math.IsInf. If a non-finite value is detected, reject it or handle it appropriately instead of passing it to appendFloat. Apply the same validation pattern to the similar float handling code around lines 215-218.

coderabbitai · 2026-06-21T21:24:00Z

+	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+	defer cancel()
+
+	for _, table := range chpkg.PixieTables() {
+		row := minimalRowFor(table, tag, now)
+		if err := s.WritePixieRows(ctx, table, []map[string]any{row}); err != nil {
+			t.Errorf("WritePixieRows(%s): %v", table, err)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Shared timeout across all tables makes this live test unnecessarily flaky.

Using one 60s context for the entire table loop can cancel later table checks due to earlier latency. Prefer per-table timeouts so failures are isolated to the table under test.

🔧 Suggested fix

- ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) - defer cancel() - for _, table := range chpkg.PixieTables() { + ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) row := minimalRowFor(table, tag, now) if err := s.WritePixieRows(ctx, table, []map[string]any{row}); err != nil { + cancel() t.Errorf("WritePixieRows(%s): %v", table, err) continue } ident := table if strings.Contains(table, ".") { ident = "`" + table + "`" } got := chCount(t, endpoint, user, pass, fmt.Sprintf("SELECT count() FROM forensic_db.%s WHERE hostname='%s'", ident, tag)) + cancel() if got < 1 { t.Errorf("table %s after WritePixieRows: count=%d, want >=1", table, got) } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go` around lines 158 - 164, The shared 60-second timeout context created before the loop causes the test to be flaky because it applies to the entire table iteration sequence rather than individual tables. Move the context.WithTimeout call inside the for loop that iterates over chpkg.PixieTables() so that each table gets its own independent 60-second timeout. This way, if one table's WritePixieRows call takes time, it won't affect the timeout for subsequent tables, ensuring failures are isolated to the specific table being tested.

coderabbitai · 2026-06-21T21:24:01Z

+		fctx, cancel := context.WithTimeout(ctx, 60*time.Second)
+		err := w.sink.WritePixieRows(fctx, w.table, buf)
+		cancel()
+		if err != nil {
+			log.WithError(err).WithFields(log.Fields{
+				"table":  w.table,
+				"rows":   len(buf),
+				"reason": reason,
+			}).Warn("streaming.BatchWriter: flush failed")
+		} else {
+			log.WithFields(log.Fields{
+				"table":  w.table,
+				"rows":   len(buf),
+				"reason": reason,
+			}).Info("streaming.BatchWriter: flushed batch")
+		}
+		buf = buf[:0]
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent row loss on failed/shutdown flushes.

flush() clears buf even when WritePixieRows fails, and in the shutdown branch the timeout context is derived from an already-canceled parent. That makes the final flush fail fast and still discard buffered rows.

Proposed fix

- flush := func(reason string) { + flush := func(reason string) { if len(buf) == 0 { return } - // Bound the CH write so a stalled CH HTTP doesn't pin us. - fctx, cancel := context.WithTimeout(ctx, 60*time.Second) + // Bound the CH write so a stalled CH HTTP doesn't pin us. + // On shutdown, use a fresh parent so we can attempt a real final flush. + baseCtx := ctx + if reason == "shutdown" { + baseCtx = context.Background() + } + fctx, cancel := context.WithTimeout(baseCtx, 60*time.Second) err := w.sink.WritePixieRows(fctx, w.table, buf) cancel() if err != nil { log.WithError(err).WithFields(log.Fields{ "table": w.table, "rows": len(buf), "reason": reason, }).Warn("streaming.BatchWriter: flush failed") - } else { + // Keep buffered rows for retry on the next timer/size flush. + // (shutdown still exits after this best-effort attempt) + if reason != "shutdown" { + return + } + } else { log.WithFields(log.Fields{ "table": w.table, "rows": len(buf), "reason": reason, }).Info("streaming.BatchWriter: flushed batch") } buf = buf[:0] }

Also applies to: 138-140

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/streaming/writer.go` around lines 117 - 134, The flush() method clears buf unconditionally after attempting WritePixieRows, which causes data loss when the write fails. Only clear buf by setting it to buf[:0] when WritePixieRows succeeds (move this operation into the else block), not in the error case. Additionally, in the shutdown branch (lines 138-140), the timeout context is derived from an already-canceled parent context, causing the final flush to fail fast. For the shutdown case, create the timeout context using context.Background() instead of the canceled parent context to give the final flush operation a proper chance to complete.

coderabbitai · 2026-06-21T21:24:01Z

+	// PollLimit caps rows returned per poll. Bounds catch-up work
+	// after a restart so a 10h backlog doesn't translate into a
+	// single multi-GiB SELECT the HTTP client times out on; instead
+	// it drains in N polls of PollLimit rows. Default 10000.
+	// 0 → unlimited (legacy behavior — NOT recommended in prod).
+	PollLimit int


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PollLimit documentation contradicts runtime behavior.

The comment says 0 means unlimited, but New() rewrites 0 to 10000. Please align the comment (or behavior) so operators don’t tune based on incorrect semantics.

Suggested fix

- // PollLimit caps rows returned per poll. Bounds catch-up work - // after a restart so a 10h backlog doesn't translate into a - // single multi-GiB SELECT the HTTP client times out on; instead - // it drains in N polls of PollLimit rows. Default 10000. - // 0 → unlimited (legacy behavior — NOT recommended in prod). + // PollLimit caps rows returned per poll. Bounds catch-up work + // after a restart so a 10h backlog doesn't translate into a + // single multi-GiB SELECT the HTTP client times out on; instead + // it drains in N polls of PollLimit rows. 0 uses the default (10000).

Also applies to: 137-139

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go` around lines 73 - 78, The PollLimit field documentation incorrectly states that 0 means unlimited behavior, but the New() function actually rewrites 0 to 10000. Update the comment for the PollLimit field to accurately reflect the actual runtime behavior that 0 is rewritten to the default value of 10000, ensuring operators understand the correct semantics when tuning this setting.

coderabbitai · 2026-06-21T21:24:01Z

+// fingerprintNoFmt is the Fprintf-free reference. Same output guarantee
+// is NOT asserted here — this is a perf-comparison anchor only. If the
+// numbers diverge by >2× from rowFingerprint, the fmt.Fprintf path is
+// a real cost.
+func fingerprintNoFmt(r kubescape.Row) string {
+	h := sha256.New()
+	var b strings.Builder
+	b.Grow(64 + len(r.RuleID) + len(r.Hostname) + len(r.K8sDetails) + len(r.ProcessDetails))
+	_, _ = fmt.Fprintf(&b, "%d", r.EventTime)
+	b.WriteByte(0)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

fingerprintNoFmt still uses fmt.Fprintf, so the benchmark baseline is skewed.

Line 131 keeps formatting overhead in the “no fmt” path, which weakens the comparison.

Proposed fix

import ( "crypto/sha256" "encoding/hex" - "fmt" + "strconv" "strings" "testing" @@ - _, _ = fmt.Fprintf(&b, "%d", r.EventTime) + var nbuf [20]byte + b.Write(strconv.AppendUint(nbuf[:0], r.EventTime, 10))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go` around lines 123 - 132, The function fingerprintNoFmt is intended to be a reference implementation that avoids fmt.Fprintf overhead for benchmarking purposes, but the code still contains a fmt.Fprintf call on the line formatting r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a direct string conversion using strconv (such as strconv.FormatInt) to convert the EventTime value to a string, then write it directly to the strings.Builder using WriteString method to truly eliminate the formatting overhead from the benchmark baseline.

coderabbitai

Review continued from previous batch...

coderabbitai · 2026-06-21T21:25:15Z

+on:
+  workflow_dispatch:
+    inputs:
+      dx_image:
+        description: dx-daemon image to test (default = .image-tags pin)
+        required: false
+        default: ""
+      soc_ref:
+        description: k8sstormcenter/soc branch
+        required: false
+        default: "218-clickhouse-schema"
+permissions:
+  contents: read
+
+jobs:
+  e2e:
+    runs-on: oracle-vm-16cpu-64gb-x86-64   # eBPF + 16cpu/64gb; ubuntu-latest cannot run Pixie
+    timeout-minutes: 90


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add workflow concurrency guards for the self-hosted rig.

This job provisions global host resources (k3s, /etc/rancher/k3s/...). Without concurrency, overlapping manual runs can race and corrupt each other.

Proposed fix

on: workflow_dispatch: @@ +concurrency: + group: e2e-log4shell-soc-${{ github.ref }} + cancel-in-progress: false + jobs: e2e:

🧰 Tools

🪛 actionlint (1.7.12)

[error] 30-30: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 zizmor (1.25.2)

[info] 29-29: workflow or action definition without a name (anonymous-definition): this job

(anonymous-definition)

[warning] 14-24: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting

(concurrency-limits)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e_log4shell_soc.yaml around lines 14 - 31, The e2e_log4shell_soc.yaml workflow lacks concurrency guards, which allows multiple simultaneous runs to race and corrupt shared self-hosted runner resources like k3s and /etc/rancher/k3s/. Add a concurrency block at the top level of the workflow (alongside the on and jobs sections) that specifies a concurrency group to ensure only one run executes at a time. The concurrency group should cancel in-progress runs when a new run is triggered, using a meaningful identifier like the workflow name or a fixed string to ensure all runs of this workflow serialize properly on the oracle-vm-16cpu-64gb-x86-64 runner.

Source: Linters/SAST tools

coderabbitai · 2026-06-21T21:25:15Z

+        git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc
+        cd soc
+        make pixie                              # vizier + AE
+        make kubescape || true                  # node-agent (netStreaming)
+        bash tree/clickhouse-lab/install.sh      # forensic_db
+        make log4j                               # vulnerable backend + attacker + dx + SBoBs (managed-by=User)
+        if [ -n "${{ inputs.dx_image }}" ]; then
+          kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not inline-dispatch inputs directly into shell commands.

Lines 58/64/65 interpolate ${{ inputs.* }} inside run:; this is command-injection prone on a self-hosted runner. Move inputs to step env, validate allowed charset, then use quoted env vars.

Proposed hardening

- name: Deploy the SOC stack (Pixie + kubescape + ClickHouse + AE + dx + log4j chain) env: + SOC_REF: ${{ inputs.soc_ref }} + DX_IMAGE: ${{ inputs.dx_image }} PX_CLOUD_ADDR: pixie.austrianopencloudcommunity.org @@ run: | set -euo pipefail + [[ "$SOC_REF" =~ ^[A-Za-z0-9._/-]+$ ]] || { echo "invalid soc_ref"; exit 1; } + [[ -z "$DX_IMAGE" || "$DX_IMAGE" =~ ^[A-Za-z0-9._/:+-]+$ ]] || { echo "invalid dx_image"; exit 1; } sudo apt-get update -qq && sudo apt-get install -y python3-yaml - git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc + git clone --depth 1 -b "$SOC_REF" https://github.com/k8sstormcenter/soc soc @@ - if [ -n "${{ inputs.dx_image }}" ]; then - kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true + if [ -n "$DX_IMAGE" ]; then + kubectl -n honey set image ds/dx-daemon dx-daemon="$DX_IMAGE" || true fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc

cd soc

make pixie # vizier + AE

make kubescape || true # node-agent (netStreaming)

bash tree/clickhouse-lab/install.sh # forensic_db

make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)

if [ -n "${{ inputs.dx_image }}" ]; then

kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true

git clone --depth 1 -b "$SOC_REF" https://github.com/k8sstormcenter/soc soc

cd soc

make pixie # vizier + AE

make kubescape || true # node-agent (netStreaming)

bash tree/clickhouse-lab/install.sh # forensic_db

make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)

if [ -n "$DX_IMAGE" ]; then

kubectl -n honey set image ds/dx-daemon dx-daemon="$DX_IMAGE" || true

🧰 Tools

🪛 zizmor (1.25.2)

[error] 58-58: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 64-64: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 65-65: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e_log4shell_soc.yaml around lines 58 - 65, The workflow is directly interpolating `inputs.soc_ref` and `inputs.dx_image` into shell commands within the run step, creating a command injection vulnerability. Move these inputs to an env section at the step level, assign them to environment variables, and validate that they contain only allowed characters (e.g., alphanumeric, hyphens, dots for versions). Replace the inline `${{ inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the git clone and kubectl commands with properly quoted environment variable references like `"$SOC_REF"` and `"$DX_IMAGE"` respectively.

Source: Linters/SAST tools

coderabbitai · 2026-06-21T21:25:15Z

+    E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
+      filt="cp-e4-${R}"
+      inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
+      inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

E4 currently masks a failed first fixture insert.

Line 79 uses || true, so E4 can pass even when the first boundary-collision row was never inserted. This weakens the reproducibility assertion for that experiment.

Proposed fix

E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash filt="cp-e4-${R}" - inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true + inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \ + || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; } ;;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash

filt="cp-e4-${R}"

inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true

inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }

E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash

filt="cp-e4-${R}"

inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \

|| { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }

inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh` around lines 77 - 80, The first inj command in the E4 test case (line 79) uses || true which silently suppresses injection failures, masking when the first boundary-collision row fails to insert. This weakens the test reproducibility assertion. Apply the same error handling pattern used in the second inj command (line 80) to the first inj command, replacing the || true with a proper error handler that logs the INJECT_FAIL status and continues the experiment, ensuring both fixture injections are properly validated before the test proceeds.

coderabbitai · 2026-06-21T21:25:16Z

@@ -0,0 +1,5 @@
+module aeload
+
+go 1.22


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the latest patched Go release in the 1.22 line, and which GO advisories are fixed there (including GO-2026-4870, GO-2026-4918, GO-2026-4980)?

💡 Result:

As of June 21, 2026, the latest release in the Go 1.22 line is Go 1.22.12 [1][2]. Regarding the specified Go security advisories: - GO-2026-4870: This advisory affects crypto/tls and is noted as affecting versions before go1.25.9, as well as versions from go1.26.0-0 before go1.26.2 [3]. It is not listed as affecting the 1.22 release line. - GO-2026-4918: This advisory concerns net/http and golang.org/x/net/http2 [4][5]. It affects net/http versions before go1.25.10 and versions from go1.26.0-0 before go1.26.3 [4]. It is not listed as affecting the 1.22 release line. - GO-2026-4980: This advisory concerns html/template and affects versions before go1.25.10, as well as versions from go1.26.0-0 before go1.26.3 [6]. It is not listed as affecting the 1.22 release line. Go 1.22 has reached a state where it is no longer receiving the same security updates as newer Go release lines (such as 1.25 and 1.26) [2][3][4][6]. Users are encouraged to upgrade to a supported release version [2].

Citations:

1: https://go.dev/doc/devel/release

2: https://endoflife.date/go

3: https://pkg.go.dev/vuln/GO-2026-4870

4: https://pkg.go.dev/vuln/GO-2026-4918

5: https://nvd.nist.gov/vuln/detail/CVE-2026-33814

6: https://pkg.go.dev/vuln/GO-2026-4980

Upgrade to a currently supported Go release.

Go 1.22 is no longer receiving active security updates. The specified Go advisories (GO-2026-4870, GO-2026-4918, GO-2026-4980) do not affect the 1.22 release line but affect newer versions. Upgrade to the latest 1.25 or 1.26 release to ensure the loadgen binaries run on a supported runtime baseline.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod` at line 3, The Go version directive in the go.mod file is set to version 1.22, which is no longer receiving active security updates. Update the go directive from version 1.22 to version 1.26 (or 1.25 if necessary) to ensure the loadgen binaries run on a currently supported Go runtime baseline and receive ongoing security patches.

Source: Linters/SAST tools

…#67) ae-prod's boundary handling accumulates the seen-fingerprint set on a no-progress tick, but if >PollLimit rows share one normalized event_time the SQL (>= watermark ORDER BY time LIMIT N, no secondary key) returns the same N boundary rows every poll → rows beyond N at that timestamp are never emitted (infinite boundary). Detect all-skipped-at-capacity and advance the watermark by 1ns to make forward progress (fingerprint dedup already tolerates the 1ns overlap). Cherry-picked the trigger fix + its test from the stale CodeRabbit-chat PR #67 (687851d); dropped that PR's unrelated gen-pod.tmpl.yaml churn. #67 itself is NOT mergeable (77 commits behind ae-prod, re-adds deleted terraform).

entlein · 2026-06-22T21:32:00Z

there are indications that it causes PEM to crash, needs RCA

Entlein and others added 30 commits June 21, 2026 22:49

adaptive_export: production AE — streaming export + write-integrity

c6515fd

Signed-off-by: entlein <einentlein@gmail.com>

e2e_test/adaptive_export_loadtest: document AE implied contracts (C1-…

1001f6d

…C14) + diagrams Signed-off-by: entlein <einentlein@gmail.com>

adaptive_export_loadtest: exp_control uses real now_s event_time (no …

808a3ea

…future-stamp watermark poison) Signed-off-by: entlein <einentlein@gmail.com>

harness: log4shell_fire.sh — detection-signal framing (Cyber Verifica…

239e032

…tion) Reword from offensive 'exploit' to detection-signal-generation language: this validates the kubescape->DX->AE detection chain. No logic change. Signed-off-by: entlein <einentlein@gmail.com>

Entlein and others added 13 commits June 21, 2026 22:49

adaptive_export: restore executable bits on harness scripts (post-hea…

e9c1331

…der) Signed-off-by: entlein <einentlein@gmail.com>

Base automatically changed from ae-prod to main June 21, 2026 21:03

coderabbitai Bot requested changes Jun 21, 2026

View reviewed changes

entlein mentioned this pull request Jun 22, 2026

📝 CodeRabbit Chat: Fix ClickHouse trigger in adaptive export service #67

Closed

entlein mentioned this pull request Jun 22, 2026

feat(ae-control): TLS on the control surface (CONTROL_TLS) #71

Open

		isolates each rep even with overlapping windows). Timestamps are explicit unix
		nanos — fixtures NEVER use wall-clock `now()`.

Conversation

entlein commented Jun 21, 2026