Skip to content

AE control auth + CodeRabbit #53 followup#68

Open
entlein wants to merge 45 commits into
mainfrom
ae-followup-auth
Open

AE control auth + CodeRabbit #53 followup#68
entlein wants to merge 45 commits into
mainfrom
ae-followup-auth

Conversation

@entlein

@entlein entlein commented Jun 21, 2026

Copy link
Copy Markdown

AE control auth + CodeRabbit #53 followup

Followup to #53 (merge #53 first — this stacks on ae-prod; I'll retarget to main once #53 lands). Captured all 48 inline + 8 review + 12 issue CodeRabbit comments to biz/PoC/log4j/ae_pr53_coderabbit_feedback/ (JSON + DIGEST.md) so nothing is lost.

✅ Done in this PR

  • Bearer-JWT auth on the control surface — uses the same shared lib the broker/PEM use (px.dev/pixie/src/shared/services/utils); SetAuth verifies via jwtutils.ParseToken (signature + audience), middleware on Handler(), /healthz exempt. dx already mints this exact service JWT — it just attaches it. No new secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH, default off) for safe incremental rollout. Unit test mints a real JWT and asserts 401/pass/healthz. (CodeRabbit server.go: protect control endpoints.)
  • Input validation — reject invalid t_end (≤0) and query windows (start≥end, non-positive). (server.go: reject invalid t_end/window.)

Companion change (separate, entlein/dx)

dx aeclient.go must attach Authorization: Bearer <minted JWT> on its AE control POSTs. Then flip CONTROL_REQUIRE_AUTH=true + mount PL_JWT_SIGNING_KEY on AE. (Tracked; small.)

Already resolved on ae-prod (verified — no action)

  • http.Server timeouts (main.go) — already added (CodeRabbit r3379377432).
  • ~9 harness Majors (ae_config/ae_vs_all/deploy_ae/exp_datavolume_extreme/exp_log4j_reps/… fail-fast) — those scripts were deleted in the harness consolidation; moot.
  • The http_events=0 "bug" — self-retracted by the author (misdiagnosis).

Remaining AE-code Majors — triaged (to address as commits here)

Safe / clear:

  • streaming/filter.go:196 — timer leak on deltaCh close.
  • passthrough/passthrough.go:147 — per-table query/write timeouts.
  • pixieapi/pixieapi.go:89 — direct-mode TLS-skip uses wrong option (→ WithDirectTLSSkipVerify).
  • stats.py:57 — guard false "EXACT" when zero PASS rows.
  • schema.sql:485 — retention on ae_reconcile (unbounded growth).
  • controller.go:347 — don't fan out Pixie rows when attribution persistence fails.

Risky / needs careful testing before a presentation — defer with care:

  • controller.go:255 — rehydrated active windows never restart fan-out (state-machine path).
  • trigger/clickhouse.go:303 — inclusive-watermark paging (this is the F8 watermark area — change only with the F8 fixtures).
  • schema.sql:430adaptive_attribution ORDER BY key width (schema change — migration risk).
  • sink/clickhouse.go:433Record blocks hot path on CH (async refactor — perf-sensitive).

Entlein and others added 30 commits June 21, 2026 22:49
Signed-off-by: entlein <einentlein@gmail.com>
New env-gated background loop that runs the same PxL shape AE's
anomaly-gated path uses, but with an empty Target (no ns/pod predicate)
and over a configurable rolling window. Writes via the existing sink so
the byte-shape of forensic_db rows is comparable between the
PASSTHROUGH=1 phase (EVERYTHING) and the PASSTHROUGH=0 phase
(AE-FILTER). One-shot A/B that yields the per-table capture fraction of
the adaptive write path.

- internal/passthrough/passthrough.go — Loop + Config; defaults to
  30s window / 30s refresh / clickhouse.PixieTables() table list.
- internal/passthrough/passthrough_test.go — 6 tests; the load-bearing
  one is TestLoop_EmitsEmptyTargetPxL (asserts neither df.namespace nor
  df.pod predicates appear in the emitted PxL).
- cmd/main.go: ADAPTIVE_PASSTHROUGH + _WINDOW_SEC + _REFRESH_SEC env
  knobs. Adapter is constructed unconditionally when passthrough is on
  (joins the existing PushPixie / streaming construction path so the
  same pxapi grpc stream is reused). Loop is registered with the
  shutdown WaitGroup so SIGTERM waits for the in-flight tick.
- cmd/BUILD.bazel: drop @px// load (other AE BUILD.bazel files use
  //bazel — sticking out as the only one with @px is a leftover from a
  prior gazelle run; align). Add passthrough dep.

Signed-off-by: entlein <einentlein@gmail.com>
Stand-alone workflow that builds entlein/dx (private Active-Diagnosis
Framework) into ghcr.io/k8sstormcenter/dx-daemon. Separates the dx image
publish from the bazel-based vizier_release pipeline; the dx repo ships
its own Dockerfile.dxd (Go cross-compile + distroless final stage) so it
doesn't need to live as a submodule inside src/vizier/services/dx.

Triggers:
- tag push 'release/dx/v*' on this repo cuts a release build, image tag
  derived from the tag suffix (release/dx/v0.1.0 -> image tag 0.1.0).
- workflow_dispatch lets us build any dx ref on demand with a custom tag
  (default: short sha of the resolved dx commit).

Pulls dx via DX_ENTLEIN_PAT (already configured on the repo). Multi-arch
build (linux/amd64, linux/arm64); Dockerfile.dxd cross-compiles in the
native BUILDPLATFORM stage and the final stage is COPY-only, so target
emulation isn't required.

Signed-off-by: entlein <einentlein@gmail.com>
The dx image build pipeline lives in entlein/dx itself (PR #53,
branch feat/bazel-release): bazel-based with @px external pin to
pixie's ae-prod tip, pushes to docker.io/entlein/dx-daemon on a
release/dx/v* tag in the dx repo. The pixie-side buildx workflow
this reverts duplicated that intent in the wrong repo + the wrong
build system (docker buildx instead of bazel + pl_go_image macros)
+ the wrong registry (ghcr.io/k8sstormcenter instead of
docker.io/entlein).

Signed-off-by: entlein <einentlein@gmail.com>
…affordances

Fixes the silent-halt bug: the trigger gated on a RAW event_time high-water-mark,
so a single anomaly in a larger unit (ms/ns) drove the watermark past all real
seconds rows and AE stopped processing forever (data still on Pixie). Normalize
event_time to canonical nanoseconds in the poll SELECT filter+order and in the
in-memory/persisted cursor, boundary-dedup, and maxSeen (normalizeEventTimeNanos
+ chNormEventTimeNanos). Validated at the data layer: vs a poisoned watermark the
raw filter returns 0 rows, the normalized filter recovers all 60.

Also adds ADAPTIVE_PUSH_REFRESH_SEC (negative = single-shot pull) for the
reproducible load-test harness, an in-package trigger unit test, and an e2e
hermetic load test (mock PixieQuerier, exact rows+bytes).

Signed-off-by: entlein <einentlein@gmail.com>
…ness

Consolidate the adaptive_export load-test harness under src/e2e_test/ (matching
vzconn_loadtest / px_cluster conventions). Control-plane experiments (E1-E4, E6,
E8 sustained) are proven exactly-reproducible on a live rig; the data-plane
experiments (E5, E8 data-mode) are authored and pending live validation on a
vizier-registered rig (status documented in README + FINDINGS_AND_BACKLOG).

- harness/: shell + python (inject, exp_control, exp_e8, stats, ...) + lib helpers
- fixtures/EXPERIMENTS.md: curated kubescape_logs data-set catalog + expected outputs
- k8s/: isolated sinks + per-rep generator pod (no probes)
- tools/loadgen/: cleanloadgen + httpsink (docker-built test tool; .bazelignore'd
  pending a bazel target — lib/pq is already vendored in the module)
- FINDINGS_AND_BACKLOG.md: F8 watermark-poison bug + the fix + AE backlog

The AE Go unit/e2e tests live with the service (internal/{trigger,e2e}).

Signed-off-by: entlein <einentlein@gmail.com>
…C14) + diagrams

Signed-off-by: entlein <einentlein@gmail.com>
…iagram; gen sustained-DNS mode

C15 = AE must keep re-pulling+writing an active pod until t_end or DX stop (the
contract DX steers on; last week's 'wrote then stopped' is its violation). Add
DX-steering sequence diagram. Generator gains SUSTAIN_SEC (distinct-DNS trickle,
a Pixie-traced protocol) + configurable SETTLE_PRE_MS warm-up for fresh-pod
capture; harness wires GEN_SUSTAIN_SEC/GEN_SETTLE_MS.

Signed-off-by: entlein <einentlein@gmail.com>
…lisation

The 700821d trigger unit-normalisation wrapped event_time in a
multiIf(...) inside both the WHERE filter and the ORDER BY. Three
existing tests in watermark_test.go + one in clickhouse_test.go pinned
the raw 'event_time >= N' substring and broke at HEAD.

Update each test's expected substring to match the new normalized form
(') >= <ns-scaled N>' — the closing paren of multiIf, then the value
in canonical nanoseconds). Per-test ns-scaling:

  watermark_test.go:94  1744000000000000000  already ns -> unchanged
  watermark_test.go:125 InitialWatermark=42  < 1e10 sec -> * 1e9
  watermark_test.go:156 InitialWatermark=7   < 1e10 sec -> * 1e9
  watermark_test.go:297 event_time='5000'    < 1e10 sec -> * 1e9
  clickhouse_test.go:82 ref.T=1744..303e9 ns already ns -> unchanged

go test ./src/vizier/services/adaptive_export/... all green.

Signed-off-by: entlein <einentlein@gmail.com>
…future-stamp watermark poison)

Signed-off-by: entlein <einentlein@gmail.com>
Records one forensic_db.ae_reconcile row per data-plane pull (read_count vs
wrote_count, window, ns/pod) across ALL three write paths — controller fan-out
(filter), passthrough firehose, and streaming scanner — so a reconcile run
localizes loss to query (R5: read<PEM) vs sink (R6: wrote<read) and quantifies
re-pull dup (C8). Counts alone (write >= read) were proven insufficient.

- new internal/reconcile leaf package (Row, Recorder, Nop) — no import cycle
- sink.Record: CH-backed recorder (INSERT INTO forensic_db.ae_reconcile)
- ae_reconcile table: schema.sql + KnownTables + OperatorOwnedTables (synced);
  not a pixie table (absent from PixieTables, so VerifyPixieSchema ignores it)
- wired: passthrough.tick, controller.pushPixieRows (deferred, all return
  paths), streaming.scanner.Run; gated by ADAPTIVE_RECONCILE=true, else Nop
- unit test proves read/wrote capture incl. the sink-drop read>wrote shape
- fixed apply_test trailing-tables guard for the new operator table
- harness: exp_row_reconcile.sh (row-level PEM<->CH), ae_vs_all.sh, exp_datavolume_extreme.sh

Signed-off-by: entlein <einentlein@gmail.com>
px -o json empty result previously printed one blank line → counted as a phantom
LOSS=1. Guard: empty set → 0-byte keys file; drop all-empty-field keys. Confirmed
against the controlled log4j run (backend http 14/14 exact, conn 66>=12, no loss).

Signed-off-by: entlein <einentlein@gmail.com>
…log4shell

Reliable BY CONSTRUCTION against bob#140 (stateful/unreliable exploit on re-fire):
fresh-JVM backend (delete pod) + attacker-before-backend + the WORKING resolvable FQDN
attacker.<ns>.svc.cluster.local:1389 (NOT the bare attacker-ns.svc which NXDOMAINs and
gets dropped), then VERIFY the actual backend->:1389 LDAP egress in forensic_db.conn_stats
and RETRY until confirmed (the validity gate). Never assumes the exploit fired. Node-side.

Signed-off-by: entlein <einentlein@gmail.com>
…tion)

Reword from offensive 'exploit' to detection-signal-generation language: this validates
the kubescape->DX->AE detection chain. No logic change.

Signed-off-by: entlein <einentlein@gmail.com>
… http2

- pxl.CompilePassthrough/Render: precompile per-table PxL once (fixed
  window => constant relative start_time), only the two time_ bounds are
  stamped per tick. Rendered output is byte-identical to QueryFor with an
  empty Target (TestCompilePassthrough_MatchesQueryFor), so this is a
  structural change, not a capture change. upid->pod/ns stays in PxL.
- passthrough: tickConcurrent fans every table out at once (was a serial
  loop); shared pull() helper. Sink/recorder are pool/HTTP-backed and
  already called concurrently elsewhere.
- drop http2_messages.beta from the firehose set (not materialised on
  every cluster => ""Table not found"" every tick); shared PixieTables/DDL
  lists untouched.
- toggle ADAPTIVE_PASSTHROUGH_COMPILED (default on; =false reverts to the
  legacy serial QueryFor path).

Signed-off-by: entlein <einentlein@gmail.com>
…e.go

Fixes "missing strict dependencies: import of .../internal/reconcile" that
broke the AE image build for passthrough, sink, streaming, controller, cmd
(pre-existing since the ADAPTIVE_RECONCILE commit added the package + imports
without bazel deps; never CI-built). Also wires the new pxl/compile.go srcs
+ passthrough/pxl test srcs (compiled_test.go, reconcile_test.go, compile_test.go).

- new internal/reconcile/BUILD.bazel (go_library, stdlib-only)
- +//internal/reconcile dep: passthrough, sink, streaming, controller, cmd
- pxl go_library +compile.go; pxl_test +compile_test.go; passthrough_test +compiled_test.go,reconcile_test.go (+reconcile dep)

Signed-off-by: entlein <einentlein@gmail.com>
F1 RCA: Pixie caps every px.display at max_output_rows_per_table (default
10000, query_flags.go) — the planner add_limit_to_batch_result_sink_rule
silently truncates wide firehose windows / busy pods at the READ (write path
is clean: ae_reconcile shows read==wrote). Fix uses Pixie own native knob —
prepend `#px:set max_output_rows_per_table=1000000` to every generated PxL
(QueryFor + CompilePassthrough) so all pull paths are uncapped. Validated on
rig: a 14208-row window returned 10000 (capped) vs 14298 (with flag). No
pagination loop, no extra round-trips. See memory project-ae-passthrough-10k-cap.

Signed-off-by: entlein <einentlein@gmail.com>
Consolidates the recurring content_type silent-drop incident class
into one default-suite test gate (6 tests, ~15ms):

  I1 TestContract_ContentTypeIsInt64InSchema
  I2 TestContract_FastEncodeContentTypeAsInt
  I3 TestContract_SilentDropDetected
  I3.b TestContract_SilentDropNotTriggeredOnSuccess
  I3.c TestContract_SilentDropToleratesMissingSummaryHeader
  I4 TestContract_HTTPEventsRoundTrip

Top-of-file docstring chronicles the incident timeline so future
operators can grep their way to the contract.

Signed-off-by: entlein <einentlein@gmail.com>
Measures the AdaptiveExport value prop: datavolume REDUCTION of DX-steered AE
(rev-3 streaming, AE writes only DX-steered activeSet pods over the control
surface) vs saving ALL data (passthrough firehose). Two arms, same fixed load,
forensic_db active-part deltas (rows+bytes) per table; reduction = 1 - DX/ALL.
Uses the canonical resolvable JNDI FQDN (attacker.attacker-ns.svc.cluster.local)
so the chain fires + DX can classify (a malformed host → NXDOMAIN → no steer).
Successor to ae_vs_all.sh, whose AE arm used the rev-2 controller gate + stale JNDI.

Signed-off-by: entlein <einentlein@gmail.com>
Measures all AE non-functional requirements under steady load on the rig:
throughput (rows+bytes/sec), capture completeness (AE read vs broker count =
F1 cap proof), write fidelity (read==wrote + write-error count), end-to-end
freshness latency (now - max(time_) in CH), resource footprint (AE pod cpu/mem
idle vs loaded), per-cycle cadence. Emits a structured report; companion to
exp_dx_steering_reduction.sh. Real-data only.

Signed-off-by: entlein <einentlein@gmail.com>
…ring + live-pod guard)

Run-1 reported false 100% reduction because stale adaptive_attribution windows
rehydrated DEAD pods (deleted loaders) into the activeSet → AE streamed dead pods
→ 0 rows. Clear adaptive_attribution before the DX arm so the activeSet only gets
freshly-steered LIVE pods; add a guard that prints the steered pods + marshalsec
fire count + live log4j-poc pods so a dead-arm result is caught, not reported as
a reduction.

Signed-off-by: entlein <einentlein@gmail.com>
lag query used now()-DateTime64 (type error -> na); use dateDiff(second,...).
Capture-completeness vs broker was window-misaligned (623% artifact) -> drop it;
report tot_read vs tot_wrote (read==wrote) + errs instead. The 10k-cap/completeness
proof is the dedicated F1 test (max_read>10000 vs broker for the SAME window).

Signed-off-by: entlein <einentlein@gmail.com>
…ary)

Run-2 byte-delta reduction came out negative because system.parts byte delta is
compaction-noisy (merges land mid-window). Report rows reduction as primary
(actual captured-row count, noise-free); keep bytes as secondary context.

Signed-off-by: entlein <einentlein@gmail.com>
Eviction-RCA finding (PR #63 NFR campaign): AE had NO memory limit (only cpu
300m) and was CPU-pinned at 300m under concurrent passthrough. AE measured tiny
(16-38Mi steady), but the raised 1M-row passthrough cap can spike, so cap at 1Gi
so AE can never memory-pressure a node; raise cpu limit 300m->1 core (was
throttling). NOTE node evictions were NOT AE/OOM — node-01 went NotReady
(network/heartbeat); the memory consumer is PEM (1365Mi, OOMs at the 2Gi default).

Signed-off-by: entlein <einentlein@gmail.com>
Root cause of the recurring "AE unauthenticated / writes 0 / crashloop" reverts:
kustomization.yaml bundled adaptive_export_secrets.yaml (placeholder pixie-api-key)
with the role+deployment, so EVERY infra re-apply (make log4j) clobbered the real
key that ae-auth had written. Separation of concerns: remove the secret from the
kustomization — infra (role+deployment) stays re-appliable; the secret holds real
creds and is owned solely by `make ae-auth`, created once, never touched by infra
re-applies. Secret manifest kept as a hand-applied seed-only template (documented).

Signed-off-by: entlein <einentlein@gmail.com>
The DX-steered arm was failing because the harness only fired stage-1 (JNDI/LDAP).
That generates ldap-egress but NO kubescape R0001 → backend never flagged → DX no
case → indeterminate → AE steers wrong/no pods. R0001 comes from stage-2 (post-
exploitation exec). fire() now does stage-1 (JNDI) + stage-2 (whoami/shadow/token/
getent in the backend) → kubescape R0001+R0006 → DX rules backend MALIGNANT →
backend enters AE activeSet → reduction is measurable. Verified live: DX evidence
unexpected-spawn+sensitive-file-read → verdict ruled_in generic=MALIGNANT.

Signed-off-by: entlein <einentlein@gmail.com>
…dd DX-steering diagnostics

Standing terminology rule: allowlist/blocklist, never whitelist/blacklist.
Pure rename (no behavior change) of the rev-3 streaming filter:
  FilterModeWhitelist  → FilterModeAllowlist
  MaxWhitelistSize     → MaxAllowlistSize
  ADAPTIVE_STREAM_MAX_WHITELIST → ADAPTIVE_STREAM_MAX_ALLOWLIST  (env)
  mode=whitelist log string → mode=allowlist
plus all comments/identifiers/tests in streaming, activeset, cmd/main.

DX-steering diagnostics (the reason DX-arm-writes-0 has been hard to RCA —
we could not tell "empty ActiveSet" from "broker returned 0 rows"):
  - scanner: log the empty-allowlist short-circuit (was silent) so an
    empty ActiveSet is visible in logs, distinct from "query completed rows=0".
  - FilterUpdater: emitted-filter log Debug→Info so the steered pod count
    per ActiveSet change is visible without debug logging.

NOTE: ADAPTIVE_STREAM_MAX_WHITELIST env renamed → tooling that sets the old
name must switch to ADAPTIVE_STREAM_MAX_ALLOWLIST.

Signed-off-by: entlein <einentlein@gmail.com>
The Pixie dx_evidence_graph UI reads dx_attack_graph via px.DataFrame
clickhouse_dsn, whose query template hardcodes event_time + hostname and
ORDER BY event_time. A table without those columns fails 'Unknown
identifier event_time'; a table created by hand (local, not via the
operator) isn't globally registered. Fix: make AE own it like the other
forensic tables.

- schema.sql: dx_attack_graph DDL with event_time(UInt64 nanos) + hostname,
  edge columns, fromUnixTimestamp64Nano partition/TTL (nanos-correct).
- KnownTables + OperatorOwnedTables: register it so Apply creates it at boot.
- apply_test: assert last-applied DDL == last OperatorOwnedTables entry
  (robust to appended operator tables) instead of hardcoding trigger_watermark.

go test ./.../clickhouse green.

Signed-off-by: entlein <einentlein@gmail.com>
Pixie's clickhouse_dsn type mapper reads UInt8 as BOOLEAN and does not
handle UInt16/UInt32/Float32 -> px fails with 'Column[N] given incorrect
type' rendering the dx_evidence_graph. weight/max_severity/num_findings
-> Int64, confidence -> Float64 (map cleanly to INT64/FLOAT64). Verified
live: px run returns all 6 edges with every column. event_time stays
UInt64 (matches kubescape_logs, which px reads).

Signed-off-by: entlein <einentlein@gmail.com>
…canner buildPxL

The DX/streaming arm silently capped each per-table pull at Pixie's default
10000-row limit while the passthrough/ALL arm (pxl.CompilePassthrough /
QueryFor) already raises it to 1,000,000 via the broker's #px:set query flag.
Validated live on 6a33dac0: a single streaming http_events pull returned
exactly rows=10000 (the cap). Left unfixed this UNDER-counts the DX arm and
OVERSTATES the DX-vs-ALL volume reduction. Prepend the same #px:set directive
to the streaming scanner's PxL so both arms are uncapped and comparable.

Signed-off-by: entlein <einentlein@gmail.com>
Entlein and others added 13 commits June 21, 2026 22:49
Adds the rule-ins-only view (condition != '') to the canonical schema.sql,
registers it in KnownTables + OperatorOwnedTables (after dx_attack_graph), and
teaches DDL() to extract CREATE VIEW headers. AE now creates it on boot so the
dx_evidence_graph UI's default malicious-only read is standard, not a per-rig
manual step. Tests updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: entlein <einentlein@gmail.com>
dx POSTs a JSON array of edges to /dx/attack_graph; AE writes them to
forensic_db.dx_attack_graph via JSONEachRow (Applier.WriteAttackGraph). Wired in
main.go when CONTROL_ADDR is set; 501 if the sink is unset. This is the AE half
of the dx->AE->CH attack-graph write path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: entlein <einentlein@gmail.com>
…e, fix stale tests

PR-53 review follow-ups (see review summary in conversation):

1. License headers added to 17 src/e2e_test/adaptive_export_loadtest/
   harness/*.{sh,py} scripts. Matches the convention every other
   src/e2e_test/*/sh in pixie already follows.

2. Dead code removed (was unreachable per golang.org/x/tools/cmd/deadcode):
   - internal/script/script.go: IsClickHouseScript, IsScriptForCluster,
     GetActions, getScriptName, getInterval, templateScript, plus the
     ScriptConfig and ScriptActions types. The cron-script sync flow they
     served was replaced by the streaming model; only the Script and
     ScriptDefinition types remain.
   - internal/pixie/pixie.go: Client.GetPresetScripts (replaced by
     builtinPresetScripts() inline in cmd/main.go).
   - internal/streaming/{supervisor,writer}.go: SupervisorStats,
     TableStats, Stats types + Supervisor.Stats and BatchWriter.Stats
     methods (no production reader). Atomic counters dropped; the
     existing flush log preserves the per-flush summary.

3. Stale passthrough tests fixed.
   TestLoop_DefaultsTablesToPixieTables and TestNew_AppliesDefaults
   asserted len(clickhouse.PixieTables()) == 13 but passthrough.New
   strips excludedTables (http2_messages.beta), yielding 12. Tests now
   compare against filterExcluded(clickhouse.PixieTables()) and add an
   extra assertion that excluded tables were not written.

4. ClickHouse HTTP client consolidation: new internal/chhttp/ package
   collapses three near-identical HTTP CH clients
   (clickhouse.Applier, sink.ClickHouseHTTP, trigger.ClickHouseWatermarkStore)
   into one. Centralises endpoint validation, basic-auth header,
   30s default timeout, fail-loud INSERT settings (4 CH input_format
   knobs), and the X-ClickHouse-Summary read path. The 4 INSERT
   call sites in the three callers all route through chhttp.Client.Insert
   now; SELECT through Query or QueryStream (the latter preserves the
   QueryActive streaming behaviour). Net code: -200 LOC across the
   three callers plus a 200-LOC chhttp package with its own tests.

5. Pixie service scaffold wired into cmd/main.go:
   services.SetupService("adaptive-export", 50900) +
   services.SetupSSLClientFlags() + services.PostFlagSetupAndParse() +
   services.SetupServiceLogging(). Matches the pattern every other
   pixie Go service uses. CheckServiceFlags() is deliberately skipped
   (AE does not run a TLS gRPC server). Existing AE env-var reads
   (ADAPTIVE_*) are untouched and still authoritative for tuning knobs.

All 14 adaptive_export internal packages pass go test (1 new chhttp,
13 unchanged), including the 3 differential oracle tests in
pxl/compile_test.go. arc lint OKAY for everything except 3 pre-existing
findings unrelated to this commit (loadgen has its own go.mod; one
QF1001 De Morgan's-law nit in reconcile_test.go from c9f19b6).

Signed-off-by: entlein <einentlein@gmail.com>
…der)

Signed-off-by: entlein <einentlein@gmail.com>
User flagged on review 4536971862:
- cmd/BUILD.bazel:18 dropped the '@px' external-workspace prefix on
  pl_build_system.bzl load. Restored — the standalone AE build pulls
  pixie as @px and needs the qualifier.

Lint cleanup (PR-53 scope, no production code touched outside renames):
- chhttp/chhttp_test.go: errcheck on two w.Write calls + gofumpt on
  the gotSettings declaration.
- passthrough/reconcile_test.go: staticcheck QF1001 (De Morgan's law)
  on the conn_stats sink-drop assertion.
- script/script.go: rename ScriptId -> ScriptID (ST1003). Propagated
  to pixie/pixie.go and cmd/main.go callsites.
- internal/e2e/BUILD.bazel and internal/trigger/BUILD.bazel: gazelle
  drift — adding loadtest_test.go and clickhouse_internal_test.go to
  srcs.
- k8s/00-sinks.yaml + gen-pod.tmpl.yaml: yamllint compliance —
  document-start marker, dedent sequence items per .yamllint
  indent-sequences=false, tighten flow-mapping spaces, collapse
  multi-space after commas. YAML semantics unchanged.
- harness/stats.py: flake8 E501 — split a long line into two.

Final arc lint state on PR-53 file scope: 0 Errors, 14 Warnings
(SHELLCHECK SC2155/SC2086 in pre-existing harness scripts from
ae7b86f, not introduced or modified by this commit). Pre-existing
loadgen typecheck failures (separate go.mod) are unaffected.

Signed-off-by: entlein <einentlein@gmail.com>
…bit items

User review 4536971862:

#2 passthrough/reconcile_test.go — strengthened with two new tests that
   exercise the FULL chain (loop → real sink.ClickHouseHTTP → httptest
   CH endpoint → reconcile recorder). TestTick_ReconcileCatchesCHSilentDrop
   mimics CH's X-ClickHouse-Summary silent-drop shape (200 OK,
   written_rows=0) and asserts the loop records WroteCount=0 with a
   silent-drop attribution — the exact R6 (sink-layer loss) regression
   the instrument exists to detect. TestTick_ReconcileAttributesCHFailureCorrectly
   covers the 500-response branch. The pre-existing in-process-fake test
   stays as the wiring check.

#4 pixie/pixie.go — added a long comment justifying the API-key auth
   choice. pixie.Client targets the Pixie CLOUD (cloudpb's PluginService),
   whose auth interceptor accepts pixie-api-key and rejects JWT service
   tokens (those are for INSIDE-cluster vizier services). The
   pixieapi.Adapter that talks to vizier directly already uses JWT via
   jwtutils.GenerateJWTForService — same pattern as
   cloud_connector/vizhealth/checker.go:111. So the split is intentional;
   flipping pixie.go to JWT would break cloud auth, not improve it.

#6 pxl/queryfor — hardened escapePxL so a raw \n, \r, \t or NUL byte
   in Target.Pod/Target.Namespace can't terminate the PxL string literal
   and inject a new statement. Added TestQueryFor_RejectsInjectionInTargetFields
   driving QueryFor with 7 adversarial pod/namespace shapes (newline,
   single-quote-only, CR, backslash-escape-of-escape, NUL, tab) plus
   the regex_match fallback path, asserting the output line count and
   every statement's leading token. Extends existing
   TestEscapePxL_TableDriven with the new escape mappings.

CodeRabbit oversights (verified each against current code):

CR r3379377432 (main.go) — wrapped the control-surface listener in
   http.Server with Read/ReadHeader/Write/Idle timeouts so a slow
   client can't pin a goroutine indefinitely.

CR r3379377607 (pixieapi.go) — switched direct-mode dial from
   pxapi.WithDisableTLSVerification (env + addr-gated, brittle) to
   pxapi.WithDirectTLSSkipVerify (added in PR #49 b523ce3 for the
   same node-IP-dial scenario). Removed the cluster.local + PX_DISABLE_TLS
   precondition in NewDirect; the always-skip semantics match the AE
   deployment shape. Refactored the obsolete env-gate test.

CR r3379377645 (streaming/filter.go) — the deltaCh-close path returned
   without calling disarm(), leaking the timer's goroutine on shutdown.
   Now calls disarm() before return on chan-close.

CR r3426923299 (sink/clickhouse.go Record) — capped Record at a 2s
   per-call context timeout. The 30s chhttp default was too long for
   the scanner/passthrough/controller hot paths that call Record inline;
   reconcile is best-effort by contract, so a stalled CH must not pin
   the pull loop.

All 14 AE packages pass go test. arc lint: 0 Errors on PR-53 file
scope.

Signed-off-by: entlein <einentlein@gmail.com>
Finish the harness consolidation #53 started: adopt exp_matrix.sh (canonical
ALL-vs-DX reduction matrix) + nfr.sh (throughput/mem/verdict-latency) as the
single runners; cut the overlapping variants (ae_vs_all, exp_datavolume_extreme,
exp_dx_steering_reduction, exp_ae_nfr_benchmark, exp_pipeline_reconcile) and the
superseded standalone setup (deploy_ae, build_gen_image). README rewritten as the
single 'how to run' source: two families — fixture-isolation (run.sh + E-series)
and live-attack e2e (log4shell_fire → exp_matrix → nfr → exp_row_reconcile).

Add e2e_log4shell_soc.yaml: k3s + full SOC stack (Pixie/kubescape/ClickHouse/AE/dx
via k8sstormcenter/soc) on the oracle runner; asserts every canonical harness
script runs + dx rules in; profiles dx in real life (pprof CPU/heap + verdict
latency, uploaded). Uses existing repo secrets.

Signed-off-by: entlein <einentlein@gmail.com>
…ntainer_images BUILD

This file is unrelated to adaptive_export — the only change was buildifier
alphabetizing container_type/bazel_sdk_versions (no functional change). Restore
main's version to keep #53's diff to real AE changes.

Signed-off-by: entlein <einentlein@gmail.com>
run-genfiles: the PR had reordered the kwargs in stirling/.../container_images/
BUILD.bazel alphabetically (bazel_sdk_versions before container_type) — a
local buildifier or gazelle drift from when the file was first committed.
CI gazelle wants the original order (container_type first), so the diff
loop fails. Restored to origin/main's ordering. Local gazelle disagrees
with CI's expected output (tooling-version drift); CI is authoritative.

run-container-lint: two findings.

  1. staticcheck QF1001 (De Morgan's law) in
     pxl/queryfor_test.go:318. Rewrote the !(a || b || c || d) negated
     disjunction as the equivalent !a && !b && !c && !d. Test behaviour
     unchanged; verified locally with TestQueryFor_RejectsInjection.

  2. golangci-lint typechecking failed on
     src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/{cleanloadgen,
     httpsink}/main.go because that subtree carries its own go.mod and
     does not resolve as a package under the root px.dev/pixie module.
     Added the loadgen subtree to .arclint's exclude list. Same fix the
     existing entries for other-module subtrees apply.

Local 'arc lint' on the PR-53 file scope: 0 Errors, 14 SHELLCHECK
Warnings + 26 SHELLCHECK Advice (all pre-existing in ae7b86f
harness scripts; not introduced by this PR).

Signed-off-by: entlein <einentlein@gmail.com>
run-genfiles CI re-failed after f244ffc reverted this file to main's
ordering — turns out gazelle on this repo IS alphabetizing the kwargs
(bazel_sdk_versions before container_type), and CI runs 'gazelle fix'
then 'git diff' to catch any drift. So main's ordering is no longer
gazelle-stable; the file has to match gazelle's preference, not main's.

Verified locally with 'bazel run //:gazelle -- fix'; the file is now
idempotent (a second gazelle run produces no diff).

Signed-off-by: entlein <einentlein@gmail.com>
run-container-lint re-failed after the run-genfiles fix because two
files added on this branch (a03aa15) had unfixed lint errors:

.github/workflows/e2e_log4shell_soc.yaml — 5 yamllint Errors:
- 1 indentation: list items under steps: must be parent-aligned per
  the repo's .yamllint config (indent-sequences: false), not 2-indented.
  Dedented every step item + its run: block by 2 spaces.
- 4 line-length (>120 chars): split the long kubectl-set-image, the
  long grep-detection gate, the curl pprof URL, and the verdict-latency
  grep across continuation lines. Semantics unchanged.

src/e2e_test/adaptive_export_loadtest/harness/{exp_matrix.sh, nfr.sh}
— missing Apache headers. Applied via arc lint --apply-patches; exec
bits restored.

Local arc lint on PR-53 file scope: 0 Errors, 14 SHELLCHECK Warnings
+ 26 Advice (all pre-existing in harness scripts, unchanged).

Signed-off-by: entlein <einentlein@gmail.com>
arc lint --apply-patches exits non-zero on Warning level too. Resolved
each: replaced unused 'for i/t in ...' loop vars with '_', split a
SC2155 declare-and-assign, dropped two never-referenced hip/pip
assignments.

Signed-off-by: entlein <einentlein@gmail.com>
…owup)

AE control endpoints had no auth. Add it using the SAME shared lib the vizier
broker/PEM use (px.dev/pixie/src/shared/services/utils): SetAuth verifies a
bearer JWT via jwtutils.ParseToken (signature + audience), middleware on
Handler() with /healthz exempt. dx already mints this exact service JWT
(GenerateJWTForService, PL_JWT_SIGNING_KEY) — it just attaches it. No new
secret/crypto. Flag-gated (CONTROL_REQUIRE_AUTH + PL_JWT_SIGNING_KEY), default
OFF so it merges before dx sends the bearer; flip on after dx is updated.

Also: reject invalid t_end (<=0) and query windows (start>=end, non-positive).
Test mints a real JWT via the shared lib; asserts 401 on missing/bad token,
pass on valid, /healthz open.
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@entlein, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 20 minutes and 50 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 06a889d9-f649-45ae-ae24-ef1deaa2c32a

📥 Commits

Reviewing files that changed from the base of the PR and between 714c1fb and dfdc465.

📒 Files selected for processing (2)
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
📝 Walkthrough

Walkthrough

This PR adds an adaptive export operator with ClickHouse-backed trigger, sink, streaming, passthrough, and control packages, plus schema and Pixie integration changes, load-test tooling and harness scripts, a SOC GitHub workflow, and supporting bootstrap, packaging, and documentation updates.

Changes

Adaptive export operator and load-test rig

Layer / File(s) Summary
Bootstrap and packaging baseline
.arclint, .bazelignore, k8s/vizier/bootstrap/*, src/vizier/services/adaptive_export/BUILD.bazel, src/stirling/.../BUILD.bazel, .github/workflows/vizier_release.yaml
Adds repo exclusions for the Docker-built loadgen, updates adaptive-export bootstrap manifests, and introduces Bazel bundle/push targets plus small infrastructure adjustments.
Load-test contracts and fixtures
src/e2e_test/adaptive_export_loadtest/*.md, .../fixtures/*, .../k8s/*, .../tools/loadgen/*
Adds adaptive-export contract and findings docs, experiment definitions, sink and generator manifests, and a standalone Go load generator and HTTP sink image.
Harness orchestration and SOC workflow
.github/workflows/e2e_log4shell_soc.yaml, src/e2e_test/adaptive_export_loadtest/harness/*
Adds the shell/Python harness for control-plane, data-plane, matrix, reconciliation, profiling, and benchmark runs, and wires those scripts into a manual SOC workflow.
Schema, query, and event foundations
src/vizier/services/adaptive_export/internal/{anomaly,kubescape,chhttp,clickhouse,pxl,reconcile}/*
Adds anomaly hashing, kubescape row extraction, ClickHouse HTTP helpers, embedded schema and DDL parsing, PxL generation utilities, and reconciliation record types with tests and integration coverage.
Pixie clients and plugin management
src/api/go/pxapi/opts.go, src/vizier/services/adaptive_export/internal/{pixie,pixieapi,script,config}/*
Adds direct Pixie TLS-skip support, direct/query adapters, plugin enablement and script creation paths, and narrows the remaining script/config data model.
ClickHouse write and trigger path
src/vizier/services/adaptive_export/internal/{sink,trigger}/*
Adds ClickHouse-backed sink and trigger implementations for writing attribution and Pixie rows, querying active state, polling kubescape logs, normalizing and persisting watermarks, and detecting silent writes.
Controller state and control endpoints
src/vizier/services/adaptive_export/internal/{activeset,controller,control}/*
Adds the controller’s active-window state machine, rehydration and prune logic, and an authenticated HTTP control surface for export, query, and attack-graph submission.
Streaming, passthrough, and operator entrypoint
src/vizier/services/adaptive_export/cmd/*, src/vizier/services/adaptive_export/internal/{streaming,passthrough,e2e}/*
Adds streaming and passthrough export loops, wires them from the new operator-style main, and adds hermetic and end-to-end tests for the runtime flow.

Sequence Diagram(s)

sequenceDiagram
  participant ClickHouse
  participant Trigger
  participant Controller
  participant Sink
  participant Pixie

  Trigger->>ClickHouse: poll kubescape_logs with watermark
  Trigger->>Controller: emit kubescape event
  Controller->>Sink: write adaptive_attribution
  Controller->>Pixie: query matching tables for active window
  Sink->>ClickHouse: insert attribution and Pixie rows
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • k8sstormcenter/pixie#53: Both PRs modify the adaptive-export trigger watermark handling and reconciliation-related ClickHouse write path contracts.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'AE control auth + CodeRabbit #53 followup' clearly summarizes the main changes: implementing bearer-JWT authentication on the control surface and addressing CodeRabbit feedback from PR #53.
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset, detailing the authentication implementation, input validation, companion changes, and remaining work items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ae-followup-auth

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from ae-prod to main June 21, 2026 21:03
- controller: don't fan out Pixie rows when attribution Sink.Write fails
  (avoids orphaned rows; release in-flight slot, non-fatal). (controller.go:347)
- controller.Rehydrate: re-arm rev-1 pushPixieRows for restored windows so a
  restart doesn't silently miss post-restart Pixie data. (controller.go:255)
- passthrough.pull: per-table timeout context so a hung dependency can't stall
  the sweep / delay shutdown (covers serial+concurrent ticks). (passthrough.go:147)
- schema: TTL (30d) on ae_reconcile to cap append-only growth. (schema.sql:485)

Verified no-change-needed: adaptive_attribution ORDER BY (hostname, anomaly_hash)
is safe — anomaly_hash already encodes namespace+pod (anomaly/hash.go), so rows
never collapse across ns/pod. (schema.sql:430)
Already on ae-prod (no-op): filter timer leak, stats.py EXACT guard, sink async,
watermark paging, pixieapi WithDirectTLSSkipVerify, http.Server timeouts.
@entlein

entlein commented Jun 21, 2026

Copy link
Copy Markdown
Author

@dx-agent — cut you a release to regression-test this followup.

Release: release/vizier/v0.14.19-aeprod20 → image ghcr.io/k8sstormcenter/vizier-adaptive_export_image:0.14.19-aeprod20
Build: https://github.com/k8sstormcenter/pixie/actions/runs/27917599684 (oracle runner, ~32 min — image lands when green).
Tree: #53 (ae-prod) + this PR's control-auth + the CodeRabbit Major fixes.

What changed vs aeprod19 (what to exercise):

  • control-plane auth — bearer-JWT via the shared jwtutils lib. Flag-gated, DEFAULT OFF (CONTROL_REQUIRE_AUTH unset) → dx→AE works unchanged, no dx-side change needed for this test. (Auth e2e is a later step once dx attaches the bearer.)
  • controller: no fan-out when attribution Sink.Write fails (no orphaned Pixie rows); Rehydrate now re-arms rev-1 fan-out for restored windows.
  • passthrough: per-table timeout (hung dep can't stall the sweep).
  • schema: 30d TTL on ae_reconcile.

Ask: swap your rig's AE to 0.14.19-aeprod20 (keep dx as-is) and confirm the cast4/ExtremeC regression still green — reduction + no-loss + detection (log4shell/react2argo rule-in + dx_attack_graph writes), AE 2/2 capturing. The Rehydrate change is the one to watch (restart → does rev-1 fan-out resume?). Ping back with pass/fail and I'll flip auth on next.

Thanks!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 44

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e_log4shell_soc.yaml:
- Around line 14-31: The e2e_log4shell_soc.yaml workflow lacks concurrency
guards, which allows multiple simultaneous runs to race and corrupt shared
self-hosted runner resources like k3s and /etc/rancher/k3s/. Add a concurrency
block at the top level of the workflow (alongside the on and jobs sections) that
specifies a concurrency group to ensure only one run executes at a time. The
concurrency group should cancel in-progress runs when a new run is triggered,
using a meaningful identifier like the workflow name or a fixed string to ensure
all runs of this workflow serialize properly on the oracle-vm-16cpu-64gb-x86-64
runner.
- Around line 58-65: The workflow is directly interpolating `inputs.soc_ref` and
`inputs.dx_image` into shell commands within the run step, creating a command
injection vulnerability. Move these inputs to an env section at the step level,
assign them to environment variables, and validate that they contain only
allowed characters (e.g., alphanumeric, hyphens, dots for versions). Replace the
inline `${{ inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the
git clone and kubectl commands with properly quoted environment variable
references like `"$SOC_REF"` and `"$DX_IMAGE"` respectively.

In @.github/workflows/vizier_release.yaml:
- Line 143: The runner label oracle-vm-16cpu-64gb-x86-64 in the runs-on field is
not recognized by actionlint, causing lint validation to fail. If this is an
intentional custom self-hosted runner label, register it in your actionlint
configuration file (typically .actionlintrc or actionlint config) to whitelist
this label and allow the workflow to pass validation. Alternatively, if this
label does not exist in your infrastructure, replace it with a valid standard
GitHub Actions runner label or your organization's properly configured
self-hosted runner label.

In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md`:
- Line 96: The `## Legend` heading is missing required blank lines for markdown
linting compliance with rule MD022. Add a blank line before the `## Legend`
heading and a blank line after it to separate it from the surrounding content.
This ensures proper spacing around the heading section in the CONTRACTS.md file.

In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md`:
- Around line 50-59: The markdown file has formatting violations in fenced code
blocks at lines 50-59 and 95-97. Add a blank line before and after each fenced
code block to comply with MD031, and add language identifiers to the opening
fence markers (e.g., change ``` to ```sql for SQL code blocks and ```text or
appropriate language for other blocks). This applies to all fenced code blocks
that currently lack language tags or proper spacing.

In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md`:
- Around line 17-18: The documentation in EXPERIMENTS.md incorrectly states that
fixture timestamps are in unix nanoseconds, but the actual harness
implementation in lib.sh (lines 145-147) uses seconds for the event_time field.
Update the documentation at lines 17-18 to accurately reflect that timestamps
are in seconds rather than nanoseconds, and also correct the same misstatement
at lines 35-36 to ensure consistency with the actual harness behavior and
prevent users from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh`:
- Around line 77-80: The first inj command in the E4 test case (line 79) uses ||
true which silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh`:
- Around line 74-80: The INJECT command invocations in the stall-detection
experiment are using the || true pattern to suppress failures, which causes
failed anomaly inserts to be treated as success and can produce false STALL
flags. Remove the || true suffix from both INJECT command calls (the first one
with the --same-time flag and the second one without it) to allow injection
failures to properly propagate and be detected, ensuring accurate stall
detection in the experiment.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh`:
- Around line 80-85: The measure function accepts a valid parameter but does not
include it when writing results to the RES file at line 84. Add the valid
parameter to the printf statement that writes to RES so it captures whether each
measurement is valid. Additionally, update the summary computation at lines
121-124 to filter the results in RES by the validity column before computing the
aggregate statistics, ensuring the reported means only include measurements
marked as valid.
- Around line 114-116: The matrix arms currently have
ADAPTIVE_PUSH_PIXIE_ROWS=false set on both the ALL arm run_arm call at line 114
and the DX arm run_arm call at line 116, which disables the data-plane write
path and prevents proper measurement of data-volume reduction. To fix this,
change ADAPTIVE_PUSH_PIXIE_ROWS=false to ADAPTIVE_PUSH_PIXIE_ROWS=true in at
least one of these run_arm calls (or both as needed) to enable the data-plane
write path so that data-volume reduction can be properly measured during the
adaptive export load test.
- Line 78: The r0001_recent function has a timestamp unit mismatch where it
multiplies by 1000000000 to convert to nanoseconds, but the event_time field is
stored in seconds according to the harness documentation. This causes the WHERE
clause comparison to fail or undercount valid events. Fix this by removing the
multiplication factor of 1000000000 from the timestamp calculation, so that the
comparison is between seconds and seconds instead of seconds and nanoseconds.
Change the expression from toUInt64((now()-130))*1000000000 to
toUInt64(now()-130).

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh`:
- Around line 54-56: The reconciliation test setup on line 55 of
exp_row_reconcile.sh disables the write path by setting
ADAPTIVE_PUSH_PIXIE_ROWS=false, which prevents the CH protocol-row writes that
the test is meant to validate, causing false reconciliation failures. Remove the
ADAPTIVE_PUSH_PIXIE_ROWS=false flag from the kubectl set env command in the
adaptive-export deployment to ensure the write path remains enabled during the
reconciliation test.

In `@src/e2e_test/adaptive_export_loadtest/harness/inject.sh`:
- Around line 23-27: The row-shape header comment in the inject.sh file
incorrectly documents event_time as unix-NANOS on line 26, but the actual
contract documented in lines 32-37 and the script's usage shows it is
unix-SECONDS. Correct the comment to change "event_time (UInt64 unix-NANOS)" to
"event_time (UInt64 unix-SECONDS)" to accurately reflect the seconds unit that
the script actually uses and expects.
- Around line 96-108: Replace the predictable temporary filename
`/tmp/inject_resp.$$` with a securely generated temporary file using `mktemp`.
Create a new temporary file at the start of the code block using `mktemp` and
store it in a variable, then substitute all occurrences of `/tmp/inject_resp.$$`
with this variable throughout the curl command, the response file output
redirection, the cat command that displays the response, and the rm command that
cleans up the file. This eliminates the security risk of predictable temp file
names that could be subject to symlink attacks or collisions.

In `@src/e2e_test/adaptive_export_loadtest/harness/lib.sh`:
- Line 33: The bash script header marks KUBECONFIG as required using set -uo
pipefail, but the kubectl operations at lines 51-53 execute without verifying
that KUBECONFIG is actually set and points to a valid file. Add an explicit
guard before the kubectl operations to check that KUBECONFIG is set and
non-empty, ensuring that kubectl commands execute against the intended cluster
context and not the caller's default context. This guard should exit with an
error if KUBECONFIG is unset or empty to prevent unintended mutations.
- Around line 89-94: The chq function is missing timeout controls on the curl
command, which can cause the harness to hang indefinitely if the port-forward or
ClickHouse service becomes unresponsive. Add curl timeout options such as
--connect-timeout for the connection establishment phase and --max-time for the
overall request duration to the curl invocation in the chq function to ensure
the harness can proceed even if the backend stalls.

In `@src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh`:
- Around line 61-66: Add validation immediately after the kubectl commands that
set the BIP and BPORT variables to ensure both contain non-empty values before
proceeding to the curl loop. If either BIP or BPORT is empty, the script should
exit with a clear error message indicating backend service resolution failed.
This prevents malformed requests from being sent by the attacker deployment and
ensures the script fails fast with meaningful diagnostics instead of producing
misleading log4j vulnerability detection results.

In `@src/e2e_test/adaptive_export_loadtest/harness/nfr.sh`:
- Around line 49-56: The script detects no-loss proof mismatches by printing
MISMATCH or READ!=WROTE in the loop that iterates over http_events, dns_events,
and conn_stats, but never exits with a failure status, causing automated runs to
pass despite data loss. Add a flag variable initialized before the loop to track
whether any mismatches occur (checking both the wrote versus CH_rows comparison
and the read versus wrote comparison), then after the loop completes, check this
flag and exit the script with a non-zero status code if any mismatches were
detected.
- Around line 34-39: The setarm function on line 34 suppresses all output and
errors from the kubectl commands using `>/dev/null 2>&1`, which hides failures
from kubectl set env and kubectl rollout status. The run_phase function on line
38 calls setarm without checking its return status, allowing the test to proceed
even if the adaptive-export reconfiguration fails. Remove the output suppression
redirection from the setarm function to allow errors to be visible, and add a
return status check immediately after the setarm call in run_phase so that the
phase aborts with an appropriate error message if the kubectl reconfiguration
fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/run.sh`:
- Around line 32-33: The EVID variable assignment on line 32 uses a hardcoded
user-specific path that is not portable across different environments. Replace
the hardcoded path with a portable default, such as using environment variables
or a system temporary directory. Additionally, the critical pipeline stages in
lines 40-56 lack explicit failure checks, allowing the script to continue even
if a stage fails and produces an invalid verdict set. Add explicit error
handling (such as set -e at the beginning of the script or individual error
checks) to ensure the script exits immediately when any critical stage fails.

In `@src/e2e_test/adaptive_export_loadtest/harness/stats.py`:
- Around line 57-76: The issue is that when there are no numeric PASS values for
a metric (the vals list is empty), the code continues without setting repro_ok
to False, allowing a false positive "EXACTLY REPRODUCIBLE" verdict. To fix this,
add repro_ok = False before the continue statement in the block where not vals
is True, so that any metric lacking valid PASS measurements will correctly mark
the reproducibility check as failed.

In `@src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml`:
- Around line 28-37: Add security hardening to the httpsink container
specification in the pod spec by setting automountServiceAccountToken to false
at the spec level to prevent service account token mounting, and within the
container specification add allowPrivilegeEscalation set to false, drop
unnecessary Linux capabilities, and configure a seccomp profile to restrict
system calls. Apply the same hardening pattern to the pg-sink pod specification
as well to ensure both long-lived sink pods follow the same security baseline
and reduce unnecessary privilege and credential exposure.

In `@src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml`:
- Around line 18-38: The gen-pod.tmpl.yaml pod spec lacks essential security
hardening configurations that should be applied to reduce the attack surface.
Add a securityContext block within the pod spec to enforce non-root execution,
disable privilege escalation, and set a read-only root filesystem. Additionally,
add automountServiceAccountToken set to false at the pod spec level to prevent
unnecessary service account token mounting. These hardening measures should be
applied to the spec section containing the restartPolicy and containers
definitions to ensure the cleanloadgen workload runs with minimal privileges and
reduced security risks.

In `@src/e2e_test/adaptive_export_loadtest/README.md`:
- Around line 51-57: The markdown file has linting compliance issues with the
Layout heading and its associated code fence block. First, ensure there is a
blank line before the "## Layout" heading and a blank line after it to satisfy
MD022 heading spacing requirements. Second, add a language tag to the opening
fence (using triple backticks) to satisfy MD040 requirements. Third, ensure
there is a blank line before the opening fence and a blank line after the
closing fence to satisfy MD031 fenced block spacing requirements. These same
spacing rules also apply to the heading at line 71.

In
`@src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go`:
- Around line 64-71: The envInt function currently accepts negative integer
values from environment variables, which can cause unintended behavior in
generation loops. After successfully parsing the integer value with strconv.Atoi
in the envInt function, add a validation check to ensure the parsed value is
non-negative. If the value is negative, call fatalf with an appropriate error
message indicating that the environment variable must be a non-negative integer.

In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod`:
- Line 3: The Go version directive in the go.mod file is set to version 1.22,
which is no longer receiving active security updates. Update the go directive
from version 1.22 to version 1.26 (or 1.25 if necessary) to ensure the loadgen
binaries run on a currently supported Go runtime baseline and receive ongoing
security patches.

In `@src/vizier/services/adaptive_export/cmd/main.go`:
- Around line 799-807: The issue is that isOperatorManagedScript() function uses
a broad prefix check (ch-) to identify operator-managed scripts, which can
accidentally match user-authored scripts that start with ch-. This causes
unintended deletion of user scripts when INSTALL_PRESET_SCRIPTS=true. Replace
the prefix-based matching logic in isOperatorManagedScript() with an explicit
list of known operator-managed script names that the function should check
against. Similarly, apply the same fix to the duplicate logic mentioned at lines
839-847 to ensure both locations use the same explicit script name matching
approach.

In `@src/vizier/services/adaptive_export/internal/chhttp/chhttp.go`:
- Around line 36-38: The QueryStream method claims to support unbounded
incremental streaming but uses the same http.Client that has a timeout set via
DefaultTimeout. In Go, http.Client.Timeout applies to the entire
request-response cycle including body reads, which will cut off long-running
responses mid-stream. For QueryStream, create a separate http.Client instance
without a timeout constraint, or modify the client initialization to
conditionally apply the timeout only for non-streaming operations. Ensure
QueryStream uses the timeout-free client while other operations continue to use
the DefaultTimeout-constrained client.

In `@src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go`:
- Around line 145-152: The test uses a single 30-second context timeout for both
the Apply and VerifyPixieSchema operations, which means if Apply consumes most
of the time, VerifyPixieSchema may fail with a deadline exceeded error even
though the schema is correct. Create separate context.WithTimeout calls for each
operation so that both Apply and VerifyPixieSchema each get their own
independent 30-second timeout deadline.

In `@src/vizier/services/adaptive_export/internal/clickhouse/schema.sql`:
- Around line 68-70: The kubescape_logs table has unsafe unit handling for the
event_time column in both the PARTITION BY and TTL expressions. The
toDateTime(event_time) calls assume event_time is in seconds, but if the actual
values are in milliseconds or nanoseconds, this will produce incorrect partition
keys and TTL horizons. Determine the actual unit of event_time (likely
milliseconds or nanoseconds), then update both the PARTITION BY
toYYYYMM(toDateTime(event_time)) expression on line 68 and the TTL
toDateTime(event_time) + INTERVAL 30 DAY DELETE expression on line 69 to apply
the appropriate unit conversion before passing to toDateTime(), such as dividing
by 1000 for milliseconds or by 1000000000 for nanoseconds.

In `@src/vizier/services/adaptive_export/internal/control/server_test.go`:
- Around line 155-169: The TestBadInputRejected test function is missing
explicit test cases for the new t_end and window validators. Add test cases that
verify the server returns http.StatusBadRequest for t_end <= 0 on the
/export/start endpoint, and for window[0] >= window[1] and non-positive window
values on the /query endpoint. Follow the existing pattern in
TestBadInputRejected by using the do() helper function with appropriate test
payloads for each validator and asserting the expected 400 status code with
descriptive failure messages.

In `@src/vizier/services/adaptive_export/internal/control/server.go`:
- Around line 120-143: The handleDXAttackGraph method is vulnerable to memory
DoS attacks because the decode function reads the entire unbounded request body
and then copies it into a second buffer. Wrap the request body with
http.MaxBytesReader to enforce a maximum size limit before passing the request
to the decode function. This will ensure oversized requests fail gracefully
instead of exhausting memory resources.

In `@src/vizier/services/adaptive_export/internal/controller/controller.go`:
- Around line 316-366: The code modifies c.active[hash] before calling
c.sink.Write(), but if the write fails, it returns without rolling back the
changes to c.active. This allows in-flight workers already processing the same
hash to continue using unpersisted state extensions. Save the prior state of
c.active[hash] before making any modifications (before the block that either
creates a new row or updates the existing row), and then add rollback logic in
the error handling block after the c.sink.Write() call fails to restore the
original state, ensuring that in-flight workers do not reference orphaned state.

In `@src/vizier/services/adaptive_export/internal/e2e/e2e_test.go`:
- Around line 146-167: The test currently validates only the payload shape
within the INSERT bodies, but does not verify that the INSERT statement targets
the correct table forensic_db.adaptive_attribution. Add an assertion that checks
the captured SQL statements (via insertedSQL) to ensure they contain the correct
table reference, preventing regressions where the wrong table could be inserted
into while still maintaining the correct payload structure. This check should be
added alongside or after the existing body content validation loop that checks
for matched payloads.

In `@src/vizier/services/adaptive_export/internal/passthrough/passthrough.go`:
- Around line 124-131: The compiled path silently skips invalid tables when
pxl.CompilePassthrough fails but does not record these failures using
l.rec(...), which changes the error recording semantics compared to legacy mode.
In the error handling block where pxl.CompilePassthrough(table, cfg.Window)
returns an error, add a call to l.rec(...) to record the failure before
continuing to skip the table. Apply this same fix to the similar error handling
code at lines 235-239 that also needs to record failures when tables cannot be
compiled.

In `@src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go`:
- Around line 34-37: The capRec struct's Record method appends to the rows slice
without synchronization, which causes race conditions when multiple goroutines
call Record concurrently. Add a sync.Mutex field to the capRec struct and use it
to protect the append operation in the Record method by locking before appending
to rows and unlocking after. Also update any code that reads the rows field to
acquire the same lock before accessing the slice to ensure consistent visibility
of all recorded rows.

In `@src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go`:
- Around line 112-160: Add a TODO comment to the Query method of the Adapter
struct that documents the need to implement a long-lived client with JWT-refresh
mechanism for high-throughput direct-mode scenarios. Place this TODO near the
existing comment about GC-time reclamation and connection-leak mitigation to
ensure future developers are aware of the technical debt and the conditions
under which it should be addressed.

In `@src/vizier/services/adaptive_export/internal/sink/clickhouse.go`:
- Around line 190-207: The unconditional Info log statement that logs "sink:
pixie write completed" with the WithFields call executes on every batch write in
the hot path, creating excessive log volume. Wrap the entire logging block (from
the summary := res.Summary assignment through the Info call) behind a
debug-level check using log.IsLevelEnabled(log.DebugLevel) or a temporary
feature flag so this diagnostic logging only occurs when explicitly enabled for
debugging rather than on every write.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode_test.go`:
- Around line 36-58: The encodeViaJSON and parseNDJSON helper functions are
ignoring errors from enc.Encode and json.Unmarshal respectively, which can allow
tests to pass with invalid data. Modify both encodeViaJSON and parseNDJSON to
accept a *testing.T parameter as the first argument, then check the error
returns from enc.Encode and json.Unmarshal and call t.Fatalf to fail the test
immediately if either operation returns an error. Finally, update all call sites
throughout the test file where encodeViaJSON and parseNDJSON are invoked to pass
the test instance t as the first argument.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go`:
- Around line 184-187: The appendFloat function is being called with float32 and
float64 values without first validating that they are finite numbers. Non-finite
values like NaN and Infinity will be emitted as invalid JSON, causing
whole-batch inserts to fail. Add a check before calling appendFloat in both the
float32 case (around line 184) and float64 case (around line 186) to verify the
value is finite using a function like math.IsNaN or math.IsInf. If a non-finite
value is detected, reject it or handle it appropriately instead of passing it to
appendFloat. Apply the same validation pattern to the similar float handling
code around lines 215-218.

In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go`:
- Around line 158-164: The shared 60-second timeout context created before the
loop causes the test to be flaky because it applies to the entire table
iteration sequence rather than individual tables. Move the context.WithTimeout
call inside the for loop that iterates over chpkg.PixieTables() so that each
table gets its own independent 60-second timeout. This way, if one table's
WritePixieRows call takes time, it won't affect the timeout for subsequent
tables, ensuring failures are isolated to the specific table being tested.

In `@src/vizier/services/adaptive_export/internal/streaming/writer.go`:
- Around line 117-134: The flush() method clears buf unconditionally after
attempting WritePixieRows, which causes data loss when the write fails. Only
clear buf by setting it to buf[:0] when WritePixieRows succeeds (move this
operation into the else block), not in the error case. Additionally, in the
shutdown branch (lines 138-140), the timeout context is derived from an
already-canceled parent context, causing the final flush to fail fast. For the
shutdown case, create the timeout context using context.Background() instead of
the canceled parent context to give the final flush operation a proper chance to
complete.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go`:
- Around line 73-78: The PollLimit field documentation incorrectly states that 0
means unlimited behavior, but the New() function actually rewrites 0 to 10000.
Update the comment for the PollLimit field to accurately reflect the actual
runtime behavior that 0 is rewritten to the default value of 10000, ensuring
operators understand the correct semantics when tuning this setting.

In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`:
- Around line 123-132: The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0f1144a8-c2c2-4681-a318-0629757dcbdb

📥 Commits

Reviewing files that changed from the base of the PR and between 855db81 and 714c1fb.

📒 Files selected for processing (116)
  • .arclint
  • .bazelignore
  • .github/workflows/e2e_log4shell_soc.yaml
  • .github/workflows/vizier_release.yaml
  • k8s/vizier/bootstrap/adaptive_export_deployment.yaml
  • k8s/vizier/bootstrap/adaptive_export_secrets.yaml
  • k8s/vizier/bootstrap/kustomization.yaml
  • skaffold/skaffold_vizier.yaml
  • src/api/go/pxapi/opts.go
  • src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
  • src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md
  • src/e2e_test/adaptive_export_loadtest/README.md
  • src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md
  • src/e2e_test/adaptive_export_loadtest/harness/ae_config.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_e5.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_matrix.sh
  • src/e2e_test/adaptive_export_loadtest/harness/exp_row_reconcile.sh
  • src/e2e_test/adaptive_export_loadtest/harness/inject.sh
  • src/e2e_test/adaptive_export_loadtest/harness/lib.sh
  • src/e2e_test/adaptive_export_loadtest/harness/log4shell_fire.sh
  • src/e2e_test/adaptive_export_loadtest/harness/nfr.sh
  • src/e2e_test/adaptive_export_loadtest/harness/run.sh
  • src/e2e_test/adaptive_export_loadtest/harness/stats.py
  • src/e2e_test/adaptive_export_loadtest/k8s/00-sinks.yaml
  • src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yaml
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/Dockerfile
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/cleanloadgen/main.go
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/cmd/httpsink/main.go
  • src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod
  • src/stirling/source_connectors/socket_tracer/testing/container_images/BUILD.bazel
  • src/vizier/services/adaptive_export/BUILD.bazel
  • src/vizier/services/adaptive_export/cmd/BUILD.bazel
  • src/vizier/services/adaptive_export/cmd/main.go
  • src/vizier/services/adaptive_export/internal/activeset/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/activeset/activeset.go
  • src/vizier/services/adaptive_export/internal/activeset/activeset_test.go
  • src/vizier/services/adaptive_export/internal/anomaly/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/anomaly/hash.go
  • src/vizier/services/adaptive_export/internal/anomaly/hash_bench_test.go
  • src/vizier/services/adaptive_export/internal/anomaly/hash_test.go
  • src/vizier/services/adaptive_export/internal/chhttp/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/chhttp/chhttp.go
  • src/vizier/services/adaptive_export/internal/chhttp/chhttp_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/clickhouse/apply.go
  • src/vizier/services/adaptive_export/internal/clickhouse/apply_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/columns_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl.go
  • src/vizier/services/adaptive_export/internal/clickhouse/ddl_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/insert.go
  • src/vizier/services/adaptive_export/internal/clickhouse/insert_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/integration_test.go
  • src/vizier/services/adaptive_export/internal/clickhouse/schema.sql
  • src/vizier/services/adaptive_export/internal/config/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/config/definition.go
  • src/vizier/services/adaptive_export/internal/control/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/control/server.go
  • src/vizier/services/adaptive_export/internal/control/server_test.go
  • src/vizier/services/adaptive_export/internal/controller/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/controller/controller.go
  • src/vizier/services/adaptive_export/internal/controller/controller_test.go
  • src/vizier/services/adaptive_export/internal/e2e/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/e2e/e2e_test.go
  • src/vizier/services/adaptive_export/internal/e2e/loadtest_test.go
  • src/vizier/services/adaptive_export/internal/kubescape/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/kubescape/extract.go
  • src/vizier/services/adaptive_export/internal/kubescape/extract_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/passthrough/compiled_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/passthrough.go
  • src/vizier/services/adaptive_export/internal/passthrough/passthrough_test.go
  • src/vizier/services/adaptive_export/internal/passthrough/reconcile_test.go
  • src/vizier/services/adaptive_export/internal/pixie/pixie.go
  • src/vizier/services/adaptive_export/internal/pixieapi/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/pixieapi/pixieapi.go
  • src/vizier/services/adaptive_export/internal/pixieapi/pixieapi_test.go
  • src/vizier/services/adaptive_export/internal/pxl/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/pxl/compile.go
  • src/vizier/services/adaptive_export/internal/pxl/compile_test.go
  • src/vizier/services/adaptive_export/internal/pxl/pxl.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor_bench_test.go
  • src/vizier/services/adaptive_export/internal/pxl/queryfor_test.go
  • src/vizier/services/adaptive_export/internal/pxl/tables.go
  • src/vizier/services/adaptive_export/internal/pxl/tables_test.go
  • src/vizier/services/adaptive_export/internal/reconcile/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/reconcile/reconcile.go
  • src/vizier/services/adaptive_export/internal/script/script.go
  • src/vizier/services/adaptive_export/internal/sink/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/sink/clickhouse.go
  • src/vizier/services/adaptive_export/internal/sink/clickhouse_test.go
  • src/vizier/services/adaptive_export/internal/sink/content_type_contract_test.go
  • src/vizier/services/adaptive_export/internal/sink/encode_bench_test.go
  • src/vizier/services/adaptive_export/internal/sink/fastencode.go
  • src/vizier/services/adaptive_export/internal/sink/fastencode_test.go
  • src/vizier/services/adaptive_export/internal/sink/integration_test.go
  • src/vizier/services/adaptive_export/internal/streaming/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/streaming/filter.go
  • src/vizier/services/adaptive_export/internal/streaming/filter_test.go
  • src/vizier/services/adaptive_export/internal/streaming/integration_test.go
  • src/vizier/services/adaptive_export/internal/streaming/notifier.go
  • src/vizier/services/adaptive_export/internal/streaming/notifier_test.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner.go
  • src/vizier/services/adaptive_export/internal/streaming/scanner_test.go
  • src/vizier/services/adaptive_export/internal/streaming/supervisor.go
  • src/vizier/services/adaptive_export/internal/streaming/writer.go
  • src/vizier/services/adaptive_export/internal/trigger/BUILD.bazel
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go
  • src/vizier/services/adaptive_export/internal/trigger/clickhouse_test.go
  • src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go
  • src/vizier/services/adaptive_export/internal/trigger/integration_test.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark.go
  • src/vizier/services/adaptive_export/internal/trigger/watermark_test.go
💤 Files with no reviewable changes (2)
  • src/vizier/services/adaptive_export/internal/pxl/pxl.go
  • src/vizier/services/adaptive_export/internal/config/definition.go

git push origin "gh-pages"
update-gh-artifacts-manifest:
runs-on: oracle-8cpu-32gb-x86-64
runs-on: oracle-vm-16cpu-64gb-x86-64

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Runner label is currently unresolved by lint/scheduling contracts.

runs-on: oracle-vm-16cpu-64gb-x86-64 is flagged as unknown, which can block release workflow validation or leave the job unschedulable unless this is explicitly configured as a self-hosted label.

🔧 Suggested fix
-    runs-on: oracle-vm-16cpu-64gb-x86-64
+    runs-on:
+      - self-hosted
+      - oracle-vm-16cpu-64gb-x86-64

If this is intentional custom-label usage, also register it in your actionlint config to keep CI green.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
runs-on: oracle-vm-16cpu-64gb-x86-64
runs-on:
- self-hosted
- oracle-vm-16cpu-64gb-x86-64
🧰 Tools
🪛 actionlint (1.7.12)

[error] 143-143: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/vizier_release.yaml at line 143, The runner label
oracle-vm-16cpu-64gb-x86-64 in the runs-on field is not recognized by
actionlint, causing lint validation to fail. If this is an intentional custom
self-hosted runner label, register it in your actionlint configuration file
(typically .actionlintrc or actionlint config) to whitelist this label and allow
the workflow to pass validation. Alternatively, if this label does not exist in
your infrastructure, replace it with a valid standard GitHub Actions runner
label or your organization's properly configured self-hosted runner label.

Source: Linters/SAST tools

- **DX controls:** (1) open/extend a window (each referral/anomaly extends `t_end`), (2) explicit **StopExport** via the control surface (`CONTROL_ADDR`, design rev-3 — confirm wired), (3) the active set (which pods AE over-captures).
- **DX relies on:** C5 (stable hash identity), C14 (write ⊇ read), **C15 (no premature stop)**, C9 (0 rows only when the workload is genuinely silent), C10 (the `ns/pod` ↔ bare join). For DX to steer dependably, C3/C8/C13/C15 must move from 🔴 to ✅.

## Legend

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix markdownlint heading spacing in the legend section.

Line 96 (## Legend) is missing the required surrounding blank line and triggers MD022.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 96-96: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/CONTRACTS.md` at line 96, The `##
Legend` heading is missing required blank lines for markdown linting compliance
with rule MD022. Add a blank line before the `## Legend` heading and a blank
line after it to separate it from the surrounding content. This ensures proper
spacing around the heading section in the CONTRACTS.md file.

Source: Linters/SAST tools

Comment on lines +50 to +59
```sql
PARTITION BY toYYYYMM(toDateTime(event_time))
TTL toDateTime(event_time) + INTERVAL 30 DAY
```
`toDateTime()` interprets its arg as **seconds**. Reproduced on the rig:
```
toDateTime(1781559074162913804) = 2106-02-07 (saturates at DateTime max)
toDateTime(1781559074162913804)+30 DAY = 1970-01-30 (OVERFLOWS past max → wraps to 1970)
(... ) < now() = 1 (already_expired)
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize fenced-code formatting and add fence language tags.

Lines 50–59 and 95–97 include fenced blocks without language identifiers and without required surrounding blank lines, which triggers MD031/MD040.

Also applies to: 95-97

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 50-50: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 53-53: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 55-55: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 55-55: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 59-59: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/FINDINGS_AND_BACKLOG.md` around lines
50 - 59, The markdown file has formatting violations in fenced code blocks at
lines 50-59 and 95-97. Add a blank line before and after each fenced code block
to comply with MD031, and add language identifiers to the opening fence markers
(e.g., change ``` to ```sql for SQL code blocks and ```text or appropriate
language for other blocks). This applies to all fenced code blocks that
currently lack language tags or proper spacing.

Source: Linters/SAST tools

Comment on lines +17 to +18
isolates each rep even with overlapping windows). Timestamps are explicit unix
nanos — fixtures NEVER use wall-clock `now()`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Timestamp unit contract is inconsistent with harness behavior.

Line 17 and Line 35 document nanosecond semantics, but src/e2e_test/adaptive_export_loadtest/harness/lib.sh (Line 145–147) states fixture event_time is in seconds. Following this doc as written can undersize ADAPTIVE_WINDOW_BEFORE_SEC and miss band rows.

Suggested doc correction
- Timestamps are explicit unix nanos — fixtures NEVER use wall-clock `now()`.
+ Timestamps are explicit unix seconds — fixtures NEVER use wall-clock `now()`.

- 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0)/1e9 + margin` so window start
+ 3. AE config: `ADAPTIVE_WINDOW_BEFORE_SEC ≥ (B1−B0) + margin` so window start

Also applies to: 35-36

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/fixtures/EXPERIMENTS.md` around lines
17 - 18, The documentation in EXPERIMENTS.md incorrectly states that fixture
timestamps are in unix nanoseconds, but the actual harness implementation in
lib.sh (lines 145-147) uses seconds for the event_time field. Update the
documentation at lines 17-18 to accurately reflect that timestamps are in
seconds rather than nanoseconds, and also correct the same misstatement at lines
35-36 to ensure consistency with the actual harness behavior and prevent users
from undersizing ADAPTIVE_WINDOW_BEFORE_SEC.

Comment on lines +74 to +80
"$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
--hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
--event-time "$T" --count "$BURST" --same-time >&2 || true
else
"$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
--hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
--event-time "$T" >&2 || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not suppress injector failures in a stall-detection experiment.

Line 74/Line 78 use || true, so failed anomaly inserts are treated as success; this can produce false STALL flags at Line 90.

Suggested fix
-    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
+    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
       --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
-      --event-time "$T" --count "$BURST" --same-time >&2 || true
+      --event-time "$T" --count "$BURST" --same-time >&2 || die "inject failed at tick=$tick"
@@
-    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
+    "$INJECT" --endpoint "$CH_HTTP" --user "$CH_RW_USER" --pass "$CH_RW_PASS" \
       --hostname "$NODE" --ns "$AELOAD_NS" --pod "$POD" --rule R0001 --pid 1234 --comm java \
-      --event-time "$T" >&2 || true
+      --event-time "$T" >&2 || die "inject failed at tick=$tick"

Also applies to: 90-90

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_e8.sh` around lines 74 -
80, The INJECT command invocations in the stall-detection experiment are using
the || true pattern to suppress failures, which causes failed anomaly inserts to
be treated as success and can produce false STALL flags. Remove the || true
suffix from both INJECT command calls (the first one with the --same-time flag
and the second one without it) to allow injection failures to properly propagate
and be detected, ensuring accurate stall detection in the experiment.

Comment on lines +184 to +187
case float32:
appendFloat(buf, float64(x))
case float64:
appendFloat(buf, x)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-finite floats before fast-path emission.

appendFloat will emit NaN/+Inf/-Inf, which is invalid JSON. This can break whole-batch inserts instead of falling back safely.

Proposed fix
 import (
 	"bytes"
 	"encoding/json"
 	"errors"
 	"fmt"
+	"math"
 	"strconv"
 	"sync"
 	"time"
 	"unicode/utf8"
@@
 	case float32:
-		appendFloat(buf, float64(x))
+		f := float64(x)
+		if math.IsNaN(f) || math.IsInf(f, 0) {
+			return errFastEncodeUnsupported
+		}
+		appendFloat(buf, f)
 	case float64:
+		if math.IsNaN(x) || math.IsInf(x, 0) {
+			return errFastEncodeUnsupported
+		}
 		appendFloat(buf, x)

Also applies to: 215-218

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/sink/fastencode.go` around lines
184 - 187, The appendFloat function is being called with float32 and float64
values without first validating that they are finite numbers. Non-finite values
like NaN and Infinity will be emitted as invalid JSON, causing whole-batch
inserts to fail. Add a check before calling appendFloat in both the float32 case
(around line 184) and float64 case (around line 186) to verify the value is
finite using a function like math.IsNaN or math.IsInf. If a non-finite value is
detected, reject it or handle it appropriately instead of passing it to
appendFloat. Apply the same validation pattern to the similar float handling
code around lines 215-218.

Comment on lines +158 to +164
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()

for _, table := range chpkg.PixieTables() {
row := minimalRowFor(table, tag, now)
if err := s.WritePixieRows(ctx, table, []map[string]any{row}); err != nil {
t.Errorf("WritePixieRows(%s): %v", table, err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Shared timeout across all tables makes this live test unnecessarily flaky.

Using one 60s context for the entire table loop can cancel later table checks due to earlier latency. Prefer per-table timeouts so failures are isolated to the table under test.

🔧 Suggested fix
-    ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
-    defer cancel()
-
     for _, table := range chpkg.PixieTables() {
+        ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
         row := minimalRowFor(table, tag, now)
         if err := s.WritePixieRows(ctx, table, []map[string]any{row}); err != nil {
+            cancel()
             t.Errorf("WritePixieRows(%s): %v", table, err)
             continue
         }
         ident := table
         if strings.Contains(table, ".") {
             ident = "`" + table + "`"
         }
         got := chCount(t, endpoint, user, pass,
             fmt.Sprintf("SELECT count() FROM forensic_db.%s WHERE hostname='%s'", ident, tag))
+        cancel()
         if got < 1 {
             t.Errorf("table %s after WritePixieRows: count=%d, want >=1", table, got)
         }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/sink/integration_test.go` around
lines 158 - 164, The shared 60-second timeout context created before the loop
causes the test to be flaky because it applies to the entire table iteration
sequence rather than individual tables. Move the context.WithTimeout call inside
the for loop that iterates over chpkg.PixieTables() so that each table gets its
own independent 60-second timeout. This way, if one table's WritePixieRows call
takes time, it won't affect the timeout for subsequent tables, ensuring failures
are isolated to the specific table being tested.

Comment on lines +117 to +134
fctx, cancel := context.WithTimeout(ctx, 60*time.Second)
err := w.sink.WritePixieRows(fctx, w.table, buf)
cancel()
if err != nil {
log.WithError(err).WithFields(log.Fields{
"table": w.table,
"rows": len(buf),
"reason": reason,
}).Warn("streaming.BatchWriter: flush failed")
} else {
log.WithFields(log.Fields{
"table": w.table,
"rows": len(buf),
"reason": reason,
}).Info("streaming.BatchWriter: flushed batch")
}
buf = buf[:0]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent row loss on failed/shutdown flushes.

flush() clears buf even when WritePixieRows fails, and in the shutdown branch the timeout context is derived from an already-canceled parent. That makes the final flush fail fast and still discard buffered rows.

Proposed fix
-	flush := func(reason string) {
+	flush := func(reason string) {
 		if len(buf) == 0 {
 			return
 		}
-		// Bound the CH write so a stalled CH HTTP doesn't pin us.
-		fctx, cancel := context.WithTimeout(ctx, 60*time.Second)
+		// Bound the CH write so a stalled CH HTTP doesn't pin us.
+		// On shutdown, use a fresh parent so we can attempt a real final flush.
+		baseCtx := ctx
+		if reason == "shutdown" {
+			baseCtx = context.Background()
+		}
+		fctx, cancel := context.WithTimeout(baseCtx, 60*time.Second)
 		err := w.sink.WritePixieRows(fctx, w.table, buf)
 		cancel()
 		if err != nil {
 			log.WithError(err).WithFields(log.Fields{
 				"table":  w.table,
 				"rows":   len(buf),
 				"reason": reason,
 			}).Warn("streaming.BatchWriter: flush failed")
-		} else {
+			// Keep buffered rows for retry on the next timer/size flush.
+			// (shutdown still exits after this best-effort attempt)
+			if reason != "shutdown" {
+				return
+			}
+		} else {
 			log.WithFields(log.Fields{
 				"table":  w.table,
 				"rows":   len(buf),
 				"reason": reason,
 			}).Info("streaming.BatchWriter: flushed batch")
 		}
 		buf = buf[:0]
 	}

Also applies to: 138-140

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/streaming/writer.go` around
lines 117 - 134, The flush() method clears buf unconditionally after attempting
WritePixieRows, which causes data loss when the write fails. Only clear buf by
setting it to buf[:0] when WritePixieRows succeeds (move this operation into the
else block), not in the error case. Additionally, in the shutdown branch (lines
138-140), the timeout context is derived from an already-canceled parent
context, causing the final flush to fail fast. For the shutdown case, create the
timeout context using context.Background() instead of the canceled parent
context to give the final flush operation a proper chance to complete.

Comment on lines +73 to +78
// PollLimit caps rows returned per poll. Bounds catch-up work
// after a restart so a 10h backlog doesn't translate into a
// single multi-GiB SELECT the HTTP client times out on; instead
// it drains in N polls of PollLimit rows. Default 10000.
// 0 → unlimited (legacy behavior — NOT recommended in prod).
PollLimit int

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PollLimit documentation contradicts runtime behavior.

The comment says 0 means unlimited, but New() rewrites 0 to 10000. Please align the comment (or behavior) so operators don’t tune based on incorrect semantics.

Suggested fix
-    // PollLimit caps rows returned per poll. Bounds catch-up work
-    // after a restart so a 10h backlog doesn't translate into a
-    // single multi-GiB SELECT the HTTP client times out on; instead
-    // it drains in N polls of PollLimit rows. Default 10000.
-    // 0 → unlimited (legacy behavior — NOT recommended in prod).
+    // PollLimit caps rows returned per poll. Bounds catch-up work
+    // after a restart so a 10h backlog doesn't translate into a
+    // single multi-GiB SELECT the HTTP client times out on; instead
+    // it drains in N polls of PollLimit rows. 0 uses the default (10000).

Also applies to: 137-139

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/vizier/services/adaptive_export/internal/trigger/clickhouse.go` around
lines 73 - 78, The PollLimit field documentation incorrectly states that 0 means
unlimited behavior, but the New() function actually rewrites 0 to 10000. Update
the comment for the PollLimit field to accurately reflect the actual runtime
behavior that 0 is rewritten to the default value of 10000, ensuring operators
understand the correct semantics when tuning this setting.

Comment on lines +123 to +132
// fingerprintNoFmt is the Fprintf-free reference. Same output guarantee
// is NOT asserted here — this is a perf-comparison anchor only. If the
// numbers diverge by >2× from rowFingerprint, the fmt.Fprintf path is
// a real cost.
func fingerprintNoFmt(r kubescape.Row) string {
h := sha256.New()
var b strings.Builder
b.Grow(64 + len(r.RuleID) + len(r.Hostname) + len(r.K8sDetails) + len(r.ProcessDetails))
_, _ = fmt.Fprintf(&b, "%d", r.EventTime)
b.WriteByte(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

fingerprintNoFmt still uses fmt.Fprintf, so the benchmark baseline is skewed.

Line 131 keeps formatting overhead in the “no fmt” path, which weakens the comparison.

Proposed fix
 import (
 	"crypto/sha256"
 	"encoding/hex"
-	"fmt"
+	"strconv"
 	"strings"
 	"testing"
@@
-	_, _ = fmt.Fprintf(&b, "%d", r.EventTime)
+	var nbuf [20]byte
+	b.Write(strconv.AppendUint(nbuf[:0], r.EventTime, 10))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/vizier/services/adaptive_export/internal/trigger/fingerprint_bench_test.go`
around lines 123 - 132, The function fingerprintNoFmt is intended to be a
reference implementation that avoids fmt.Fprintf overhead for benchmarking
purposes, but the code still contains a fmt.Fprintf call on the line formatting
r.EventTime with the "%d" format string. Replace this fmt.Fprintf call with a
direct string conversion using strconv (such as strconv.FormatInt) to convert
the EventTime value to a string, then write it directly to the strings.Builder
using WriteString method to truly eliminate the formatting overhead from the
benchmark baseline.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +14 to +31
on:
workflow_dispatch:
inputs:
dx_image:
description: dx-daemon image to test (default = .image-tags pin)
required: false
default: ""
soc_ref:
description: k8sstormcenter/soc branch
required: false
default: "218-clickhouse-schema"
permissions:
contents: read

jobs:
e2e:
runs-on: oracle-vm-16cpu-64gb-x86-64 # eBPF + 16cpu/64gb; ubuntu-latest cannot run Pixie
timeout-minutes: 90

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add workflow concurrency guards for the self-hosted rig.

This job provisions global host resources (k3s, /etc/rancher/k3s/...). Without concurrency, overlapping manual runs can race and corrupt each other.

Proposed fix
 on:
   workflow_dispatch:
@@
+concurrency:
+  group: e2e-log4shell-soc-${{ github.ref }}
+  cancel-in-progress: false
+
 jobs:
   e2e:
🧰 Tools
🪛 actionlint (1.7.12)

[error] 30-30: label "oracle-vm-16cpu-64gb-x86-64" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2025-vs2026", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xlarge", "macos-latest-large", "macos-26-intel", "macos-26-xlarge", "macos-26-large", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xlarge", "macos-14-large", "macos-14", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 zizmor (1.25.2)

[info] 29-29: workflow or action definition without a name (anonymous-definition): this job

(anonymous-definition)


[warning] 14-24: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting

(concurrency-limits)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e_log4shell_soc.yaml around lines 14 - 31, The
e2e_log4shell_soc.yaml workflow lacks concurrency guards, which allows multiple
simultaneous runs to race and corrupt shared self-hosted runner resources like
k3s and /etc/rancher/k3s/. Add a concurrency block at the top level of the
workflow (alongside the on and jobs sections) that specifies a concurrency group
to ensure only one run executes at a time. The concurrency group should cancel
in-progress runs when a new run is triggered, using a meaningful identifier like
the workflow name or a fixed string to ensure all runs of this workflow
serialize properly on the oracle-vm-16cpu-64gb-x86-64 runner.

Source: Linters/SAST tools

Comment on lines +58 to +65
git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc
cd soc
make pixie # vizier + AE
make kubescape || true # node-agent (netStreaming)
bash tree/clickhouse-lab/install.sh # forensic_db
make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)
if [ -n "${{ inputs.dx_image }}" ]; then
kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not inline-dispatch inputs directly into shell commands.

Lines 58/64/65 interpolate ${{ inputs.* }} inside run:; this is command-injection prone on a self-hosted runner. Move inputs to step env, validate allowed charset, then use quoted env vars.

Proposed hardening
     - name: Deploy the SOC stack (Pixie + kubescape + ClickHouse + AE + dx + log4j chain)
       env:
+        SOC_REF: ${{ inputs.soc_ref }}
+        DX_IMAGE: ${{ inputs.dx_image }}
         PX_CLOUD_ADDR: pixie.austrianopencloudcommunity.org
@@
       run: |
         set -euo pipefail
+        [[ "$SOC_REF" =~ ^[A-Za-z0-9._/-]+$ ]] || { echo "invalid soc_ref"; exit 1; }
+        [[ -z "$DX_IMAGE" || "$DX_IMAGE" =~ ^[A-Za-z0-9._/:+-]+$ ]] || { echo "invalid dx_image"; exit 1; }
         sudo apt-get update -qq && sudo apt-get install -y python3-yaml
-        git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc
+        git clone --depth 1 -b "$SOC_REF" https://github.com/k8sstormcenter/soc soc
@@
-        if [ -n "${{ inputs.dx_image }}" ]; then
-          kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true
+        if [ -n "$DX_IMAGE" ]; then
+          kubectl -n honey set image ds/dx-daemon dx-daemon="$DX_IMAGE" || true
         fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
git clone --depth 1 -b "${{ inputs.soc_ref }}" https://github.com/k8sstormcenter/soc soc
cd soc
make pixie # vizier + AE
make kubescape || true # node-agent (netStreaming)
bash tree/clickhouse-lab/install.sh # forensic_db
make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)
if [ -n "${{ inputs.dx_image }}" ]; then
kubectl -n honey set image ds/dx-daemon dx-daemon="${{ inputs.dx_image }}" || true
git clone --depth 1 -b "$SOC_REF" https://github.com/k8sstormcenter/soc soc
cd soc
make pixie # vizier + AE
make kubescape || true # node-agent (netStreaming)
bash tree/clickhouse-lab/install.sh # forensic_db
make log4j # vulnerable backend + attacker + dx + SBoBs (managed-by=User)
if [ -n "$DX_IMAGE" ]; then
kubectl -n honey set image ds/dx-daemon dx-daemon="$DX_IMAGE" || true
🧰 Tools
🪛 zizmor (1.25.2)

[error] 58-58: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 64-64: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 65-65: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e_log4shell_soc.yaml around lines 58 - 65, The workflow
is directly interpolating `inputs.soc_ref` and `inputs.dx_image` into shell
commands within the run step, creating a command injection vulnerability. Move
these inputs to an env section at the step level, assign them to environment
variables, and validate that they contain only allowed characters (e.g.,
alphanumeric, hyphens, dots for versions). Replace the inline `${{
inputs.soc_ref }}` and `${{ inputs.dx_image }}` references in the git clone and
kubectl commands with properly quoted environment variable references like
`"$SOC_REF"` and `"$DX_IMAGE"` respectively.

Source: Linters/SAST tools

Comment on lines +77 to +80
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

E4 currently masks a failed first fixture insert.

Line 79 uses || true, so E4 can pass even when the first boundary-collision row was never inserted. This weakens the reproducibility assertion for that experiment.

Proposed fix
     E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
       filt="cp-e4-${R}"
-      inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
+      inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \
+        || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
       inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
       ;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time || true
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
E4) # boundary collision: 2 rows, same event_time, different RuleID, same target → 1 hash
filt="cp-e4-${R}"
inj --ns aeload --pod "$filt" --rule R0001 --pid 1234 --comm java --event-time "$T" --same-time \
|| { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
inj --ns aeload --pod "$filt" --rule R0010 --pid 1234 --comm java --event-time "$T" --same-time || { echo "$rep,$EXP,,,,,,,INJECT_FAIL"|tee -a "$OUT"; continue; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/harness/exp_control.sh` around lines 77
- 80, The first inj command in the E4 test case (line 79) uses || true which
silently suppresses injection failures, masking when the first
boundary-collision row fails to insert. This weakens the test reproducibility
assertion. Apply the same error handling pattern used in the second inj command
(line 80) to the first inj command, replacing the || true with a proper error
handler that logs the INJECT_FAIL status and continues the experiment, ensuring
both fixture injections are properly validated before the test proceeds.

@@ -0,0 +1,5 @@
module aeload

go 1.22

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the latest patched Go release in the 1.22 line, and which GO advisories are fixed there (including GO-2026-4870, GO-2026-4918, GO-2026-4980)?

💡 Result:

As of June 21, 2026, the latest release in the Go 1.22 line is Go 1.22.12 [1][2]. Regarding the specified Go security advisories: - GO-2026-4870: This advisory affects crypto/tls and is noted as affecting versions before go1.25.9, as well as versions from go1.26.0-0 before go1.26.2 [3]. It is not listed as affecting the 1.22 release line. - GO-2026-4918: This advisory concerns net/http and golang.org/x/net/http2 [4][5]. It affects net/http versions before go1.25.10 and versions from go1.26.0-0 before go1.26.3 [4]. It is not listed as affecting the 1.22 release line. - GO-2026-4980: This advisory concerns html/template and affects versions before go1.25.10, as well as versions from go1.26.0-0 before go1.26.3 [6]. It is not listed as affecting the 1.22 release line. Go 1.22 has reached a state where it is no longer receiving the same security updates as newer Go release lines (such as 1.25 and 1.26) [2][3][4][6]. Users are encouraged to upgrade to a supported release version [2].

Citations:


Upgrade to a currently supported Go release.

Go 1.22 is no longer receiving active security updates. The specified Go advisories (GO-2026-4870, GO-2026-4918, GO-2026-4980) do not affect the 1.22 release line but affect newer versions. Upgrade to the latest 1.25 or 1.26 release to ensure the loadgen binaries run on a supported runtime baseline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/e2e_test/adaptive_export_loadtest/tools/loadgen/go.mod` at line 3, The Go
version directive in the go.mod file is set to version 1.22, which is no longer
receiving active security updates. Update the go directive from version 1.22 to
version 1.26 (or 1.25 if necessary) to ensure the loadgen binaries run on a
currently supported Go runtime baseline and receive ongoing security patches.

Source: Linters/SAST tools

…#67)

ae-prod's boundary handling accumulates the seen-fingerprint set on a no-progress
tick, but if >PollLimit rows share one normalized event_time the SQL
(>= watermark ORDER BY time LIMIT N, no secondary key) returns the same N boundary
rows every poll → rows beyond N at that timestamp are never emitted (infinite
boundary). Detect all-skipped-at-capacity and advance the watermark by 1ns to make
forward progress (fingerprint dedup already tolerates the 1ns overlap).

Cherry-picked the trigger fix + its test from the stale CodeRabbit-chat PR #67
(687851d); dropped that PR's unrelated gen-pod.tmpl.yaml churn. #67 itself is
NOT mergeable (77 commits behind ae-prod, re-adds deleted terraform).
@entlein

entlein commented Jun 22, 2026

Copy link
Copy Markdown
Author

there are indications that it causes PEM to crash, needs RCA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant