Document direct kubectl logs as fallback for /loki-query

jcschaff · claude · jcschaff · commit 6d1f44bd8d9e · 2026-05-06T17:02:17.000-04:00
Loki is good at multi-pod sweeps, structured filters, and historical
queries — but not every investigation fits that. Add a section to the
/loki-query slash command covering when and how to read pod logs
directly:

- After a pod crash/exit (--previous), e.g. when JmsFailoverWatchdog
  recycles a wedged pod and we need the prior container's stack trace.
- For real-time tail of a single pod, or windows that have aged out of
  Loki retention.
- For the broker pods (activemqint, activemqsim, artemismq) — verified
  during the 2026-05-06 wedge investigation that these run supervisord
  as PID 1 and the actual broker logs to a file inside the pod
  (/var/log/activemq/activemq.log for ActiveMQ Classic). Neither
  `kubectl logs` nor Loki sees those events; the only access path is
  `kubectl exec ... cat`. Document this gap explicitly so future
  investigators don't waste time on a `kubectl logs` that returns ten
  supervisord lifecycle lines and call it a dead end.

Mirrors the Loki kubeconfig convention (LOKI_KUBECONFIG →
~/.kube/kubeconfig_vxrails.yaml), provides a discovery command for
actual deployment/statefulset names, and notes the in-pod log rotation
horizon (~14h on activemqint) as a real operational gap worth fixing
by reconfiguring these brokers to log to stdout.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/commands/loki-query.md b/.claude/commands/loki-query.md
@@ -81,6 +81,82 @@ bash tools/loki/loki-query.sh --output=raw --limit=20000 --since=1h \
 
 For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest.
 
+## Direct kubectl logs (when Loki isn't enough)
+
+Use `kubectl logs` directly when:
+- **A pod just crashed/restarted** — `--previous` shows the prior container's logs. Critical when a `JmsFailoverWatchdog` `System.exit` recycles a pod and you need to see why.
+- **Broker pods** — `activemqint` (ActiveMQ Classic, legacy daemon-to-daemon traffic — counterpart to client-side JMS errors in `data`/`db`/`sched`/`submit`) and `activemqsim` (sim/solver traffic). These may not be in Loki's collection set; check Loki first and fall back here.
+- **Real-time tail** of a single pod, or windows that have aged out of Loki retention.
+
+Setup (same kubeconfig as Loki):
+```bash
+KCFG="${LOKI_KUBECONFIG:-$HOME/.kube/kubeconfig_vxrails.yaml}"
+NS=prod   # or stage / dev
+```
+
+Discover the actual workload names — deployment vs statefulset, exact selectors:
+```bash
+kubectl --kubeconfig "$KCFG" -n "$NS" get deployments,statefulsets,pods \
+  | grep -iE "data|activemq|api|db|sched|submit"
+```
+
+Common log patterns:
+```bash
+# Last 200 lines from the data pod (the one that wedged on 2026-05-06)
+kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --tail=200
+
+# Real-time tail
+kubectl --kubeconfig "$KCFG" -n "$NS" logs -f deployment/data
+
+# Previous container instance — after a crash, OOM, or watchdog-driven exit
+kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --previous --tail=500
+
+# Time-bounded
+kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since=10m
+kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since-time="2026-05-06T06:30:00Z"
+
+```
+
+### Broker pods are special — supervisord hides the broker log
+
+`activemqint`, `activemqsim`, and `artemismq` all run as `Deployments` in `prod` (and the same naming applies in `stage`/`dev` if present), but their container PID 1 is **supervisord**. The actual broker logs to a file inside the pod, not stdout. Consequences:
+
+- `kubectl logs deployment/activemqint` returns ~10 lines of supervisord lifecycle, frozen at pod startup (44d old in prod). It does **not** show broker activity.
+- Loki's promtail isn't scraping these pods either (verified by `logcli series '{namespace="prod"}'` — no `activemq`/`artemis` containers appear).
+- The only path to broker events is `kubectl exec` against the in-pod log file.
+
+```bash
+# ActiveMQ Classic (activemqint, activemqsim) — log at /var/log/activemq/activemq.log
+kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
+  tail -200 /var/log/activemq/activemq.log
+
+# Around an incident window (in-pod awk filter — efficient on a multi-MB log)
+kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
+  bash -c 'awk "/2026-05-06 06:[34][0-9]/" /var/log/activemq/activemq.log'
+
+# Errors only
+kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
+  grep -E "WARN|ERROR" /var/log/activemq/activemq.log
+
+# Artemis (artemismq) — different layout; discover the log path
+kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/artemismq -- \
+  bash -c 'find / -name "*.log" -size +1k 2>/dev/null | head -10'
+```
+
+The in-pod ActiveMQ log file rotates after roughly 14h of activity. For older incidents, the broker side is not recoverable — a real operational gap when investigating wedges that took longer than that to manifest.
+
+If `deployment/<name>` doesn't resolve (e.g. a broker is later deployed as a StatefulSet), discover the pod by label or by listing pods.
+
+**Worth fixing**: reconfigure these brokers to log to stdout (or have supervisord forward child stdout/stderr). Once stdout flows, both `kubectl logs` and Loki's promtail pick them up automatically — no more `exec` archaeology and no more 14h horizon.
+
+### Putting it together for a JMS-wedge incident
+
+Pull both sides around the same window:
+- **Client side** (`data`/`db`/`sched`/`submit`) via Loki — the `IllegalStateException: Timer already cancelled.` WARNs surface here, but the *triggering* network event happened earlier and silently.
+- **Broker side** (`activemqint`/`activemqsim`) via `kubectl exec` — `Transport Connection to: tcp://<podIP>:<port> failed: java.io.EOFException` lines tell you which client connection actually dropped, and at what timestamp. Compare to client-side reconnect attempts to confirm the wedge was already latent before the trigger.
+
+When to switch back to Loki: multi-pod sweeps, time-range queries spanning multiple containers, structured filters (`|~ "ERROR"`, regex), and historical incidents older than the kubelet's local log retention.
+
 ## Workflow
 
 1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.