Skip to content

Commit 6d1f44b

Browse files
jcschaffclaude
andcommitted
Document direct kubectl logs as fallback for /loki-query
Loki is good at multi-pod sweeps, structured filters, and historical queries — but not every investigation fits that. Add a section to the /loki-query slash command covering when and how to read pod logs directly: - After a pod crash/exit (--previous), e.g. when JmsFailoverWatchdog recycles a wedged pod and we need the prior container's stack trace. - For real-time tail of a single pod, or windows that have aged out of Loki retention. - For the broker pods (activemqint, activemqsim, artemismq) — verified during the 2026-05-06 wedge investigation that these run supervisord as PID 1 and the actual broker logs to a file inside the pod (/var/log/activemq/activemq.log for ActiveMQ Classic). Neither `kubectl logs` nor Loki sees those events; the only access path is `kubectl exec ... cat`. Document this gap explicitly so future investigators don't waste time on a `kubectl logs` that returns ten supervisord lifecycle lines and call it a dead end. Mirrors the Loki kubeconfig convention (LOKI_KUBECONFIG → ~/.kube/kubeconfig_vxrails.yaml), provides a discovery command for actual deployment/statefulset names, and notes the in-pod log rotation horizon (~14h on activemqint) as a real operational gap worth fixing by reconfiguring these brokers to log to stdout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 17b40d8 commit 6d1f44b

1 file changed

Lines changed: 76 additions & 0 deletions

File tree

.claude/commands/loki-query.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,82 @@ bash tools/loki/loki-query.sh --output=raw --limit=20000 --since=1h \
8181

8282
For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest.
8383

84+
## Direct kubectl logs (when Loki isn't enough)
85+
86+
Use `kubectl logs` directly when:
87+
- **A pod just crashed/restarted**`--previous` shows the prior container's logs. Critical when a `JmsFailoverWatchdog` `System.exit` recycles a pod and you need to see why.
88+
- **Broker pods**`activemqint` (ActiveMQ Classic, legacy daemon-to-daemon traffic — counterpart to client-side JMS errors in `data`/`db`/`sched`/`submit`) and `activemqsim` (sim/solver traffic). These may not be in Loki's collection set; check Loki first and fall back here.
89+
- **Real-time tail** of a single pod, or windows that have aged out of Loki retention.
90+
91+
Setup (same kubeconfig as Loki):
92+
```bash
93+
KCFG="${LOKI_KUBECONFIG:-$HOME/.kube/kubeconfig_vxrails.yaml}"
94+
NS=prod # or stage / dev
95+
```
96+
97+
Discover the actual workload names — deployment vs statefulset, exact selectors:
98+
```bash
99+
kubectl --kubeconfig "$KCFG" -n "$NS" get deployments,statefulsets,pods \
100+
| grep -iE "data|activemq|api|db|sched|submit"
101+
```
102+
103+
Common log patterns:
104+
```bash
105+
# Last 200 lines from the data pod (the one that wedged on 2026-05-06)
106+
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --tail=200
107+
108+
# Real-time tail
109+
kubectl --kubeconfig "$KCFG" -n "$NS" logs -f deployment/data
110+
111+
# Previous container instance — after a crash, OOM, or watchdog-driven exit
112+
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --previous --tail=500
113+
114+
# Time-bounded
115+
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since=10m
116+
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since-time="2026-05-06T06:30:00Z"
117+
118+
```
119+
120+
### Broker pods are special — supervisord hides the broker log
121+
122+
`activemqint`, `activemqsim`, and `artemismq` all run as `Deployments` in `prod` (and the same naming applies in `stage`/`dev` if present), but their container PID 1 is **supervisord**. The actual broker logs to a file inside the pod, not stdout. Consequences:
123+
124+
- `kubectl logs deployment/activemqint` returns ~10 lines of supervisord lifecycle, frozen at pod startup (44d old in prod). It does **not** show broker activity.
125+
- Loki's promtail isn't scraping these pods either (verified by `logcli series '{namespace="prod"}'` — no `activemq`/`artemis` containers appear).
126+
- The only path to broker events is `kubectl exec` against the in-pod log file.
127+
128+
```bash
129+
# ActiveMQ Classic (activemqint, activemqsim) — log at /var/log/activemq/activemq.log
130+
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
131+
tail -200 /var/log/activemq/activemq.log
132+
133+
# Around an incident window (in-pod awk filter — efficient on a multi-MB log)
134+
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
135+
bash -c 'awk "/2026-05-06 06:[34][0-9]/" /var/log/activemq/activemq.log'
136+
137+
# Errors only
138+
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
139+
grep -E "WARN|ERROR" /var/log/activemq/activemq.log
140+
141+
# Artemis (artemismq) — different layout; discover the log path
142+
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/artemismq -- \
143+
bash -c 'find / -name "*.log" -size +1k 2>/dev/null | head -10'
144+
```
145+
146+
The in-pod ActiveMQ log file rotates after roughly 14h of activity. For older incidents, the broker side is not recoverable — a real operational gap when investigating wedges that took longer than that to manifest.
147+
148+
If `deployment/<name>` doesn't resolve (e.g. a broker is later deployed as a StatefulSet), discover the pod by label or by listing pods.
149+
150+
**Worth fixing**: reconfigure these brokers to log to stdout (or have supervisord forward child stdout/stderr). Once stdout flows, both `kubectl logs` and Loki's promtail pick them up automatically — no more `exec` archaeology and no more 14h horizon.
151+
152+
### Putting it together for a JMS-wedge incident
153+
154+
Pull both sides around the same window:
155+
- **Client side** (`data`/`db`/`sched`/`submit`) via Loki — the `IllegalStateException: Timer already cancelled.` WARNs surface here, but the *triggering* network event happened earlier and silently.
156+
- **Broker side** (`activemqint`/`activemqsim`) via `kubectl exec``Transport Connection to: tcp://<podIP>:<port> failed: java.io.EOFException` lines tell you which client connection actually dropped, and at what timestamp. Compare to client-side reconnect attempts to confirm the wedge was already latent before the trigger.
157+
158+
When to switch back to Loki: multi-pod sweeps, time-range queries spanning multiple containers, structured filters (`|~ "ERROR"`, regex), and historical incidents older than the kubelet's local log retention.
159+
84160
## Workflow
85161

86162
1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.

0 commit comments

Comments
 (0)