You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document direct kubectl logs as fallback for /loki-query
Loki is good at multi-pod sweeps, structured filters, and historical
queries — but not every investigation fits that. Add a section to the
/loki-query slash command covering when and how to read pod logs
directly:
- After a pod crash/exit (--previous), e.g. when JmsFailoverWatchdog
recycles a wedged pod and we need the prior container's stack trace.
- For real-time tail of a single pod, or windows that have aged out of
Loki retention.
- For the broker pods (activemqint, activemqsim, artemismq) — verified
during the 2026-05-06 wedge investigation that these run supervisord
as PID 1 and the actual broker logs to a file inside the pod
(/var/log/activemq/activemq.log for ActiveMQ Classic). Neither
`kubectl logs` nor Loki sees those events; the only access path is
`kubectl exec ... cat`. Document this gap explicitly so future
investigators don't waste time on a `kubectl logs` that returns ten
supervisord lifecycle lines and call it a dead end.
Mirrors the Loki kubeconfig convention (LOKI_KUBECONFIG →
~/.kube/kubeconfig_vxrails.yaml), provides a discovery command for
actual deployment/statefulset names, and notes the in-pod log rotation
horizon (~14h on activemqint) as a real operational gap worth fixing
by reconfiguring these brokers to log to stdout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest.
83
83
84
+
## Direct kubectl logs (when Loki isn't enough)
85
+
86
+
Use `kubectl logs` directly when:
87
+
-**A pod just crashed/restarted** — `--previous` shows the prior container's logs. Critical when a `JmsFailoverWatchdog``System.exit` recycles a pod and you need to see why.
88
+
-**Broker pods** — `activemqint` (ActiveMQ Classic, legacy daemon-to-daemon traffic — counterpart to client-side JMS errors in `data`/`db`/`sched`/`submit`) and `activemqsim` (sim/solver traffic). These may not be in Loki's collection set; check Loki first and fall back here.
89
+
-**Real-time tail** of a single pod, or windows that have aged out of Loki retention.
### Broker pods are special — supervisord hides the broker log
121
+
122
+
`activemqint`, `activemqsim`, and `artemismq` all run as `Deployments` in `prod` (and the same naming applies in `stage`/`dev` if present), but their container PID 1 is **supervisord**. The actual broker logs to a file inside the pod, not stdout. Consequences:
123
+
124
+
-`kubectl logs deployment/activemqint` returns ~10 lines of supervisord lifecycle, frozen at pod startup (44d old in prod). It does **not** show broker activity.
125
+
- Loki's promtail isn't scraping these pods either (verified by `logcli series '{namespace="prod"}'` — no `activemq`/`artemis` containers appear).
126
+
- The only path to broker events is `kubectl exec` against the in-pod log file.
127
+
128
+
```bash
129
+
# ActiveMQ Classic (activemqint, activemqsim) — log at /var/log/activemq/activemq.log
The in-pod ActiveMQ log file rotates after roughly 14h of activity. For older incidents, the broker side is not recoverable — a real operational gap when investigating wedges that took longer than that to manifest.
147
+
148
+
If `deployment/<name>` doesn't resolve (e.g. a broker is later deployed as a StatefulSet), discover the pod by label or by listing pods.
149
+
150
+
**Worth fixing**: reconfigure these brokers to log to stdout (or have supervisord forward child stdout/stderr). Once stdout flows, both `kubectl logs` and Loki's promtail pick them up automatically — no more `exec` archaeology and no more 14h horizon.
151
+
152
+
### Putting it together for a JMS-wedge incident
153
+
154
+
Pull both sides around the same window:
155
+
-**Client side** (`data`/`db`/`sched`/`submit`) via Loki — the `IllegalStateException: Timer already cancelled.` WARNs surface here, but the *triggering* network event happened earlier and silently.
156
+
-**Broker side** (`activemqint`/`activemqsim`) via `kubectl exec` — `Transport Connection to: tcp://<podIP>:<port> failed: java.io.EOFException` lines tell you which client connection actually dropped, and at what timestamp. Compare to client-side reconnect attempts to confirm the wedge was already latent before the trigger.
157
+
158
+
When to switch back to Loki: multi-pod sweeps, time-range queries spanning multiple containers, structured filters (`|~ "ERROR"`, regex), and historical incidents older than the kubelet's local log retention.
159
+
84
160
## Workflow
85
161
86
162
1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.
0 commit comments