Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions .claude/commands/loki-query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
description: Query VCell logs from Loki via logcli, across prod/stage/dev (port-forwarded automatically)
---

Query the VCell Loki instance to investigate incidents, monitor failures, or trace user-reported errors. Same Loki, same tenant — choose the namespace per query.

The wrapper script handles installation, port-forwarding, and tenant headers.

## When invoked

The argument `$ARGUMENTS` is a free-form description of what to look for. Translate it into one or more `logcli query` invocations using the wrapper script:

```
bash tools/loki/loki-query.sh [logcli-args...] '<LogQL selector>'
```

The wrapper auto-runs `tools/loki/setup.sh` (idempotent — installs logcli if missing, starts a port-forward to `loki-read` if not already running, exports `LOKI_ADDR` and `LOKI_ORG_ID`).

If the user hasn't given specific time bounds, default to `--since=15m`. For incidents, narrow the window to ±5–10 min around the reported timestamp using `--from=<RFC3339Z>` and `--to=<RFC3339Z>` (Better Stack timestamps are HDT = UTC-10).

## Picking the namespace

Always include a `namespace=` label in every query. The same Loki instance covers all three VCell environments:

| `namespace=` | Site | URL |
| ------------ | ----------- | ---------------------------- |
| `prod` | Production | `vcell.cam.uchc.edu` |
| `stage` | Staging | `vcell-stage.cam.uchc.edu` |
| `dev` | Development | `vcell-dev.cam.uchc.edu` |

If the user mentions a specific URL or "Better Stack incident", pick `prod` unless they say otherwise. If unsure, ask — don't guess across environments.

## VCell containers (per namespace)

Common to all three:

| `container=` | Service |
| ------------ | ------------------------------------------------ |
| `api` | Legacy Restlet `/api/v0/` (HealthService, BMDB) |
| `data` | SimDataServer - Data RPC consumer |
| `db` | Database RPC consumer |
| `sched` | Simulation dispatcher |
| `submit` | Sim submit / ServerJobDispatcher |
| `webapp` | Angular webapp |
| `export` | Export server |

Environment-specific:
- `prod` only: `rest` (Quarkus `/api/v1/`), `mongodb` (sidecar)
- `stage` only: `sif-prepull` (Apptainer SIF pre-build)

When in doubt, discover with `logcli series` (after `eval "$(bash tools/loki/setup.sh --quiet)"`):
```bash
logcli series --quiet --since=1h '{namespace="dev"}' | grep -oE 'container="[^"]+"' | sort -u
```

## Useful queries

```bash
# Recent ERRORs across all VCell pods in prod
bash tools/loki/loki-query.sh --since=1h \
'{namespace="prod", container=~"api|data|db|sched|submit|rest"} |~ "ERROR"'

# Same sweep, but in dev (note: no `rest` container in dev)
bash tools/loki/loki-query.sh --since=1h \
'{namespace="dev", container=~"api|data|db|sched|submit"} |~ "ERROR"'

# Around a specific incident, raw output for jq parsing
bash tools/loki/loki-query.sh --output=raw --limit=500 \
--from="2026-05-05T14:15:00Z" --to="2026-05-05T14:25:00Z" \
'{namespace="prod", container="data"} |~ "FailoverTransport|Exception"'

# Compare same container across two environments side-by-side
bash tools/loki/loki-query.sh --since=15m --output=raw \
'{namespace=~"prod|stage", container="data"} |~ "ERROR"'

# Activity volume by minute (sanity that a container is even running)
bash tools/loki/loki-query.sh --output=raw --limit=20000 --since=1h \
'{namespace="prod", container="data"}' \
| jq -r '.["@timestamp"][:16] // empty' | sort | uniq -c
```

For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest.

## Workflow

1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.
2. If the request is broad (e.g., "what's wrong with prod"), start with an ERROR-only sweep across all containers in that namespace, then drill in.
3. If the volume is large, save to `/tmp/<topic>.txt` with `--output=raw` and parse with `jq`.
4. Report the finding crisply: timestamp, namespace, thread/logger, key log lines. Cite the file:line in vcell source code if the stack frame is visible.
5. When done, leave the port-forward running (subsequent calls reuse it). If you want to stop it, run `bash tools/loki/teardown.sh`.

## First-time setup hints (for the user)

- Requires `kubectl` and a kubeconfig with cluster access. Default path is `~/.kube/kubeconfig_vxrails.yaml`; override via `LOKI_KUBECONFIG`.
- macOS-only auto-install of `logcli` via `brew install logcli`. On Linux, install manually before first use.
- Tenant is `uchc`; the script sets `X-Scope-OrgID` automatically.
- Bypasses the `loki-gateway` (whose ED25519 TLS cert LibreSSL/macOS curl can't handshake) by port-forwarding `loki-read` on plain HTTP.

User request: $ARGUMENTS
20 changes: 20 additions & 0 deletions tools/loki/loki-query.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash
# Query VCell prod Loki via logcli, after auto-starting the port-forward.
#
# Usage:
# bash tools/loki/loki-query.sh [logcli-args...]
# bash tools/loki/loki-query.sh --since=15m '{namespace="prod", container="data"} |~ "ERROR"'
# bash tools/loki/loki-query.sh --from="2026-05-05T14:15:00Z" --to="2026-05-05T14:25:00Z" \
# --output=raw --limit=500 '{namespace="prod", container="api"} |~ "HealthService"'
#
# All arguments are passed verbatim to `logcli query`.

set -euo pipefail

DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Ensure the port-forward is up; capture its env exports.
EXPORTS="$(bash "$DIR/setup.sh" --quiet)"
eval "$EXPORTS"

exec logcli query --quiet "$@"
101 changes: 101 additions & 0 deletions tools/loki/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/usr/bin/env bash
# Configure logcli access to the VCell prod Loki instance.
#
# Idempotent: safe to run multiple times. Installs logcli if missing,
# starts a kubectl port-forward to loki-read in the background if one
# isn't already running on the chosen port, and prints the env vars
# required for logcli.
#
# Usage:
# bash tools/loki/setup.sh # start + print env (default port 3100)
# bash tools/loki/setup.sh --port 3101
# eval "$(bash tools/loki/setup.sh --quiet)" # set env in current shell
#
# Required:
# - kubectl in PATH
# - kubeconfig with access to the prod cluster (default: $LOKI_KUBECONFIG
# or ~/.kube/kubeconfig_vxrails.yaml)
# - brew (for first-time logcli install on macOS)

set -euo pipefail

PORT=3100
QUIET=0
KUBECONFIG_PATH="${LOKI_KUBECONFIG:-$HOME/.kube/kubeconfig_vxrails.yaml}"
LOKI_NS="logging"
LOKI_SVC="loki-read"
LOKI_TENANT="uchc"
PF_LABEL="vcell-loki-pf"
PIDFILE="/tmp/${PF_LABEL}.pid"
LOGFILE="/tmp/${PF_LABEL}.log"

while [[ $# -gt 0 ]]; do
case "$1" in
--port) PORT="$2"; shift 2 ;;
--quiet) QUIET=1; shift ;;
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
-h|--help)
grep '^#' "$0" | sed 's/^# \{0,1\}//'
exit 0 ;;
*) echo "unknown arg: $1" >&2; exit 2 ;;
esac
done

log() { [[ $QUIET -eq 1 ]] || echo "[loki-setup] $*" >&2; }
die() { echo "[loki-setup] ERROR: $*" >&2; exit 1; }

command -v kubectl >/dev/null || die "kubectl not found in PATH"
[[ -f "$KUBECONFIG_PATH" ]] || die "kubeconfig not found at $KUBECONFIG_PATH (set LOKI_KUBECONFIG)"

# Install logcli on macOS via brew (no-op if already installed).
if ! command -v logcli >/dev/null; then
log "logcli not found"
if [[ "$(uname)" == "Darwin" ]] && command -v brew >/dev/null; then
log "installing logcli via brew..."
brew install logcli >/dev/null
else
die "install logcli manually: https://grafana.com/docs/loki/latest/query/logcli/"
fi
fi

# Reuse existing port-forward if it's still alive on the same port.
if [[ -f "$PIDFILE" ]]; then
OLD_PID="$(cat "$PIDFILE")"
if kill -0 "$OLD_PID" 2>/dev/null \
&& lsof -iTCP:"$PORT" -sTCP:LISTEN -P 2>/dev/null | grep -q "[ ]$OLD_PID[ ]"; then
log "reusing existing port-forward (pid $OLD_PID, port $PORT)"
else
log "stale pidfile, cleaning up"
rm -f "$PIDFILE"
fi
fi

# Start a new port-forward if needed.
if [[ ! -f "$PIDFILE" ]]; then
if lsof -iTCP:"$PORT" -sTCP:LISTEN -P >/dev/null 2>&1; then
die "port $PORT already in use by another process — pass --port <other>"
fi
log "starting port-forward: svc/$LOKI_SVC -n $LOKI_NS :$PORT"
nohup kubectl --kubeconfig "$KUBECONFIG_PATH" \
-n "$LOKI_NS" port-forward "svc/$LOKI_SVC" "$PORT:3100" \
> "$LOGFILE" 2>&1 &
echo $! > "$PIDFILE"
# Wait for the local port to accept connections.
for _ in $(seq 1 20); do
if curl -fsS "http://localhost:$PORT/ready" >/dev/null 2>&1; then break; fi
sleep 0.5
done
if ! curl -fsS "http://localhost:$PORT/ready" >/dev/null 2>&1; then
die "port-forward did not become ready — see $LOGFILE"
fi
fi

# Verify Loki is reachable with the tenant header.
if ! curl -fsS -H "X-Scope-OrgID: $LOKI_TENANT" \
"http://localhost:$PORT/loki/api/v1/labels" >/dev/null; then
die "Loki not responding for tenant '$LOKI_TENANT' — port-forward may be stale"
fi

log "ready: tenant=$LOKI_TENANT, port=$PORT"
echo "export LOKI_ADDR=http://localhost:$PORT"
echo "export LOKI_ORG_ID=$LOKI_TENANT"
17 changes: 17 additions & 0 deletions tools/loki/teardown.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash
# Stop the loki-read port-forward started by setup.sh.

set -euo pipefail
PF_LABEL="vcell-loki-pf"
PIDFILE="/tmp/${PF_LABEL}.pid"

if [[ -f "$PIDFILE" ]]; then
PID="$(cat "$PIDFILE")"
if kill -0 "$PID" 2>/dev/null; then
kill "$PID"
echo "[loki-teardown] killed port-forward pid $PID" >&2
fi
rm -f "$PIDFILE"
else
echo "[loki-teardown] no port-forward pidfile — nothing to do" >&2
fi
Loading