Logs, metrics, runtime snapshots, backtraces.
The server uses tracing with JSON output by default per
config/dev.toml. Override at runtime via env:
RUST_LOG=brain_server=debug cargo run --bin brain-server -- --config config/dev.tomlRUST_LOG=brain_storage=trace,brain_server=info \
cargo run --bin brain-server -- --config config/dev.tomlBRAIN_LOG=info,brain_server::network=debug \
cargo run --bin brain-server -- --config config/dev.tomlFilter precedence: BRAIN_LOG > RUST_LOG > [logging] level.
Valid levels: error, warn, info, debug, trace.
Verify the filter took effect:
curl -s http://127.0.0.1:9091/metrics | grep "^# HELP" | head -3If BRAIN_LOG=info,brain_server::network=debug is set, you should
see DEBUG lines from target = "brain_server::network::..." in
the server log while serving requests.
Log lines are newline-delimited JSON objects. Pipe through jq:
cargo run --bin brain-server -- --config config/dev.toml 2>&1 \
| jq 'select(.level == "ERROR")'cargo run ... 2>&1 \
| jq 'select(.fields.shard != null) | {shard: .fields.shard, msg: .fields.message}'cargo run ... 2>&1 \
| jq 'select(.span.name == "brain.request")'(The third example surfaces every per-request span from the Phase 12.3 OTel instrumentation.)
debug-snapshot gives a point-in-time view of one shard's worker
state without stopping the server:
cargo run --bin brain-cli -- --output json debug-snapshot --shard 0 \
| jq '.workers[] | select(.errors > 0)'If any worker has reported errors, this surfaces them.
Verify:
A healthy 30-minute-old server should print nothing (all workers
have errors == 0).
The /metrics endpoint exposes Prometheus text-format output:
curl -s http://127.0.0.1:9091/metrics | head -40Key metrics:
brain_up 1 when accepting requests
brain_shards_total configured shard count
brain_connections_active in-flight client connections
brain_connections_total total accepted since startup
brain_connections_closed_total{reason="bye|protocol_error|timeout|eof|fatal"}
brain_frame_send_total outbound frames since startup
brain_frame_recv_total inbound frames since startup
brain_request_total{op,status} per-op + per-status counter
brain_request_active{op} per-op in-flight gauge
brain_request_duration_ms_* per-op latency histogram
brain_worker_cycles_total worker run count per worker per shard
brain_worker_errors_total worker error count per worker per shard
brain_worker_last_run_unixtime unix timestamp of last worker cycle
brain_hnsw_node_count active HNSW nodes
brain_hnsw_tombstone_count tombstoned HNSW nodes
brain_hnsw_tombstone_ratio tombstone / total
process_cpu_seconds_total cumulative CPU time
process_memory_resident_bytes resident set size
process_open_fds open file descriptors
process_uptime_seconds server uptime
Full taxonomy in
docs/guides/observability.md.
Verify metric counts after an ENCODE:
# baseline
curl -s :9091/metrics | grep 'brain_request_total{op="encode",status="success"}'
# run ENCODE via SDK (or example)
cargo run --example store_and_recall -p brain-sdk-rust
# after
curl -s :9091/metrics | grep 'brain_request_total{op="encode",status="success"}'The second value should be higher than the first by exactly the number of successful encodes.
The container sets RUST_BACKTRACE=1 automatically. For full
backtrace:
RUST_BACKTRACE=full cargo run --bin brain-server -- --config config/dev.tomlThe panic line + full stack appears on stderr.
cargo test -p brain-storage --lib -- arena::tests::crc_mismatch_halts --nocapture--nocapture shows println! / eprintln! output that's
suppressed by default. Without --nocapture you only see assertion
failures.
brain-storage is the only crate allowed to use unsafe. Run the
unsafe blocks under Miri to check for UB. Syscall-bound paths
(mmap, pwritev2) are excluded by #[cfg(miri)]-gated tests; the
~47 pure-data tests run:
cargo +nightly miri test -p brain-storage --libFailures here are real soundness bugs — surface immediately.
If [tracing] enabled = true and the OTLP collector is reachable,
every request emits a brain.request span:
curl -s http://localhost:4318/v1/traces -X POST -H 'Content-Type: application/json' -d '{}'In a Jaeger / Tempo UI, search by service name (brain-server) and
sort by duration to find slow operations. See
docs/guides/observability.md §4.
09-troubleshooting.md — common issues
and how to resolve them.