Observability

Summary

This document covers how we monitor the Knowledge Graph services (Indexer, Dispatcher, and Web Service): metrics, logs, tracing, health checks, and what operators need to do for self-managed deployments.

How Observability Works Today (GitLab.com)

On GitLab.com, the Knowledge Graph services use the existing observability stack:

Metrics: Grafana and Grafana Mimir
Logs: Elasticsearch and Logstash

SLIs, SLOs, and Metrics

Alerting is based on SLOs and SLIs:

Availability SLO: We will adopt the Dedicated reference of ≥99.5% monthly SLO for the GKG API plane (excluding planned maintenance).
Availability SLIs: We will use error-rate and Apdex SLIs on request latency.

Each service exposes a Prometheus /metrics endpoint. We use LabKit for instrumentation where possible.

Reliability Signals:

Ingest Lag: The delay between data being written to the Postgres WAL, processed by Siphon, sent through NATS, and ingested.
Consumer Health: Monitoring NATS JetStream delivery/ack rates, dead-letter counts, and stream retention headroom.

KG Web Server (Indexer and Web Service):

HTTP error and success rate by HTTP method and path
HTTP latency (p50, p95, p99) by HTTP method and path
gRPC error and success rate by RPC method
gRPC latency (p50, p95, p99) by RPC method
gRPC bidi stream duration for ExecuteQuery RPCs
Redaction exchange latency (time spent waiting for Rails authorization responses)

KG Indexer Service:

The indexer emits metrics under five OpenTelemetry meters: gkg_etl for the core engine, gkg_scheduler for the scheduled task loop, gkg_indexer_sdlc for the SDLC module, gkg_indexer_code for the code indexing module, and gkg_indexer_namespace_deletion for the namespace deletion module. All duration histograms use OTel-recommended buckets (5 ms to 10 s).

Engine metrics (gkg_etl):

Metric	Type	Unit	Labels	Description
`gkg.etl.messages.processed`	Counter	count	`topic`, `outcome` (ack/nack/term/dead_letter)	Total messages processed
`gkg.etl.message.duration`	Histogram	s	`topic`	End-to-end time per message through dispatch
`gkg.etl.handler.duration`	Histogram	s	`handler`	Time inside each handler's `handle()` call
`gkg.etl.handler.errors`	Counter	count	`handler`, `error_kind`	Handler errors at the engine dispatch level
`gkg.etl.permit.wait.duration`	Histogram	s	`permit_kind` (global/group), `group`	Time waiting for a worker pool permit
`gkg.etl.permits.active`	UpDownCounter	count	`permit_kind`	Worker permits currently held (global or per concurrency group)
`gkg.etl.nats.fetch.duration`	Histogram	s	`outcome` (success/error)	Time to fetch a batch from NATS
`gkg.etl.destination.write.duration`	Histogram	s	`table`	Time to write a batch to ClickHouse
`gkg.etl.destination.rows.written`	Counter	count	`table`	Total rows written to ClickHouse
`gkg.etl.destination.bytes.written`	Counter	bytes	`table`	Total bytes written to ClickHouse
`gkg.etl.destination.write.errors`	Counter	count	`table`	Total failed writes to ClickHouse

Scheduled task metrics (gkg_scheduler):

Metric	Type	Unit	Labels	Description
`gkg.scheduler.task.runs`	Counter	count	`task`, `outcome` (success/error)	Total scheduled task runs
`gkg.scheduler.task.duration`	Histogram	s	`task`	End-to-end duration of a scheduled task run
`gkg.scheduler.task.requests.published`	Counter	count	`task`	Requests successfully published
`gkg.scheduler.task.requests.skipped`	Counter	count	`task`	Requests skipped (already in-flight)
`gkg.scheduler.task.query.duration`	Histogram	s	`query`	Duration of a scheduled task ClickHouse query
`gkg.scheduler.task.errors`	Counter	count	`task`, `stage` (publish/query)	Scheduled task errors by stage

SDLC module metrics (gkg_indexer_sdlc):

Metric	Type	Unit	Labels	Description
`gkg.indexer.sdlc.pipeline.duration`	Histogram	s	`entity`	End-to-end duration of an entity or edge pipeline run
`gkg.indexer.sdlc.pipeline.rows.processed`	Counter	count	`entity`	Total rows extracted and written
`gkg.indexer.sdlc.pipeline.errors`	Counter	count	`entity`, `error_kind`	SDLC pipeline failures
`gkg.indexer.sdlc.handler.duration`	Histogram	s	`handler`	Duration of a full handler invocation
`gkg.indexer.sdlc.datalake.query.duration`	Histogram	s	`entity`	Duration of ClickHouse datalake extraction queries
`gkg.indexer.sdlc.datalake.query.bytes`	Counter	bytes	`entity`	Total bytes returned by ClickHouse datalake extraction queries
`gkg.indexer.sdlc.transform.duration`	Histogram	s	`entity`	Duration of DataFusion SQL transform per batch
`gkg.indexer.sdlc.watermark.lag`	Gauge	s	`entity`	Seconds between the current watermark and wall clock (data freshness)

Code module metrics (gkg_indexer_code):

Metric	Type	Unit	Labels	Description
`gkg.indexer.code.events.processed`	Counter	count	`outcome` (indexed, skipped_checkpoint, skipped_lock, error)	Total code indexing tasks processed
`gkg.indexer.code.handler.duration`	Histogram	s		End-to-end duration of processing a single code indexing task
`gkg.indexer.code.repository.fetch.duration`	Histogram	s		Duration of resolving a repository (cache check + optional download and extraction)
`gkg.indexer.code.repository.resolution`	Counter	count	`strategy` (cache_hit, incremental, full_download, full_download_fallback)	Repository resolution strategy used
`gkg.indexer.code.indexing.duration`	Histogram	s		Duration of code-graph parsing and analysis
`gkg.indexer.code.files.processed`	Counter	count	`outcome` (parsed, skipped, errored)	Total files seen by the code-graph indexer
`gkg.indexer.code.nodes.indexed`	Counter	count	`kind` (directory, file, definition, imported_symbol, edge)	Total graph nodes and edges indexed
`gkg.indexer.code.errors`	Counter	count	`stage` (decode, repository_fetch, indexing, arrow_conversion, write, checkpoint)	Code indexing errors by pipeline stage

Namespace deletion module metrics (gkg_indexer_namespace_deletion):

Metric	Type	Unit	Labels	Description
`gkg.indexer.namespace_deletion.table.duration`	Histogram	s	`table`	Duration of a single table's soft-delete INSERT-SELECT
`gkg.indexer.namespace_deletion.table.errors`	Counter	count	`table`	Total per-table deletion failures

KG Web Service:

Query Health: p50/p95 latency by tool (find_nodes, traverse, explore, aggregate), memory spikes, and rows/bytes read per query.
MCP tools latency (p50, p95, p99), usage and success rate

Query pipeline metrics (gkg_query_pipeline):

The query pipeline instruments end-to-end query execution from security check through formatted output. All duration histograms use seconds with OTel-recommended buckets. All histograms and counters carry a query_type label (for example, find_nodes, traverse, explore, aggregate).

Metric	Type	Unit	Labels	Description
`gkg.query.pipeline.queries`	Counter	count	`query_type`, `status` (ok / error code)	Total queries processed through the pipeline
`gkg.query.pipeline.duration`	Histogram	s	`query_type`, `status`	End-to-end pipeline duration from security check to formatted output
`gkg.query.pipeline.compile.duration`	Histogram	s	`query_type`	Time spent compiling a query from JSON to parameterized SQL
`gkg.query.pipeline.execute.duration`	Histogram	s	`query_type`	Time spent executing the compiled query against ClickHouse
`gkg.query.pipeline.authorization.duration`	Histogram	s	`query_type`	Time spent on authorization exchange with Rails
`gkg.query.pipeline.hydration.duration`	Histogram	s	`query_type`	Time spent hydrating neighbor properties from ClickHouse
`gkg.query.pipeline.result_set.size`	Histogram	count	`query_type`	Number of rows returned after formatting
`gkg.query.pipeline.batch.count`	Histogram	count	`query_type`	Number of Arrow record batches returned from ClickHouse
`gkg.query.pipeline.redacted.count`	Histogram	count	`query_type`	Number of rows redacted per query
`gkg.query.pipeline.ch.read_rows`	Counter	count	`query_type`, `label`	ClickHouse rows read per query execution (from `X-ClickHouse-Summary` header)
`gkg.query.pipeline.ch.read_bytes`	Counter	bytes	`query_type`, `label`	ClickHouse bytes read per query execution (from `X-ClickHouse-Summary` header)
`gkg.query.pipeline.ch.memory_usage`	Histogram	bytes	`query_type`, `label`	ClickHouse peak memory usage per query execution (from `X-ClickHouse-Summary` header)
`gkg.query.pipeline.error.security_rejected`	Counter	count	`reason` (security)	Pipeline rejected due to invalid or missing security context
`gkg.query.pipeline.error.execution_failed`	Counter	count	`reason` (execution)	ClickHouse query execution failed
`gkg.query.pipeline.error.authorization_failed`	Counter	count	`reason` (authorization)	Authorization exchange with Rails failed
`gkg.query.pipeline.error.content_resolution_failed`	Counter	count	`reason` (content_resolution)	Virtual column resolution from remote service failed
`gkg.query.pipeline.error.streaming_failed`	Counter	count	`reason` (streaming)	Streaming channel unavailable during authorization

Content resolution metrics (gkg_content_resolution):

The content resolution subsystem instruments Gitaly interactions during virtual column resolution. Duration histogram buckets target Gitaly call latency (1 ms to 5 s). Duration and total metrics carry an outcome label (gitaly_direct or error; Phase 2 will add cache_hit and cache_miss).

Metric	Type	Unit	Labels	Description
`gkg.content.resolve.duration`	Histogram	s	`outcome`	Time spent resolving content from Gitaly
`gkg.content.resolve`	Counter	count	`outcome`	Total content resolution attempts
`gkg.content.resolve.batch_size`	Histogram	count		Number of rows per content resolution batch
`gkg.content.blob.size`	Histogram	bytes		Size of resolved blob content in bytes
`gkg.content.gitaly.calls`	Counter	count		Total list_blobs RPCs issued to Gitaly

Query engine metrics (gkg_query_engine):

The query engine fires counters during compilation to track security-relevant rejections. Each counter uses a reason label for low-cardinality breakdown. Counters marked "server layer" are exported for the gRPC/HTTP layer to increment.

Metric	Type	Labels	Description
`gkg.query.engine.threat.validation_failed`	Counter	`reason` (parse/schema/reference/pagination)	Query rejected by structural validation
`gkg.query.engine.threat.allowlist_rejected`	Counter	`reason` (ontology/ontology_internal)	Entity, column, or relationship not in the ontology allowlist
`gkg.query.engine.threat.auth_filter_missing`	Counter	`reason` (security)	Security context invalid or absent (server layer)
`gkg.query.engine.threat.timeout`	Counter	`reason`	Query compilation or execution exceeded deadline (server layer)
`gkg.query.engine.threat.rate_limited`	Counter	`reason`	Caller throttled before compilation (server layer)
`gkg.query.engine.threat.depth_exceeded`	Counter	`reason` (depth)	Traversal depth or hop count exceeded the hard cap
`gkg.query.engine.threat.limit_exceeded`	Counter	`reason` (limit)	Array cardinality cap exceeded (node_ids count or IN filter value count)
`gkg.query.engine.internal.pipeline_invariant_violated`	Counter	`reason` (lowering/codegen)	Lowering or codegen hit a state upstream validation should have prevented

Shared Infrastructure Metrics:

Disk and Memory usage per container
Network traffic between services

Prometheus scrapes these metrics into Grafana Mimir. We also maintain dashboards for the ClickHouse layer (queries, merges, background tasks).

Alert Rules

Alert rules are defined as PrometheusRule CRDs, automatically discovered by the Prometheus Operator. Thresholds are configurable via Helm values.

Metrics flow through Prometheus scraping PodMonitor endpoints exposed by the GKG chart. The OTel-to-Prometheus conversion replaces dots with underscores, appends unit suffixes (_seconds for "s", _bytes for "By"), and appends _total for counters (e.g., gkg.query.engine.threat.auth_filter_missing → gkg_query_engine_threat_auth_filter_missing_total).

Security alerts (any non-zero count is anomalous):

Alert	Metric	Default Threshold	Severity	`for`	Fires when
`GKGAuthFilterMissing`	`gkg_query_engine_threat_auth_filter_missing_total`	> 0 in 5m	critical	1m	A query was processed without a valid security context, meaning authorization filtering was bypassed
`GKGPipelineInvariantViolated`	`gkg_query_engine_internal_pipeline_invariant_violated_total`	> 0 in 5m	critical	1m	The query compiler reached a state that upstream validation should have prevented — may produce incorrect SQL
`GKGSecurityRejected`	`gkg_query_pipeline_error_security_rejected_total`	> 0 in 5m	warning	5m	Pipeline rejected a request due to invalid or missing security context

Query health alerts (sustained error rates or latency degradation):

Alert	Metric	Default Threshold	Severity	`for`	Fires when
`GKGQueryingErrorRateHigh`	`gkg_query_pipeline_queries_total{status!="ok"}` / `gkg_query_pipeline_queries_total`	> 5%	warning	5m	Aggregate error rate across all failure modes exceeds threshold — the availability SLI
`GKGQueryTimeoutRateHigh`	`gkg_query_engine_threat_timeout_total` / `gkg_query_pipeline_queries_total`	> 5%	warning	5m	More than 5% of queries time out, indicating ClickHouse saturation or pathological queries
`GKGValidationFailedBurst`	`gkg_query_engine_threat_validation_failed_total`	> 10/min	warning	5m	Sustained burst of structural validation failures (broken client or probing)
`GKGAllowlistRejectedBurst`	`gkg_query_engine_threat_allowlist_rejected_total`	> 5/min	warning	5m	Sustained ontology violations (schema drift or enumeration attempt)
`GKGExecutionFailureRate`	`gkg_query_pipeline_error_execution_failed_total`	> 1/min	warning	5m	ClickHouse query execution is failing
`GKGAuthorizationFailureRate`	`gkg_query_pipeline_error_authorization_failed_total`	> 1/min	warning	5m	Rails authorization exchange is failing
`GKGPipelineLatencyP95High`	`gkg_query_pipeline_duration_seconds` (histogram)	> 5s	warning	10m	p95 end-to-end pipeline latency exceeds threshold

Capacity alerts (traffic and limit pressure):

Alert	Metric	Default Threshold	Severity	`for`	Fires when
`GKGRateLimitedHigh`	`gkg_query_engine_threat_rate_limited_total`	> 10/min	warning	5m	High rate of throttled callers — may need capacity scaling

Logging

All logs are structured JSON, shipped to Logstash and Elasticsearch. Every log entry includes a correlation ID so you can trace a request across services.

Logging Structure and Format

All log output is JSON. Each entry has standard fields plus context-specific data.

Standard Fields:

Field	Type	Description
`timestamp`	String	ISO 8601 formatted timestamp (UTC)
`level`	String	Log level (e.g., `info`, `warn`, `error`)
`service`	String	Name of the service (e.g., `gkg-indexer`)
`correlation_id`	String	A unique ID for tracing a request
`message`	String	The log message

Example Log Entry:

{
  "timestamp": "2025-10-10T12:00:00.000Z",
  "level": "info",
  "service": "gkg-indexer",
  "correlation_id": "req-xyz-123",
  "message": "Indexing started for project"
}

Tracing

Services are instrumented with OpenTelemetry for distributed tracing. A single request can be followed across GKG and other GitLab services.

Health Checks

The Indexer and Dispatcher expose /live and /ready endpoints on dedicated health ports (default 4202 and 4203 respectively). The /ready probe checks downstream dependencies (NATS, ClickHouse graph, ClickHouse datalake) and returns HTTP 503 when any are unreachable. Traffic is only routed to healthy instances.

Self-Managed Instances

For self-managed deployments, we expose a stable integration surface so operators can integrate our metrics, logs, and tracing into their existing observability stacks.

Interface contracts (what we provide):

Metrics: Each service exposes a Prometheus-compatible /metrics endpoint for service-level KPIs; we also expose gauges for graph database disk usage where applicable. CPU and host/container resource utilization are expected to be collected via standard exporters alongside our service metrics.
Logs: All services emit structured JSON to stdout/stderr using the schema defined in Logging Structure and Format (including correlation_id).
Tracing: Services are instrumented with OpenTelemetry, allowing operators to configure an OTLP exporter (gRPC/HTTP) to a customer-managed collector or backend.
Health: Liveness (/live) and readiness (/ready) endpoints on dedicated health ports for orchestration and local SLOs.

Operator responsibilities:

Scrape /metrics with your Prometheus (or compatible) and manage storage, alerting, and retention.
Collect node/container resource metrics (CPU, memory, disk I/O, and usage) via standard exporters (e.g., cAdvisor, kube-state-metrics, node_exporter) and correlate with service metrics.
Collect and ship JSON logs (e.g., Fluentd/Vector/Filebeat) to your aggregator (e.g., Elasticsearch/Loki/Splunk) and manage parsing/retention.
Provide and operate an OpenTelemetry collector or tracing backend if traces are required.
Secure endpoints and govern egress in accordance with your environment.

Deployment to Omnibus-adjacent Kubernetes Environment

In Kubernetes, most of this is automatic:

Metrics: Prometheus Operator discovers and scrapes /metrics. Cluster exporters (cAdvisor, kube-state-metrics, node_exporter) handle CPU, memory, and disk.
Logging: Container logs go to stdout/stderr and get collected by the cluster's logging agent (Fluentd, Vector).
Health Checks: Kubernetes uses liveness and readiness probes to restart unhealthy pods and manage traffic during rollouts.

Ownership, On-call & Escalation

Who owns what, and who gets paged.

Service/Component Ownership

Siphon & NATS (development/bug fixes): Analytics stage (primarily Platform Insights), with collaboration from the Knowledge Graph team.
GKG Service (Indexer + API/MCP): Knowledge Graph team.

On-call

Tier 1: Production Engineering SRE (existing on-call rotation).
Tier 2: Analytics / Platform Insights.
Knowledge Graph Services: Dedicated on-call rotation (TBD). During initial launch the KG team will actively monitor the service.

Long-term Stewardship

Future ownership will be evaluated, for example, NATS may move under the Durability team, while Siphon is likely to remain with Data Engineering (Analytics).

Runbooks

Initial runbook procedures for GitLab.com. These will grow as we learn more in production.

Siphon

Monitoring: Regularly verify replication slots and monitor producer throughput and lag.
Snapshots: Be aware that snapshots can temporarily inflate JetStream storage. Plan for sufficient headroom per table, and use work-queue/limits retention settings during bulk loads.

NATS JetStream

Stream Policies: Enforce LimitsPolicy (size, age, and message caps) on streams.
Alerting: Configure alerts to trigger at 70%, 85%, and 95% of usage limits.

Database (ClickHouse)

Ingestion: Monitor background merge operations, as this is where data deduplication occurs.
ETL and Graph Ingestion: Establish a clear set of metrics for these processes.
Workload Isolation: Run GKG queries on a separate Warehouse to isolate them from ingestion workloads. Pin agent reads to read-only compute nodes.
Query Safety:
- Limits: Set per-user quotas for max_memory_usage, max_rows_to_read, max_bytes_to_read, and timeouts on the GKG role.
- Join & Scan Budgets: Enforce linting rules in the query planner to block unbounded joins, text-search filters in aggregates, or multi-hop traversals (>3) unless pre-materialized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability

Summary

How Observability Works Today (GitLab.com)

SLIs, SLOs, and Metrics

Alert Rules

Logging

Logging Structure and Format

Tracing

Health Checks

Self-Managed Instances

Deployment to Omnibus-adjacent Kubernetes Environment

Ownership, On-call & Escalation

Service/Component Ownership

On-call

Long-term Stewardship

Runbooks

Siphon

NATS JetStream

Database (ClickHouse)

FilesExpand file tree

observability.md

Latest commit

History

observability.md

File metadata and controls

Observability

Summary

How Observability Works Today (GitLab.com)

SLIs, SLOs, and Metrics

Alert Rules

Logging

Logging Structure and Format

Tracing

Health Checks

Self-Managed Instances

Deployment to Omnibus-adjacent Kubernetes Environment

Ownership, On-call & Escalation

Service/Component Ownership

On-call

Long-term Stewardship

Runbooks

Siphon

NATS JetStream

Database (ClickHouse)