This document covers how we monitor the Knowledge Graph services (Indexer, Dispatcher, and Web Service): metrics, logs, tracing, health checks, and what operators need to do for self-managed deployments.
On GitLab.com, the Knowledge Graph services use the existing observability stack:
- Metrics: Grafana and Grafana Mimir
- Logs: Elasticsearch and Logstash
Alerting is based on SLOs and SLIs:
- Availability SLO: We will adopt the Dedicated reference of ≥99.5% monthly SLO for the GKG API plane (excluding planned maintenance).
- Availability SLIs: We will use error-rate and Apdex SLIs on request latency.
Each service exposes a Prometheus /metrics endpoint. We use LabKit for instrumentation where possible.
Reliability Signals:
- Ingest Lag: The delay between data being written to the Postgres WAL, processed by Siphon, sent through NATS, and ingested.
- Consumer Health: Monitoring NATS JetStream delivery/ack rates, dead-letter counts, and stream retention headroom.
KG Web Server (Indexer and Web Service):
- HTTP error and success rate by HTTP method and path
- HTTP latency (p50, p95, p99) by HTTP method and path
- gRPC error and success rate by RPC method
- gRPC latency (p50, p95, p99) by RPC method
- gRPC bidi stream duration for
ExecuteQueryRPCs - Redaction exchange latency (time spent waiting for Rails authorization responses)
KG Indexer Service:
The indexer emits metrics under five OpenTelemetry meters: gkg_etl for the core engine, gkg_scheduler for the scheduled task loop, gkg_indexer_sdlc for the SDLC module, gkg_indexer_code for the code indexing module, and gkg_indexer_namespace_deletion for the namespace deletion module. All duration histograms use OTel-recommended buckets (5 ms to 10 s).
Engine metrics (gkg_etl):
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.etl.messages.processed |
Counter | count | topic, outcome (ack/nack/term/dead_letter) |
Total messages processed |
gkg.etl.message.duration |
Histogram | s | topic |
End-to-end time per message through dispatch |
gkg.etl.handler.duration |
Histogram | s | handler |
Time inside each handler's handle() call |
gkg.etl.handler.errors |
Counter | count | handler, error_kind |
Handler errors at the engine dispatch level |
gkg.etl.permit.wait.duration |
Histogram | s | permit_kind (global/group), group |
Time waiting for a worker pool permit |
gkg.etl.permits.active |
UpDownCounter | count | permit_kind |
Worker permits currently held (global or per concurrency group) |
gkg.etl.nats.fetch.duration |
Histogram | s | outcome (success/error) |
Time to fetch a batch from NATS |
gkg.etl.destination.write.duration |
Histogram | s | table |
Time to write a batch to ClickHouse |
gkg.etl.destination.rows.written |
Counter | count | table |
Total rows written to ClickHouse |
gkg.etl.destination.bytes.written |
Counter | bytes | table |
Total bytes written to ClickHouse |
gkg.etl.destination.write.errors |
Counter | count | table |
Total failed writes to ClickHouse |
Scheduled task metrics (gkg_scheduler):
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.scheduler.task.runs |
Counter | count | task, outcome (success/error) |
Total scheduled task runs |
gkg.scheduler.task.duration |
Histogram | s | task |
End-to-end duration of a scheduled task run |
gkg.scheduler.task.requests.published |
Counter | count | task |
Requests successfully published |
gkg.scheduler.task.requests.skipped |
Counter | count | task |
Requests skipped (already in-flight) |
gkg.scheduler.task.query.duration |
Histogram | s | query |
Duration of a scheduled task ClickHouse query |
gkg.scheduler.task.errors |
Counter | count | task, stage (publish/query) |
Scheduled task errors by stage |
SDLC module metrics (gkg_indexer_sdlc):
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.indexer.sdlc.pipeline.duration |
Histogram | s | entity |
End-to-end duration of an entity or edge pipeline run |
gkg.indexer.sdlc.pipeline.rows.processed |
Counter | count | entity |
Total rows extracted and written |
gkg.indexer.sdlc.pipeline.errors |
Counter | count | entity, error_kind |
SDLC pipeline failures |
gkg.indexer.sdlc.handler.duration |
Histogram | s | handler |
Duration of a full handler invocation |
gkg.indexer.sdlc.datalake.query.duration |
Histogram | s | entity |
Duration of ClickHouse datalake extraction queries |
gkg.indexer.sdlc.datalake.query.bytes |
Counter | bytes | entity |
Total bytes returned by ClickHouse datalake extraction queries |
gkg.indexer.sdlc.transform.duration |
Histogram | s | entity |
Duration of DataFusion SQL transform per batch |
gkg.indexer.sdlc.watermark.lag |
Gauge | s | entity |
Seconds between the current watermark and wall clock (data freshness) |
Code module metrics (gkg_indexer_code):
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.indexer.code.events.processed |
Counter | count | outcome (indexed, skipped_checkpoint, skipped_lock, error) |
Total code indexing tasks processed |
gkg.indexer.code.handler.duration |
Histogram | s | End-to-end duration of processing a single code indexing task | |
gkg.indexer.code.repository.fetch.duration |
Histogram | s | Duration of resolving a repository (cache check + optional download and extraction) | |
gkg.indexer.code.repository.resolution |
Counter | count | strategy (cache_hit, incremental, full_download, full_download_fallback) |
Repository resolution strategy used |
gkg.indexer.code.indexing.duration |
Histogram | s | Duration of code-graph parsing and analysis | |
gkg.indexer.code.files.processed |
Counter | count | outcome (parsed, skipped, errored) |
Total files seen by the code-graph indexer |
gkg.indexer.code.nodes.indexed |
Counter | count | kind (directory, file, definition, imported_symbol, edge) |
Total graph nodes and edges indexed |
gkg.indexer.code.errors |
Counter | count | stage (decode, repository_fetch, indexing, arrow_conversion, write, checkpoint) |
Code indexing errors by pipeline stage |
Namespace deletion module metrics (gkg_indexer_namespace_deletion):
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.indexer.namespace_deletion.table.duration |
Histogram | s | table |
Duration of a single table's soft-delete INSERT-SELECT |
gkg.indexer.namespace_deletion.table.errors |
Counter | count | table |
Total per-table deletion failures |
KG Web Service:
- Query Health: p50/p95 latency by tool (
find_nodes,traverse,explore,aggregate), memory spikes, and rows/bytes read per query. - MCP tools latency (p50, p95, p99), usage and success rate
Query pipeline metrics (gkg_query_pipeline):
The query pipeline instruments end-to-end query execution from security check through formatted output. All duration histograms use seconds with OTel-recommended buckets. All histograms and counters carry a query_type label (for example, find_nodes, traverse, explore, aggregate).
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.query.pipeline.queries |
Counter | count | query_type, status (ok / error code) |
Total queries processed through the pipeline |
gkg.query.pipeline.duration |
Histogram | s | query_type, status |
End-to-end pipeline duration from security check to formatted output |
gkg.query.pipeline.compile.duration |
Histogram | s | query_type |
Time spent compiling a query from JSON to parameterized SQL |
gkg.query.pipeline.execute.duration |
Histogram | s | query_type |
Time spent executing the compiled query against ClickHouse |
gkg.query.pipeline.authorization.duration |
Histogram | s | query_type |
Time spent on authorization exchange with Rails |
gkg.query.pipeline.hydration.duration |
Histogram | s | query_type |
Time spent hydrating neighbor properties from ClickHouse |
gkg.query.pipeline.result_set.size |
Histogram | count | query_type |
Number of rows returned after formatting |
gkg.query.pipeline.batch.count |
Histogram | count | query_type |
Number of Arrow record batches returned from ClickHouse |
gkg.query.pipeline.redacted.count |
Histogram | count | query_type |
Number of rows redacted per query |
gkg.query.pipeline.ch.read_rows |
Counter | count | query_type, label |
ClickHouse rows read per query execution (from X-ClickHouse-Summary header) |
gkg.query.pipeline.ch.read_bytes |
Counter | bytes | query_type, label |
ClickHouse bytes read per query execution (from X-ClickHouse-Summary header) |
gkg.query.pipeline.ch.memory_usage |
Histogram | bytes | query_type, label |
ClickHouse peak memory usage per query execution (from X-ClickHouse-Summary header) |
gkg.query.pipeline.error.security_rejected |
Counter | count | reason (security) |
Pipeline rejected due to invalid or missing security context |
gkg.query.pipeline.error.execution_failed |
Counter | count | reason (execution) |
ClickHouse query execution failed |
gkg.query.pipeline.error.authorization_failed |
Counter | count | reason (authorization) |
Authorization exchange with Rails failed |
gkg.query.pipeline.error.content_resolution_failed |
Counter | count | reason (content_resolution) |
Virtual column resolution from remote service failed |
gkg.query.pipeline.error.streaming_failed |
Counter | count | reason (streaming) |
Streaming channel unavailable during authorization |
Content resolution metrics (gkg_content_resolution):
The content resolution subsystem instruments Gitaly interactions during virtual column resolution. Duration histogram buckets target Gitaly call latency (1 ms to 5 s). Duration and total metrics carry an outcome label (gitaly_direct or error; Phase 2 will add cache_hit and cache_miss).
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
gkg.content.resolve.duration |
Histogram | s | outcome |
Time spent resolving content from Gitaly |
gkg.content.resolve |
Counter | count | outcome |
Total content resolution attempts |
gkg.content.resolve.batch_size |
Histogram | count | Number of rows per content resolution batch | |
gkg.content.blob.size |
Histogram | bytes | Size of resolved blob content in bytes | |
gkg.content.gitaly.calls |
Counter | count | Total list_blobs RPCs issued to Gitaly |
Query engine metrics (gkg_query_engine):
The query engine fires counters during compilation to track security-relevant rejections. Each counter uses a reason label for low-cardinality breakdown. Counters marked "server layer" are exported for the gRPC/HTTP layer to increment.
| Metric | Type | Labels | Description |
|---|---|---|---|
gkg.query.engine.threat.validation_failed |
Counter | reason (parse/schema/reference/pagination) |
Query rejected by structural validation |
gkg.query.engine.threat.allowlist_rejected |
Counter | reason (ontology/ontology_internal) |
Entity, column, or relationship not in the ontology allowlist |
gkg.query.engine.threat.auth_filter_missing |
Counter | reason (security) |
Security context invalid or absent (server layer) |
gkg.query.engine.threat.timeout |
Counter | reason |
Query compilation or execution exceeded deadline (server layer) |
gkg.query.engine.threat.rate_limited |
Counter | reason |
Caller throttled before compilation (server layer) |
gkg.query.engine.threat.depth_exceeded |
Counter | reason (depth) |
Traversal depth or hop count exceeded the hard cap |
gkg.query.engine.threat.limit_exceeded |
Counter | reason (limit) |
Array cardinality cap exceeded (node_ids count or IN filter value count) |
gkg.query.engine.internal.pipeline_invariant_violated |
Counter | reason (lowering/codegen) |
Lowering or codegen hit a state upstream validation should have prevented |
Shared Infrastructure Metrics:
- Disk and Memory usage per container
- Network traffic between services
Prometheus scrapes these metrics into Grafana Mimir. We also maintain dashboards for the ClickHouse layer (queries, merges, background tasks).
Alert rules are defined as PrometheusRule CRDs, automatically discovered by the Prometheus Operator. Thresholds are configurable via Helm values.
Metrics flow through Prometheus scraping PodMonitor endpoints exposed by the GKG chart. The OTel-to-Prometheus conversion replaces dots with underscores, appends unit suffixes (_seconds for "s", _bytes for "By"), and appends _total for counters (e.g., gkg.query.engine.threat.auth_filter_missing → gkg_query_engine_threat_auth_filter_missing_total).
Security alerts (any non-zero count is anomalous):
| Alert | Metric | Default Threshold | Severity | for |
Fires when |
|---|---|---|---|---|---|
GKGAuthFilterMissing |
gkg_query_engine_threat_auth_filter_missing_total |
> 0 in 5m | critical | 1m | A query was processed without a valid security context, meaning authorization filtering was bypassed |
GKGPipelineInvariantViolated |
gkg_query_engine_internal_pipeline_invariant_violated_total |
> 0 in 5m | critical | 1m | The query compiler reached a state that upstream validation should have prevented — may produce incorrect SQL |
GKGSecurityRejected |
gkg_query_pipeline_error_security_rejected_total |
> 0 in 5m | warning | 5m | Pipeline rejected a request due to invalid or missing security context |
Query health alerts (sustained error rates or latency degradation):
| Alert | Metric | Default Threshold | Severity | for |
Fires when |
|---|---|---|---|---|---|
GKGQueryingErrorRateHigh |
gkg_query_pipeline_queries_total{status!="ok"} / gkg_query_pipeline_queries_total |
> 5% | warning | 5m | Aggregate error rate across all failure modes exceeds threshold — the availability SLI |
GKGQueryTimeoutRateHigh |
gkg_query_engine_threat_timeout_total / gkg_query_pipeline_queries_total |
> 5% | warning | 5m | More than 5% of queries time out, indicating ClickHouse saturation or pathological queries |
GKGValidationFailedBurst |
gkg_query_engine_threat_validation_failed_total |
> 10/min | warning | 5m | Sustained burst of structural validation failures (broken client or probing) |
GKGAllowlistRejectedBurst |
gkg_query_engine_threat_allowlist_rejected_total |
> 5/min | warning | 5m | Sustained ontology violations (schema drift or enumeration attempt) |
GKGExecutionFailureRate |
gkg_query_pipeline_error_execution_failed_total |
> 1/min | warning | 5m | ClickHouse query execution is failing |
GKGAuthorizationFailureRate |
gkg_query_pipeline_error_authorization_failed_total |
> 1/min | warning | 5m | Rails authorization exchange is failing |
GKGPipelineLatencyP95High |
gkg_query_pipeline_duration_seconds (histogram) |
> 5s | warning | 10m | p95 end-to-end pipeline latency exceeds threshold |
Capacity alerts (traffic and limit pressure):
| Alert | Metric | Default Threshold | Severity | for |
Fires when |
|---|---|---|---|---|---|
GKGRateLimitedHigh |
gkg_query_engine_threat_rate_limited_total |
> 10/min | warning | 5m | High rate of throttled callers — may need capacity scaling |
All logs are structured JSON, shipped to Logstash and Elasticsearch. Every log entry includes a correlation ID so you can trace a request across services.
All log output is JSON. Each entry has standard fields plus context-specific data.
Standard Fields:
| Field | Type | Description |
|---|---|---|
timestamp |
String | ISO 8601 formatted timestamp (UTC) |
level |
String | Log level (e.g., info, warn, error) |
service |
String | Name of the service (e.g., gkg-indexer) |
correlation_id |
String | A unique ID for tracing a request |
message |
String | The log message |
Example Log Entry:
{
"timestamp": "2025-10-10T12:00:00.000Z",
"level": "info",
"service": "gkg-indexer",
"correlation_id": "req-xyz-123",
"message": "Indexing started for project"
}Services are instrumented with OpenTelemetry for distributed tracing. A single request can be followed across GKG and other GitLab services.
The Indexer and Dispatcher expose /live and /ready endpoints on dedicated health ports (default 4202 and 4203 respectively). The /ready probe checks downstream dependencies (NATS, ClickHouse graph, ClickHouse datalake) and returns HTTP 503 when any are unreachable. Traffic is only routed to healthy instances.
For self-managed deployments, we expose a stable integration surface so operators can integrate our metrics, logs, and tracing into their existing observability stacks.
Interface contracts (what we provide):
- Metrics: Each service exposes a Prometheus-compatible
/metricsendpoint for service-level KPIs; we also expose gauges for graph database disk usage where applicable. CPU and host/container resource utilization are expected to be collected via standard exporters alongside our service metrics. - Logs: All services emit structured JSON to
stdout/stderrusing the schema defined in Logging Structure and Format (includingcorrelation_id). - Tracing: Services are instrumented with OpenTelemetry, allowing operators to configure an OTLP exporter (gRPC/HTTP) to a customer-managed collector or backend.
- Health: Liveness (
/live) and readiness (/ready) endpoints on dedicated health ports for orchestration and local SLOs.
Operator responsibilities:
- Scrape
/metricswith your Prometheus (or compatible) and manage storage, alerting, and retention. - Collect node/container resource metrics (CPU, memory, disk I/O, and usage) via standard exporters (e.g., cAdvisor, kube-state-metrics, node_exporter) and correlate with service metrics.
- Collect and ship JSON logs (e.g., Fluentd/Vector/Filebeat) to your aggregator (e.g., Elasticsearch/Loki/Splunk) and manage parsing/retention.
- Provide and operate an OpenTelemetry collector or tracing backend if traces are required.
- Secure endpoints and govern egress in accordance with your environment.
In Kubernetes, most of this is automatic:
- Metrics: Prometheus Operator discovers and scrapes
/metrics. Cluster exporters (cAdvisor, kube-state-metrics, node_exporter) handle CPU, memory, and disk. - Logging: Container logs go to
stdout/stderrand get collected by the cluster's logging agent (Fluentd, Vector). - Health Checks: Kubernetes uses liveness and readiness probes to restart unhealthy pods and manage traffic during rollouts.
Who owns what, and who gets paged.
- Siphon & NATS (development/bug fixes): Analytics stage (primarily Platform Insights), with collaboration from the Knowledge Graph team.
- GKG Service (Indexer + API/MCP): Knowledge Graph team.
- Tier 1: Production Engineering SRE (existing on-call rotation).
- Tier 2: Analytics / Platform Insights.
- Knowledge Graph Services: Dedicated on-call rotation (TBD). During initial launch the KG team will actively monitor the service.
Future ownership will be evaluated, for example, NATS may move under the Durability team, while Siphon is likely to remain with Data Engineering (Analytics).
Initial runbook procedures for GitLab.com. These will grow as we learn more in production.
- Monitoring: Regularly verify replication slots and monitor producer throughput and lag.
- Snapshots: Be aware that snapshots can temporarily inflate JetStream storage. Plan for sufficient headroom per table, and use work-queue/limits retention settings during bulk loads.
- Stream Policies: Enforce
LimitsPolicy(size, age, and message caps) on streams. - Alerting: Configure alerts to trigger at 70%, 85%, and 95% of usage limits.
- Ingestion: Monitor background merge operations, as this is where data deduplication occurs.
- ETL and Graph Ingestion: Establish a clear set of metrics for these processes.
- Workload Isolation: Run GKG queries on a separate Warehouse to isolate them from ingestion workloads. Pin agent reads to read-only compute nodes.
- Query Safety:
- Limits: Set per-user quotas for
max_memory_usage,max_rows_to_read,max_bytes_to_read, and timeouts on the GKG role. - Join & Scan Budgets: Enforce linting rules in the query planner to block unbounded joins, text-search filters in aggregates, or multi-hop traversals (>3) unless pre-materialized.
- Limits: Set per-user quotas for