Skip to content

Latest commit

 

History

History
301 lines (216 loc) · 20.9 KB

File metadata and controls

301 lines (216 loc) · 20.9 KB

Observability

Summary

This document covers how we monitor the Knowledge Graph services (Indexer, Dispatcher, and Web Service): metrics, logs, tracing, health checks, and what operators need to do for self-managed deployments.

How Observability Works Today (GitLab.com)

On GitLab.com, the Knowledge Graph services use the existing observability stack:

SLIs, SLOs, and Metrics

Alerting is based on SLOs and SLIs:

  • Availability SLO: We will adopt the Dedicated reference of ≥99.5% monthly SLO for the GKG API plane (excluding planned maintenance).
  • Availability SLIs: We will use error-rate and Apdex SLIs on request latency.

Each service exposes a Prometheus /metrics endpoint. We use LabKit for instrumentation where possible.

Reliability Signals:

  • Ingest Lag: The delay between data being written to the Postgres WAL, processed by Siphon, sent through NATS, and ingested.
  • Consumer Health: Monitoring NATS JetStream delivery/ack rates, dead-letter counts, and stream retention headroom.

KG Web Server (Indexer and Web Service):

  • HTTP error and success rate by HTTP method and path
  • HTTP latency (p50, p95, p99) by HTTP method and path
  • gRPC error and success rate by RPC method
  • gRPC latency (p50, p95, p99) by RPC method
  • gRPC bidi stream duration for ExecuteQuery RPCs
  • Redaction exchange latency (time spent waiting for Rails authorization responses)

KG Indexer Service:

The indexer emits metrics under five OpenTelemetry meters: gkg_etl for the core engine, gkg_scheduler for the scheduled task loop, gkg_indexer_sdlc for the SDLC module, gkg_indexer_code for the code indexing module, and gkg_indexer_namespace_deletion for the namespace deletion module. All duration histograms use OTel-recommended buckets (5 ms to 10 s).

Engine metrics (gkg_etl):

Metric Type Unit Labels Description
gkg.etl.messages.processed Counter count topic, outcome (ack/nack/term/dead_letter) Total messages processed
gkg.etl.message.duration Histogram s topic End-to-end time per message through dispatch
gkg.etl.handler.duration Histogram s handler Time inside each handler's handle() call
gkg.etl.handler.errors Counter count handler, error_kind Handler errors at the engine dispatch level
gkg.etl.permit.wait.duration Histogram s permit_kind (global/group), group Time waiting for a worker pool permit
gkg.etl.permits.active UpDownCounter count permit_kind Worker permits currently held (global or per concurrency group)
gkg.etl.nats.fetch.duration Histogram s outcome (success/error) Time to fetch a batch from NATS
gkg.etl.destination.write.duration Histogram s table Time to write a batch to ClickHouse
gkg.etl.destination.rows.written Counter count table Total rows written to ClickHouse
gkg.etl.destination.bytes.written Counter bytes table Total bytes written to ClickHouse
gkg.etl.destination.write.errors Counter count table Total failed writes to ClickHouse

Scheduled task metrics (gkg_scheduler):

Metric Type Unit Labels Description
gkg.scheduler.task.runs Counter count task, outcome (success/error) Total scheduled task runs
gkg.scheduler.task.duration Histogram s task End-to-end duration of a scheduled task run
gkg.scheduler.task.requests.published Counter count task Requests successfully published
gkg.scheduler.task.requests.skipped Counter count task Requests skipped (already in-flight)
gkg.scheduler.task.query.duration Histogram s query Duration of a scheduled task ClickHouse query
gkg.scheduler.task.errors Counter count task, stage (publish/query) Scheduled task errors by stage

SDLC module metrics (gkg_indexer_sdlc):

Metric Type Unit Labels Description
gkg.indexer.sdlc.pipeline.duration Histogram s entity End-to-end duration of an entity or edge pipeline run
gkg.indexer.sdlc.pipeline.rows.processed Counter count entity Total rows extracted and written
gkg.indexer.sdlc.pipeline.errors Counter count entity, error_kind SDLC pipeline failures
gkg.indexer.sdlc.handler.duration Histogram s handler Duration of a full handler invocation
gkg.indexer.sdlc.datalake.query.duration Histogram s entity Duration of ClickHouse datalake extraction queries
gkg.indexer.sdlc.datalake.query.bytes Counter bytes entity Total bytes returned by ClickHouse datalake extraction queries
gkg.indexer.sdlc.transform.duration Histogram s entity Duration of DataFusion SQL transform per batch
gkg.indexer.sdlc.watermark.lag Gauge s entity Seconds between the current watermark and wall clock (data freshness)

Code module metrics (gkg_indexer_code):

Metric Type Unit Labels Description
gkg.indexer.code.events.processed Counter count outcome (indexed, skipped_checkpoint, skipped_lock, error) Total code indexing tasks processed
gkg.indexer.code.handler.duration Histogram s End-to-end duration of processing a single code indexing task
gkg.indexer.code.repository.fetch.duration Histogram s Duration of resolving a repository (cache check + optional download and extraction)
gkg.indexer.code.repository.resolution Counter count strategy (cache_hit, incremental, full_download, full_download_fallback) Repository resolution strategy used
gkg.indexer.code.indexing.duration Histogram s Duration of code-graph parsing and analysis
gkg.indexer.code.files.processed Counter count outcome (parsed, skipped, errored) Total files seen by the code-graph indexer
gkg.indexer.code.nodes.indexed Counter count kind (directory, file, definition, imported_symbol, edge) Total graph nodes and edges indexed
gkg.indexer.code.errors Counter count stage (decode, repository_fetch, indexing, arrow_conversion, write, checkpoint) Code indexing errors by pipeline stage

Namespace deletion module metrics (gkg_indexer_namespace_deletion):

Metric Type Unit Labels Description
gkg.indexer.namespace_deletion.table.duration Histogram s table Duration of a single table's soft-delete INSERT-SELECT
gkg.indexer.namespace_deletion.table.errors Counter count table Total per-table deletion failures

KG Web Service:

  • Query Health: p50/p95 latency by tool (find_nodes, traverse, explore, aggregate), memory spikes, and rows/bytes read per query.
  • MCP tools latency (p50, p95, p99), usage and success rate

Query pipeline metrics (gkg_query_pipeline):

The query pipeline instruments end-to-end query execution from security check through formatted output. All duration histograms use seconds with OTel-recommended buckets. All histograms and counters carry a query_type label (for example, find_nodes, traverse, explore, aggregate).

Metric Type Unit Labels Description
gkg.query.pipeline.queries Counter count query_type, status (ok / error code) Total queries processed through the pipeline
gkg.query.pipeline.duration Histogram s query_type, status End-to-end pipeline duration from security check to formatted output
gkg.query.pipeline.compile.duration Histogram s query_type Time spent compiling a query from JSON to parameterized SQL
gkg.query.pipeline.execute.duration Histogram s query_type Time spent executing the compiled query against ClickHouse
gkg.query.pipeline.authorization.duration Histogram s query_type Time spent on authorization exchange with Rails
gkg.query.pipeline.hydration.duration Histogram s query_type Time spent hydrating neighbor properties from ClickHouse
gkg.query.pipeline.result_set.size Histogram count query_type Number of rows returned after formatting
gkg.query.pipeline.batch.count Histogram count query_type Number of Arrow record batches returned from ClickHouse
gkg.query.pipeline.redacted.count Histogram count query_type Number of rows redacted per query
gkg.query.pipeline.ch.read_rows Counter count query_type, label ClickHouse rows read per query execution (from X-ClickHouse-Summary header)
gkg.query.pipeline.ch.read_bytes Counter bytes query_type, label ClickHouse bytes read per query execution (from X-ClickHouse-Summary header)
gkg.query.pipeline.ch.memory_usage Histogram bytes query_type, label ClickHouse peak memory usage per query execution (from X-ClickHouse-Summary header)
gkg.query.pipeline.error.security_rejected Counter count reason (security) Pipeline rejected due to invalid or missing security context
gkg.query.pipeline.error.execution_failed Counter count reason (execution) ClickHouse query execution failed
gkg.query.pipeline.error.authorization_failed Counter count reason (authorization) Authorization exchange with Rails failed
gkg.query.pipeline.error.content_resolution_failed Counter count reason (content_resolution) Virtual column resolution from remote service failed
gkg.query.pipeline.error.streaming_failed Counter count reason (streaming) Streaming channel unavailable during authorization

Content resolution metrics (gkg_content_resolution):

The content resolution subsystem instruments Gitaly interactions during virtual column resolution. Duration histogram buckets target Gitaly call latency (1 ms to 5 s). Duration and total metrics carry an outcome label (gitaly_direct or error; Phase 2 will add cache_hit and cache_miss).

Metric Type Unit Labels Description
gkg.content.resolve.duration Histogram s outcome Time spent resolving content from Gitaly
gkg.content.resolve Counter count outcome Total content resolution attempts
gkg.content.resolve.batch_size Histogram count Number of rows per content resolution batch
gkg.content.blob.size Histogram bytes Size of resolved blob content in bytes
gkg.content.gitaly.calls Counter count Total list_blobs RPCs issued to Gitaly

Query engine metrics (gkg_query_engine):

The query engine fires counters during compilation to track security-relevant rejections. Each counter uses a reason label for low-cardinality breakdown. Counters marked "server layer" are exported for the gRPC/HTTP layer to increment.

Metric Type Labels Description
gkg.query.engine.threat.validation_failed Counter reason (parse/schema/reference/pagination) Query rejected by structural validation
gkg.query.engine.threat.allowlist_rejected Counter reason (ontology/ontology_internal) Entity, column, or relationship not in the ontology allowlist
gkg.query.engine.threat.auth_filter_missing Counter reason (security) Security context invalid or absent (server layer)
gkg.query.engine.threat.timeout Counter reason Query compilation or execution exceeded deadline (server layer)
gkg.query.engine.threat.rate_limited Counter reason Caller throttled before compilation (server layer)
gkg.query.engine.threat.depth_exceeded Counter reason (depth) Traversal depth or hop count exceeded the hard cap
gkg.query.engine.threat.limit_exceeded Counter reason (limit) Array cardinality cap exceeded (node_ids count or IN filter value count)
gkg.query.engine.internal.pipeline_invariant_violated Counter reason (lowering/codegen) Lowering or codegen hit a state upstream validation should have prevented

Shared Infrastructure Metrics:

  • Disk and Memory usage per container
  • Network traffic between services

Prometheus scrapes these metrics into Grafana Mimir. We also maintain dashboards for the ClickHouse layer (queries, merges, background tasks).

Alert Rules

Alert rules are defined as PrometheusRule CRDs, automatically discovered by the Prometheus Operator. Thresholds are configurable via Helm values.

Metrics flow through Prometheus scraping PodMonitor endpoints exposed by the GKG chart. The OTel-to-Prometheus conversion replaces dots with underscores, appends unit suffixes (_seconds for "s", _bytes for "By"), and appends _total for counters (e.g., gkg.query.engine.threat.auth_filter_missinggkg_query_engine_threat_auth_filter_missing_total).

Security alerts (any non-zero count is anomalous):

Alert Metric Default Threshold Severity for Fires when
GKGAuthFilterMissing gkg_query_engine_threat_auth_filter_missing_total > 0 in 5m critical 1m A query was processed without a valid security context, meaning authorization filtering was bypassed
GKGPipelineInvariantViolated gkg_query_engine_internal_pipeline_invariant_violated_total > 0 in 5m critical 1m The query compiler reached a state that upstream validation should have prevented — may produce incorrect SQL
GKGSecurityRejected gkg_query_pipeline_error_security_rejected_total > 0 in 5m warning 5m Pipeline rejected a request due to invalid or missing security context

Query health alerts (sustained error rates or latency degradation):

Alert Metric Default Threshold Severity for Fires when
GKGQueryingErrorRateHigh gkg_query_pipeline_queries_total{status!="ok"} / gkg_query_pipeline_queries_total > 5% warning 5m Aggregate error rate across all failure modes exceeds threshold — the availability SLI
GKGQueryTimeoutRateHigh gkg_query_engine_threat_timeout_total / gkg_query_pipeline_queries_total > 5% warning 5m More than 5% of queries time out, indicating ClickHouse saturation or pathological queries
GKGValidationFailedBurst gkg_query_engine_threat_validation_failed_total > 10/min warning 5m Sustained burst of structural validation failures (broken client or probing)
GKGAllowlistRejectedBurst gkg_query_engine_threat_allowlist_rejected_total > 5/min warning 5m Sustained ontology violations (schema drift or enumeration attempt)
GKGExecutionFailureRate gkg_query_pipeline_error_execution_failed_total > 1/min warning 5m ClickHouse query execution is failing
GKGAuthorizationFailureRate gkg_query_pipeline_error_authorization_failed_total > 1/min warning 5m Rails authorization exchange is failing
GKGPipelineLatencyP95High gkg_query_pipeline_duration_seconds (histogram) > 5s warning 10m p95 end-to-end pipeline latency exceeds threshold

Capacity alerts (traffic and limit pressure):

Alert Metric Default Threshold Severity for Fires when
GKGRateLimitedHigh gkg_query_engine_threat_rate_limited_total > 10/min warning 5m High rate of throttled callers — may need capacity scaling

Logging

All logs are structured JSON, shipped to Logstash and Elasticsearch. Every log entry includes a correlation ID so you can trace a request across services.

Logging Structure and Format

All log output is JSON. Each entry has standard fields plus context-specific data.

Standard Fields:

Field Type Description
timestamp String ISO 8601 formatted timestamp (UTC)
level String Log level (e.g., info, warn, error)
service String Name of the service (e.g., gkg-indexer)
correlation_id String A unique ID for tracing a request
message String The log message

Example Log Entry:

{
  "timestamp": "2025-10-10T12:00:00.000Z",
  "level": "info",
  "service": "gkg-indexer",
  "correlation_id": "req-xyz-123",
  "message": "Indexing started for project"
}

Tracing

Services are instrumented with OpenTelemetry for distributed tracing. A single request can be followed across GKG and other GitLab services.

Health Checks

The Indexer and Dispatcher expose /live and /ready endpoints on dedicated health ports (default 4202 and 4203 respectively). The /ready probe checks downstream dependencies (NATS, ClickHouse graph, ClickHouse datalake) and returns HTTP 503 when any are unreachable. Traffic is only routed to healthy instances.

Self-Managed Instances

For self-managed deployments, we expose a stable integration surface so operators can integrate our metrics, logs, and tracing into their existing observability stacks.

Interface contracts (what we provide):

  • Metrics: Each service exposes a Prometheus-compatible /metrics endpoint for service-level KPIs; we also expose gauges for graph database disk usage where applicable. CPU and host/container resource utilization are expected to be collected via standard exporters alongside our service metrics.
  • Logs: All services emit structured JSON to stdout/stderr using the schema defined in Logging Structure and Format (including correlation_id).
  • Tracing: Services are instrumented with OpenTelemetry, allowing operators to configure an OTLP exporter (gRPC/HTTP) to a customer-managed collector or backend.
  • Health: Liveness (/live) and readiness (/ready) endpoints on dedicated health ports for orchestration and local SLOs.

Operator responsibilities:

  • Scrape /metrics with your Prometheus (or compatible) and manage storage, alerting, and retention.
  • Collect node/container resource metrics (CPU, memory, disk I/O, and usage) via standard exporters (e.g., cAdvisor, kube-state-metrics, node_exporter) and correlate with service metrics.
  • Collect and ship JSON logs (e.g., Fluentd/Vector/Filebeat) to your aggregator (e.g., Elasticsearch/Loki/Splunk) and manage parsing/retention.
  • Provide and operate an OpenTelemetry collector or tracing backend if traces are required.
  • Secure endpoints and govern egress in accordance with your environment.

Deployment to Omnibus-adjacent Kubernetes Environment

In Kubernetes, most of this is automatic:

  • Metrics: Prometheus Operator discovers and scrapes /metrics. Cluster exporters (cAdvisor, kube-state-metrics, node_exporter) handle CPU, memory, and disk.
  • Logging: Container logs go to stdout/stderr and get collected by the cluster's logging agent (Fluentd, Vector).
  • Health Checks: Kubernetes uses liveness and readiness probes to restart unhealthy pods and manage traffic during rollouts.

Ownership, On-call & Escalation

Who owns what, and who gets paged.

Service/Component Ownership

  • Siphon & NATS (development/bug fixes): Analytics stage (primarily Platform Insights), with collaboration from the Knowledge Graph team.
  • GKG Service (Indexer + API/MCP): Knowledge Graph team.

On-call

  • Tier 1: Production Engineering SRE (existing on-call rotation).
  • Tier 2: Analytics / Platform Insights.
  • Knowledge Graph Services: Dedicated on-call rotation (TBD). During initial launch the KG team will actively monitor the service.

Long-term Stewardship

Future ownership will be evaluated, for example, NATS may move under the Durability team, while Siphon is likely to remain with Data Engineering (Analytics).

Runbooks

Initial runbook procedures for GitLab.com. These will grow as we learn more in production.

Siphon

  • Monitoring: Regularly verify replication slots and monitor producer throughput and lag.
  • Snapshots: Be aware that snapshots can temporarily inflate JetStream storage. Plan for sufficient headroom per table, and use work-queue/limits retention settings during bulk loads.

NATS JetStream

  • Stream Policies: Enforce LimitsPolicy (size, age, and message caps) on streams.
  • Alerting: Configure alerts to trigger at 70%, 85%, and 95% of usage limits.

Database (ClickHouse)

  • Ingestion: Monitor background merge operations, as this is where data deduplication occurs.
  • ETL and Graph Ingestion: Establish a clear set of metrics for these processes.
  • Workload Isolation: Run GKG queries on a separate Warehouse to isolate them from ingestion workloads. Pin agent reads to read-only compute nodes.
  • Query Safety:
    • Limits: Set per-user quotas for max_memory_usage, max_rows_to_read, max_bytes_to_read, and timeouts on the GKG role.
    • Join & Scan Budgets: Enforce linting rules in the query planner to block unbounded joins, text-search filters in aggregates, or multi-hop traversals (>3) unless pre-materialized.