Server configuration runbook

Reference for all configurable knobs in the GKG server. All four modes (Webserver, Indexer, DispatchIndexing, HealthCheck) share the same AppConfig struct and loading mechanism.

Configuration loading

Config is loaded in layers, each overriding the previous:

Configuration file: Example config/default.yaml
Secrets: Files in /etc/secrets/ (Kubernetes secret mounts)
Environment variables: Prefixed with GKG_, using __ as a separator for nested keys and , for lists

Environment variable examples:

GKG_NATS__URL=nats://gkg-nats:4222
GKG_GRAPH__DATABASE=gkg-sandbox
GKG_ENGINE__MAX_CONCURRENT_WORKERS=16

Server modes

The binary (gkg-server) runs in one of four modes via --mode:

Mode	Purpose	Key config sections
`Webserver`	HTTP/gRPC query server	`bind_address`, `grpc_bind_address`, `grpc`, `tls`, `query`, `graph`, `gitlab`
`Indexer`	Consumes NATS messages and runs indexing handlers	`nats`, `engine`, `graph`, `datalake`, `gitlab`, `schedule`, `schema`
`DispatchIndexing`	Runs the scheduler loop that publishes indexing requests	`nats`, `graph`, `datalake`, `schedule`, `schema`
`HealthCheck`	K8s readiness/liveness probes	`health_check`, `graph`, `datalake`

All modes share the same configuration structure.

NATS

Connection

Config path	Env var	Default	Description
`nats.url`	`GKG_NATS__URL`	`localhost:4222`	Broker address
`nats.username`	`GKG_NATS__USERNAME`	None	Auth username
`nats.password`	`GKG_NATS__PASSWORD`	None	Auth password
`nats.tls_ca_cert_path`	`GKG_NATS__TLS_CA_CERT_PATH`	None	CA cert (PEM) for TLS. Setting any TLS path enables TLS.
`nats.tls_cert_path`	`GKG_NATS__TLS_CERT_PATH`	None	Client cert (PEM) for mTLS. Must pair with `tls_key_path`.
`nats.tls_key_path`	`GKG_NATS__TLS_KEY_PATH`	None	Client key (PEM) for mTLS. Must pair with `tls_cert_path`.
`nats.connection_timeout_secs`		`10`	Connection timeout
`nats.request_timeout_secs`		`5`	Request timeout

Consumer

Config path	Env var	Default	Description
`nats.consumer_name`	`GKG_NATS__CONSUMER_NAME`	None	Durable consumer name. `None` = ephemeral (lost on disconnect). Set in production for persistence across restarts.
`nats.ack_wait_secs`		`300`	Seconds before unacked message is redelivered
`nats.max_deliver`		`5`	Max redelivery attempts. `None` = unlimited.
`nats.batch_size`		`10`	Messages fetched per batch
`nats.subscription_buffer_size`		`100`	Internal channel buffer between fetch loop and handler
`nats.fetch_expires_secs`		`5`	Server-side timeout for batch fetch (clamped to min 1s)

Stream

Config path	Env var	Default	Description
`nats.auto_create_streams`	`GKG_NATS__AUTO_CREATE_STREAMS`	`true`	Create streams on startup
`nats.stream_replicas`	`GKG_NATS__STREAM_REPLICAS`	`1`	Replicas per stream. Use 3 in production for fault tolerance.
`nats.stream_max_age_secs`	`GKG_NATS__STREAM_MAX_AGE_SECS`	None	Max message age before deletion
`nats.stream_max_bytes`	`GKG_NATS__STREAM_MAX_BYTES`	None	Max stream size in bytes
`nats.stream_max_messages`	`GKG_NATS__STREAM_MAX_MESSAGES`	None	Max messages per stream

The GKG_INDEXER stream is created with:

Retention: WorkQueue (messages deleted after ack)
Discard: New with discard_new_per_subject: true
Max messages per subject: 1 (deduplication: rejects publishes while a handler hasn't acked)
Storage: File

ClickHouse

Two separate ClickHouse connections are required: one for the datalake (Siphon-replicated tables) and one for the graph (indexed property graph).

Datalake

Config path	Env var	Default	Description
`datalake.url`	`GKG_DATALAKE__URL`	`http://127.0.0.1:8123`	HTTP endpoint
`datalake.database`	`GKG_DATALAKE__DATABASE`	`default`	Database name
`datalake.username`	`GKG_DATALAKE__USERNAME`	`default`	Auth user
`datalake.password`	`GKG_DATALAKE__PASSWORD`	None	Auth password
`datalake.query_settings`		`{}`	ClickHouse session settings (e.g., `max_rows_to_read`)

Graph

Config path	Env var	Default	Description
`graph.url`	`GKG_GRAPH__URL`	`http://127.0.0.1:8123`	HTTP endpoint
`graph.database`	`GKG_GRAPH__DATABASE`	`default`	Database name
`graph.username`	`GKG_GRAPH__USERNAME`	`default`	Auth user
`graph.password`	`GKG_GRAPH__PASSWORD`	None	Auth password
`graph.query_settings`		`{}`	ClickHouse session settings

Profiling (debug)

Config path	Default	Description
`graph.profiling.enabled`	`false`	Enable query profiling
`graph.profiling.explain`	`false`	Collect EXPLAIN output
`graph.profiling.query_log`	`false`	Log to system.query_log
`graph.profiling.processors`	`false`	Collect processor stats
`graph.profiling.instance_health`	`false`	Check instance health

Worker pool

The worker pool limits how many messages are processed concurrently. It uses a two-level semaphore: a global limit and optional per-group limits.

Config path	Env var	Default	Description
`engine.max_concurrent_workers`	`GKG_ENGINE__MAX_CONCURRENT_WORKERS`	`16`	Global concurrency cap
`engine.concurrency_groups`		`{}`	Named group limits

How concurrency groups work

Each handler can declare a concurrency_group. When a message arrives, the handler acquires the group semaphore first, then the global semaphore. Both are released after processing.

This prevents one handler type from monopolizing all workers. For example, with 16 global workers, you can cap SDLC at 12 and code at 4:

engine:
  max_concurrent_workers: 16
  concurrency_groups:
    sdlc: 12
    code: 4

Topic configuration

Each NATS subscription has retry and concurrency settings under engine.topics.<name>:

Config path	Default	Description
`topics.<name>.concurrency_group`	None	Which group semaphore to use
`topics.<name>.max_attempts`	None	Total attempts (1 = no retry, 5 = 4 retries)
`topics.<name>.retry_interval_secs`	None	Delay between retries (NATS nack delay)

Default topic settings

From config/default.yaml:

engine:
  max_concurrent_workers: 16
  concurrency_groups:
    sdlc: 12
    code: 4
  topics:
    global-handler:
      concurrency_group: sdlc
      max_attempts: 1
      retry_interval_secs: 60
    namespace-handler:
      concurrency_group: sdlc
      max_attempts: 1
      retry_interval_secs: 60
    code-indexing-task:
      concurrency_group: code
      max_attempts: 5
      retry_interval_secs: 60
      dead_letter_on_exhaustion: true
    namespace-deletion:
      concurrency_group: code
      max_attempts: 1

Handler-specific settings

SDLC entity handler

Config path	Default	Description
`engine.handlers.entity-handler.datalake_batch_size`	`1,000,000`	Rows per datalake extraction query
`engine.handlers.entity-handler.batch_size_overrides.<Entity>`	None	Per-entity override for datalake batch size
`engine.handlers.entity-handler.partition_overrides.<Entity>`	None	Number of partitions for initial load parallelism

Code indexing task handler

Config path	Default	Description
`engine.handlers.code-indexing-task.pipeline.max_file_size_bytes`	`5,000,000`	Largest source file the v2 pipeline will parse
`engine.handlers.code-indexing-task.pipeline.max_files`	`1,000,000`	Maximum language-supported source files accepted for one pipeline run
`engine.handlers.code-indexing-task.pipeline.worker_threads`	`0`	Rayon workers per language; `0` uses Rayon default
`engine.handlers.code-indexing-task.pipeline.max_concurrent_languages`	`0`	Concurrent language pipelines; `0` uses the pipeline default

Retry strategy by topic

Topic	max_attempts	DLQ	Rationale
`global-handler` (SDLC)	1	No	Re-dispatched every cycle. No need to retry.
`namespace-handler` (SDLC)	1	No	Re-dispatched every cycle. No need to retry.
`code-indexing-task`	5	Yes	Event-driven. Won't be re-dispatched. Must retry and DLQ.
`namespace-deletion`	1	No	Re-dispatched on next scheduler cycle.

Scheduler configuration

Scheduled tasks run in DispatchIndexing mode. Each task has a 6-field cron expression (seconds, minutes, hours, day-of-month, month, day-of-week). Tasks without a cron expression fall back to a 60-second interval.

Distributed locking via NATS KV ensures only one dispatcher instance runs each task per interval.

Task	Config path	Default cron	Description
Global dispatch	`schedule.tasks.global.cron`	`0 /1 * * *` (every minute)	Publishes `GlobalIndexingRequest`
Namespace dispatch	`schedule.tasks.namespace.cron`	`0 /1 * * *` (every minute)	Publishes per-namespace requests
Code task dispatch	`schedule.tasks.code-indexing-task.cron`	`0 /1 * * *` (every minute)	Consumes Siphon CDC push events
Code backfill	`schedule.tasks.namespace-code-backfill.cron`	`0 /1 * * *` (every minute)	Backfills newly enabled namespaces
Table cleanup	`schedule.tasks.table-cleanup.cron`	`0 0 3 * * *` (daily 03:00 UTC)	Runs `OPTIMIZE TABLE ... FINAL CLEANUP`
Namespace deletion	`schedule.tasks.namespace-deletion.cron`	`0 0 3 * * *` (daily 03:00 UTC)	Schedules and executes namespace deletions
Migration completion	`schedule.tasks.migration-completion.cron`	`0 /1 * * *` (every minute)	Detects completed schema migrations

Code dispatch task settings

Config path	Default	Description
`schedule.tasks.code-indexing-task.events_stream_name`	`siphon_stream_main_db`	NATS stream for Siphon CDC events
`schedule.tasks.code-indexing-task.batch_size`	`100`	CDC events to process per cycle

Code backfill task settings

Config path	Default	Description
`schedule.tasks.namespace-code-backfill.events_stream_name`	`siphon_stream_main_db`	NATS stream for namespace events
`schedule.tasks.namespace-code-backfill.batch_size`	`100`	Events to process per cycle

GitLab client

Required for code indexing (repository archive download) and authorization.

Config path	Env var	Default	Description
`gitlab.base_url`	`GKG_GITLAB__BASE_URL`	None	GitLab instance URL
`gitlab.jwt.signing_key`	`GKG_GITLAB__JWT__SIGNING_KEY`	None	JWT signing key (for creating tokens)
`gitlab.jwt.verifying_key`	`GKG_GITLAB__JWT__VERIFYING_KEY`	(required)	JWT verification key
`gitlab.resolve_host`	`GKG_GITLAB__RESOLVE_HOST`	None	Override DNS resolution for GitLab

Observability

Logging

Config path	Env var	Default	Description
`metrics.log_level`	`GKG_METRICS__LOG_LEVEL`	None	Rust log filter string

Example: info,gkg_server=debug,gkg_indexer=trace

OpenTelemetry

Config path	Default	Description
`metrics.otel.enabled`	`false`	Enable OTEL tracing
`metrics.otel.endpoint`	`http://localhost:4317`	OTEL gRPC collector endpoint

Prometheus

Config path	Default	Description
`metrics.prometheus.enabled`	`false`	Expose `/metrics` endpoint
`metrics.prometheus.port`	`9394`	Prometheus scrape port

Webserver

These settings are used by the Webserver mode.

General

Config path	Env var	Default	Description
`bind_address`	`GKG_BIND_ADDRESS`	`127.0.0.1:4200`	HTTP server bind address
`grpc_bind_address`	`GKG_GRPC_BIND_ADDRESS`	`127.0.0.1:50054`	gRPC server bind address
`jwt_clock_skew_secs`	`GKG_JWT_CLOCK_SKEW_SECS`	`60`	Allowed JWT clock skew in seconds
`health_check_url`	`GKG_HEALTH_CHECK_URL`	None	Optional health check URL

TLS

Config path	Env var	Default	Description
`tls.cert_path`	`GKG_TLS__CERT_PATH`	None	TLS certificate path (PEM)
`tls.key_path`	`GKG_TLS__KEY_PATH`	None	TLS private key path (PEM)

gRPC tuning

Config path	Default	Description
`grpc.keepalive_interval_secs`	`20`	HTTP/2 keepalive ping interval
`grpc.keepalive_timeout_secs`	`20`	Keepalive ping timeout
`grpc.tcp_keepalive_secs`	`60`	TCP keepalive interval
`grpc.connection_window_size`	`2097152` (2 MB)	HTTP/2 connection flow control window
`grpc.stream_window_size`	`1048576` (1 MB)	HTTP/2 stream flow control window
`grpc.concurrency_limit`	`256`	Max concurrent requests
`grpc.max_connection_age_secs`	`300` (5 min)	Max connection age (for L4 ILB rebalancing)
`grpc.max_connection_age_grace_secs`	`30`	Graceful drain window after `max_connection_age_secs` fires. Must be non-zero to avoid a tonic 0.14.5 panic (hyperium/tonic#2522).
`grpc.stream_timeout_secs`	`60`	Stream timeout
`grpc.max_header_list_size_bytes`	`65536` (64 KiB)	HTTP/2 `SETTINGS_MAX_HEADER_LIST_SIZE` advertised to clients. tonic/hyper default of 16 KiB is too small for GitLab JWTs carrying many traversal IDs.

Query settings

Supports default settings and per-query-type overrides (e.g. aggregation, traversal, search):

query:
  default:
    max_execution_time: 30
    max_memory_usage: 1073741824
    use_query_cache: false
    query_cache_ttl: 60
  aggregation:
    max_execution_time: 60

Config path	Default	Description
`query.default.max_execution_time`	`30`	ClickHouse `max_execution_time` in seconds
`query.default.max_memory_usage`	unset	ClickHouse `max_memory_usage` in bytes
`query.default.max_bytes_to_read`	unset	ClickHouse `max_bytes_to_read` in bytes
`query.default.max_rows_to_read`	unset	ClickHouse `max_rows_to_read`
`query.default.max_rows_in_set`	unset	ClickHouse `max_rows_in_set` (IN subquery cap)
`query.default.use_query_cache`	`false`	Enable ClickHouse query cache
`query.default.query_cache_ttl`	`60`	Query cache TTL in seconds

Schema management

Config path	Default	Description
`schema.max_retained_versions`	`2`	Number of schema version table-sets to retain (min 2)

Analytics

Controls Snowplow product-analytics event emission. Events carry orbit_common and orbit_query contexts (consumer-owned, defined in gkg-analytics). Disabled by default -- Helm enables it for .com and Dedicated.

Config path	Env var	Default	Description
`analytics.enabled`	`GKG_ANALYTICS__ENABLED`	`false`	Enable Snowplow analytics event emission
`analytics.collector_url`	`GKG_ANALYTICS__COLLECTOR_URL`	`""`	Snowplow collector endpoint (e.g. `https://snowplowprd.trx.gitlab.net`)
`analytics.deployment.type`	`GKG_ANALYTICS__DEPLOYMENT__TYPE`	`self_managed`	`com`, `dedicated`, or `self_managed`
`analytics.deployment.environment`	`GKG_ANALYTICS__DEPLOYMENT__ENVIRONMENT`	`development`	`development`, `staging`, or `production`

Example for the .com staging cluster:

analytics:
  enabled: true
  collector_url: "https://snowplowprd.trx.gitlab.net"
  deployment:
    type: com
    environment: staging

Billing

Controls Snowplow billing-event emission and the CDot quota gate that enforces GitLab credit limits on metered Orbit queries.

Billing events

Config path	Env var	Default	Description
`billing.enabled`	`GKG_BILLING__ENABLED`	`false`	Enable Snowplow billing-event emission
`billing.collector_url`	`GKG_BILLING__COLLECTOR_URL`	`""`	Snowplow collector endpoint

Quota gate

When enabled, every metered Orbit query (mcp, rest source types) is checked against CDot before execution. Requests from namespaces with exhausted credits are rejected with codes.ResourceExhausted.

Config path	Env var	Default	Description
`billing.quota.enabled`	`GKG_BILLING__QUOTA__ENABLED`	`false`	Enable the CDot quota gate
`billing.quota.customers_dot_url`	`GKG_BILLING__QUOTA__CUSTOMERS_DOT_URL`	`""`	CDot base URL (e.g. `https://customers.gitlab.com`)
`billing.quota.request_timeout_ms`	`GKG_BILLING__QUOTA__REQUEST_TIMEOUT_MS`	`1000`	CDot request timeout in milliseconds
`billing.quota.api_user`	`GKG_BILLING__QUOTA__API_USER`	None	CDot admin email. Mounted from `/etc/secrets/billing/quota/api_user`.
`billing.quota.api_token`	`GKG_BILLING__QUOTA__API_TOKEN`	None	CDot admin token. Mounted from `/etc/secrets/billing/quota/api_token`.

Health check

Config path	Default	Description
`health_check.bind_address`	`0.0.0.0:4201`	HealthCheck mode bind address
`indexer_health_bind_address`	`0.0.0.0:4202`	Health check server address for Indexer mode
`dispatcher_health_bind_address`	`0.0.0.0:4203`	Health check server address for DispatchIndexing mode

Tuning guide

Increase indexing throughput

Increase global and group concurrency:

engine:
  max_concurrent_workers: 32
  concurrency_groups:
    sdlc: 24
    code: 8

Increase SDLC batch sizes for large namespaces:

engine:
  handlers:
    entity-handler:
      datalake_batch_size: 5000000

Reduce NATS pressure

Increase ack wait for slow handlers:

GKG_NATS__ACK_WAIT_SECS=600  # 10 minutes instead of default 5

Handle large CDC backlogs

Increase the code dispatch batch size:

schedule:
  tasks:
    code-indexing-task:
      batch_size: 500

Production NATS settings

GKG_NATS__CONSUMER_NAME=gkg-indexer       # Durable consumer (survives restarts)
GKG_NATS__STREAM_REPLICAS=3               # Fault tolerance
GKG_NATS__AUTO_CREATE_STREAMS=true        # Auto-create on startup

Helm chart configuration

In production, GKG is deployed via the gkg-helm-charts. Most configuration is set through Helm values rather than raw YAML or environment variables.

Key Helm values mapping

Helm value	Application config	Description
`nats.url`	`nats.url`	NATS broker address
`nats.consumerName`	`nats.consumer_name`	Durable consumer name
`clickhouse.datalake.host`	`datalake.url`	Datalake ClickHouse host
`clickhouse.datalake.database`	`datalake.database`	Datalake database name
`clickhouse.graph.host`	`graph.url`	Graph ClickHouse host
`clickhouse.graph.database`	`graph.database`	Graph database name
`gitlab.baseUrl`	`gitlab.base_url`	GitLab instance URL
`indexer.logLevel`	`metrics.log_level`	Indexer log level
`secrets.existingSecret`	(secret mounts)	Kubernetes secret with credentials

Overriding configuration via Helm

helm upgrade gkg gkg-helm-charts/gkg \
  --set nats.consumerName=gkg-indexer \
  --set clickhouse.graph.database=gkg-production

For complex overrides, use a values file:

helm upgrade gkg gkg-helm-charts/gkg -f custom-values.yaml

Troubleshooting with kubectl

Check pod status and logs

kubectl -n gkg get pods
kubectl -n gkg logs deployment/gkg-indexer -f
kubectl -n gkg logs deployment/gkg-dispatcher -f

# Filter logs for a specific project
kubectl -n gkg logs deployment/gkg-indexer -f | grep 'project_id=<id>'

Inspect running configuration

kubectl -n gkg exec deployment/gkg-indexer -- env | grep GKG_

Troubleshoot NATS with nats-box

Spin up a nats-box pod to run NATS commands inside the cluster:

kubectl -n gkg run nats-box --image=natsio/nats-box:latest --restart=Never -- sleep infinity
kubectl -n gkg exec -it nats-box -- sh

From inside nats-box:

# Check stream health
nats -s nats://gkg-nats:4222 stream ls
nats -s nats://gkg-nats:4222 stream info GKG_INDEXER

# Inspect consumers
nats -s nats://gkg-nats:4222 consumer ls GKG_INDEXER

# Check dead letter queue
nats -s nats://gkg-nats:4222 stream info GKG_DEAD_LETTERS

# Purge a stuck subject
nats -s nats://gkg-nats:4222 stream purge GKG_INDEXER \
  --subject='sdlc.namespace.indexing.requested.<org>.<ns>'

# Inspect KV locks
nats -s nats://gkg-nats:4222 kv ls indexing_locks

Clean up when done:

kubectl -n gkg delete pod nats-box

Example: production config

nats:
  url: nats://gkg-nats:4222
  consumer_name: gkg-indexer
  ack_wait_secs: 300
  auto_create_streams: true
  stream_replicas: 3

datalake:
  url: http://clickhouse:8123
  database: gitlab_clickhouse_main_production
  username: default

graph:
  url: http://clickhouse:8123
  database: gkg-sandbox
  username: default

gitlab:
  base_url: https://gitlab.example.com

engine:
  max_concurrent_workers: 16
  concurrency_groups:
    sdlc: 12
    code: 4
  topics:
    global-handler:
      concurrency_group: sdlc
      max_attempts: 1
      retry_interval_secs: 60
    namespace-handler:
      concurrency_group: sdlc
      max_attempts: 1
      retry_interval_secs: 60
    code-indexing-task:
      concurrency_group: code
      max_attempts: 5
      retry_interval_secs: 60
      dead_letter_on_exhaustion: true
    namespace-deletion:
      concurrency_group: code
      max_attempts: 1
  handlers:
    entity-handler:
      datalake_batch_size: 1000000
    code-indexing-task:
      pipeline:
        max_file_size_bytes: 5000000
        max_files: 1000000
        worker_threads: 0
        max_concurrent_languages: 0

schedule:
  tasks:
    table-cleanup:
      cron: "0 0 3 * * *"
    namespace-deletion:
      cron: "0 0 3 * * *"
    migration-completion:
      cron: "0 */1 * * * *"

metrics:
  log_level: info,gkg_server=debug
  prometheus:
    enabled: true
    port: 9394

FilesExpand file tree

server_configuration.md

Latest commit

History