Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Indexing

Overview

The Knowledge Graph indexing architecture transforms GitLab SDLC metadata and code repositories into a queryable property graph. The indexing service operates as a distributed ETL (Extract, Transform, Load) pipeline that leverages the Data Insights Platform to process both SDLC events and code changes.

This document outlines the general architecture, shared patterns, and components across both indexing domains. For detailed implementation specifics, see:

Architecture Goals

The indexing architecture achieves the following:

  • Process GitLab SDLC metadata and code repositories into a unified property graph model
  • Scale horizontally across multiple indexer workers without overlapping work
  • Minimize operational load on production PostgreSQL databases
  • Share infrastructure and patterns between code and SDLC indexing to reduce operational complexity
  • Support incremental indexing for efficient updates

Shared Architecture Components

Both code and SDLC indexing leverage the same foundational infrastructure from the Data Insights Platform and share patterns for distributed coordination, data storage, and query access.

flowchart TD
  %% === Shared Infrastructure ===
  subgraph SOURCES["GitLab Sources"]
    PG["PostgreSQL (SDLC)"]
    Rails["GitLab Rails (Code Archives)"]
  end

  subgraph DIP["Data Insights Platform (Shared)"]
    SYP["Siphon (CDC)"]
    JS["NATS JetStream"]
    KV["NATS KV (Locks & State)"]
    CH_RAW["ClickHouse (Raw Data Lake)"]
  end

  subgraph INDEXERS["gkg-indexer workers"]
    direction TB
    SDLC_IDX["SDLC Indexer"]
    CODE_IDX["Code Indexer"]
  end

  subgraph STORAGE["Shared Graph Storage"]
    CH_GRAPH["ClickHouse (Graph Tables)"]
  end

  subgraph QUERY["gkg-webserver"]
    WEB["Web Server"]
    QE["Graph Query Engine"]
  end

  %% === Data Flow ===
  PG -- Logical Replication --> SYP
  SYP -- CDC Events --> JS
  JS -- Event Streams --> SDLC_IDX
  JS -- Code Indexing Tasks --> CODE_IDX
  CODE_IDX -- Archive Download --> Rails

  SDLC_IDX -- Queries --> CH_RAW
  SDLC_IDX -- Writes --> CH_GRAPH
  CODE_IDX -- Writes --> CH_GRAPH

  SDLC_IDX -.-> KV
  CODE_IDX -.-> KV

  WEB -- Queries --> CH_GRAPH
  WEB -- Status --> KV
  WEB --> QE

  %% === Styling ===
  classDef source fill:#f5f5f5,stroke:#9ca3af,color:#111
  classDef platform fill:#fff7ed,stroke:#fb923c,color:#7c2d12
  classDef indexer fill:#eef2ff,stroke:#4338ca,color:#111
  classDef storage fill:#ecfeff,stroke:#06b6d4,color:#0e7490
  classDef query fill:#f0fdf4,stroke:#16a34a,color:#065f46

  class PG,Rails source
  class SYP,JS,KV,CH_RAW platform
  class SDLC_IDX,CODE_IDX indexer
  class CH_GRAPH storage
  class WEB,QE query
Loading

1. Siphon (Change Data Capture)

Role: Streams data from PostgreSQL to NATS without impacting production database performance.

Shared Use:

  • SDLC indexing: Receives events for issues, merge requests, pipelines, projects, namespaces, and other SDLC entities
  • Code indexing: Receives p_knowledge_graph_code_indexing_tasks to trigger repository indexing and knowledge_graph_enabled_namespaces to trigger namespace backfill

Siphon uses PostgreSQL's logical replication to capture changes from the write-ahead log (WAL), publishing them as protobuf messages to NATS JetStream. This decouples the Knowledge Graph from the production database.

2. NATS JetStream and NATS KV (Event Broker and Distributed Coordination)

Role: Provides durable event streaming and distributed coordination.

Shared Use:

  • Delivers and distributes CDC events (including p_knowledge_graph_code_indexing_tasks) to indexing workers via NATS JetStream subjects.
  • Distributes workload across multiple indexer replicas
  • Provides NATS KV for code indexing task mutual exclusion and cadence coordination. SDLC dispatch deduplication uses per-subject message limits on the JetStream stream.

Both indexing pipelines subscribe to relevant NATS subjects and use the same NATS deployment for event distribution and coordination.

3. ClickHouse (Data Lake and Graph Storage)

Role: Acts as both the raw data lake and the final graph storage layer.

Shared Use:

  • Raw Data Lake: SDLC indexers query raw CDC data from ClickHouse to build graph transformations
  • Graph Storage: Both indexers write to property graph tables (nodes and edges) in ClickHouse
  • Query Backend: The web server queries the same ClickHouse tables for both code and SDLC graphs

We will leverage ClickHouse's columnar storage and merge tree engines to provide bulk inserts, background merging, and adjacency list optimizations for graph traversals.

4. Shared Schema and Data Model

Role: Property graph schema that both indexers write to.

Shared Tables:

  • Node tables for entities (e.g., projects, files, definitions, issues, merge_requests)
  • Edge tables for relationships (e.g., project_has_file, mr_closes_issue, definition_calls_definition)

We will use the same repository-defined graph schema for both code and SDLC indexing. In the current codebase, that shared source of truth is split across:

  • config/graph.sql for the implemented ClickHouse tables
  • config/ontology/ for entity definitions, relationship variants, redaction metadata, and ETL mappings

This shared schema allows linking between the two graphs (e.g., a Project node from the SDLC graph can be linked to a Branch, File, or Definition node from the Code Graph).