| title | GKG ADR 013: Materialize edges from system notes | |
|---|---|---|
| creation-date | 2026-05-18 | |
| authors |
|
|
| toc_hide | true |
Proposed
2026-05-18
The current Knowledge Graph drops a large class of cross-entity relationships before they reach gl_edge. config/ontology/nodes/core/note.yaml filters with where: "system = false", which silently excludes every system note Rails writes when an issue is closed, a merge request is merged, a commit is added to an MR, or one entity references another in free text. The graph-completeness epic and Angelo's "you don't have anything useful" verdict from the 2026-04-20 Orbit sync both name this gap. The concrete shapes missing today are MR<->MR mentions, MR<->WorkItem mentions, MR<->Commit linkages from commit/merge actions, and the REOPENED lifecycle transition (the MERGED and CLOSED slices exist via FK but are sparse on older data).
Three things make this hard. The structured discriminator, system_note_metadata.action, is not yet replicated into Siphon (siphon_notes exists; siphon_system_note_metadata does not, verified by absence from fixtures/siphon.sql and from the Siphon repo's sample config). The target entities of cross-references are encoded as GFM reference tokens inside the free-text body (e.g. mentioned in !123, mentioned in group/subgroup/project#456, mentioned in 54f7727c), not as structured foreign keys. And the source data is large: gitlab-org/orbit/knowledge-graph#499 cites ~6.7M system notes for gitlab-org alone, against a ~4TB global notes table.
The previous attempt at this work, !1109, reached green CI and was then closed by the author on 2026-05-18 with "I'll close this MR, and open an MR with an ADR first." That MR scoped itself to the lifecycle subset (merged, closed, reopened), added a siphon_system_note_metadata fixture, and materialized one pre-filtered ClickHouse view per action (siphon_system_note_merged, siphon_system_note_closed, siphon_system_note_reopened) so the existing standalone-edge ETL machinery could consume them. The closure was process-driven, not correctness-driven.
Three inputs from that MR: (a) standalone edge ETL has no WHERE clause in config/schemas/ontology.schema.json's edgeEtlConfig definition, which forces a per-action view pattern that does not scale to 10+ cross-reference actions × 3 noteable types; (b) the lifecycle-only slice has small marginal value because merge_user_id and closed_by_id FKs already cover most of it (only REOPENED is novel ground); (c) the Siphon prerequisite was acknowledged in the MR description but never filed as a Siphon-side issue, so the MR would have been inert in production.
Constraints: Rails owns authorization and is the source of system note bodies, so we do not control templates and must accept Rails-side phrasing changes as a maintenance cost. The Analytics team owns Siphon, so the new source-table replication is cross-team coordination on the critical path. The v0.5 migration framework from gitlab-org/orbit/knowledge-graph#443 is complete and validated on staging, so adding new edge kinds is a SCHEMA_VERSION bump, not a table rebuild.
This ADR proposes a Rust extraction handler running inside the SDLC indexer, with two batched ClickHouse lookups for entity resolution, gated on a one-time Siphon-side replication of system_note_metadata. The recommendation is backed by a POC harness (!1335) that measured parser throughput on both a 21-entry synthetic golden corpus and a 74,125-note GDK-seeded real corpus, plus an end-to-end ClickHouse resolver pass against a local GDK instance.
Build a dedicated Rust system-notes handler at crates/indexer/src/modules/sdlc/transform/system_notes/. Read siphon_notes joined to siphon_system_note_metadata on note_id (Mode A) or fall back to body-text filtering (Mode B) when the join target is unavailable. Parse the Rails TYPES_WITH_CROSS_REFERENCES subset plus selected EE cross-reference actions, commit (Markdown SHA list), and merge (auto-merge inline SHA, MR-ref fallback).
Resolve targets with two batched IN-list queries against siphon_routes and the entity tables (siphon_merge_requests, siphon_issues, siphon_work_items). Emit MENTIONS, RELATED_TO (supplemental), ADDS_COMMIT, MERGED_AT_COMMIT, CLOSED (supplemental), MERGED (supplemental), and a new REOPENED edge kind via the standard gl_edge writer.
Update (2026-06-17):
CONTAINSis no longer note-derived. The note path described below originally co-wroteWorkItem → WorkItemCONTAINSfrom the containment actions (epic_issue_added,issue_added_to_epic,epic_issue_moved, hierarchy-shapedtask). That path was removed: the FK path (siphon_work_item_parent_links→config/ontology/edges/contains.yaml) is the single current-state source of truth for hierarchy containment, and the note path redundantly co-wrote the samegl_edgetuple. Because both writers shared theReplacingMergeTreeidentity, the note row's wall-clock_version(and hardcoded_deleted = false) permanently outranked the FK tombstone whenever a child was removed from its parent — resurrecting containment the graph no longer holds. The four containment actions are no longer parsed or routed, andCONTAINSis dropped fromconfig/ontology/derived/core/system_note.yaml'semits:list (leaving[MENTIONS, REOPENED]). The MR only stops new note-derivedCONTAINSwrites; existing stale/resurrected rows clear on the next reindex (noSCHEMA_VERSIONbump, accepted in staging per the maintainer). Seegitlab-org/orbit/knowledge-graphMR removing note-derivedCONTAINSand the research verdict indgruzd/tasks#2902. The remaining sections that still mentionCONTAINSdescribe the original (now-superseded) note-derived design.
Vendor Rails' ICON_TYPES constant with a CI drift check modeled on scripts/check-goon-format-version.sh. Ship behind a feature flag with staging benchmarks 2–5 (lookup latency, end-to-end pass, edge-density gain) as the gate to GA.
| Edge | From → To | Source action(s) | Body token to extract |
|---|---|---|---|
MENTIONS |
MergeRequest / WorkItem / Commit → same (cross-typed) | cross_reference |
GFM ref after "mentioned in " |
RELATED_TO (supplement) |
WorkItem <-> WorkItem | relate, unrelate |
GFM ref after "marked … as related to" / "removed the relation with" |
MENTIONS (parent/child) |
WorkItem → WorkItem | relate_to_parent, relate_to_child, unrelate_from_parent, unrelate_from_child |
GFM ref + relation type |
MENTIONS (lifecycle moves) |
WorkItem → WorkItem | moved, cloned, duplicate |
GFM ref after verb phrase |
ADDS_COMMIT |
MergeRequest → Commit | commit |
Markdown list of SHAs |
MERGED_AT_COMMIT |
MergeRequest → Commit | merge (auto-merge variant) |
SHA inline |
MERGED (supplement to FK) |
User → MergeRequest | merged (lifecycle string) |
n/a |
CLOSED (supplement to FK) |
User → MergeRequest, WorkItem | closed |
n/a |
REOPENED (new edge kind) |
User → MergeRequest, WorkItem | reopened |
n/a |
Explicit non-goals: @-mention edges (separate *_user_mentions tables, tracked under a follow-up to gitlab-org/orbit/knowledge-graph#482), resource state / label / milestone events (dedicated resource_*_events tables under #482), Banzai HTML rendering (the parser is regex-only on plain text), external (Jira) issue references, and the Siphon-side replication MR itself.
crates/indexer/src/modules/sdlc/transform/system_notes/
mod.rs // SystemNotesPipeline: impl EntityPipeline + registration
extract.rs // SQL for the JOIN(siphon_notes ⋈ siphon_system_note_metadata)
parse.rs // Regex + per-action dispatch (lifted verbatim from xtask POC)
resolve.rs // siphon_routes IN-list and (project_id, iid) IN-list batchers
emit.rs // Edge row construction → gl_edge writer
vendored/icon_types.rs // Vendored copy of Rails ICON_TYPES, pinned SHA
tests/ // Unit tests for parse.rs (regex coverage)
Under ADR 014's entity-level SDLC dispatch (scaffolded in !1341), each entity-kind dispatched by EntityDispatcher flows through a single shared EntityIndexingHandler, which routes by entity_kind to a per-kind pipeline. ADR 014 introduces SimpleEntityPipeline as the default plan-driven pipeline and names SystemNotes specifically as the motivating example for the EntityPipeline custom-pipeline extension point:
"All current entities use
SimpleEntityPipeline… Future entities (e.g., SystemNotes) can implementEntityPipelinewith custom logic instead of usingSimpleEntityPipeline." — ADR 014, "Handler and pipeline"
ADR 013's SystemNotesPipeline is that custom impl. It receives an EntityIndexingRequest (entity_kind = "SystemNote", scope = IndexingScope::Namespace { namespace_id, traversal_path }, partition = None for v1) on sdlc.entity.indexing.requested.SystemNote.{dotted_traversal_path} and applies the two-stage extract → resolve → emit pipeline below. See Compatibility with entity-level SDLC indexing (ADR 014) for the full forward-compatibility analysis.
This is a custom EntityPipeline implementation, not an ontology-driven plan. The custom-pipeline path is the intended extension point for entities whose ETL shape (multi-edge dispatch off body parsing, cross-typed targets) cannot be expressed in YAML. The departure from the ontology-first convention (AGENTS.md: "New entity types start there, not in Rust") is deliberate and bounded:
- Edge ETL YAML has no
WHEREclause (verified againstconfig/schemas/ontology.schema.json::edgeEtlConfig, which hasadditionalProperties: falseand exposes onlyscope,source,watermark,deleted,order_by,from,to). - A single source table needs to dispatch into ~9 edge variants whose target type depends on body parsing, not on a fixed source-column enum.
- The 10-cross-reference-action × 3-noteable-type expansion through the only available alternative (per-action ClickHouse views, MR !1109's pattern) would land ~30 redundant projections in
fixtures/siphon.sql. Adam's guidance on the Siphon side ("skip indexes, not projections") argues against that shape generally.
Edge kinds are still declared in ontology YAML (config/ontology/edges/{mentions,adds_commit,merged_at_commit,reopened}.yaml); only the ETL logic is Rust. Existing edges that get new supplemental writes (MERGED, CLOSED, RELATED_TO) gain a documented comment pointing at the system-notes handler as an additional emitter.
for batch in extract(): # paginated, watermark-bounded, traversal-path-scoped
parsed = []
for row in batch: # siphon_notes ⋈ siphon_system_note_metadata
action = row.action # Mode A
if action not in TARGET_ACTIONS:
continue
refs = parse_dispatch(action, row.note, row.namespace_traversal_path)
parsed.push((row.id, row.note_id, row.noteable_id, row.noteable_type,
row.author_id, action, refs))
# Two batched lookups (same pattern as namespace_deletion/store.rs)
paths = distinct(ref.full_path for refs in parsed for ref in refs if ref.kind != Commit)
routes = ch.query("""
SELECT source_type, source_id, path, traversal_path
FROM siphon_routes
WHERE _siphon_deleted = false
AND startsWith(traversal_path, {traversal_path:String})
AND path IN ({paths:Array(String)})
""")
iid_pairs = distinct((route.source_id, ref.iid) for ...)
entity_ids = ch.query("""
SELECT id, iid, target_project_id AS project_id FROM merge_requests
WHERE _siphon_deleted = false
AND startsWith(traversal_path, {traversal_path:String})
AND (target_project_id, iid) IN ({pairs:Array(Tuple(Int64,Int64))})
UNION ALL
SELECT id, iid, project_id FROM issues
WHERE ... (analogous)
""")
edges = build_edges(parsed, routes, entity_ids)
emit(edges)
Dedup key: (system_note_metadata.id, edge_kind, source_kind, source_id, target_kind, target_id). ReplacingMergeTree handles re-processing idempotency. The two-stage IN (…) resolution pattern is precedented at config/ontology/nodes/core/project.yaml:115-121 and config/ontology/nodes/core/group.yaml:101-107 (both resolve in the inverse direction source_id → path via siphon_routes).
Default-project resolution for unqualified refs. Each siphon_notes row carries noteable_id and noteable_type. The handler resolves that pair to the source entity's project_id (via siphon_merge_requests / siphon_issues / siphon_work_items) and looks up that project's traversal_path from siphon_routes; the resulting path becomes the scope for unqualified GFM references on that row (!N, #N, short SHAs). The bench harness's --default-project flag is a harness-only artefact — the production handler derives the default per-row from the noteable, and does not call Gitaly to validate commit SHAs.
Future resolver shape: graph-DB dictionaries. Once the graph DB carries a projected, licensed-namespaces-only view of routes, the path-resolution step can be replaced by dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', PROJECT_ID, '0/') — a strictly cheaper shape than the siphon_routes IN-list. This is left as a v2 lever (see Future optimization: graph-DB-side lookup dictionaries).
Both modes operate exclusively on system-authored notes — they read siphon_notes with WHERE system = true AND _siphon_deleted = false. The current config/ontology/nodes/core/note.yaml ETL is the exact inverse (where: "system = false", feeding the user-note Note node table); the system-notes handler is the first ETL that consumes the complementary half of the siphon_notes table.
- Mode A (preferred, production default): join
siphon_notestosiphon_system_note_metadataonnote_idand filtersnm.action IN (TARGET_ACTIONS). The join is needed becausesiphon_system_note_metadatacarries only(note_id, action, commit_count, description_version_id)— the body, noteable (noteable_id,noteable_type), and author live onsiphon_notes. Thesystem = trueconstraint is implicit on the join (the join target only contains system rows) but is still emitted onsiphon_notesfor query-shape clarity. This is the shape the handler is designed for. - Mode B (degraded fallback): if
siphon_system_note_metadatais not yet replicated, the handler readssiphon_notes WHERE system = trueand dispatches by body content: lifecycle actions are detected by exact equality (notes.note = 'closed','merged','reopened'), and cross-reference actions are detected with an anchored regex on the body prefix (^"mentioned in ",^"marked this issue as related to ", etc.). Thesystem = truefilter is load-bearing for Mode B's precision claim: it scopes the prefix-regex away from arbitrary user-typed notes that happen to start with"mentioned in ". Mode B is still slower than Mode A (no metadata-based pre-filter) and is intended only to unblock staging benchmarks and demos while the Siphon-side MR is in flight.
The handler reads its mode from config; production deployment uses Mode A.
ADR 014 replaced the former GlobalHandler + NamespaceHandler split with one EntityHandler per ontology entity type. Each handler subscribes to the shared global or namespace NATS topic and processes a single entity kind per message. Partitioning for initial loads is configured per entity via partition_overrides.
ADR 013's system-notes pipeline is fully compatible with this model: it registers as a namespaced EntityHandler via Plan, reuses the standard checkpoint key format (ns.{id}.SystemNote), and can opt into partitioning later via partition_overrides.SystemNote: N.
Three-layer defence:
- Vendored constant.
crates/indexer/src/modules/sdlc/transform/system_notes/vendored/icon_types.rscarries a literal copy of upstream RailsICON_TYPES(61 values at the time of writing), pinned to a SHA and documented in a header comment. - CI drift check.
scripts/check-system-note-actions.shmirrors the working pattern ofscripts/check-goon-format-version.sh(ADR 012, the analogous "upstream owns the source of truth, we vendor a copy" problem): the script fetches the upstreamsystem_note_metadata.rb, diffs theICON_TYPESarray against the vendored constant, and fails with an explicit message listing values present upstream but missing locally. Wired into lefthook pre-commit and into thelintCI stage. - Runtime safety. The handler's dispatch is
match action { ... _ => log_and_drop }, neverpanic!. Unknown actions surface as a new metricgkg.indexer.sdlc.system_notes.unknown_action_total{action}registered incrates/gkg-observability/src/indexer/sdlc.rs; cardinality is bounded byICON_TYPESsize (~60–100), so a label dimension is safe. See the metrics step of the implementation plan for the full instrument list.
-
(Parallel, lead time): File a Siphon-side issue against
gitlab-org/analytics-section/siphonrequestingsystem_note_metadatareplication; loop in@ahegyi @arun.sori. Specify: skip index onaction, primary key(traversal_path, note_id), mirror Railsdb/structure.sqlexactly. This work item can run in parallel with ADR review; it does not block "Accepted". -
Add
siphon_system_note_metadatatofixtures/siphon.sql. The DDL bytes from MR !1109 are reusable; the two fixture-only bugs the author hit there (stray;inside a SQL comment that broke the integration-testkit's naivesplit(';')schema runner, and aUSING(note_id)clause that should beON sn.id = snm.note_id) are documented in the research package and fixed in this round. -
Vendor
crates/indexer/src/modules/sdlc/transform/system_notes/vendored/icon_types.rsfrom upstream Rails at a pinned SHA. -
Add the CI drift check
scripts/check-system-note-actions.sh+ lefthook hook (model:scripts/check-goon-format-version.sh). -
Add new edge YAML:
config/ontology/edges/{mentions.yaml, adds_commit.yaml, merged_at_commit.yaml, reopened.yaml}.RELATED_TO,CLOSED,MERGEDget a documented comment that the system-notes handler is an additional emitter; no YAML schema change. -
Register the new edge kinds in
config/ontology/schema.yaml. -
Implement
SystemNotesPipelineatcrates/indexer/src/modules/sdlc/transform/system_notes/, liftingparser.rsandresolver.rsverbatim from the POC atcrates/xtask/src/system_notes_bench/. The type implementsEntityPipeline(the trait introduced by a stacked follow-up to !1341, per ADR 014's "Handler and pipeline" section). Registration: insert into theHashMap<String, Arc<dyn EntityPipeline>>held byEntityIndexingHandlerwith key"SystemNote", viamodules/sdlc/mod.rs::register_entity_handlers(added by !1341, now merged). The single shared handler routes byentity_kindautomatically.Custom-pipeline precedent: ADR 014 names SystemNotes specifically as the motivating example for the
EntityPipelineextension point. Custom-handler precedent in the existing codebase:crates/indexer/src/modules/code/. -
Metrics. Hook into the existing
gkg.indexer.sdlc.*catalog (crates/gkg-observability/src/indexer/sdlc.rs) wherever an instrument already fits; add two narrowly-scoped new instruments. This directly addresses the review request to "hook ourselves in the existing metrics":Concern Instrument Status Per-batch duration gkg.indexer.sdlc.pipeline.duration{entity="SystemNote"}Reuse existing Rows extracted gkg.indexer.sdlc.pipeline.rows.processed{entity="SystemNote"}Reuse existing Parse failures gkg.indexer.sdlc.pipeline.errors{entity="SystemNote", error_kind="parse_failure"}Reuse existing Edges emitted gkg.indexer.sdlc.edges_emitted_total{entity, edge_kind}Add to sdlc.rs(general-purpose; future entities benefit)Unknown action drift gkg.indexer.sdlc.system_notes.unknown_action_total{action}Add; cardinality bounded by ICON_TYPES(~60–100)Catalog regeneration via
metrics-catalog-check. The two new instruments land ingkg-observability/src/indexer/sdlc.rs(not in a system-notes-specific module) so the catalog stays domain-aligned. -
Bump
config/SCHEMA_VERSION(currently 44 → 45). -
Update
docs/design-documents/data_model.md,docs/design-documents/indexing/sdlc_indexing.md,AGENTS.md, andCLAUDE.mdin the same MR (per the AGENTS.md design-doc sync rule). -
Integration test
crates/integration-tests/tests/indexer/sdlc/notes.rs::materialises_cross_reference_edges, plus a lifecycle test ported from the closed !1109. The full source of the !1109 lifecycle test is preserved alongside the research package atdgruzd/droid-workspace/task/2685so future implementers do not need to spelunk a closed-MR branch. -
Feature flag (config-driven) defaulting to off; staging-only first. The flag is handler-config-driven and lives where the rest of the handler config lives under ADR 014's entity-dispatch model:
handlers: entity-handler: batch_size_overrides: SystemNote: 100000 # ~13s of resolver budget per pass on local GDK # No partition_overrides for v1; see "Out of scope".
HandlersConfigurationusesdeny_unknown_fields(crates/gkg-server-config/src/engine.rs), so the SystemNote-specific knobs must fit insideentity-handler.batch_size_overrides(the map key is the entity kind) rather than introducing a new top-level config block. Toggling between staging-on and staging-off is a config push, no code change.
POC harness: !1335, crates/xtask/src/system_notes_bench/. 43/43 unit tests pass, clippy -D warnings clean, cargo fmt clean. Breakdown:
| Check | Result |
|---|---|
| Parser correctness (16 actions, deep namespaces, multiple refs, lifecycle no-ops, negative cases) | 25 unit tests, all green |
SQL template shape (named parameters, startsWith(traversal_path), tuple IN-list, _siphon_deleted = false) |
8 unit tests, all green |
In-memory join semantics (path → source_id → entity, namespace rows filtered) |
2 unit tests, all green |
Golden corpus end-to-end through the parser (real Rails-template bodies vendored from app/services/system_notes/*.rb) |
3 corpus-level tests, all green |
xtask system-notes-bench inspect smoke against the corpus |
Output verified against Rails templates |
| End-to-end against a 74,125-note GDK-seeded real corpus, parser + ClickHouse resolver | All 16 action types round-tripped, zero panics, zero incorrect parses over 75,125 × 100 iterations; CH resolver 3-query batch ≤15 ms at batch=5,000. See Benchmark 1 (real data) and Benchmarks 2–3 (early E2E numbers) below. |
Pure-CPU, single core, release build. Two corpora: the 21-entry synthetic golden corpus and a 74,125-note GDK-seeded real corpus covering all 16 action types (full E2E report: !1335 (note 3360033462)).
| Corpus | Notes | Iterations | Median ns/note | Median notes/sec | Refs/pass |
|---|---|---|---|---|---|
| Golden (synthetic) | 21 | 5,000 | 666 | 1,501,501 | 20 |
| Real GDK (seeded) | 75,125 | 100 | 575 | 1,739,130 | 42,155 |
The real corpus runs ~16% faster than the synthetic baseline. The cause is the action mix: about 40% of real GDK notes are lifecycle actions (closed, merged, reopened, opened) that short-circuit before any regex work; the golden corpus over-represents multi-ref relate/commit patterns where the regex actually fires. Real-corpus body length distribution: min 6 chars, median 15, p95 53, max 121, short enough that the regex engine's startup dominates per-note cost.
Correctness on the real corpus: 75,125 notes × 100 iterations produced zero panics, zero incorrect parses, and zero false positives across all 16 action types. Unknown actions seeded into the GDK data (work_item_status, assignee, start_date_or_due_date) were silently dropped with a WARN log as expected by the log_and_drop design.
Both figures exceed the 500k/sec/core pass criterion by ~3×. At gitlab-org's ~6.7M system notes (#499, 2026-04-27) this puts pure-CPU parse time at ~4 seconds per full indexing pass on a single core.
Benchmarks 2 (siphon_routes IN-list latency) and 3 (entity tuple IN-list latency) were originally scoped to staging ClickHouse only. The E2E validation pass ran them against a local GDK instance (Docker CH 25.12.11.4, 93 routes, 120 MRs, 609 issues, traversal_path "") using the real 75k corpus. Staging measurements are still needed for the production-scale verdict, but the GDK numbers validate the query plan.
| batch_size | distinct paths | routes lookup | MR tuple lookup | WorkItem tuple lookup | 3-query total | MR hits | WI hits |
|---|---|---|---|---|---|---|---|
| 100 (synthetic) | 3 | 3 ms | 2 ms | 2 ms | 7 ms | — | — |
| 1,000 (real GDK) | 18 | 3 ms | 2 ms | 3 ms | 8 ms | 11 / 247 pairs | 33 / 337 pairs |
| 5,000 (real GDK) | 18 | 4 ms | 5 ms | 6 ms | 15 ms | 48 / 761 pairs | 119 / 1,217 pairs |
At batch=1,000 (the configured per-batch resolution size) the full 3-query plan resolves in 8 ms against real GDK data, against a ≤50 ms per-batch budget. Combined with the 575 ns/note parse cost: end-to-end throughput (parse + resolve) of ~125k notes/sec on a single core.
Hit rates of 6–10% are realistic for the GDK seed: the seeder references random IIDs up to 500, but GDK only has 120 MRs and 609 issues, so most synthetic references are unresolvable. Unresolvable refs produced zero rows from the entity-lookup queries (correct behaviour; the edge writer drops them).
The GDK numbers validate the query shape but do not exercise (a) the gitlab-org-scale ~6.7M-note corpus, (b) a non-empty traversal_path filter, or (c) siphon_routes.path IN-list against millions of rows where a skip index would matter. The remaining benchmarks need staging ClickHouse access and (for Mode A) the in-progress Siphon replication:
- Benchmark 4 — end-to-end pass against
gitlab-org. Pass criterion: ≤10 min wall-clock for the full namespace. The 125k notes/sec end-to-end figure from the GDK E2E run extrapolates to ~54 s of pure compute for 6.7M notes; the 10-minute budget is dominated by ClickHouse scan time, not by parse + resolve. - Benchmark 5 — edge density gain (the #499 acceptance criterion). Pass criteria: ≥3× MR<->WorkItem edges, ≥10× MR<->MR edges, both vs. current
gl_edgestate forgitlab-org. Proposed as the concrete numeric form of "a material increase in edge density" from the upstream issue; to be agreed at ADR review.
The handler implementation MR will not merge to behind-flag-on until Benchmarks 4 and 5 have produced numbers and the report is attached.
The GDK E2E pass uncovered two bugs in the bench harness (not in the parser or resolver). Both fixed in !1335 commit be315e04.
Array(Tuple)parameter serialization (blocked all CH benchmark runs). The bench built the(project_id, iid)pairs asserde_json::json!([0_i64, *iid]); theclickhousecrate v0.15.0 serialisesValue::Arrayas[0,123], but ClickHouse requires tuple syntax(0,123)forArray(Tuple(Int64,Int64)). Every CH bench run failed withCANNOT_PARSE_INPUT_ASSERTION_FAILED. Fix: use Rust(i64, i64)tuples directly so the serde serialiser produces the correct(pid,iid)form. Bench-harness call-site issue; the resolver SQL template and parameter typing inresolver.rswere already correct.- Hardcoded
gitlab-org/gitlabdefault project (zero entity hits). The bench used"gitlab-org/gitlab"as the default project path for un-namespaced refs. That path does not exist outside GitLab.com, so on GDK the routes lookup returned 0 rows and all(project_id, iid)pairs collapsed to(0, iid), matching no entity. Fix: added a--default-projectflag (defaulttoolbox/gitlab-smoke-tests, present in every GDK seed) and an--inputflag so the CH bench consumes the same JSONL dump as the parser bench. The production handler receives the default project path from indexer config.
Both fixes are localised to crates/xtask/src/system_notes_bench/; the parser and resolver modules themselves did not require changes.
| Layer | Where | Covers |
|---|---|---|
| Unit (Rust regex) | transform/system_notes/parse.rs tests |
All 16 action variants, namespace-prefixed refs (1–20 segments), shorthand refs, commit SHAs (7–40 hex), malformed bodies, lifecycle no-ops |
| Unit (dispatch) | transform/system_notes/mod.rs tests |
Action → edge-kind mapping, unknown action → log + drop, target-type resolution |
| Unit (POC, lifted) | crates/xtask/src/system_notes_bench/{parser,resolver,golden}.rs |
43/43 tests, carried over verbatim into the production handler |
| Snapshot | tests/snapshots/system_note_bodies.rs |
Real production-style note bodies pinned in fixtures |
| Integration | crates/integration-tests/tests/indexer/sdlc/notes.rs |
Full extract → transform → write against ClickHouse testcontainers, per-action assertions, lifecycle + cross-reference |
| CI drift | scripts/check-system-note-actions.sh |
Vendored ICON_TYPES vs. upstream Rails source-of-truth |
| Benchmark report | crates/xtask/src/system_notes_bench/ (POC harness output) |
Reproduces Benchmarks 1–5 numbers, attached to the implementation MR description |
Why not Option B — Rails internal endpoint. Calling Rails on the indexing hot path couples GKG throughput to Puma thread capacity for a workload whose primary cost is text regex matching. The MR-diff resolver (gitlab-org/orbit/knowledge-graph#482) chose Rails precisely because diff data lives in object storage; system notes are already in siphon_notes, so there is nothing structural Rails can give us that the lake does not have. Backfilling a ~4TB notes table through Rails would be a production incident waiting to happen. Where Option B could fit is a future contingency: calling Rails only for unknown action values as a schema authority. That is not a Stage 1 plan.
Why not Option C — structured-action only, no text parsing. MR !1109 was effectively Option C for the lifecycle subset and the author closed it as insufficient. The cross-reference gap (the actual "you don't have anything useful" problem) stays unaddressed. Lifecycle edges are 80% redundant with existing FKs (merge_user_id, closed_by_id); only REOPENED is novel ground. Option C is preserved as a degraded mode (Mode B above), but it is not the recommended target shape.
Why not Option D — Siphon-side precomputation. Siphon's table-mapping config supports only TransformationType: "ignore" (pkg/siphon/table_mapping_config.go). A reference-extraction primitive would be a net-new Go feature in a repo GKG does not own, carrying the same parser-drift risk as Option A but in a codebase where our reviewers cannot land fixes. No offsetting benefit; unconditionally dominated by Option A.
Option E — Cached HTML DOM parsing (future evolution path). Rails resolves GFM references at note insert time and caches the rendered HTML in notes.note_html (and description_html on issues / MRs / work items). Each resolved reference is emitted as an <a> element with data- attributes (data-reference-type, data-project, data-issue / data-merge-request / data-commit / data-work-item) — a stable machine-readable contract that the Rails redactor already relies on. An alternative extraction shape is: read note_html instead of note, walk the DOM (e.g. lol_html streaming, or scraper), and extract (reference_type, target_id, target_path) tuples directly from the <a data-…> attributes — falling back to body-parse when note_html is NULL or stale.
Why not in v1: (a) notes.note_html replication via Siphon has not been verified — siphon_notes.note is the known quantity; (b) the POC's 1.74M notes/sec/core throughput against body text is a measured baseline, while DOM-walk throughput on a note_html payload (10–50× larger per row) is unmeasured; (c) Option E still needs siphon_system_note_metadata for the lifecycle subset (closed / merged / reopened, which render no <a>), so it does not eliminate the Siphon coordination item on the critical path; (d) Option A's body-template surface, while less stable than the HTML render surface, is small (16 actions, vendored constant + CI drift check).
When we would flip: if Benchmark 4 (gitlab-org end-to-end) shows the body-parse regex path is too brittle at production scale, or the graph-completeness epic expands scope to user notes (system = false) and entity descriptions — Option E is a strict superset of Option A's coverage there, and the noteable-driven default-project resolution becomes free (Rails has already resolved data-project at render time). The implementation MR's POC harness (crates/xtask/src/system_notes_bench/) can be extended cheaply with a parse-html subcommand to microbench note vs note_html extraction on the existing 75k GDK corpus once Siphon replication of note_html is confirmed. Option E is therefore documented here as a future evolution path, not a v1 option.
Why not the per-action ClickHouse VIEW approach (MR !1109's pattern). Works for 3 lifecycle actions (the !1109 scope). Does not scale to 10+ cross-reference actions × 3 noteable types ≈ 30 views, each materializing redundant projections of siphon_notes ⋈ siphon_system_note_metadata. A single Rust handler with one inline WHERE snm.action IN (…) is the cleaner shape. Adam's guidance ("skip indexes, not projections") for new Siphon tables argues against introducing many derivative views in fixtures/siphon.sql as a matter of project style.
Why not split into two MRs (lifecycle first, cross-references second). This is the open question the ADR review should settle. The case for one MR (the current recommendation): both slices share the same Siphon prerequisite, the same handler module, the same SCHEMA_VERSION bump, and the same staging cycle; splitting forces two reviews of largely-overlapping code. The case for two MRs: smaller code per review, lifecycle ships visible value to dashboards faster, lower risk that a regex bug in cross-reference parsing blocks the lifecycle ship. The handler is feature-flagged per action, so the two-MR plan is recoverable from the one-MR codebase by toggling flags.
What improves:
- Closes the MR<->MR / MR<->WorkItem / MR<->Commit edge gap that motivates the graph-completeness epic.
- One end-to-end story for system-note edges: one handler, one CI drift check, one new metric family, one feature flag. Future cross-reference actions (Rails ships a new action; agent training surfaces a new edge need) become a regex / match-arm change, not a YAML + fixture + ETL change.
- Under the entity-based dispatch model (ADR 014 / !1341), system-notes gets its own NATS subject (
sdlc.entity.indexing.requested.SystemNote.{dotted_traversal_path}) and ack lifecycle. A slow or failing system-notes pass no longer redelivers MergeRequest / Issue / Pipeline messages for the same namespace; conversely, a slow MR pass does not block system-notes. The stream'smax_messages_per_subject: 1+discard_new_per_subject: truededuplication operates at the exact(entity_kind, scope)level, which is strictly finer-grained than today's per-namespace ack scope. - The POC harness output is reusable as a regression baseline: any throughput regression in the production handler can be checked against the 1.5M notes/sec/core synthetic POC number and the 1.74M notes/sec/core real-GDK E2E number, plus the 8 ms 3-query CH resolver budget at batch=1,000.
What gets harder:
- A new Rust module to maintain. Action-template drift is a real ongoing cost, quantified by the CI check and the unknown-action metric, but a real cost.
- Two new Siphon-side prerequisites:
system_note_metadatatable replication, and (likely, pending Benchmark 2) a skip index onsiphon_routes.path. - The handler is the first non-ontology-driven SDLC handler outside
modules/code/andmodules/namespace_deletion/. It departs from the documented "ontology first" convention and the deviation needs to be motivated in code comments + AGENTS.md. - Mode B is dead code if Analytics replication lands cleanly; carrying it adds test surface that exists purely as a fallback.
The current resolver runs against the analytics DB (siphon_routes, siphon_merge_requests, siphon_issues, siphon_work_items) because that is where the source rows live. @ahegyi and @michaelangeloio both raised, from different angles, the same structural question: the resolver workload is a hot-path lookup against a relatively small projected set (licensed namespaces' routes plus per-entity (project_id, iid) → id tuples), and the graph DB is the more natural home for that lookup once a projected view exists there.
Two concrete shapes have been proposed:
- ClickHouse
DICTIONARY(project_traversal_paths_dict). Replaces thesiphon_routesIN-list withdictGetOrDefault('project_traversal_paths_dict', 'traversal_path', PROJECT_ID, '0/')— a constant-time per-row lookup that avoids the table scan dimension entirely. Pre-requisite: the dictionary must be defined and refreshed in the graph DB (cadence, ownership, and source query are open questions for @ahegyi). - Load-routes-once (in-memory). The handler loads the licensed-namespaces route table once per pass into an in-process
HashMap<path, traversal_path>and resolves per-row in memory. Cheaper than even a dictionary for small route counts; bounded only by per-worker memory and route-table size.
Neither is needed for v1: the measured 3-query plan resolves in 8 ms at batch=1,000 against GDK, well inside the ≤50 ms per-batch budget. The trigger to revisit is either Benchmark 4 failing the 10-minute wall-clock criterion against gitlab-org, or >2× growth in route-lookup latency observed at staging scale. Both shapes are graph-DB-resident, so they are not blocked on Siphon coordination.
- Intra-batch parallelism (rayon / per-row parallel parse). Not pursued in v1: pure-CPU parse is ~4 s for the full 6.7M
gitlab-orgcorpus at 1.74M notes/sec/core — it is not the bottleneck. Horizontal partitioning via ADR 014'spartition_overrides.SystemNote: Nis the lever once we outgrow a single worker. @-mention edges (*_user_mentionstables). Separate effort, separate Siphon prerequisite.- Resource state / label / milestone events. Tracked under #482 with dedicated
resource_*_eventsSiphon replication. - Banzai HTML rendering. The parser only extracts GFM references from plain text;
lib/banzai/reference_parser/*_parser.rb(HTML-AST-based) is explicitly not what we port. - External (Jira) issue references. Out per upstream #499.
- The Siphon replication MR itself. Filed as a separate Analytics-owned MR against
gitlab-org/analytics-section/siphon; this ADR depends on it but does not specify it. @-link_typeproperty onMENTIONSto distinguishrelatevs.movedvs.duplicate. Open design question for review feedback; default proposal is yes, using the existinglink_typeenum pattern fromrelated_to.yaml. To be settled in the implementation MR, not the ADR.- Partitioning of the system-notes ETL across workers.
v1 ships with
partition = Noneon theEntityIndexingRequestbecause the POC measured ~125k notes/sec end-to-end on a single core, giving ~54 s pergitlab-org-scale pass — well inside the per-message budget. Partitioning is available later through ADR 014's dispatcher-ownedPartitionAssignmentmachinery: settinghandlers.entity-handler.partition_overrides.SystemNote: Nmakes the dispatcher compute quantile boundaries and publish N messages, each carrying aPartitionAssignmentwhoseRange { lower_bound, upper_bound }the pipeline applies as a SQLWHEREconjunct. The default partition column derivation in ADR 014 picks the first non-scope column of the sourceorder_by; forsiphon_notesthat isnoteable_type(low-cardinality, ~10 enum values), so when partitioning is enabled the implementation will need either a per-entitypartition_columnoverride or a customPartitionStrategyregistered forSystemNote. The natural partition column issiphon_system_note_metadata.note_id(high cardinality, primary key). Listed here so a future contributor does not re-derive that we already considered it.
system_note_metadataSiphon replication slip. Longest lead-time item. The Siphon-side MR is filed in parallel with this ADR; if it slips past the implementation MR review, the handler ships in Mode B and flips to Mode A when replication lands. The mode switch is a single config change with no schema impact.- Parser drift against Rails'
ICON_TYPESand body templates.ICON_TYPEShas grown across releases (now 61 values; prior captures show ~50) and Rails has moved system-note phrasing more than once. Mitigation: vendored constant + CI drift check +log_and_dropon unknown actions + regex anchored on the GFM-reference token rather than on the verb phrase (so phrasing changes do not break extraction). E2E validation against a real 75k-note GDK corpus seeded with unknown actions confirmed thelog_and_droppath silently absorbs unrecognised values without breaking the pass (see POC results). - Custom-handler maintenance cost. This is the first cross-reference-oriented handler departing from the ontology-first convention. Mitigation: confine the deviation to the materialization logic only (edge kinds still declare in YAML) and document the rationale in
AGENTS.mdso future ADRs do not treat this as precedent for arbitrary custom handlers. - Full-table-scan cost on resolver second-hop lookups.
siphon_routes,merge_requests, andwork_itemsareORDER BY (traversal_path, ...), but the resolver's filter columns (path,source_id,(target_project_id, iid),(project_id, iid)) are not usable PK prefixes — without atraversal_pathleg, every lookup is a full scan of the shared Siphon datalake. Mitigation (implemented): all four resolver queries carrystartsWith(traversal_path, {root_prefix:String})whereroot_prefixis the source note's top-level namespace prefix (<org>/<top_level_ns>/). This turns each full scan into a primary-index range scan bounded to one top-level namespace partition. The trade-off is that v1 resolves only same-top-level-namespace references; a cross-top-level reference (other-group/proj#5) lives outside the prefix and is silently not resolved (under-counts, never a wrong edge). Cross-top-level resolution is deferred to the graph-DB dictionary lever (see Future optimization), which also resolves the cross-namespace edge-visibility (authz) question. Thecross_top_level_reference_is_not_resolvedintegration test guards this as a deliberate limitation. - Acceptance threshold vagueness.
gitlab-org/orbit/knowledge-graph#499says "a material increase in edge density (numeric target to be set after initial measurement)". This ADR proposes ≥3× MR<->WorkItem and ≥10× MR<->MR as concrete numeric forms. If review prefers different thresholds, the POC harness (xtask system-notes-bench) can re-run cheaply.
v1 resolves only references whose target lives under the same top-level namespace as the source note. Every resolver query (siphon_routes, merge_requests, work_items) is bounded by startsWith(traversal_path, {root_prefix:String}) to avoid full-scanning the shared Siphon datalake tables. This means:
- Same-project references (
#123,!456): always resolved. - Same-top-level cross-project references (
sibling-group/project!42): resolved, because the target's route falls under the same top-level namespace prefix. - Cross-top-level references (
other-org/project#5): intentionally not resolved. The target's route lives outside theroot_prefix, so it never appears in the bounded scan. This produces under-counts (missing edges), never wrong edges. - Commit references (
deadbeef): parsed but not resolved (noCommitnode type yet).
Cross-top-level resolution is deferred to the graph-DB dictionary lever (§ Future optimization), which replaces the siphon_routes IN-list with a constant-time dictionary lookup that naturally spans all namespaces. That lever also resolves the cross-namespace edge-visibility (authz) question — a cross-top-level edge might reference a project the querying user cannot see, and the dictionary can enforce that at read time.
The system-notes handler is implemented as a BlockTransform (ADR 015) rather than the EntityPipeline described in the original decision section. ADR 015 refined ADR 014's extension point: the seam is the transform stage, not a custom pipeline. The SystemNotesTransform implements BlockTransform and is registered via TransformRegistry::register("system_notes", factory). The extract plan is declared as a derived entity in config/ontology/derived/core/system_note.yaml and rides the shared Pipeline for paging, checkpointing, and streaming writes. Resolver lookups against siphon_routes, merge_requests, and work_items split bound array params into engine.handlers.entity-handler.system_notes_resolve_lookup_batch_size chunks and union the decoded rows, because ClickHouse HTTP params are serialized into the request URL.
- Upstream issue:
gitlab-org/orbit/knowledge-graph#499 - POC MR: !1335
- POC E2E validation report against GDK: !1335 note 3360033462 (74,125-note seeded corpus, parser + ClickHouse resolver, bench-harness bug fixes)
- Closed prior MR: !1109
- Related:
gitlab-org/orbit/knowledge-graph#482(MR ingestion gaps),gitlab-org/orbit/knowledge-graph#443(migration framework, complete), graph-completeness epic - Entity-level SDLC indexing scaffold: !1341 (ADR 014, supersedes the earlier !1272 draft); pipeline-owned vs. dispatcher-owned partitioning trade-off discussed in this ADR's Compatibility section
- Prior research and POC plan: dgruzd/droid-workspace/task/2685, E2E validation report: task/2685-e2e, entity-refactor compatibility analysis: task/2685-entity-refactor
- Rails source of truth:
app/models/system_note_metadata.rb(ICON_TYPES,TYPES_WITH_CROSS_REFERENCES),app/services/system_notes/*.rb(body templates),app/models/{issue,merge_request,commit,project}.rb(reference_pattern) - Siphon repo:
gitlab-org/analytics-section/siphon - ADR precedent: 009 (Code Indexer Service) for implementation-plan shape; 012 (GOON Format) for benchmark-driven decision rationale and the vendored-constant + CI drift-check pattern
- Custom-handler precedent in code:
crates/indexer/src/modules/code/,crates/indexer/src/modules/namespace_deletion/ - Schema version file:
config/SCHEMA_VERSION(50 → 51 with this work) - Routes-join precedent:
config/ontology/nodes/core/project.yaml:115-121,config/ontology/nodes/core/group.yaml:101-107 - Note filter today:
config/ontology/nodes/core/note.yaml(where: "system = false")