| title | GKG ADR 011: Agent Command Surface (lazy-mcp pattern) | |||
|---|---|---|---|---|
| creation-date | 2026-05-06 | |||
| authors |
|
|||
| toc_hide | true |
Accepted
2026-05-06
GKG already exposes an agent-facing surface in three places:
- gRPC:
KnowledgeGraphServiceon the GKG server (crates/gkg-server/proto/gkg.proto). - REST:
GET /api/v4/orbit/*endpoints owned by Rails (proxy to GKG). - MCP:
tools/listandtools/callhandled by Rails, fanning out to GKG.
The existing agent tools and structured endpoints were each exposed with hand-written descriptions and JSON Schema. This worked for the first agent integrations (GitLab Duo, Agentic Chat) but broke down once we tried to make the same surface usable by external coding agents (Claude Code, OpenCode, Codex):
- Tool descriptions are truncated. Claude Code truncates anything over ~2000 characters. The
query_graphdescription embeds the full query DSL (config/schemas/graph_query.schema.json) so the LLM has any chance of writing a valid query, and that pushes us well over the limit. The grammar gets cut off mid-token and the agent immediately produces invalid queries. - Schema discovery is incomplete.
get_graph_schemareturns node and edge metadata but does not include the descriptions on properties, so the LLM cannot tell thatdefinition_typeis a coarse category and that the language-specific fine-grained labels (e.g.decorated_async_function) live somewhere else. Agents hallucinate filter values likedefinition_type = "function"that do not exist. - No way to discover the response shape. Coding agents that compose Python or shell pipelines on top of
query_graphneed the response JSON Schema and its semver to write iteration code. Today that schema lives inconfig/schemas/query_response.jsonand is not exposed by any RPC or REST endpoint. - Every new capability requires a Rails MR. Rails owns the MCP tool catalog, the REST routes, and the gRPC client. Adding a new tool means changes in three repositories with three review queues. We have moved at one tool every few months, when we want to be moving multiple times per week as agents surface new usability bugs.
query_graphcannot move into the GKG executor.query_graphgoes through GitLab Workhorse for streaming and JWT-scoped redaction. It relies on Rails-only context that the GKG executor does not have.
The team converged on the lazy-mcp pattern — a discovery and invocation tool pair that lets a single MCP entry point expose an arbitrary catalog of typed sub-commands. We already use lazy-mcp internally and trust the pattern. The decision is to apply the same pattern to GKG's agent surface, with one wrinkle: commands that need Rails context must still be intercepted by Rails before reaching the GKG executor.
Adopt a two-tool MCP surface (list_commands, invoke_command) backed by a typed command registry in GKG. Rails intercepts the commands that depend on Rails-only behavior; everything else dispatches directly to GKG.
The agent-facing surface collapses to two MCP tools and two REST endpoints. The structured query and schema tools become commands behind that surface. The structured REST endpoints (/api/v4/orbit/query, /schema, /graph_status, /tools) stay in place so the GitLab UI and existing programmatic consumers do not break.
| Layer | Surface |
|---|---|
MCP tools/list |
Returns only list_commands and invoke_command |
MCP tools/call |
Routes through invoke_command |
| REST (agent) | GET /api/v4/orbit/agent/commands, POST /api/v4/orbit/agent/commands/:name |
| REST (UI / programmatic) | GET /api/v4/orbit/{query,schema,graph_status,tools,status} (unchanged) |
| GKG gRPC | ListAgentCommands, InvokeAgentCommand (new); plus GetQueryDsl, GetResponseFormat (new), and the existing ExecuteQuery, GetGraphSchema, GetGraphStatus, ListTools, GetClusterHealth |
The new agent REST endpoints sit under /orbit/agent/* and are marked hidden: true in Grape. That namespace is the agent-only contract: GKG can change the command catalog at any time without breaking dashboards or hand-written API clients, because dashboards are expected to use the structured /orbit/{query,schema,...} endpoints.
The command registry lives in the GKG server at crates/gkg-server/src/tools/registry.rs (CommandRegistry). Each command has a name, a short description, and a JSON Schema for its parameters — the same ToolDefinition shape we already use for MCP tools.
Initial catalog:
| Command | Where it executes | Why |
|---|---|---|
query_graph |
Rails interceptor builds workhorse_send_data; Workhorse calls GKG ExecuteQuery |
Needs Workhorse streaming and the bidirectional redaction exchange |
get_graph_schema |
GKG executor (InvokeAgentCommand) |
Pure ontology lookup, no Rails context required |
get_query_dsl |
GKG executor (InvokeAgentCommand) |
Returns config/schemas/graph_query.schema.json and config/QUERY_DSL_VERSION (RAW) or a versioned TOON-condensed grammar (LLM) |
get_response_format |
GKG executor (InvokeAgentCommand) |
Returns the response JSON Schema and its semver from RAW_OUTPUT_FORMAT_VERSION |
The two new commands (get_query_dsl, get_response_format) directly answer the discovery problems that motivated this ADR:
get_query_dsldecouples the DSL grammar from thequery_graphtool description. Agents that hit truncation can still fetch the full grammar on demand, along withQUERY_DSL_VERSION. Direct API consumers can use theGetQueryDslRPC or a REST endpoint such asGET /api/v4/orbit/dsl; MCP agents use the command catalog andInvokeAgentCommand.get_response_formatreturns the JSON Schema for the formatter output plus the matchingRAW_OUTPUT_FORMAT_VERSION. Coding agents that build Python iteration on top ofquery_graphget an authoritative shape they can pin against.
Both new commands accept a format: raw | llm parameter, mirroring get_graph_schema. RAW returns the verbatim JSON Schema; LLM returns a TOON-condensed form to save tokens.
sequenceDiagram
participant Agent
participant Rails as Rails (MCP / REST)
participant Interceptor as CommandInterceptor
participant GKG as GKG (gRPC)
participant Workhorse
Agent->>Rails: tools/call invoke_command(name, params)
Rails->>Interceptor: intercept_orbit_command(name, args)
alt name == query_graph
Interceptor-->>Rails: workhorse_send_data
Rails-->>Agent: 200 (Workhorse handles streaming)
Note right of Workhorse: ExecuteQuery + redaction
else other command
Interceptor-->>Rails: not handled
Rails->>GKG: InvokeAgentCommand(name, params)
GKG-->>Rails: result_json or formatted_text
Rails-->>Agent: 200 JSON
end
Two control points keep this safe:
- Rails interceptor (
Analytics::Orbit::CommandInterceptor). Sits in front of the GKG dispatch. Forquery_graphit builds the Workhorse send-data. It returns anInterceptResult { handled, result, workhorse_send_data }so the caller can tell whether the command was consumed. - GKG executor guard (
ExecutorError::InterceptedCommand).ToolService::resolve_commandmatches onquery_graphand returnsInterceptedCommand, which the gRPC handler maps toFAILED_PRECONDITION. If a misconfigured Rails ever forwards an intercepted command, GKG refuses rather than executing without the Rails context.
list_commands accepts an optional command_names array and returns a slice of the registry. invoke_command requires command_name and accepts a generic parameters object that GKG validates against the registered schema.
The MCP wrapper (API::Orbit::McpHandlers::CallTool) advertises either the legacy tool set or the new list_commands/invoke_command pair, controlled by a feature flag (see Feature flag rollout). Agents discover the command catalog by calling list_commands once at the start of a session.
The ai-assist Orbit agent prompt encodes the contract that agents are expected to follow (!5446):
- Call
orbit_list_commandsonce per session. - Before the first query, call
orbit_invoke_commandwithcommand_name=get_query_dslandcommand_name=get_graph_schema. Do not guess node, edge, or property names from GitLab API terminology. - Call
orbit_invoke_commandwithcommand_name=query_graphfor queries. - On a schema-violation error, re-fetch
get_graph_schemawith the relevant node expanded before retrying.
This is the same shape as lazy-mcp: a single discovery call followed by typed invoke_command calls. Agents keep one mental model regardless of whether the underlying command runs in GKG, Rails, or Workhorse.
A single "exploration" tool with an array of capability flags was considered. The team rejected it because:
- An array of opaque keys ("schema", "dsl", "status") does not type-check the parameters for each capability. Smaller models cannot reliably reason about which fields go with which key.
list_commandsplusinvoke_commandmatches an existing pattern (lazy-mcp) the team already trusts and external agents already know how to consume.
A radical version of this proposal would push query_graph into InvokeAgentCommand as well, removing the interceptor entirely. We did not take that step because:
query_graphruns through Workhorse for streaming and for the bidirectional redaction exchange with Rails. Moving it into the GKG executor would either lose streaming (buffering large result sets in Rails) or require a parallel Workhorse path that duplicates the existing one. The two-layer dispatch (Rails interceptor first, GKG executor as fallback) preserves this contract while still letting every other command move at GKG's pace.
We considered three options for the REST surface:
- Keep REST fully structured. Stable but every new command requires a new Rails endpoint and Rails MR.
- Make the structured REST endpoints dynamic. Single entry point, fully driven by the GKG registry. Rejected because the GitLab UI and any community dashboards rely on stable schemas at
/orbit/{query,schema,graph_status}. - Mirror the MCP surface under
/orbit/agent/*and keep the structured endpoints stable. Chosen.
/orbit/agent/commands and /orbit/agent/commands/:name give agents the same dynamic catalog the MCP surface gives them, and hidden: true documents that this namespace is agent-only and free to evolve. The structured endpoints remain the contract for the UI and for hand-written clients.
The agent command surface ships as three coordinated MRs:
| Repository | MR | Scope |
|---|---|---|
gitlab-org/orbit/knowledge-graph |
!1252 | ListAgentCommands, InvokeAgentCommand, GetQueryDsl, GetResponseFormat RPCs; CommandRegistry; ExecutorError::InterceptedCommand; ToolService::resolve_command |
gitlab-org/gitlab |
!234925 | MCP list_commands / invoke_command handlers; query command interception; GET /orbit/agent/commands, POST /orbit/agent/commands/:name; gRPC client methods |
gitlab-org/modelops/.../ai-assist |
!5446 | Orbit agent toolset switches to orbit_list_commands / orbit_invoke_command; prompt rewritten around discovery-first flow |
Order of merge:
- GKG first, so the gRPC contract exists.
- Rails second, so the MCP and REST surface go live.
- ai-assist last, so the agent prompt switches to the new surface only after Rails is shipping it.
The new MCP tool list (list_commands, invoke_command) lands in Rails atomically with the new gRPC client methods. Older Duo and Agentic Chat clients that talk to MCP keep working through the same tools/call endpoint — they just see a different tool list.
Rails gates the MCP tool list behind a feature flag (orbit_mcp_command_tools). The flag controls which tools tools/list returns:
| Flag state | MCP tools/list returns |
|---|---|
| Off (default) | Legacy tools: query_graph, get_graph_schema |
| On | New surface: list_commands, invoke_command |
The switch is atomic — an agent session sees one surface or the other, never both. Mixing legacy tools with the new command surface in the same session would confuse agents: they would see both query_graph as a top-level tool and as a command inside invoke_command, leading to unpredictable tool selection.
The structured REST endpoints (/orbit/query, /schema, /graph_status, /tools) and the new agent REST endpoints (/orbit/agent/commands, /orbit/agent/commands/:name) are always available regardless of flag state. The flag only affects MCP tool discovery.
Once the flag is fully rolled out and the new surface is stable, the legacy MCP tool registrations can be removed in a follow-up cleanup MR.
- GKG-paced iteration. New commands land with a single GKG MR. No Rails or ai-assist change is required as long as the command does not need Rails-side context.
- Discovery for token-constrained agents. Coding agents that truncate tool descriptions can still discover the full DSL and response shape via
get_query_dslandget_response_format. - Authoritative response shape. External agents can pin against
RAW_OUTPUT_FORMAT_VERSIONand write iteration code that survives schema bumps. - One mental model. Agents see the same
list_commands/invoke_commandflow regardless of whether the command runs in GKG, Rails, or Workhorse.
- Three-repo coordination for the initial change. The first rollout touches GKG, Rails, and ai-assist together. After that, the registry can grow in GKG alone.
- GKG validates command parameters. Because
invoke_commandaccepts a generic JSON object, GKG must JSON-Schema-validateparametersagainst the registered command schema before executing. This shifts validation responsibility from Grape (which would have validated structured parameters per endpoint) into the GKG command executor. - Feature flag for clean switchover. Rails uses
orbit_mcp_command_toolsto swap the MCP tool list atomically between the legacy tools and the newlist_commands/invoke_commandpair. Both code paths must coexist until the flag is fully rolled out. - Two registries during transition. The MR keeps
ToolRegistry(the MCPtools/listsource) and addsCommandRegistry(the lazy-mcp catalog).ToolRegistryis now restricted tolist_commandsandinvoke_commandplus the structured tools the existing UI still uses. We can collapse them in a follow-up once the legacy MCP tools are removed, but that is out of scope for this ADR.
- REST agent surface is dynamic. Hand-written clients that hit
/orbit/agent/commands/:nameare subject to schema changes without a Rails-side deprecation cycle. We accept this because the structured endpoints stay stable for non-agent consumers, and thehidden: trueflag plus the/agent/prefix signal the contract. - Rails-intercepted commands stay opaque to GKG. GKG cannot tell from the executor that
query_graphran successfully — Workhorse handles the response. Cross-cutting metrics (e.g. command-level latency histograms) need to be instrumented in Rails separately from the GKG-side metrics for the rest. - One extra round-trip for first-time discovery. Agents pay one
list_commandscall per session before they can compose queries. The lazy-mcp pattern accepts this cost in exchange for fitting under MCP description budgets.
A single MCP tool that takes { include: ["schema", "dsl", "response_format"] } and returns a top-level JSON object keyed by the requested capabilities. Considered and rejected during the design sync because:
- Smaller models cannot reliably map per-capability parameters (e.g.
expand_nodesfor schema,formatfor DSL) onto a flat union. Per-command JSON Schemas type-check far better. - It does not handle
query_graph, which needs its own parameter shape anyway. We would still end up with multiple top-level tools.
Auto-generate a REST endpoint per command, so REST and MCP stay one-to-one. Rejected because every new endpoint would need a fresh Rails route, Grape declaration, and would touch the API team's review queue. That defeats the goal of decoupling release cadence from Rails.
A pure hypermedia-style API, where every response embeds links to next actions. Conceptually elegant but harder for current LLM agents to navigate compared to a flat command list with typed parameter schemas. We may revisit this as agent capabilities improve.
- GKG Tool Design Sync 2026-05-06 transcript
- GKG MR !1252: agent command RPCs and registry
- Rails MR !234925: MCP/REST surface and command interceptor
- ai-assist MR !5446: agent prompt and command discovery
- lazy-mcp project
- ADR 003: API design
- ADR 004: Unified response schema
- ADR 010: Graph status endpoint
- Duo / Orbit prompt routing architecture — when Duo agents are routed through Rails to the MCP surface defined here