Skip to content

Latest commit

 

History

History
55 lines (35 loc) · 6.77 KB

File metadata and controls

55 lines (35 loc) · 6.77 KB

Cloud-Native KB Ingestion — Overview

Status — Phase 3B operational substrate. This guide describes the cloud-native Knowledge Base ingestion substrate (Epic #11624): tenant identity, ingestion contracts, server-side stamping, read-side filtering, and the Phase 2 push/bulk facades.

What "cloud-native KB ingestion" means

Neo's Knowledge Base (KB) is a Chroma-backed vector index over source content — guides, ADRs, skills, API docs, tickets, PRs, test specs. In a single-repo deployment, the KB indexes exactly one repository: Neo's own.

A cloud-native deployment runs Agent OS as a service for external client workspaces. Each client workspace is a distinct tenant with its own source content (its own repos, possibly in languages and layouts unlike Neo's). The cloud deployment must let every tenant ingest its own content into the KB while continuing to serve Neo's curated content (guides, ADRs, skills) to all tenants.

The substrate that makes this possible is the per-tenant ingestion contract: every chunk in the KB carries an authoritative identity tuple, content is server-stamped at write time, and reads are tenant-scoped. No tenant can read another tenant's private content; every tenant can read Neo's team-visible curated corpus.

The contract split (Phase 0/1)

Phase 0/1 defines the data contracts cloud ingestion is built on. PRs #11647, #11659, #11661, and #11662 are merged:

Contract Ticket / PR What it provides
parsed-chunk-v1 schema #11629 / PR #11647 The ingest-side chunk shape every Source/Parser emits. Rejects records carrying an embedding field (that field belongs only to the restore-only backup-record-v1 sibling).
Path-identity tuple #11629 / PR #11647 {tenantId, repoSlug, rootKind, sourcePath} in chunk metadata — replaces the legacy single-neoRootDir-relative source string. Neo's curated content uses tenantId: 'neo-shared', repoSlug: 'neo'.
Deletion-signaling contract #11629 / PR #11647 Tombstone + manifest + revision-boundary mechanisms for propagating source deletions into the index.
SourceRegistry #11658 / PR #11659 Data-driven Source/Parser registration. useDefaultSources / useDefaultParsers gate Neo's curated sources; customSources / customParsers register tenant-supplied classes; rawRepoSource explicitly enables the built-in raw repo fallback.
Per-source path externalization #11660 / PR #11661 aiConfig.sourcePaths — each default Source class reads its input path from config, so a tenant whose layout differs from Neo's can reuse the curated Source classes.
Write-side tenant stamping #11631 / PR #11662 VectorService.embed server-stamps {tenantId, repoSlug, visibility, originAgentIdentity} from the authenticated ingestion context; client-supplied identity fields are overwritten or rejected. Tenant-aware Chroma IDs prevent byte-identical chunks from different tenants colliding.

The split was deliberate: contracts before transport. The schemas and registry stabilized first; the Phase 2 ingestion endpoints were built against that frozen contract surface.

Topology anchor — unified Chroma

Per ADR 0003 — Chroma Topology Unified Only, a cloud deployment runs one ChromaDB daemon hosting three collections — knowledge-base, neo-agent-memory, neo-agent-sessions — reached by two MCP servers (KB + Memory Core). Cloud ingestion writes only to the knowledge-base collection. There is no per-tenant Chroma instance: tenant isolation is enforced by the identity tuple + write-side stamping + read-side filter, not by physical separation.

Default-source inheritance

A zero-config Neo deployment behaves identically with or without the cloud-ingestion substrate: useDefaultSources defaults to true, rawRepoSource defaults to false, the SourceRegistry auto-registers Neo's 10 curated Source classes, and aiConfig.sourcePaths carries Neo's default layout. A cloud tenant opting out of Neo's curated content sets useDefaultSources: false; a tenant whose repo layout differs overrides only the sourcePaths keys it needs; a tenant with no known shape can opt into rawRepoSource as a day-0 fallback. Inheritance is the default; divergence is opt-in and granular.

Registry contract split — Source vs Parser

  • A Source locates and reads content from a territory (a directory tree, an external workspace) and emits knowledge chunks into the full-corpus build.
  • A Parser transforms a specific file format into chunk content (e.g. a .proto parser, an ES5-aware parser).

SourceRegistry holds both. Neo's curated Sources auto-register; tenant Sources/Parsers register declaratively via aiConfig.customSources / aiConfig.customParsers or programmatically via SourceRegistry.registerSource(...). The Phase 2 ingestion facades can invoke a registered Parser during an ingestion call; with no parser match, the raw-text fallback ingests the whole file as one chunk.

Relationship to Neo's curated content

Neo's own guides, ADRs, skills, and API docs remain in the KB under tenantId: 'neo-shared', visibility: 'team' — readable by every tenant. A cloud tenant's content is stamped with the tenant's own tenantId and a visibility of team (tenant-internal) or private. Read-side filtering (#11632) resolves each query against the requester's authenticated identity: a tenant sees its own content plus neo-shared, never another tenant's private chunks.

Where to go next

  • Security — tenant-isolation invariants, write-side stamping + spoof-rejection, parser-execution boundary framing, and the KB-as-cache vs MC-as-store recovery model.
  • Day-0 Tutorial - linear PoC walkthrough for a fresh operator or agent: remote MCP healthcheck, MC/KB connection, tenant ingestion, client-side parsing, bulk path, and backup/redeploy handoff.
  • Migration Path — how an existing single-repo Neo deployment upgrades to the cloud-ingestion substrate with zero config changes.
  • Tenant Ingestion Model — the operator-facing model for tenant repo identity, credential boundaries, parser dispatch, source-family inventory, and push-vs-bulk ingestion choices.
  • Configuration — the aiConfig keys and the KnowledgeBaseTenantConfig / kb-config.yaml tenant-config storage.
  • Custom Sources / Custom Parsers — authoring a Source for the full-corpus build, or a Parser for the push path.
  • Hook Wiring — the ingest_source_files / ai:ingest-tenant facades and git-hook patterns; runnable companions live under examples/cloud-deployment/.