Status — Phase 3B. This guide documents how a deployment teaches the Knowledge Base to index a new content territory by authoring a custom
Sourceclass — the Phase 0/1BSourceRegistrysubstrate of Epic #11624. A runnable custom Source ships alongside this guide underai/examples/cloud-deployment/.
The KB ingestion substrate splits content acquisition into two roles (see Overview):
- A Source locates and reads content from a territory it can see on disk — a directory tree, a co-located repo — and feeds it into the full-corpus build (
npm run ai:sync-kb). - A Parser transforms one file format into chunk content for the push path (
ingest_source_files,ai:ingest-tenant). See Custom Parsers.
Most cloud tenants need Custom Parsers, not Custom Sources. A tenant pushing its content to a remote KB server uses the push path — parsers + parsed-chunk-v1. Custom Sources matter when a deployment runs its own full-corpus build over a repo it has on disk — e.g. a deployment that co-locates a tenant repo and indexes it directly rather than receiving pushes. If you are wiring git hooks to push content, start with Hook Wiring and Custom Parsers.
A Source is a Neo class extending source/Base.mjs. It implements one abstract method:
async extract(writeStream, createHashFn) : Promise<Number>
extract traverses its territory, builds a chunk record per unit of content, writes each as one JSON line to writeStream, and returns the number of chunks written. The full-corpus build (DatabaseService.createKnowledgeBase) calls extract on every registered Source in turn.
Each chunk record carries the indexable content plus the metadata the KB ranks on — the shape Neo's built-in Sources emit (see AdrSource for the canonical precedent):
| Field | Meaning |
|---|---|
type |
Coarse content category — e.g. proto, guide, src. |
kind |
Finer semantic category — e.g. schema, method, doc-section. |
name |
Human-readable chunk name. |
content |
The chunk text — embedded server-side. |
source |
The chunk's source path. |
Set chunk.hash = createHashFn(chunk) before writing each record. createHashFn is the content-hash function the build supplies; it folds the tenant identity (tenantId, repoSlug) into the hash automatically — a custom Source does not thread tenant fields itself (see Identity-tuple semantics below).
The
extractpath is the full-corpus build's chunk contract. It is distinct from theparsed-chunk-v1push contract in Custom Parsers:parsed-chunk-v1is the shape a tenant pushes throughingest_source_files/ai:ingest-tenant, whereasextractchunks are produced in-process by the full-corpus build. The two are separate ingestion paths into the sameknowledge-basecollection.
A minimal Source — modelled on the built-in AdrSource — that indexes a tenant's .proto schema files:
import Base from './Base.mjs';
import fs from 'fs-extra';
import path from 'path';
import aiConfig from '../../../mcp/server/knowledge-base/config.mjs';
/**
* @summary Extracts knowledge chunks from a tenant's `.proto` schema files.
* @class MyOrg.kb.source.ProtoSource
* @extends Neo.ai.services.knowledge-base.source.Base
* @singleton
*/
class ProtoSource extends Base {
static config = {
className: 'MyOrg.kb.source.ProtoSource',
singleton: true
}
async extract(writeStream, createHashFn) {
let count = 0;
// Path resolves from aiConfig.sourcePaths; the config leaf owns the default.
const dir = path.resolve(aiConfig.neoRootDir, aiConfig.sourcePaths.ProtoSource);
if (await fs.pathExists(dir)) {
for (const file of (await fs.readdir(dir)).sort()) {
if (!file.endsWith('.proto')) continue;
const filePath = path.join(dir, file);
const chunk = {
type : 'proto',
kind : 'schema',
name : path.basename(file, '.proto'),
content: (await fs.readFile(filePath, 'utf-8')).trim(),
source : path.relative(aiConfig.neoRootDir, filePath)
};
chunk.hash = createHashFn(chunk);
writeStream.write(JSON.stringify(chunk) + '\n');
count++;
}
}
return count;
}
}
export default Neo.setupClass(ProtoSource);Sort the territory deterministically (.sort() above) so the generated corpus is byte-stable run-to-run.
A Source class is registered in the SourceRegistry singleton under a stable name:
- Declaratively — add it to
aiConfig.customSourcesas{SourceClass, sourceName?}(loaded once at boot).sourceNamedefaults to the class'sclassNamefinal segment (ProtoSourceabove). - Programmatically —
SourceRegistry.registerSource(ProtoSource, {sourceName: 'ProtoSource'})at runtime; re-registering the same name overwrites (idempotent, useful for hot-reload).
aiConfig.useDefaultSources (default true) controls whether Neo's 10 curated Source classes are also registered. A deployment indexing only tenant content sets it false; the registry then contains only the tenant's custom Sources. See Configuration.
Use rawRepoSource: true when a tenant needs day-0 ingestion before its repository shape is known well enough to justify a custom Source. This registers RawRepoSource, which walks aiConfig.sourcePaths.RawRepoSource.root and emits one raw-text parsed chunk per included file. It is not part of Neo's 10 curated default Sources, so zero-config Neo syncs never walk the full repository implicitly. This is the full-corpus Source-build path (kbSync lane / npm run ai:sync-kb) — not pull mode: server-side pull-mode ingestion takes the whole git tree from the deployment mirror regardless of sourcePaths.
rawRepoSource: true,
useDefaultSources: false,
sourcePaths: {
RawRepoSource: {
root : '.',
includeExtensions: ['.md', '.js', '.json'],
excludePaths : ['.git', 'node_modules', 'dist', 'docs/output'],
excludeExtensions: ['.png', '.jpg', '.pdf', '.woff2']
}
}Graduate from RawRepoSource to a custom Source when a tenant needs semantic chunking, generated-manifest boundaries, or source-specific metadata.
A Source should not hard-code its territory path. Resolve it from aiConfig.sourcePaths keyed by the Source's registry name — aiConfig.sourcePaths.ProtoSource — so a deployment whose layout differs overrides only that key. Defaults belong in config.template.mjs, not in consumer-side optional chains. Each Source interprets its own entry shape (a string, a string-array, or a path→type object — see the built-in Source defaults in config.template.mjs).
Every KB chunk is owned by the path-identity tuple {tenantId, repoSlug, rootKind, sourcePath} (see identity-tuple.md). A custom Source does not set tenantId / repoSlug itself:
createHashFnfoldstenantId+repoSluginto the content hash automatically, so byte-identical content under two tenants produces distinct chunk ids.- The write-side server stamp applies the authoritative tenant tuple at embed time — client/Source-supplied tenant fields are never authoritative.
Neo's own curated content resolves to tenantId: 'neo-shared', repoSlug: 'neo', rootKind: 'neo-workspace' — the team namespace visible to every tenant. A custom Source emits content; the substrate stamps the identity.
- Overview — the Source/Parser registry split and the contract layering.
- Custom Parsers — the push-path counterpart; what most cloud tenants need.
- Hook Wiring — the
ingest_source_files/ai:ingest-tenantpush facades. - Configuration —
useDefaultSources,customSources,sourcePaths. source/Base.mjs— the abstract Source contract ·identity-tuple.md— the chunk-identity tuple.- #11658 Phase 0/1B Source/Parser registry · #11660 per-source path externalization.