Skip to content

Commit e6fc325

Browse files
authored
Merge pull request #69 from devlux76/copilot/p1-subtask-issues
2 parents 619374d + a58ddbc commit e6fc325

20 files changed

Lines changed: 2711 additions & 177 deletions

DESIGN.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -442,7 +442,10 @@ interface Page {
442442
```
443443

444444
#### Book
445-
Ordered sequence of pages with representative medoid.
445+
Ordered sequence of pages from a **single ingest call** with a representative medoid.
446+
One `ingestText()` call always produces exactly one Book — the entire ingested document.
447+
A collection of Books forms a Volume; a collection of Volumes forms a Shelf.
448+
Books are identified by `SHA-256(sorted pageIds)` so their identity is content-addressed.
446449

447450
```typescript
448451
interface Book {
@@ -630,14 +633,19 @@ Rather than returning nearest neighbors by similarity, Cortex traces a coherent
630633
2. **Generate Embeddings** — Batch embed with selected provider
631634
3. **Persist Vectors** — Append to OPFS vector file
632635
4. **Persist Pages** — Write page metadata to IndexedDB; initialise `PageActivity` record
633-
5. **Build/Attach Hierarchy**Construct/update books, volumes, shelves; attempt hotpath admission for each level's medoid/prototype using tier quota via `SalienceEngine`
634-
6. **Fast Semantic Neighbor Insert** — Update semantic neighbor graph incrementally; bounded degree via `HotpathPolicy`; check new page for hotpath admission
636+
5. **Create Ingest Book**Build exactly one Book for the entire ingest: compute the medoid page (minimum total cosine distance to all other pages in the document), derive `bookId = SHA-256(sorted pageIds)`, persist. Hotpath admission for the book runs via `SalienceEngine`. Volumes and Shelves are assembled lazily by the Daydreamer from accumulated Books.
637+
6. **Fast Semantic Neighbor Insert** — Update semantic neighbor graph incrementally; bounded degree via `HotpathPolicy`; check new pages for hotpath admission
635638
7. **Mark Dirty** — Flag volumes for full recalc by Daydreamer
636639

637-
**Incremental Strategy:**
638-
Fast local semantic neighbor insertion keeps ingest-time latency low. At ingest time, only the initial forward and reverse edges are created — neighbors are selected by cosine similarity within Williams-cutoff **distance** (not a fixed K; the cutoff is derived from `HotpathPolicy`). On degree overflow, the lowest-cosine-similarity neighbor is evicted.
640+
**Incremental Strategy (fast and lightweight):**
641+
Ingest must remain fast and lightweight. At ingest time only two classes of edges are created:
642+
- **Document-order adjacency** — Forward and reverse `SemanticNeighbor` edges between each consecutive page pair within the book slice, inserted unconditionally (document-adjacent chunks are always related). This uses a pre-built `Map<pageId, embedding>` for O(1) lookups; no O(n²) index scans.
643+
- **Proximity edges** — Additional `SemanticNeighbor` edges to nearby pages already in the corpus, bounded by cosine-distance cutoff and `maxDegree` eviction.
639644

640-
Full cross-edge reconnection is intentionally deferred: Daydreamer walks the graph during idle passes to build additional edges, strengthening or pruning connections via LTP/LTD. This avoids a full graph recalculation on every insert while still converging to a well-connected graph over time. Hotpath admission runs at ingest time for new pages and hierarchy prototypes.
645+
Full cross-edge reconnection is intentionally deferred: Daydreamer walks the graph during idle passes to build additional edges — connections we never noticed at ingest time — and strengthens or prunes them via LTP/LTD. This keeps ingest cost sublinear while converging to a well-connected graph over time.
646+
647+
**IndexedDB Schema Upgrade Strategy:**
648+
During early development (pre-v1.0) the schema upgrade path intentionally drops and recreates object stores rather than migrating data. This keeps upgrade code minimal and avoids cruft until the data model stabilises. The neighbor graph is rebuilt from scratch after any ingest replay.
641649

642650
## Consolidation Design
643651

core/types.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,12 +67,14 @@ export interface Edge {
6767
// Semantic nearest-neighbor graph
6868
// ---------------------------------------------------------------------------
6969

70+
/** A single directed proximity edge in the sparse semantic neighbor graph. */
7071
export interface SemanticNeighbor {
7172
neighborPageId: Hash;
7273
cosineSimilarity: number; // threshold is defined by runtime policy
7374
distance: number; // 1 - cosineSimilarity (ready for TSP)
7475
}
7576

77+
/** Induced subgraph returned by BFS expansion of the semantic neighbor graph. */
7678
export interface SemanticNeighborSubgraph {
7779
nodes: Hash[];
7880
edges: { from: Hash; to: Hash; distance: number }[];

cortex/KnowledgeGapDetector.ts

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
import type { Hash } from "../core/types";
2+
import type { ModelProfile } from "../core/ModelProfile";
3+
import { hashText } from "../core/crypto/hash";
4+
import type { Metroid } from "./MetroidBuilder";
5+
6+
export interface KnowledgeGap {
7+
queryText: string;
8+
queryEmbedding: Float32Array;
9+
knowledgeBoundary: Hash | null;
10+
detectedAt: string;
11+
}
12+
13+
export interface CuriosityProbe {
14+
probeId: Hash;
15+
queryText: string;
16+
queryEmbedding: Float32Array;
17+
knowledgeBoundary: Hash | null;
18+
mimeType: string;
19+
modelUrn: string;
20+
createdAt: string;
21+
}
22+
23+
/**
24+
* Returns a KnowledgeGap when the metroid signals that m2 could not be found
25+
* (i.e. the engine has no antithesis for this query). Returns null when the
26+
* metroid is complete and no gap was detected.
27+
*/
28+
export async function detectKnowledgeGap(
29+
queryText: string,
30+
queryEmbedding: Float32Array,
31+
metroid: Metroid,
32+
// eslint-disable-next-line @typescript-eslint/no-unused-vars -- reserved for future model-aware gap categorisation
33+
_modelProfile: ModelProfile,
34+
): Promise<KnowledgeGap | null> {
35+
if (!metroid.knowledgeGap) return null;
36+
37+
return {
38+
queryText,
39+
queryEmbedding,
40+
knowledgeBoundary: metroid.m1 !== "" ? metroid.m1 : null,
41+
detectedAt: new Date().toISOString(),
42+
};
43+
}
44+
45+
/**
46+
* Builds a serialisable CuriosityProbe from a detected KnowledgeGap.
47+
* The probeId is the SHA-256 of (queryText + detectedAt) so it is
48+
* deterministic for the same gap inputs.
49+
*/
50+
export async function buildCuriosityProbe(
51+
gap: KnowledgeGap,
52+
modelProfile: ModelProfile,
53+
mimeType = "text/plain",
54+
): Promise<CuriosityProbe> {
55+
const probeId = await hashText(gap.queryText + gap.detectedAt);
56+
57+
return {
58+
probeId,
59+
queryText: gap.queryText,
60+
queryEmbedding: gap.queryEmbedding,
61+
knowledgeBoundary: gap.knowledgeBoundary,
62+
mimeType,
63+
modelUrn: `urn:model:${modelProfile.modelId}`,
64+
createdAt: new Date().toISOString(),
65+
};
66+
}

cortex/MetroidBuilder.ts

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
import type { Hash, VectorStore } from "../core/types";
2+
import type { ModelProfile } from "../core/ModelProfile";
3+
4+
export interface Metroid {
5+
m1: Hash;
6+
m2: Hash | null;
7+
c: Float32Array | null;
8+
knowledgeGap: boolean;
9+
}
10+
11+
export interface MetroidBuilderOptions {
12+
modelProfile: ModelProfile;
13+
vectorStore: VectorStore;
14+
}
15+
16+
/** Standard Matryoshka tier sizes in ascending order. */
17+
const MATRYOSHKA_TIERS = [32, 64, 128, 256, 512, 768, 1024, 2048] as const;
18+
19+
function cosineSimilarity(a: Float32Array, b: Float32Array): number {
20+
let dotProduct = 0;
21+
let normA = 0;
22+
let normB = 0;
23+
const len = Math.min(a.length, b.length);
24+
for (let i = 0; i < len; i++) {
25+
dotProduct += a[i] * b[i];
26+
normA += a[i] * a[i];
27+
normB += b[i] * b[i];
28+
}
29+
if (normA === 0 || normB === 0) return 0;
30+
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
31+
}
32+
33+
function cosineDistance(a: Float32Array, b: Float32Array): number {
34+
return 1 - cosineSimilarity(a, b);
35+
}
36+
37+
/**
38+
* Returns the index of the medoid: the element that minimises total cosine
39+
* distance to every other element in the set.
40+
*/
41+
function findMedoidIndex(embeddings: Float32Array[]): number {
42+
if (embeddings.length === 1) return 0;
43+
44+
let bestIdx = 0;
45+
let bestTotal = Infinity;
46+
47+
for (let i = 0; i < embeddings.length; i++) {
48+
let total = 0;
49+
for (let j = 0; j < embeddings.length; j++) {
50+
if (i !== j) {
51+
total += cosineDistance(embeddings[i], embeddings[j]);
52+
}
53+
}
54+
if (total < bestTotal) {
55+
bestTotal = total;
56+
bestIdx = i;
57+
}
58+
}
59+
60+
return bestIdx;
61+
}
62+
63+
interface CandidateEntry {
64+
pageId: Hash;
65+
embeddingOffset: number;
66+
embeddingDim: number;
67+
}
68+
69+
interface CandidateWithEmbedding extends CandidateEntry {
70+
embedding: Float32Array;
71+
}
72+
73+
/**
74+
* Searches for m2 among `others` (candidates excluding m1) using the free
75+
* dimensions starting at `protectedDim`.
76+
*
77+
* Returns the selected medoid candidate or `null` if no valid opposite set
78+
* can be assembled.
79+
*/
80+
function searchM2(
81+
others: CandidateWithEmbedding[],
82+
m1Embedding: Float32Array,
83+
protectedDim: number,
84+
): CandidateWithEmbedding | null {
85+
if (others.length === 0) return null;
86+
87+
const m1Free = m1Embedding.slice(protectedDim);
88+
89+
const scored = others.map((c) => {
90+
const free = c.embedding.slice(protectedDim);
91+
return { candidate: c, score: -cosineSimilarity(free, m1Free) };
92+
});
93+
94+
// Prefer candidates that are genuinely opposite (score >= 0).
95+
let oppositeSet = scored.filter((s) => s.score >= 0);
96+
97+
// Fall back to the top 50% when the genuine-opposite set is too small.
98+
if (oppositeSet.length < 2) {
99+
const byScore = [...scored].sort((a, b) => b.score - a.score);
100+
const topHalf = Math.max(1, Math.ceil(byScore.length / 2));
101+
oppositeSet = byScore.slice(0, topHalf);
102+
}
103+
104+
if (oppositeSet.length === 0) return null;
105+
106+
const medoidIdx = findMedoidIndex(oppositeSet.map((s) => s.candidate.embedding.slice(protectedDim)));
107+
return oppositeSet[medoidIdx].candidate;
108+
}
109+
110+
/**
111+
* Builds the dialectical probe (Metroid) for a given query embedding and a
112+
* ranked list of candidate memory nodes.
113+
*
114+
* Step overview
115+
* 1. Select m1 (thesis): the candidate with highest cosine similarity to the query.
116+
* 2. Select m2 (antithesis): the medoid of the cosine-opposite set in free dims.
117+
* Uses Matryoshka dimensional unwinding when the initial tier yields no m2.
118+
* 3. Compute centroid c (synthesis): protected dims copied from m1, free dims
119+
* averaged between m1 and m2.
120+
*/
121+
export async function buildMetroid(
122+
queryEmbedding: Float32Array,
123+
candidateMedoids: Array<{ pageId: Hash; embeddingOffset: number; embeddingDim: number }>,
124+
options: MetroidBuilderOptions,
125+
): Promise<Metroid> {
126+
const { modelProfile, vectorStore } = options;
127+
128+
if (candidateMedoids.length === 0) {
129+
return { m1: "", m2: null, c: null, knowledgeGap: true };
130+
}
131+
132+
// Load all candidate embeddings in one pass.
133+
const candidates: CandidateWithEmbedding[] = await Promise.all(
134+
candidateMedoids.map(async (cand) => ({
135+
...cand,
136+
embedding: await vectorStore.readVector(cand.embeddingOffset, cand.embeddingDim),
137+
})),
138+
);
139+
140+
// Select m1: highest cosine similarity to the query.
141+
let m1Candidate = candidates[0];
142+
let m1Score = cosineSimilarity(queryEmbedding, candidates[0].embedding);
143+
144+
for (let i = 1; i < candidates.length; i++) {
145+
const score = cosineSimilarity(queryEmbedding, candidates[i].embedding);
146+
if (score > m1Score) {
147+
m1Score = score;
148+
m1Candidate = candidates[i];
149+
}
150+
}
151+
152+
const protectedDim = modelProfile.matryoshkaProtectedDim;
153+
154+
if (protectedDim === undefined) {
155+
// Non-Matryoshka model: antithesis search is impossible.
156+
return { m1: m1Candidate.pageId, m2: null, c: null, knowledgeGap: true };
157+
}
158+
159+
const others = candidates.filter((c) => c.pageId !== m1Candidate.pageId);
160+
161+
// --- Matryoshka dimensional unwinding ---
162+
// Start at modelProfile.matryoshkaProtectedDim. If m2 not found, progressively
163+
// shrink the protected boundary (expand the free-dimension search region).
164+
165+
const startingTierIndex = MATRYOSHKA_TIERS.indexOf(
166+
protectedDim as (typeof MATRYOSHKA_TIERS)[number],
167+
);
168+
169+
// Build the list of tier boundaries to attempt, from the configured value
170+
// down to the smallest tier (expanding the free region at each step).
171+
const tierBoundaries: number[] = [];
172+
if (startingTierIndex !== -1) {
173+
for (let i = startingTierIndex; i >= 0; i--) {
174+
tierBoundaries.push(MATRYOSHKA_TIERS[i]);
175+
}
176+
} else {
177+
// protectedDim is not a standard tier; try it as-is plus any smaller standard tiers.
178+
tierBoundaries.push(protectedDim);
179+
for (const t of [...MATRYOSHKA_TIERS].reverse()) {
180+
if (t < protectedDim) tierBoundaries.push(t);
181+
}
182+
}
183+
184+
let m2Candidate: CandidateWithEmbedding | null = null;
185+
let usedProtectedDim = protectedDim;
186+
187+
for (const tierBoundary of tierBoundaries) {
188+
const found = searchM2(others, m1Candidate.embedding, tierBoundary);
189+
if (found !== null) {
190+
m2Candidate = found;
191+
usedProtectedDim = tierBoundary;
192+
break;
193+
}
194+
}
195+
196+
if (m2Candidate === null) {
197+
return { m1: m1Candidate.pageId, m2: null, c: null, knowledgeGap: true };
198+
}
199+
200+
// Compute frozen synthesis centroid c.
201+
const fullDim = m1Candidate.embedding.length;
202+
const c = new Float32Array(fullDim);
203+
204+
for (let i = 0; i < usedProtectedDim; i++) {
205+
c[i] = m1Candidate.embedding[i];
206+
}
207+
for (let i = usedProtectedDim; i < fullDim; i++) {
208+
c[i] = (m1Candidate.embedding[i] + m2Candidate.embedding[i]) / 2;
209+
}
210+
211+
return {
212+
m1: m1Candidate.pageId,
213+
m2: m2Candidate.pageId,
214+
c,
215+
knowledgeGap: false,
216+
};
217+
}

cortex/OpenTSPSolver.ts

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import type { Hash, SemanticNeighborSubgraph } from "../core/types";
2+
3+
/**
4+
* Greedy nearest-neighbor open-path TSP heuristic.
5+
*
6+
* Visits every node in the subgraph exactly once, starting from the
7+
* lexicographically smallest node ID for determinism. At each step the
8+
* algorithm advances to the unvisited node nearest to the current one
9+
* (using edge distance). Ties are broken lexicographically. Missing edges
10+
* are treated as having distance Infinity.
11+
*/
12+
export function solveOpenTSP(subgraph: SemanticNeighborSubgraph): Hash[] {
13+
const { nodes, edges } = subgraph;
14+
if (nodes.length === 0) return [];
15+
16+
// Build undirected adjacency map: node → (neighbor → distance).
17+
const adj = new Map<Hash, Map<Hash, number>>();
18+
for (const node of nodes) {
19+
adj.set(node, new Map());
20+
}
21+
for (const edge of edges) {
22+
const fromMap = adj.get(edge.from);
23+
const toMap = adj.get(edge.to);
24+
if (fromMap !== undefined) fromMap.set(edge.to, edge.distance);
25+
if (toMap !== undefined) toMap.set(edge.from, edge.distance);
26+
}
27+
28+
// Pre-sort once so lexicographic tiebreaking is O(1) per step.
29+
const sorted = [...nodes].sort();
30+
31+
const visited = new Set<Hash>();
32+
const path: Hash[] = [];
33+
let current = sorted[0];
34+
35+
while (path.length < nodes.length) {
36+
visited.add(current);
37+
path.push(current);
38+
39+
if (path.length === nodes.length) break;
40+
41+
const neighbors = adj.get(current)!;
42+
let bestNode: Hash | undefined;
43+
let bestDist = Infinity;
44+
45+
for (const node of sorted) {
46+
if (visited.has(node)) continue;
47+
const dist = neighbors.get(node) ?? Infinity;
48+
if (
49+
dist < bestDist ||
50+
(dist === bestDist && (bestNode === undefined || node < bestNode))
51+
) {
52+
bestDist = dist;
53+
bestNode = node;
54+
}
55+
}
56+
57+
// bestNode is always defined here because at least one unvisited node remains.
58+
current = bestNode!;
59+
}
60+
61+
return path;
62+
}

0 commit comments

Comments
 (0)