Data Model Deep Dive

Memory Types

The system supports four memory types, each with different lifecycle behaviors:

Type	Behavior	Expires?	Decays?	Supersedes?
`event`	Append-only log entry	Yes (30d TTL)	No	No
`fact`	Upsertable knowledge	No	Yes	Yes (by `key`)
`decision`	Immutable record	No	No	No
`status`	Current state tracker	No	Yes	Yes (by `subject`)

event

Something that happened. Append-only -- events are never superseded. After 30 days (configurable via EVENT_TTL_DAYS), events with access_count=0 and importance medium or low are soft-deleted by the consolidation cleanup.

fact

Persistent knowledge that evolves over time. Facts should always include a key for clean lifecycle management. When a new fact is stored with a key matching an existing active fact:

The old fact is marked active: false
The old fact gets superseded_by: new_id and superseded_at: timestamp
The old fact's valid_to is set to the current time
The new fact gets supersedes: old_id

Facts without a key generate a console warning -- they pile up forever with no supersedes mechanism.

decision

A choice made and the reasoning behind it. Decisions are immutable and never expire, decay, or get superseded. They serve as an audit trail.

status

The current state of a system, workflow, or process. Statuses should always include a subject. When a new status is stored with a subject matching an existing active status, the same supersedes logic as facts applies.

Memory Payload Schema

Every memory stored in Qdrant has this payload structure:

{
  "text": "The memory content (scrubbed of credentials)",
  "type": "fact",
  "source_agent": "claude-code",
  "observed_by": ["claude-code"],
  "observation_count": 1,
  "client_id": "acme-corp",
  "category": "semantic",
  "importance": "high",
  "knowledge_category": "technical",
  "content_hash": "abc123def456ab",
  "created_at": "2026-03-15T10:00:00.000Z",
  "last_accessed_at": "2026-03-29T12:00:00.000Z",
  "access_count": 7,
  "confidence": 1.0,
  "active": true,
  "consolidated": false,
  "supersedes": null,
  "superseded_by": null,
  "key": "acme-tech-stack",
  "subject": null,
  "status_value": null,
  "valid_from": "2026-03-15T10:00:00.000Z",
  "valid_to": null,
  "entities": [
    {"name": "acme-corp", "type": "client"},
    {"name": "Next.js", "type": "technology"}
  ],
  "metadata": {}
}

Field Reference

Field	Type	Description
`text`	string	Memory content after credential scrubbing
`type`	string	`event`, `fact`, `decision`, `status`
`source_agent`	string	Agent that originally stored this memory
`observed_by`	string[]	All agents that have stored identical content (corroboration)
`observation_count`	number	Length of `observed_by` (capped at 20)
`client_id`	string	Client scope (`"global"` for system-wide)
`category`	string	`semantic` (concepts), `episodic` (events), `procedural` (how-to)
`importance`	string	`critical`, `high`, `medium`, `low`
`knowledge_category`	string	`brand`, `strategy`, `meeting`, `content`, `technical`, `relationship`, `general`
`content_hash`	string	First 16 chars of SHA256 of scrubbed content
`created_at`	string	ISO 8601 creation timestamp
`last_accessed_at`	string	ISO 8601 -- updated on every search hit
`access_count`	number	Incremented on every search hit
`confidence`	number	Base confidence (always 1.0 at creation)
`active`	boolean	`false` = soft-deleted or superseded
`consolidated`	boolean	`true` after consolidation engine has processed it
`supersedes`	string	UUID of the memory this one replaced (or null)
`superseded_by`	string	UUID of the memory that replaced this one (or null)
`key`	string	Facts only: unique key for upsert matching
`subject`	string	Status only: what this status is about
`status_value`	string	Status only: the status value
`valid_from`	string	Facts/statuses: when this became true
`valid_to`	string	Facts/statuses: when this stopped being true (null = still valid)
`entities`	object[]	Extracted entities: `[{name, type}]`
`metadata`	object	Arbitrary user metadata

Deduplication

Deduplication is tenant-scoped and runs at write time (not async). The process:

Content is scrubbed of credentials
SHA256 hash is computed and truncated to 16 hex characters
Qdrant is queried for existing points matching content_hash + client_id + type + active: true
If a match is found:
- Same agent: Return the existing memory ID (true dedup, no write)
- Different agent: Record cross-agent corroboration by appending to observed_by array

The observed_by array is capped at MAX_OBSERVED_BY = 20 to prevent unbounded growth. Once the cap is reached, additional corroboration is acknowledged in the response but not recorded.

Consolidation Dedup

The consolidation engine has its own dedup layer for merged facts:

Exact dedup: Check content_hash against existing active memories
Semantic dedup: Embed the merged fact, search for similar active memories. If any result has cosine similarity >= 0.92 (SEMANTIC_DEDUP_THRESHOLD), skip the merge.

Supersedes Mechanism

Supersedes creates a version chain for facts and statuses:

fact_v1 (key: "acme-stack")
  active: false
  superseded_by: fact_v2
  superseded_at: 2026-03-20T...
  valid_to: 2026-03-20T...

fact_v2 (key: "acme-stack")
  active: true
  supersedes: fact_v1
  valid_from: 2026-03-20T...
  valid_to: null

When a fact/status is superseded:

Old memory: active=false, superseded_by=new_id, superseded_at=now, valid_to=now
New memory: supersedes=old_id, valid_from=now (or provided value), valid_to=null
Old memory is deactivated in keyword search index
Webhook notification memory_superseded is dispatched

By default, search excludes superseded memories (active: true filter). Pass include_superseded=true to see the full version history.

Temporal Validity

Facts and statuses support temporal validity windows:

Field	Default	Purpose
`valid_from`	`created_at`	When this fact became true
`valid_to`	`null`	When this fact stopped being true (`null` = still valid)

The at_time search parameter enables temporal queries: "what was true at time X?"

When at_time is provided, search adds range filters:

valid_from <= at_time
The Qdrant query returns only memories where valid_from is before the requested time

This enables historical queries like "what tech stack did acme use in January?" even after the fact has been superseded.

Confidence Decay

Confidence decay prevents stale facts and statuses from dominating search results. It only applies to fact and status types -- events and decisions are historical records that don't decay.

Formula

effective_confidence = base_confidence * (DECAY_FACTOR ^ days_since_last_access)

Where:

base_confidence = payload.confidence (always 1.0 at creation)
DECAY_FACTOR = 0.98 (configurable via env var, default 2% per day)
days_since_last_access = days since last_accessed_at (or created_at if never accessed)

Decay Curve

Days Without Access	Effective Confidence
0	1.000
7	0.869
14	0.755
30	0.545
60	0.297
90	0.162

The stats endpoint reports decayed_below_50pct -- the count of facts with effective confidence below 0.5 (sampled from a random 100 facts).

How Access Prevents Decay

Every time a memory appears in search results:

access_count is incremented
last_accessed_at is updated to current time

This resets the decay clock. Frequently-accessed memories stay confident; forgotten ones fade.

Search Scoring

Final search ranking combines multiple signals:

1. Vector Similarity Score

Cosine similarity from Qdrant (0.0 to 1.0). A minimum threshold of 0.3 filters out irrelevant results.

2. Confidence Decay

Applied as described above. Only affects facts and statuses.

3. Access Boost

access_boost = 1 + (0.3 * log2(access_count + 1))

Access Count	Boost Factor
0	1.000
1	1.300
3	1.600
7	1.900
15	2.200

4. Effective Score

effective_score = vector_score * effective_confidence * access_boost

Results are sorted by effective_score descending. In compact format, scores are rounded to 4 decimal places.

5. RRF Fusion (Multi-Path)

When multi-path retrieval is active (default), results from vector, keyword, and graph searches are merged using Reciprocal Rank Fusion before confidence decay and access boost are applied.

RRF Formula: rrf_score(d) = sum(1 / (k + rank)) across all ranked lists where d appears.

k = 60 (configurable via RRF_K). Higher values give more equal weight across ranks.
Items appearing in multiple lists (e.g., found by both vector and keyword search) get boosted.
Items missing from a list simply don't contribute score from that list (no penalty).

The three retrieval paths:

Path	Source	How It Ranks
Vector	Qdrant cosine similarity	By similarity score
Keyword	Postgres `ts_rank_cd` or SQLite FTS5 `rank`	By BM25 text relevance
Graph	BFS spreading activation through entity graph	By aggregate activation score

Entity Extraction

Entities are extracted from memory content at write time using a fast-path approach (no LLM calls):

Extraction Pipeline

Client ID -- If client_id is not "global", it becomes a client entity
Source agent -- Always extracted as an agent entity
Domain names -- Regex: *.com, *.ca, *.io, etc. Typed as domain
Known technology names -- Dictionary of 70+ tech names with canonical forms (e.g., "nextjs" -> "Next.js"). Pre-compiled regex per keyword.
Known system names -- Dictionary of internal system/product names (e.g., "agency system", "shared brain")
Quoted names -- Text in quotes/backticks (3-60 chars), filtered by junk detection
Capitalized multi-word phrases -- Regex for Capitalized Word Patterns, filtered by junk detection

Junk Filtering

The extractor uses pattern-based filtering to avoid false positives:

Action verb prefixes: "Fixed Something", "Added Feature" -- these are log messages, not entities
Noise words at start: "The System", "This Feature" -- determiners, not names
Time words: "Monday Meeting", "January Update"
CSS-like patterns: "Background Color", "Border Radius"
Generic tail words: "Error Fix", "Data Type", "Page View"
File paths, HTML, code: Anything containing /, \, <, {, etc.
Prose fragments: Phrases where >40% of words are common English/French prose words
camelCase/snake_case: Variable names, not entity names

Alias Cache

An in-memory alias cache maps alternative names to canonical entities:

Pre-loaded with 70+ built-in tech name mappings
Extended at startup from the structured store's entity_aliases table
Updated in real-time when consolidation discovers new entities/aliases

When extraction encounters a name that matches an alias, it resolves to the canonical entity name and type.

Entity Linking

After extraction, entities are linked to memories in the structured store:

For each extracted entity, find or create the entity record
Create an entity_memory_link (entity_id, memory_id, role)
Create co_occurrence relationships between all entities in the same memory

Roles: about (client_id), source (source_agent), mentioned (found in content).

Knowledge Categories

Memories can be classified into knowledge domains:

Category	What It Covers
`brand`	Voice, identity, guidelines, visual standards
`strategy`	Plans, positioning, campaigns, competitive analysis
`meeting`	Call notes, action items, client conversations
`content`	Published work, content performance, editorial
`technical`	Hosting, CMS, SEO issues, infrastructure, code
`relationship`	Contacts, preferences, communication style
`general`	Default when no specific category fits

The consolidation engine reclassifies memories with general or null categories to more specific ones based on content analysis. Only reclassifies from general -- never overwrites a specific category.

Entity Relationships

Entities can have typed relationships:

Type	Meaning	Example
`contact_of`	Person is a contact of a client	"Jean -> acme-corp"
`same_owner`	Entities share ownership	"site-a.com -> site-b.com"
`uses`	Entity uses another entity	"acme-corp -> Next.js"
`works_on`	Agent/person works on a project	"claude-code -> acme-corp"
`competitor_of`	Competitive relationship	"acme -> rival-corp"
`co_occurrence`	Appear together in memories (auto-generated)	any pair

Relationships have a strength counter that increments each time the relationship is created/observed. The graph search uses strength to weight BFS traversal: strengthFactor = min(strength / 5, 1.0).

Graph Search (BFS Spreading Activation)

The graph search traverses entity relationships starting from entities mentioned in the query:

Extract entities from query text (fast-path regex)
Resolve to entity IDs in the store
BFS with activation decay:
- Seed entities start with activation 1.0
- Each hop decays by GRAPH_SEARCH_DECAY (default 0.8)
- Typed relationships (uses, works_on, etc.) get CAUSAL_BOOST (default 2.0x)
- co_occurrence relationships get no boost
- Stop at MAX_DEPTH (default 2) or when activation < 0.1
Collect all memory IDs linked to activated entities
Sum activation scores per memory (memories linked to multiple activated entities score higher)

Consolidation Pipeline

The consolidation engine runs on a cron schedule (default: every 6 hours) and processes all unconsolidated memories:

What It Produces

Output	Action	Stored As
Merged facts	Creates new fact, supersedes source memories	`type: fact`, `source_agent: consolidation-engine`
Contradictions	Flags for review	`type: event`, `importance: high`
Connections	Updates metadata on existing points	`connections` + `connection_description` fields
Entities	Creates/updates entities + aliases	Structured store entities/aliases tables
Knowledge categories	Reclassifies `general` -> specific	Updates `knowledge_category` payload field
Entity relationships	Creates typed relationships	Structured store entity_relationships table
Insights	Disabled (noise factory)	Skipped since 2026-03-19

Merged Fact Dedup

Before storing a merged fact, the engine runs two checks:

Exact hash match against existing active memories
Semantic similarity check (embed + search): skip if any result >= 0.92 similarity

Event Cleanup

After consolidation, the engine runs cleanupOldEvents:

Scrolls all events with type: event
Filters for: active=true, access_count=0, created_at older than EVENT_TTL_DAYS (default 30), importance is medium or low
Soft-deletes matching events (sets active: false, expired_at: now)

Batch Processing

Memories are grouped by client_id for focused analysis
Each group is processed in batches of 50 (to stay within LLM context limits)
The LLM receives memories wrapped in XML tags (<memory>) to resist prompt injection
All memory IDs in LLM output are validated against the current batch -- IDs from other batches are stripped

Credential Scrubbing

The scrub.js service removes sensitive patterns from content before storage:

API keys, passwords, tokens
Connection strings with credentials
Bearer tokens

This runs on every store path (POST /memory, webhook, import).

Cross-References

Architecture -- data flow diagrams
API Reference -- endpoint schemas
MCP Tools -- tool parameters for each concept
Configuration -- decay, TTL, RRF tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Model Deep Dive

Memory Types

event

fact

decision

status

Memory Payload Schema

Field Reference

Deduplication

Consolidation Dedup

Supersedes Mechanism

Temporal Validity

Confidence Decay

Formula

Decay Curve

How Access Prevents Decay

Search Scoring

1. Vector Similarity Score

2. Confidence Decay

3. Access Boost

4. Effective Score

5. RRF Fusion (Multi-Path)

Entity Extraction

Extraction Pipeline

Junk Filtering

Alias Cache

Entity Linking

Knowledge Categories

Entity Relationships

Graph Search (BFS Spreading Activation)

Consolidation Pipeline

What It Produces

Merged Fact Dedup

Event Cleanup

Batch Processing

Credential Scrubbing

Cross-References

FilesExpand file tree

data-model.md

Latest commit

History

data-model.md

File metadata and controls

Data Model Deep Dive

Memory Types

event

fact

decision

status

Memory Payload Schema

Field Reference

Deduplication

Consolidation Dedup

Supersedes Mechanism

Temporal Validity

Confidence Decay

Formula

Decay Curve

How Access Prevents Decay

Search Scoring

1. Vector Similarity Score

2. Confidence Decay

3. Access Boost

4. Effective Score

5. RRF Fusion (Multi-Path)

Entity Extraction

Extraction Pipeline

Junk Filtering

Alias Cache

Entity Linking

Knowledge Categories

Entity Relationships

Graph Search (BFS Spreading Activation)

Consolidation Pipeline

What It Produces

Merged Fact Dedup

Event Cleanup

Batch Processing

Credential Scrubbing

Cross-References