This document is a draft for potential codex-mem v2 work.
It is not the current normative spec. The v1 documents under this directory remain the source of truth for implemented behavior unless and until a future v2 spec is adopted.
Companion draft documents:
- v2-runtime-resurfacing.md
- v2-config-draft.md
- v2-embedding-storage-draft.md
- v2-conformance-scenarios-draft.md
- v2-migration-sequencing-draft.md
The current v1 system is intentionally local-first, scope-safe, and operationally simple:
- SQLite is the durable store
- notes use FTS5 for full-text search
- handoffs use structured retrieval and text matching
- ranking policy lives mostly in Go code
That design has worked well for structured note and handoff memory, but it leaves some room for improvement:
- lexical search misses semantically similar queries that do not share terms
- imported artifacts and larger free-form summaries may be harder to recover with keyword overlap alone
- retrieval quality may degrade when the same concept is described in very different wording across sessions
This draft describes a v2 direction that improves retrieval quality without discarding the scope, privacy, provenance, and local-first guarantees that define v1.
V2 should be evolutionary, not a rewrite.
Required continuity assumptions:
- v1 data remains readable without migration to a different primary store
- v1 MCP tools continue to exist unless a future normative spec explicitly changes them
- v1 retrieval behavior remains available as a fallback
- scope, privacy, and provenance rules remain mandatory
V2 should treat v1 as the stable baseline and add retrieval capabilities around it rather than replacing it wholesale.
The current leading candidate for v2 is hybrid retrieval.
Hybrid retrieval means:
- keep structured filters and lexical retrieval
- add embedding-backed candidate retrieval for notes and selected future artifacts
- fuse both candidate sets in the Go retrieval layer
- preserve explicit explanations for why a result was returned
This is intentionally different from "switch everything to a vector database."
The likely v2 direction is:
- SQLite remains the canonical durable record store
- FTS remains useful for exact terminology, code identifiers, and operational queries
- vector retrieval becomes an additional recall path, not the only one
- final ranking remains policy-aware and scope-aware in Go
V2 should support retrieving note candidates from both:
- lexical search such as SQLite FTS
- semantic similarity search using embeddings
The system should then merge and rerank results in a way that still respects:
- scope boundaries
- lifecycle state
- importance
- provenance
- recency
- explicit retrieval intent
V2 should improve not only explicit search, but also retrieval during an active session.
Today, the system already supports:
- bootstrap-time recovery at session start
- explicit retrieval through tools such as search and recent-history lookup
But a stronger v2 experience should also support:
- retrieving relevant prior notes while a new user request is being handled
- resurfacing previously solved problems that match the current task, even when the user did not explicitly ask to search memory
- keeping that retrieval bounded by scope, privacy, and provenance rules
This should be treated as task-conditioned retrieval, not as an always-on global memory dump.
The system should be able to use the current request, active task, and recent session context to decide whether durable memory should be consulted before or during response generation.
V2 should keep local-first operation as the default experience.
That means a v2 design should prefer:
- local embedding generation when practical
- local caching and backfill workflows
- storage layouts that do not force a remote SaaS dependency
V2 may allow external vector infrastructure, but it should not assume that a remote vector service is required for a normal local installation.
Semantic similarity must not weaken project isolation.
The retrieval order should still begin with hard policy filters:
- scope eligibility
- privacy and searchability eligibility
- lifecycle and provenance filters
- candidate generation
- ranking and dedupe
Semantic similarity should help rank eligible candidates. It should not decide whether ineligible records become visible.
V2 should keep retrieval explainable enough for debugging and operator trust.
At minimum, a result should remain traceable by:
- source scope
- record kind
- record source or provenance
- relation type for cross-project results
- retrieval reason or score components at debug level
V2 may improve support for larger text summaries or imported artifacts, but it should not turn codex-mem into an indiscriminate transcript archive by default.
Structured notes and handoffs should remain the core durable objects.
The v2 runtime experience should distinguish between three retrieval moments:
- bootstrap retrieval at session start
- explicit retrieval when the agent or user asks to search memory
- implicit runtime resurfacing while handling a new request
The third mode is the important addition.
In that mode, the system should be able to:
- inspect the current task or user request
- fetch a small set of relevant prior durable records
- inject only the most relevant records into the active reasoning context
- avoid repeatedly reloading the same irrelevant memory on every turn
Recommended safeguards:
- keep the candidate set small
- prefer current workspace and current project first
- require stronger confidence for related-project expansion
- preserve provenance labels so the caller can distinguish retrieved memory from newly derived reasoning
- allow the caller or deployment to disable implicit runtime resurfacing if desired
Implicit runtime resurfacing should not run blindly on every turn.
Recommended trigger classes:
- the current request appears to ask about a previously solved bug, fix, or design choice
- the current request contains terminology that strongly overlaps with prior durable memory
- the active task already has a durable history and the new request looks like a continuation of that work
- the request is ambiguous enough that prior project memory is likely to reduce repeated investigation cost
Recommended non-triggers:
- casual conversation that does not depend on project memory
- requests already fully satisfied by the current turn context
- repeated turns where the same memory was just surfaced and no meaningful new signal appeared
Runtime resurfacing should use an explicit confidence gate before injecting durable memory into the active reasoning context.
Possible contributors to resurfacing confidence:
- lexical overlap with note title or content
- semantic similarity to prior notes
- same-task continuity signals from the current session
- scope proximity
- note type relevance such as decision, bugfix, or discovery
- recency and importance
- provenance preference for explicit memory over weaker imported artifacts
If the confidence is below a configured threshold, the system should skip automatic resurfacing and remain usable without it.
The resurfacing path should stay intentionally small.
Recommended initial budget:
- default to 1 to 3 records for implicit runtime resurfacing
- allow a slightly larger budget when the user explicitly asks to search or recover memory
- prefer fewer high-confidence records over a larger noisy set
The system should optimize for relevance density, not recall at all costs.
The system should avoid reloading the same memory repeatedly within one active task unless there is a good reason.
Suggested safeguards:
- keep a session-local resurfacing cache of recently injected record ids
- suppress the same records for a cooling period unless the request changed materially
- allow resurfacing again when the task shifts, the confidence rises meaningfully, or the user explicitly asks to search memory
This helps keep runtime retrieval useful rather than repetitive.
Runtime resurfacing should not dump raw durable records into the live context without shaping.
Preferred behavior:
- retrieve a small set of records
- convert them into compact working-context summaries
- preserve the original record ids and provenance labels
- keep the full record available for explicit inspection when needed
Recommended working-context fields:
- record id
- kind
- source scope
- note or handoff title
- one compact reason why it is relevant now
- one compact summary of the durable content
This keeps the reasoning context compact while preserving traceability.
If the resurfacing path is unavailable or degraded, the system should still behave well.
Examples:
- embeddings unavailable: fall back to lexical-only resurfacing
- embedding coverage incomplete: use the records that do have coverage and continue
- confidence too low: skip implicit resurfacing
- indexing stale: warn in diagnostics, but do not fail ordinary request handling by default
The safest first v2 implementation boundary is narrower than the full long-term direction.
Recommended starting assumptions:
- notes are the first and only semantic retrieval target
- handoffs remain lexical or structured only in early v2
- imported artifacts participate through durable notes rather than a new direct semantic path
- implicit runtime resurfacing stays current-project-first
- related-project implicit resurfacing remains optional and higher-threshold
- implicit runtime resurfacing injects shaped summaries, not full durable records
This preserves a small initial blast radius while still testing the most valuable new retrieval behavior.
The following should remain out of scope unless separately justified:
- a global undifferentiated memory pool
- remote-first architecture
- mandatory hosted vector infrastructure
- automatic durable storage of raw transcripts by default
- black-box ranking with no retrieval explanation
- dropping SQLite as the canonical source of durable truth
Recommended v2 retrieval pipeline:
- Resolve the current scope and retrieval policy.
- Apply hard filters for scope, privacy, searchability, and state.
- Fetch lexical candidates.
- Fetch semantic candidates.
- Merge, dedupe, and rerank in Go.
- Return results with provenance and relation labels.
The lexical path can continue to use:
- SQLite FTS for notes
- optional future FTS support for handoffs
- structured filters for type, state, importance, and scope
The semantic path should be interface-driven and pluggable.
The v2 spec should not yet require one storage backend. Acceptable designs could include:
- SQLite plus a local sidecar vector index
- SQLite plus vector rows managed in a companion local store
- a configurable external vector backend for advanced deployments
The important requirement is not the brand of store. It is that semantic retrieval integrates cleanly with the existing scope and policy model.
Current draft recommendation:
- keep embedding readiness metadata in SQLite
- use a pluggable semantic-index interface
- prefer a local sidecar index for the first rollout
- allow optional future external backends later without changing the canonical store
The final ranking layer should remain in Go, not be delegated entirely to raw backend scores.
Candidate ranking should consider at least:
- scope proximity
- lifecycle state
- importance
- recency
- source and provenance bias
- lexical relevance
- semantic relevance
- explicit search intent
This keeps retrieval policy portable even if the semantic backend changes later.
V2 likely needs additional metadata for embedding-backed retrieval.
Possible additions include:
- embedding model identifier
- embedding version
- embedding status for each eligible record
- embedding updated time
- chunk or segment metadata for larger future artifacts
- retrieval explanation metadata for debug output
These additions should not change the canonical meaning of memory_items and handoffs.
They should extend retrieval support around the existing durable objects.
The safest v2 direction is to preserve the current tool names first.
Possible rollout order:
- keep
memory_searchandmemory_bootstrap_sessionunchanged - improve their retrieval quality internally behind configuration gates
- add optional debug or explanation fields only when the contract is ready
- consider new tool parameters only if the retrieval policy truly needs them
This minimizes client breakage and lets v2 mature behind stable entrypoints.
Any v2 implementation should maintain:
- compatibility with existing v1 records
- compatibility with installations that only want lexical retrieval
- graceful operation when embeddings are unavailable or stale
- deterministic fallback to lexical-only retrieval when semantic infrastructure is disabled
Recommended migration posture:
- do not require destructive schema changes to keep v1 running
- allow embeddings to be backfilled incrementally
- treat missing embeddings as degraded-but-usable, not fatal
- keep operator control over backfill cost and retention
The following questions should be resolved before a normative v2 spec is written:
- Should v2 require a specific local embedding provider, or only define an interface?
- After a notes-first rollout, when if ever should handoffs or imported artifacts join the semantic path?
- What fusion strategy best balances exact keyword matches with semantic recall?
- How much retrieval explanation should be returned by default versus debug mode only?
- Should a single-binary local install remain the baseline even when semantic retrieval is enabled?
- Which concrete sidecar format gives the best local-first tradeoff for the first semantic rollout?
If the project decides to pursue v2, a low-risk sequence would be:
- write a maintainer roadmap for hybrid retrieval slices
- add retrieval interfaces for semantic candidate generation
- add optional embedding metadata and backfill state
- implement semantic retrieval behind a config gate
- fuse lexical and semantic candidates in the existing Go ranking layer
- expand conformance coverage for lexical-only, semantic-enabled, and degraded fallback modes
Treat this document as a planning anchor, not as a committed product contract.
For now:
- v1 remains the active implemented baseline
- hybrid retrieval is the most likely v2 direction
- SQLite should remain the canonical durable store unless a future v2 spec explicitly changes that decision