This document is a draft for potential codex-mem v2 work.
It is exploratory and non-normative. The v1 documents under this directory remain the source of truth for implemented behavior unless a future v2 spec adopts this material.
The v2 outline establishes hybrid retrieval as the leading direction, but it leaves open a practical question:
- where embedding state should live
- how semantic candidate retrieval should be wired in
- how embeddings should be backfilled and refreshed without destabilizing the v1 baseline
This document narrows that storage and lifecycle question into a concrete draft direction.
Primary references:
Any v2 embedding-storage design should preserve the core v1 properties:
- SQLite remains the canonical durable store for notes and handoffs
- lexical retrieval remains available without semantic infrastructure
- scope, privacy, searchability, and lifecycle policy remain hard filters before semantic ranking
- local-first installation remains the default experience
- missing or stale embeddings degrade retrieval quality, not baseline product correctness
These constraints matter more than choosing one fashionable vector backend.
V2 does not need to make embeddings part of the canonical note body.
Instead, it needs two classes of state:
This state should remain in SQLite because it is part of the system of record for retrieval readiness.
Recommended metadata fields per eligible durable record:
- durable record id
- durable record kind
- embedding eligibility state
- embedding model id
- embedding content version
- embedding content hash
- embedding status
- last embedded at
- last embedding error code
- last embedding error time
This state is the high-dimensional embedding payload used for semantic retrieval.
It does not need to become part of the canonical product record. It only needs stable linkage back to the canonical durable record id plus enough versioning metadata to detect staleness.
The v2 outline already allows multiple backend shapes. This draft compares them more directly.
Description:
- keep metadata and vector payloads in the same SQLite database
- attempt to support nearest-neighbor lookup from SQLite itself or from application-side scanning
Strengths:
- operationally simple on paper
- one file remains the full local data footprint
- backup story is straightforward
Weaknesses:
- practical vector search support inside the current SQLite setup is not yet established in this repository
- pure application-side scanning does not look attractive once the note set grows
- coupling vector experimentation to the canonical SQLite file raises upgrade and portability risk earlier than necessary
Current assessment:
- acceptable as a future specialized path
- not the best first target for low-risk v2 exploration
Description:
- keep canonical metadata in SQLite
- store vector payloads and nearest-neighbor data structures in a sidecar local index managed by a semantic-index interface
- link sidecar entries back to durable record ids and embedding content versions
Strengths:
- preserves SQLite as the canonical source of truth
- keeps the semantic path pluggable and easier to replace
- supports local-first installs without requiring a remote service
- lets the project evolve semantic internals without rewriting the canonical durable schema
Weaknesses:
- introduces another local artifact that must stay in sync
- requires explicit rebuild or repair workflows
- backup and diagnostics need to acknowledge a two-part local state model
Current assessment:
- best first-fit option for the project's current goals
- offers the cleanest separation between canonical memory state and experimental semantic infrastructure
Description:
- keep canonical metadata in SQLite
- delegate vector storage and nearest-neighbor search to a configurable external system
Strengths:
- can unlock more mature ANN capabilities quickly
- may scale better for advanced or centralized deployments
Weaknesses:
- weakens the default local-first story
- adds deployment and connectivity complexity
- risks making semantic retrieval feel like a separate product subsystem
Current assessment:
- useful as an advanced optional backend later
- not a good baseline for the first v2 rollout
The best first v2 direction is:
- keep canonical embedding metadata in SQLite
- use a pluggable semantic-index interface
- make the default implementation a local sidecar index
- allow optional future external backends only behind the same interface
This gives the project a narrow and reversible first step.
The first implementation should separate responsibilities clearly:
SQLite should continue to own:
- durable note and handoff records
- scope and lifecycle fields
- privacy and searchability state
- lexical retrieval structures
- embedding readiness metadata
- backfill queue visibility or status metadata if needed
The semantic index should own:
- vector payload persistence
- nearest-neighbor candidate lookup
- index-local rebuild state
- index version compatibility checks
The Go retrieval layer should own:
- hard policy filtering before semantic candidate use
- requesting semantic candidates for eligible records
- fusion and reranking across lexical and semantic paths
- degraded fallback behavior when semantic state is absent or stale
This keeps policy in Go and storage-specific behavior in the semantic backend.
The first v2 metadata extension can stay intentionally small.
Recommended per-record fields:
embedding_eligibleembedding_statusembedding_model_idembedding_content_hashembedding_content_versionembedded_atembedding_error_codeembedding_error_at
Suggested status values:
not_applicablependingreadystalefailed
Suggested semantics:
not_applicable: the record type or policy is not eligible for embeddingspending: eligible but not yet embeddedready: embedding is available and matches current content versionstale: embedding exists but content or model version changedfailed: the last embedding attempt failed and needs operator attention or retry
The content hash should be derived from the embedding source text, not from unrelated record metadata.
For the first notes-only rollout, embedding text should be derived from a stable, bounded projection of the durable record.
Recommended source ingredients:
- note title
- note type
- compact normalized body content
- optional tags if the product later adopts them explicitly
It should not depend on:
- volatile session-local context
- diagnostic fields
- retrieval scores
This keeps staleness detection predictable and avoids unnecessary re-embedding.
The first backfill workflow should be explicit and operator-controlled.
Recommended first phase:
- one-shot backfill command
- current-project scope first
- notes only
- resumable progress by status and content hash
Recommended later phase:
- incremental background updates when durable notes change
- optional scheduled maintenance workflows
This matches the project's cautious rollout posture.
The first operator backfill flow should look like this:
- enumerate eligible records from SQLite
- skip records already marked
readyfor the current content hash and model id - select a bounded batch of
pending,stale, or retryablefailedrecords - generate embeddings
- write vector payloads into the semantic index
- update SQLite metadata only after the semantic write succeeds
- emit a summary of processed, skipped, failed, and stale-cleared records
This ordering avoids claiming a record is ready before the semantic index actually contains the vector.
Sidecar designs only work well if repair is explicit.
Recommended failure handling:
- semantic write fails after embedding generation: leave SQLite status as
pendingor markfailed - SQLite metadata update fails after sidecar write: treat the sidecar entry as orphan-repairable and detect it during health checks
- sidecar index missing or incompatible: mark semantic retrieval unavailable and fall back to lexical retrieval
- content hash mismatch: mark record
staleand exclude stale semantic candidates unless degraded policy says otherwise
Recommended repair workflows:
- rebuild the entire sidecar index from canonical SQLite metadata and durable records
- reconcile sidecar entries against SQLite embedding metadata
- clear failed states for retryable records without touching canonical note content
Retrieval behavior should remain predictable in partially-built states:
- no semantic index present: lexical-only retrieval
- semantic index present but empty: lexical-only retrieval plus diagnostics
- semantic index partially populated: hybrid retrieval over only
readyrecords - semantic index incompatible with current model version: lexical-only retrieval until rebuild or repair
The fallback should be deterministic and boring.
Any first v2 implementation should expose enough observability for maintainers to understand semantic readiness.
Useful health signals include:
- eligible record count
- ready record count
- stale record count
- failed record count
- sidecar index version
- sidecar orphans count
- current embedding model id
- last successful backfill time
These can remain maintainer-facing before any operator UX is formalized.
The implementation should likely split the semantic path into two interfaces:
Responsibilities:
- turn bounded source text into vectors
- expose provider or model identity
- surface retryable versus permanent failures
Responsibilities:
- upsert vectors keyed by durable record id and content version
- fetch nearest candidates for a query vector
- report readiness or compatibility state
- rebuild, repair, or clear index-local state when needed
This keeps provider and storage concerns separate.
The first real implementation should keep all of these limits:
- notes only
- one embedding source-text shape
- one local sidecar implementation
- explicit backfill command first
- lexical fallback always available
- no mandatory startup indexing
- no semantic dependence for bootstrap recovery correctness
These limits are a feature, not a temporary embarrassment.
The following design questions remain open:
- What sidecar format best matches the project's portability goals: another SQLite file, a purpose-built local index file, or a small embedded library format?
- Should stale vectors ever remain queryable in degraded mode, or should only
readyvectors participate in candidate generation? - How much backfill progress state belongs in SQLite metadata versus index-local bookkeeping?
- Should the first backfill command operate only on the current project scope, or also allow related-project batches when policy permits?
Use this document as the working draft for v2 semantic storage and backfill design.
For the first implementation attempt:
- keep SQLite as the canonical metadata source
- add only minimal embedding lifecycle fields there
- use a local sidecar semantic index behind an interface
- start with one-shot operator backfill
- treat rebuild and lexical fallback as first-class paths