What the code actually does at runtime. If a section here contradicts the spec, trust the code.
Same read/write core, different wrappers:
- CLI -
src/dory_cli/main.py - HTTP -
src/dory_http/app.py - Native MCP bridge -
src/dory_mcp/server.py - Claude bridge -
scripts/claude-code/dory-mcp-http-bridge.py - Hermes provider -
plugins/hermes-dory/provider.py - OpenClaw plugin -
packages/openclaw-dory/src/index.ts
Everything routes through dory_core. The wrappers are thin.
Three layers:
- Scanned by
MarkdownStoreinsrc/dory_core/markdown_store.py - Frontmatter parsed by
src/dory_core/frontmatter.py - Chunked by
chunk_markdown()insrc/dory_core/chunking.py
Lives under index_root. Blow it away and rebuild any time.
index_root/dory.db- SQLite. Holds the entity registry, claims, claim events, files, chunks, FTS, edge graph, embedding cache, chunk vectors, recall log, and parity tables.chunk_vectors- the vector table, managed bySqliteVectorStore. Brute-force O(n) cosine today. A legacyindex_root/lance/chunks_vec.jsonis imported as a fallback when the SQLite vector table is empty.- Raw session logs under
logs/sessions/**are excluded from this durable index. They are raw evidence, not canonical memory.
index_root/session_plane.db- SQLite, managed bySessionEvidencePlaneinsrc/dory_core/session_plane.py.- Used for session recall and ingest. Deliberately separate from durable memory so raw logs don't leak into canonical answers.
- Session-plane search is lexical/FTS-based with recency scoring; it is reached through
mode="recall"orcorpus="sessions", and may be merged intocorpus="all"only for explicit recent/session-style queries. dory reindexanddory ops watchsync session markdown intosession_plane.dbwithout embedding it into the durable vector index.
Full reindex:
MarkdownStore.scan()walks durable*.md, excludinglogs/sessions/**.- Each doc is parsed and chunked.
reindex_corpus()prepares file rows and chunk rows.- New embeddings are generated for uncached content hashes.
SqliteStore.replace_documents()rewrites file/chunk/FTS/cache state.SqliteVectorStore.replace()rewrites vector rows indory.db.- Session logs are synced separately into
session_plane.db.
Partial reindex:
- Changed relative paths are resolved.
- Old chunk IDs for those paths are deleted.
- Changed docs are reparsed and re-embedded.
- SQLite/vector rows are upserted.
Main code:
src/dory_core/index/reindex.pysrc/dory_core/index/sqlite_store.pysrc/dory_core/index/sqlite_vector_store.py
WakeBuilder in src/dory_core/wake.py builds a frozen prompt block from:
core/user.mdcore/soul.mdcore/env.mdcore/active.md- optional pinned decisions
- optional recent session summaries
Current behavior:
- respects a token budget
- recent sessions are selected by file mtime, not lexicographic path order
- recent sessions are summarized to one line each
- it's a static snapshot at call time
SearchEngine in src/dory_core/search.py supports:
bm25vectorhybridrecallexact
Input aliases:
text,keyword,lexical->bm25semantic->vector
corpus values:
durablesessionsall
- BM25 reads from SQLite FTS; scores are negative (raw SQLite
bm25()output) - Vector search reads all vector rows and scores by cosine similarity (brute-force)
- Hybrid search:
- gets BM25 candidates
- gets vector candidates
- fuses rankings via RRF
- applies priors for canonical/current/source-backed docs
- returns
evidence_classand confidence in the default client-facing response;rank_score,score_normalized, rawscore, andfrontmatterare diagnostic fields exposed only withdebug=true - keeps
rank_scoreandscore_normalizedaligned with the final returned order after rerank, selection, or session merge - demotes raw inbox, generated, and session-like material unless a scope/exact query asks for it
- adds lexical and temporal boosts
- optionally expands queries through OpenRouter (
query_expansion.py) whenDORY_QUERY_EXPANSION_ENABLED=true - when OpenRouter retrieval planning is configured with
DORY_QUERY_PLANNER_ENABLED=true,retrieval_planner.pycan replace heuristic expansion/session decisions with a strict search plan of durable queries plus optional session queries - can rerank rows when
DORY_QUERY_RERANKER_ENABLED=trueandrerankpermits it - expanded queries contribute to both BM25 and vector candidate generation
- durable/session fallback merge uses rank-plus-coverage scoring instead of raw concatenation
Score inconsistency: BM25-only returns negative scores; hybrid/vector return positive. API consumers shouldn't compare raw scores across modes. Use result order in normal mode; use debug scores only for diagnostics.
mode="recall"orcorpus="sessions"searchessession_plane.db- Session recall honors
SearchScopefilters foragent,device,session_id,session_key,status,since, anduntil - Results are marked as lower-trust session evidence
- Session-plane ranking combines FTS hits with token coverage and recency
- Session snippets are cut around matching query terms instead of always returning the first bytes
- Recall mode searches all session documents unless a session scope filter is supplied
- Raw session logs are not part of durable BM25/vector memory. Use recall/session search when the user asks about recent work, prior conversations, or agent-session evidence.
In hybrid search:
- Deterministic path — weak durable results trigger a fallback into session-plane results.
- With the LLM retrieval planner on — session use becomes part of the explicit plan instead of relying on the weakness heuristic. The final merged durable + session set can also be reordered via strict-schema selection.
- Generic
corpus="all"merges skipactiveandinterruptedsession rows unless the query asks for session or recent-history evidence. This keeps live agent transcripts out of normal project-state answers while preserving explicit recall.
- The background client shipper scans local harness stores on its poll interval and posts cleaned sessions to
/v1/session-ingest. - The Claude Code HTTP bridge also runs the shipper once immediately before
dory_wakeby default. This makes just-finished local Codex/Claude/OpenClaw/Hermes sessions available to a wake call without waiting for the next poll. - The pre-wake sync uses the configured client env, spool, and checkpoint paths. Disable it with
DORY_SYNC_SESSIONS_ON_WAKE=false.
Default stays deterministic so interactive search stays fast. The LLM path is opt-in.
ActiveMemoryEngine in src/dory_core/active_memory.py. Optional staged retrieval helper, not a full autonomous sub-agent.
Stages:
- Explicit call enters the staged flow.
- Request
profileresolves togeneral,coding,writing,privacy, orpersonal.autofalls back to prompt classification. - The profile's source policy decides the wake profile, whether session evidence is allowed, whether the generated wiki shell is used as helper context, and which path families are blocked.
- Optional
projector inferredcwdproject context resolves toprojects/<slug>/state.mdand is injected as durable evidence when available. - Optional retrieval planner turns the prompt plus compact helper context into durable and optional session queries.
- Durable hybrid search with rerank enabled by default.
- Session-plane recall only when the profile allows it and the prompt asks for recency. Any
ActiveMemoryReq.scopeis applied to this session-recall leg, not to durable active-memory search. - Optional composer compresses a tiny sanitized evidence packet into a synthesis block.
- Final output: synthesis + bounded durable/session evidence under the request token budget. Response includes the resolved profile.
Profile rules:
codingblocks personal/voice paths.writingloads voice context without full identity.privacyis boundary-only: no session evidence, no wiki helper, no inbox, no people pages, no personal knowledge.
LLM path (optional):
- Provider via
DORY_ACTIVE_MEMORY_LLM_PROVIDER:openrouter,local(OpenAI-compatible endpoint fromDORY_LOCAL_LLM_*),auto(local then OpenRouter), oroff. - Stages via
DORY_ACTIVE_MEMORY_LLM_STAGES:plan,compose, orboth. Deterministic retrieval always runs; the LLM only touches the stages you enable. - Read-only. The LLM sees sanitized snippets and strict schemas. No write path. If the deadline is tight, LLM stages are skipped.
Dream/digest jobs use DORY_DREAM_LLM_PROVIDER (openrouter, local, or auto). The local path reuses DORY_LOCAL_LLM_*, so session distillation and daily digests can run against a LAN OpenAI-compatible model without internet access from the container.
Other notes:
- Wiki pages are helper context only and are never returned as citeable sources. Durable evidence excludes
wiki/so generated cache pages do not outrank canonical files. - Output starts with a
## Active memorysynthesis across current focus, helper hints, project state, and selected durable/session hits, followed by bounded evidence sections. When a canonical file is available locally, active-memory combines the focused search snippet with a compact canonical excerpt soinclude_wake=falsecalls are not reduced to a single generic line. Usescope.session_keyor related session filters when a client wants the block grounded in a specific active agent session.
get is intentionally simple.
HTTP GET /v1/get:
- resolves a corpus-relative path
- prevents escape from the corpus root
- slices lines via
fromandlinesquery params - returns 404 for paths outside the configured corpus, even when a memory document cites those paths as implementation evidence
- returns content, frontmatter, hash, and line counts
Native MCP dory_get returns the same path, line, frontmatter, and hash metadata.
WriteEngine in src/dory_core/write.py is the low-level writer.
Supported kinds:
appendcreatereplaceforget
Core behavior:
- validates relative targets and resolves final parents under the corpus root before writing
- validates/normalizes frontmatter
- scans for prompt-injection patterns (subset of spec patterns; doesn't cover "ignore all instructions" or Unicode tag characters)
- rejects invisible unicode
- supports quarantine mode for unsafe content
- preserves tombstones for
forget - allows dotted relative targets like
*.tombstone.mdfor internal tombstone rewrites - updates the index immediately when embedder and
index_rootare present - resyncs link edges for changed docs
Known gaps:
- the main write engine uses atomic temp-write-and-replace, but non-core scripts still use plain
write_text() forgetstill spans two file replacements and isn't transactional across both paths- raw write validation is still path-based; the semantic layer is preferred for agent-facing writes
SemanticWriteEngine in src/dory_core/semantic_write.py is the preferred durable write layer.
Pipeline:
- Resolve the subject through
EntityRegistry(with fallback matching). - Route to a canonical target path.
- Drop an immutable evidence artifact under
sources/semantic/YYYY/MM/DD/*.md. - Append / replace / forget through
WriteEngine. - Update
ClaimStoreactive claims and claim events using the evidence artifact as provenance. - Rebuild the canonical page from active claims + events.
- On
forget, republish the tombstone page from claim history + events.
Notes:
- Person, project, concept, decision, and core pages are structured section docs.
TimelineandEvidenceon those pages come from claim events, not ad-hoc appends.- Every resolved semantic write emits an immutable evidence artifact unless quarantined.
- Live and dry-run semantic write responses include the planned evidence path and match source. Dry-run responses include a compact preview payload with target, subject refs, resolved mode, and confidence.
- Optional
agent,session_id, andorigin_surfacerequest fields are preserved on semantic evidence and inbox/quarantine captures. - Claim events carry
entity_id,event_type,reason,evidence_path,created_at. - Semantic
forgetsupersedes the original page and writes the retired history + evidence view to the tombstone. - Unresolved low-confidence subjects get rejected or quarantined.
_sync_registry()refreshes registry state inside the same process, so writes can resolve subjects that were just established.
MigrationEngine in src/dory_core/migration_engine.py. Four stages when LLM support is on:
- Per-document strict extraction
- Evidence-first staging for unresolved docs
- Corpus-level entity clustering across candidates
- Claim-store-backed canonical compilation, plus optional audit / repair pass on generated pages
- Markdown files + legacy structured inputs:
.json,.jsonl,.ndjson,.txt,.yaml,.yml,.toml,.csv. Non-markdown inputs are normalized to markdown before classification and land as*.json.md,*.jsonl.md, etc. so the corpus stays markdown-first. - Per-document artifacts land under
inbox/migration-documents/<run_id>/and record fallback reasons when LLM promotion degrades to evidence-only.
Unresolved docs stay evidence-only or quarantined by default, but a narrow set of structured inputs can promote deterministically:
- Transcript-shaped
.jsonl/.ndjsonsession evidence with clear assistant statements. - Typed JSON with an explicit family (
project,person,decision,concept) plus a title and summary. - Schema-tagged exports with a registered adapter:
dory.project_export.v1,dory.person_export.v1,dory.decision_export.v1,dory.concept_export.v1. Unknown schemas stay evidence-only.
Migration doesn't just append every atom — it maps them to claim kinds:
project_update→ claim kindstate. A newerproject_updatefor the same project replaces the current active state claim instead of leaving multiple actives side by side.person_fact/concept_claim→ claim kindfact.goal/open_question/followup/timeline_event→ claim kindnote.
Extracted source dates (from time_ref) are preserved on the written claims so canonical timelines reflect when things happened, not when migration ran.
Resolved docs go through a corpus-level clustering pass before atoms are committed. Clustering merges aliases (e.g. person:primary-user and person:primary-user-alias) across source files before claims and pages are written.
Core docs map to specific sections instead of a generic Role fallback:
core/user.md→Summarycore/identity.md→Rolecore/soul.md→Voicecore/env.md→Environmentcore/active.md→Current Focuscore/defaults.md→Default Operating Assumptions
- After compilation, migration emits
inbox/migration-runs/<run_id>.audit.json. - If pages are flagged, it emits
inbox/migration-runs/<run_id>.repair.json, applies one bounded grounded repair pass, and re-audits before writing the final report and run artifact. - Entity-resolution, audit, and repair failures that previously degraded silently now persist as fallback warnings in the run report.
SessionIngestService in src/dory_core/session_ingest.py:
- Only accepts
logs/sessions/**/*.md. - Writes a session markdown file under the corpus.
- Writes a session evidence row into
session_plane.db. - Doesn't trigger a full durable reindex.
Why the separation matters:
- Session evidence is searchable quickly.
- It doesn't pollute durable canonical memory automatically.
- The markdown write uses the same atomic temp-write-and-replace helper as the main durable write path.
Link behavior splits into two mechanisms:
- Explicit wikilinks like
[[people/alex|Alex]] - Implicit known-entity matching against corpus entities
During indexed writes:
- Edges are extracted from markdown.
- Previous edges from the source doc are deleted.
- New edges are inserted into SQLite.
Graph reads exposed via:
neighborsbacklinkslint
in src/dory_core/link.py.
neighbors and backlinks are bounded by max_edges and can filter noisy path families with exclude_prefixes. Responses include count, total_count, and truncated, so agent clients can keep dense project/core graphs small without losing the fact that more edges exist.
Note: the code block regex in link.py (re.compile(r"```.*?```", re.DOTALL)) can match incorrectly across multiple fenced code blocks and doesn't handle unclosed blocks.
ResearchEngine in src/dory_core/research.py is still bounded, but no longer a snippet dump.
- Runs hybrid search with rerank enabled.
- Requests compact snippets rather than whole chunk bodies.
- Builds a grounded artifact body with
Question,Answer,Evidence, and optionalSession Evidencesections. - Dedupes source paths before returning the artifact.
- Optionally persists the artifact through
ArtifactWriter.
Artifact targets:
- reports ->
references/reports/ - briefings ->
references/briefings/ - wiki notes ->
wiki/concepts/ - proposals ->
inbox/proposed/
Semantic memory can now be proposed before it is committed. dory_memory_propose and dory proposals create run the same semantic router as memory-write in dry-run mode, then persist a proposal JSON document under inbox/proposed/ with the planned route, preview, risk flags, source paths, and provenance. Reviewers can list/show pending, applied, or rejected proposals through CLI, HTTP, MCP, the Claude Code bridge, or Hermes.
Applying a pending proposal re-runs the dry-run route and compares the durable routing identity (target_path, subject_ref, and preview target_subject_ref) with the stored preview before performing the live write with canonical writes allowed. Rejected and applied proposals move to inbox/rejected/ or inbox/applied/ so agents can audit prior decisions without reprocessing the pending queue.
src/dory_core/openclaw_parity.py adds two parity surfaces:
- recall event tracking
- public artifact listing
Current code supports:
POST /v1/recall-eventGET /v1/public-artifacts- parity diagnostics inside status
ClaimStore plus the canonical / wiki renderers split current state from history.
- Active / current sections: from active claims.
Timeline: from ordered claim events.Evidence: from dedupedevidence_pathvalues on those events.- Older callers can still synthesize events from claim history, but live semantic writes supply explicit claim events now.
Main code:
src/dory_core/claim_store.pysrc/dory_core/canonical_pages.pysrc/dory_core/compiled_wiki.py- Recall-promotion candidate tracking (used by dreaming)
Two implementations:
src/dory_core/token_counting.py- tiktoken-based counting with per-agent encoding and heuristic fallbacksrc/dory_core/chunking.py- usestext.split()(whitespace split), not tiktoken
Chunking doesn't use the tiktoken counter, so chunk boundaries are based on word count, not actual token count.
Domain exceptions in src/dory_core/errors.py:
DoryError- baseDoryConfigError- configuration issuesDoryValidationError- input validation failures
HTTP mapping: all DoryValidationError instances return 400 regardless of the specific failure (path invalid, precondition failed, injection blocked, quota exceeded, frontmatter invalid). The API contract defines distinct error codes and HTTP status codes for each.
MCP mapping: tool-call exceptions are caught at the server boundary and returned as JSON-RPC errors. Validation-style failures map to invalid-params, everything else to internal server error.
- Vector search is SQLite + brute-force cosine, not ANN.
- TCP MCP enforces bearer auth via
params._auth.tokenon the first request, sharing the HTTPauth-tokens.jsonfile. Stdio MCP is process-local trust. The Claude Code bridge forwards HTTP bearer auth. - Recall mode searches all sessions unless the caller supplies session scope.
- Critical write paths are atomic; some non-core scripts still use plain
write_text(). SubjectResolverentries can go stale in long-running daemons.- HTTP errors return the contract envelope
{"error": {"code", "message", "type"}}for/v1/*,/healthz,/metrics, and/v1/stream. Wiki routes still use FastAPI's{"detail": "..."}because they're HTML-driven. - Chunking splits on whitespace for token counts, not tiktoken.
- Chunk
overlap_ratiois accepted but not implemented.