This document records design decisions and open questions that require stakeholder input.
Issue: When a project is re-discovered (POST /agent/discover for an existing project), the original project_token cannot be returned because only its SHA-256 hash is stored. The current behavior returns __existing__:<project_id> as the token value, which is not a usable token.
Question: What should happen on re-discovery of an existing project?
- Option A: Always regenerate the token (current behavior when
regenerate_token: true) - Option B: Return a sentinel indicating "project already registered — use your saved token or pass
regenerate_token: true" - Option C: Store the token encrypted rather than hashed (allows recovery but is a security trade-off)
Current behavior: Returns __existing__:<id> sentinel when project exists and regenerate_token is not set.
Option A is OK.
Issue: The policy system stores agent_pattern, allowed_paths, and denied_paths, but the current Agent API endpoints do not enforce policies when fetching secrets. Policies are stored but not consulted at access time.
Question: Should policy enforcement be implemented?
- The data model is in place, but the enforcement logic (glob pattern matching on agent_id, path checking) is not wired into
/agent/secretsand/agent/confighandlers. - If yes: agent session tokens from
/agent/authenticatewould need to carry the agent_id and be validated against matching policies on each request. Yes.
Issue: There are currently two separate auth flows:
- Agent auth flow: Ed25519
auth_proof→ session_token (1-hour expiry) - Project flow:
project_token(no expiry, permanent until regenerated)
The session_token from agent authentication is not currently used to gate /agent/secrets or /agent/config. Only the project_token is used there.
Question: Should agent authentication be required before accessing project secrets?
- Combined flow: agent authenticates first, then uses session_token + project_token together?
- Separate flows: agents and projects are independent auth concerns? Agent and projects are independant auth concerns. The /discover api should be called by agent with session token.
Issue: Currently, any project with a valid project_token can be mapped to ANY secret in the vault via the discover flow. There is no per-secret access control.
Question: Should secrets be namespaced or tagged with which projects/agents can access them?
- This would prevent one project from being mapped to secrets it shouldn't see. Yes, put a namespace for secret. and also a namespace for project/agent too.
Resolved. The data model is now envelope-encrypted (per-row DEKs wrapped by an in-memory KEK).
POST /admin/rotate-key {"new_kek_password": "..."} re-derives the KEK from a new operator
passphrase, re-wraps every DEK, and bumps kek_version. Body ciphertexts are untouched, so
rotation is O(rows wrapped) rather than O(rows re-encrypted). The server must be restarted with
the new passphrase afterward to pick up the new KEK in memory.
Issue: SQLite with a single file does not support multiple concurrent writer instances. If the server needs to run as multiple replicas, a different database backend is needed.
Question: Is single-instance sufficient, or is horizontal scaling required?
- If scaling is needed: consider PostgreSQL backend (sqlx supports it with a feature flag change) No, SQLite is OK currently.
Issue: cortex-cli uses std::os::unix::process::CommandExt::exec() which is Unix-only. The Windows equivalent is different.
Question: Is Windows support required?
- If yes: need to implement a Windows-compatible launcher (spawn child, wait, forward exit code) No. Only Unix-only is supported.
Resolved. TLS is terminated in-process when both TLS_CERT_FILE and TLS_KEY_FILE
environment variables are set; otherwise the server falls back to plain HTTP for local dev.
Implementation uses tokio-rustls with PKCS8 private keys.
Resolved. Audit logs are deleted after a rolling 60-day window. The cleanup task runs
once a day from cortex-server/src/main.rs. Every state-changing API call is logged,
including KEK rotation, namespace lifecycle, and honey-token alarms.
Resolved. Each project_token now carries an explicit scope — the set of secret
key_paths frozen on the project row at discover time. /project/secrets filters its
response to the scope; a leaked token cannot read anything outside its mint-time scope.
The token is still a SHA-256-hashed opaque random string (not yet a signed Ed25519 JWT
— see open question #14).
Default TTL raised from 120 minutes to 14 days now that scope contains the blast radius.
Resolved. Audit rows are HMAC-SHA256 chained: each row's entry_mac covers
prev_hash || canonical_payload, with the running tail MAC stored in audit_mac_state.
The MAC key is derived from the KEK using HMAC with a fixed domain separator
(cortex-auth/audit-mac-v1).
Audit rows also carry optional caller metadata (caller_pid, caller_binary_sha256,
caller_argv_hash, caller_cwd, caller_git_commit, source_ip, hostname, os)
populated from X-Cortex-Caller-* request headers.
Open:
- Verification CLI: a
cortex-cli verify-auditsubcommand that walks the audit log and recomputes each entry MAC to detect tampering would be high-value but is not yet built. - External anchoring: the design suggests periodically pinning the chain tail to an external commit (e.g. a Git repo) so an internal admin who holds the audit MAC key cannot rewrite history. Not implemented.
- The CLI does not yet populate the
X-Cortex-Caller-*headers, so audit rows fromcortex-cli runonly carrysource_ip(when behind a proxy that setsX-Forwarded-For).
Resolved. Secrets carry an is_honey_token boolean. A read attempt against a
honey-token immediately:
- Revokes the calling project's token (sets
token_revoked_at). - Writes an
alarm-status row to the audit log (action="honey_token_access"). - Returns a generic 401 to the caller.
- Dispatches an outbound notification to every enabled channel.
Outbound notifications (resolved April 2026): notification_channels table holds
envelope-encrypted channel configs managed via the dashboard's Notifications tab and
POST /admin/notification-channels. Channel types:
slack— incoming webhookdiscord— incoming webhooktelegram— Bot API (bot_token+chat_id)email— pipes the message tohimalaya-clion stdin (when on PATH)
Dispatch is fire-and-forget per channel in a tokio task; a slow webhook never blocks
the calling request handler. See cortex-server/src/notifications.rs.
Still open: severity filters / per-channel templating / retry-with-backoff. Today every channel receives the same plain-text payload for both honey-token alarms and recovery-boot events. Tracked under #20.
Resolved. Every agent authenticates with an Ed25519 keypair. Migration
007_ed25519_and_devices.sql introduced the agent_pub column;
008_drop_legacy_jwt_secret.sql made it NOT NULL and dropped the legacy
HMAC jwt_secret_encrypted / wrapped_dek columns from agents. Agents
without an Ed25519 public key were dropped during that migration and must
re-register.
Registration (POST /admin/agents) requires agent_pub (base64url-encoded
Ed25519 public key). /agent/discover requires the request body to include
ts and nonce and verifies auth_proof as an Ed25519 signature over
ts | nonce | agent_id | /agent/discover. The ts must be within ±5 minutes
of the server clock (drop-replay window).
CLI:
cortex-cli gen-key --agent-id <id>writes a private key to~/.cortex/agent-<id>.key(mode 0600) and prints the base64url public key on stdout.cortex-cli sign-proof --agent-id <id> --priv-key-file <path>prints a JSON{ts, nonce, auth_proof}ready to splice into the discover body.
Still open:
- Replay nonce caching (an LRU of
(agent_id, nonce)for the 5-minute window) — the current path only enforces the timestamp bound. Tracked in NEXT_STEPS #4.
Resolved (basic). The migration adds server_keys (envelope-encrypted server signing
key) and revoked_token_jti. On first boot the server generates an Ed25519 keypair,
stores it sealed under the KEK, and exposes the public key at
GET /.well-known/jwks.json (kid-versioned).
POST /agent/discover accepts signed_token: true. When set, the response carries an
EdDSA-signed JWT in signed_project_token alongside the legacy random project_token.
Claims: {iss, sub, aud, iat, exp, jti, scope, namespace, project_id}. Both formats are
accepted on /project/* — the verification path branches on whether the bearer token has
3 dot-separated segments (JWT) or not (legacy hex hash compare).
Revocation is via revoked_token_jti (checked on every signed-token request). The
/admin/projects/<name>/revoke endpoint also continues to set token_revoked_at for the
legacy path.
Still open:
- Insert a
revoked_token_jtirow when an admin revokes a project — today only legacy tokens are revoked; signed JWTs are revoked only by settingexppast or by waiting for natural expiry. The migration column exists; wiring the admin handler is small but not yet done. - Body-replay protection (
ts + nonce + path + methodcovered by an HMAC over the bearer) — the design calls for this on every request; today only the discover path uses ts/nonce.
Resolved. The sharks crate provides a (m, n) Shamir split/recover primitive.
POST /admin/shamir/generate {threshold, shares}splits the running KEK into n shares with threshold m and returns them once. The server retains nothing; the response carries awarningfield telling the operator to distribute immediately.CORTEX_RECOVERY_MODE=1 CORTEX_RECOVERY_THRESHOLD=<m>boots in recovery mode. The server prompts for m shares interactively on stdin (echo disabled viarpassword), reconstructs the KEK, verifies the on-disk sentinel, and either succeeds or refuses to bind the listener.- A successful recovery boot writes an
alarm-statusrecovery_bootrow to the audit log and dispatches notifications to every enabled channel (#12).
See cortex-server/src/shamir.rs and cortex-server/src/kek.rs::unseal_via_recovery.
Still open:
- Multi-file share input (today: stdin only) — operators sometimes prefer dropping share files into a directory. Easy to add as a fallback.
- Restoring the KEK and the operator password — the current recovery boot only puts
the KEK back in memory; rotating to a fresh password requires running
POST /admin/rotate-keyfrom the recovered instance.
Resolved (full). The daemon now holds project tokens transparently;
cortex-cli run no longer accepts --token/--agent-id/--priv-key-file.
Server endpoints (in cortex-server/src/api/agent.rs + admin.rs):
POST /device/authorize— issuesdevice_code+user_code(10 min TTL).POST /device/token— daemon polls; returnsauthorization_pendinguntil approved, then mints an EdDSA JWT access token bound to the agent_id.GET /device— minimal HTML approval form.POST /admin/web/device/approve— admin-token-gated approval (binds user_code → agent_id).GET /admin/devices— list pending + enrolled devices.DELETE /admin/devices/:agent_id— revoke a device.
CLI (cortex-cli daemon login | status | logout) implements the OAuth 2.0 device-grant
client flow, persisting the access token at ~/.cortex/daemon-session.json (mode 0600).
cortex-daemon (separate binary in the cortex-cli crate) listens on
~/.cortex/agent.sock (mode 0600) with a single-line JSON protocol:
{"cmd":"status"}→ returns the cached daemon-session JSON + the live attestation session_id.{"cmd":"run","program":..,"args":..,"project":..,"url":..}→ daemon discovers the project token internally (auto-rotating on expiry), fetches secrets, spawns the program with the env vars injected, returns{"ok":true,"exit_code":N}after the child exits. The raw secret values never travel back over the socket and the CLI never sees a project token. Pending admin approval is propagated as{"ok":false,"error_code":"pending_approval","grant_id":..,"requested_keys":[..]}.
First-access approval (pending_grants): /agent/discover no longer
auto-issues tokens for unfamiliar (agent_id, project_name) pairs. The first
discover creates a row in pending_grants (status=pending) and notifies the
configured channels. After the admin approves via POST /admin/pending-grants/:id/approve,
subsequent discovers within a 30-day window auto-pass as long as the requested
scope is a subset of the approved keys; broader scopes re-trigger approval.
Daemon hardening implemented:
prctl(PR_SET_DUMPABLE, 0)+mlockall(MCL_CURRENT|MCL_FUTURE)at process start.- Per-connection SO_PEERCRED check rejects callers whose UID does not match the daemon's.
- Project token cache persisted at
~/.cortex/daemon-projects.json(mode 0600).
Still open:
- SSO integration at
/auth/oidc/*— today device approval is admin-token-gated, not user-bound. - Frequency limits (1 authorize/min/IP, 5 devices/user/day) — rate limiting is generally absent; tracked under NEXT_STEPS #3.
inject_templateandssh_proxysocket commands.- systemd unit hardening (
MemoryDenyWriteExecute=yes,NoNewPrivileges=yes,ProtectKernelTunables=yes) andPT_DENY_ATTACHon macOS — recipe is indocs/USAGE.md.
Resolved. Implemented end-to-end via the new /daemon/attest endpoint and
the X-Daemon-Attestation request header.
Schema (migration 010_pending_grants_and_daemon_attestation.sql):
daemon_sessions(session_id, agent_id, attestation_pub, binary_sha256, daemon_version, daemon_pid, daemon_uid, hostname, created_at, expires_at, revoked_at)— 24-hour TTL per session.allowed_daemon_versions(binary_sha256, version, description, enabled, created_at)— operator-curated allowlist; an empty table means "not enforced", so existing deployments do not break on upgrade.daemon_attest_seen_jti(jti, seen_at)— 5-minute replay-protection cache.
Daemon side (cortex-cli/src/daemon.rs):
- At startup the daemon hardens the process (
PR_SET_DUMPABLE=0,mlockall), loads its access_token, decodes thesubclaim to learnagent_id, and loads the agent Ed25519 private key from~/.cortex/agent-<id>.key. - It computes its own binary SHA-256 from
/proc/self/exe. - It generates an ephemeral Ed25519 attestation keypair — the private half never touches disk.
- It posts
{attestation_pub, binary_sha256, daemon_version, daemon_pid, daemon_uid, hostname}to/daemon/attestand receives asession_id. - Every sensitive HTTP request afterwards carries
X-Daemon-Attestation: <session_id>.<ts>.<jti>.<body_sha256>.<sig_b64>, wheresigis Ed25519 over"<ts>|<jti>|<METHOD>|<path>|<body_sha256>".
Server side (cortex-server/src/api/daemon.rs):
verify_attestation_header(state, headers, method, path, body)validates the ts window (±5 min), recomputesbody_sha256, looks up the session, checksrevoked_at+expires_at, verifies the signature with the registeredattestation_pub, and rejects replays viaINSERT OR IGNOREagainstdaemon_attest_seen_jti. Best-effort cleanup deletes jtis older than 10 min.- Allowlist enforcement: when
allowed_daemon_versionsis non-empty, an attest call with a SHA-256 not in the allowlist or withenabled=0returns 403 witherror_code: binary_not_allowed/binary_disabled.
Dashboard: a "Daemon Allowlist" page lists active sessions and lets operators add or remove approved binary hashes (admin-token gated).
Design intent (UPDATED_DESIGN.md §4): Replace the single static ADMIN_TOKEN with
multi-user accounts where each admin has a namespace scope. Super-admins can manage all
namespaces; regular admins only their own.
Status: NOT IMPLEMENTED. Today the ADMIN_TOKEN is a single shared bearer token
checked by check_admin_token(). Migrating requires:
admin_userstable (id, email, password_hash via Argon2id mid-tier, namespace_scope, is_super, created_at)./auth/loginand session cookie / OIDC integration.- All admin handlers gated by namespace scope check.
- Migration story for the bootstrap super-admin.
Tied to #14. Once project tokens are Ed25519-signed, the server's signing key needs to
rotate without breaking in-flight tokens. The standard answer is JWKS at
GET /.well-known/jwks.json with kid versioning. Daemons cache JWKS for 24 hours.
Resolved (basic) — see #12. Slack / Discord / Telegram / email-via-himalaya channels are now implemented and dispatched on every honey-token access (and on Shamir recovery boots — #15).
Still open:
- Per-namespace channel mapping (today every channel receives every event).
- Severity filters so a slack channel can opt into honey-token alarms but not recovery-boot pages.
- Per-channel payload templating (today every channel gets the same plain-text body).
- Retry / exponential backoff (today a single attempt; failures land in
tracing::warn!).
The server accepts X-Cortex-Caller-Pid, X-Cortex-Caller-Binary-SHA256,
X-Cortex-Caller-Argv-Hash, X-Cortex-Caller-Cwd, X-Cortex-Caller-Git-Commit,
X-Cortex-Hostname, X-Cortex-Os and stores them on the audit row. The current
cortex-cli does not populate them yet — when added it should:
- compute its own SHA-256 from
/proc/self/exe(Linux) or_NSGetExecutablePath()(macOS); - hash
argvso the audit row records which invocation; - read the cwd and
git rev-parse HEADif a.gitdirectory is present; - include hostname and
uname -s -m.
These fields are advisory — a malicious caller can lie — but combined with the source
IP and signed project_token/auth_proof they make forensic post-mortem dramatically
faster.