Skip to content

Commit cd366bc

Browse files
rajivmlclaude
andcommitted
docs(MIGRATION.md): reflect the darwin-kubernetes port
§5b — Dask topology section now points at the actual ported darwin-kubernetes/*.yaml manifests with a concrete cutover script, not just "you'll need to port these later" boilerplate. §10a — Footgun is RESOLVED for the Darwin path (the 5 new Darwin manifests all wire REDIS_PASSWORD via optional secretKeyRef). Marks the entry as such rather than removing it, so the history of "why was this previously a concern" stays readable. §12 — Commit count, file count, and totals updated for the two new commits (MIGRATION.md itself + the darwin-kubernetes port). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 017f127 commit cd366bc

1 file changed

Lines changed: 69 additions & 43 deletions

File tree

MIGRATION.md

Lines changed: 69 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -122,33 +122,60 @@ as the `redis` ClusterIP Service on 6379. Pod restart drops the cache
122122
— that's intentional; the source of truth is Postgres, and counters
123123
self-heal as windows expire.
124124

125-
### 5b. Dask scaling topology (optional, future)
126-
127-
The bg-scaling commit adds these in `deployment/kubernetes/` (the
128-
upstream-style path, **not** the darwin-specific path):
129-
130-
- `background-beat-deployment.yaml`
131-
- `background-celery-deployment.yaml`
132-
- `background-indexer-scheduler-deployment.yaml`
133-
- `dask-scheduler-service-deployment.yaml`
134-
- `dask-worker-deployment.yaml`
135-
- `docker-compose.dask-distributed.yml` (compose variant)
125+
### 5b. Dask scaling topology (opt-in, Darwin manifests ready)
126+
127+
The bg-scaling commit added 5 upstream-style manifests under
128+
`deployment/kubernetes/` (legacy / reference tree; **not** what
129+
Darwin applies from — see AGENTS.md "Critical fact §9"). A later
130+
commit on this branch ported each one to Darwin conventions under
131+
`darwin-kubernetes/`, with the right image registry, configmap /
132+
secrets wiring, REDIS_PASSWORD (optional), indexcpu node affinity,
133+
darwin/indexing toleration, and PVCs:
134+
135+
- `darwin-kubernetes/background-beat-deployment.yaml`
136+
- `darwin-kubernetes/background-celery-deployment.yaml`
137+
- `darwin-kubernetes/background-indexer-scheduler-deployment.yaml`
138+
- `darwin-kubernetes/dask-scheduler-service-deployment.yaml`
139+
- `darwin-kubernetes/dask-worker-deployment.yaml`
140+
141+
Plus `deployment/docker_compose/docker-compose.dask-distributed.yml`
142+
(compose variant, for local reproduction of the remote-scheduler
143+
topology — not part of the prod deploy).
136144

137145
Darwin currently runs `darwin-kubernetes/background-deployment.yaml`
138-
(a single combined background pod). **The new manifests are NOT
139-
applied automatically** and don't conflict with the existing
140-
`background-deployment.yaml`.
146+
(a single combined beat+celery+indexer pod via supervisord). **The new
147+
manifests are NOT applied automatically** by `kubectl apply -f
148+
darwin-kubernetes/` because the combined deployment is still in place
149+
— you apply each new file explicitly when you want to switch.
141150

142-
If/when you want to switch Darwin to the new topology:
151+
To switch Darwin to the split topology:
143152

144-
1. Mirror the Redis env wiring from `darwin-kubernetes/background-deployment.yaml`
145-
into each of the new manifests (none of them currently have
146-
`REDIS_PASSWORD` wired — see §10).
147-
2. Apply the new manifests, scale the old background-deployment to 0.
148-
3. Verify each pod boots and the worker logs show clean startup.
153+
```bash
154+
# 1. Apply the new five (order doesn't matter; they self-discover
155+
# the scheduler Service once it's up).
156+
kubectl apply -f darwin-kubernetes/dask-scheduler-service-deployment.yaml
157+
kubectl apply -f darwin-kubernetes/dask-worker-deployment.yaml
158+
kubectl apply -f darwin-kubernetes/background-beat-deployment.yaml
159+
kubectl apply -f darwin-kubernetes/background-celery-deployment.yaml
160+
kubectl apply -f darwin-kubernetes/background-indexer-scheduler-deployment.yaml
161+
162+
# 2. Wait for all five to be Ready.
163+
kubectl get pods -l 'app in (background-beat,background-celery,background-indexer-scheduler,dask-scheduler,dask-worker)'
164+
165+
# 3. Once healthy + you've seen an indexing attempt dispatch through
166+
# the new dask-scheduler-service (check the indexer-scheduler
167+
# pod logs), scale the old combined deployment to 0:
168+
kubectl scale deploy/background-deployment --replicas=0
169+
170+
# 4. If anything goes wrong, scale back up:
171+
kubectl scale deploy/background-deployment --replicas=1
172+
# The split pods will keep running but no harm — only one set is
173+
# actually doing the work (whichever has --replicas > 0).
174+
```
149175

150-
Out of scope for this PR; the relevant files are present so the
151-
migration is a one-step "apply" later.
176+
Both deployments can coexist briefly during cutover, but **do NOT
177+
run both at non-zero replicas long-term** — two beat schedulers on
178+
the same Postgres broker fire every crontab task twice.
152179

153180
---
154181

@@ -274,25 +301,21 @@ All features default OFF means even without revert, setting all
274301

275302
## 10. Known footguns
276303

277-
### 10a. Bg-scaling k8s manifests don't have `REDIS_PASSWORD` wired
278-
279-
The new `deployment/kubernetes/{background-celery,background-beat,
280-
background-indexer-scheduler,dask-scheduler-service,dask-worker}-deployment.yaml`
281-
files don't include the `REDIS_PASSWORD` env var pattern that the
282-
existing `darwin-kubernetes/background-deployment.yaml` has.
283-
284-
- **Today's impact: none.** Darwin runs the darwin-kubernetes/ tree,
285-
not the upstream-style deployment/kubernetes/ tree, and none of the
286-
bg-scaling processes currently invoke persona-mutating code paths
287-
that would need Redis access.
288-
- **Becomes a real concern if** Darwin adopts the new topology AND
289-
`PERSONA_CACHE_ENABLED=true` AND a future Celery task ever mutates a
290-
Persona / Persona__User / Persona__UserGroup row. In that scenario
291-
the mutation succeeds, the cache bust logs a warning, and `/persona`
292-
serves stale data for up to 24h (TTL backstop).
293-
- **Fix when relevant**: mirror the `REDIS_PASSWORD` `secretKeyRef`
294-
block from `darwin-kubernetes/background-deployment.yaml` into each
295-
of the new manifests.
304+
### 10a. ~~Bg-scaling k8s manifests don't have `REDIS_PASSWORD` wired~~ — RESOLVED
305+
306+
**Closed for the Darwin path.** The 5 ported manifests under
307+
`darwin-kubernetes/` (added in `19335e31`) all wire `REDIS_PASSWORD`
308+
via `secretKeyRef` with `optional: true`, matching the existing
309+
`darwin-kubernetes/background-deployment.yaml` pattern. So persona-
310+
cache invalidation from any future Celery / indexer-scheduler /
311+
dask-worker task path will work correctly once you switch to the
312+
split topology.
313+
314+
The upstream `deployment/kubernetes/*` files are still missing
315+
`REDIS_PASSWORD` env wiring, but **Darwin doesn't apply from that
316+
tree** — it's reference-only (see AGENTS.md "Critical fact §9").
317+
Leave them alone unless/until you adopt the upstream-style
318+
deployment shape outside Darwin.
296319

297320
### 10b. `backend/scripts/seed_assistants.py` bypasses persona-cache invalidation
298321

@@ -353,11 +376,14 @@ These need eyes — automated coverage doesn't catch them:
353376

354377
## 12. Branch contents at-a-glance
355378

356-
14 commits on top of `rajiv/add-claude` (PR #45):
379+
16 commits on top of `rajiv/add-claude` (PR #45):
357380

358381
```
382+
[BG-scale] darwin-kubernetes: port split-background manifests + lock convention in AGENTS.md
359383
[BG-scale] Scale indexing via remote Dask scheduler topology
360384

385+
[Docs] docs: add MIGRATION.md covering Redis / bg-scaling / UX
386+
361387
[UX] Gallery: column picker as dropdown to match Sort
362388
[UX] Gallery: user-controllable column count (segmented control, persists)
363389
[UX] Show document-set names on assistant cards (was: count only)
@@ -374,4 +400,4 @@ These need eyes — automated coverage doesn't catch them:
374400
[Redis] docs: add Redis caching & scaling plan
375401
```
376402
377-
Total: **45 files changed, +5857 / −499**. 63 unit tests pass.
403+
Total: **51 files changed, +6372 / −499**. 63 unit tests pass.

0 commit comments

Comments
 (0)