Skip to content

Commit 017f127

Browse files
rajivmlclaude
andcommitted
darwin-kubernetes: port split-background manifests + lock convention in AGENTS.md
The bg-scaling commit (03d1649) added 5 new k8s manifests under `deployment/kubernetes/` that split the combined background pod into beat / celery / indexer-scheduler / dask-scheduler / dask-worker. But Darwin doesn't apply from `deployment/kubernetes/` — its prod manifests live under `darwin-kubernetes/`, and the two trees aren't kept in sync. Porting all five into `darwin-kubernetes/` with Darwin conventions: - Image registry sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend - configMapRef env-configmap, secretKeyRef danswer-secrets - POSTGRES_USER / POSTGRES_PASSWORD wired everywhere that talks to PG - REDIS_PASSWORD wired as optional secretKeyRef (the latent footgun flagged in MIGRATION.md §10a is now closed for the Darwin path) - indexcpu nodeAffinity + darwin/indexing toleration on every indexing-side pod (celery, indexer-scheduler, dask-scheduler, dask-worker); beat stays on the default pool (lightweight) - dynamic-pvc + file-connector-pvc volume mounts where any task may stage files The existing `darwin-kubernetes/background-deployment.yaml` (combined beat+celery+indexer via supervisord) is intentionally LEFT IN PLACE — the split is an opt-in rollout, not a forced cutover. To switch: apply the new five, verify the new pods are healthy, scale the combined deployment to 0. Also lock the convention in AGENTS.md so this doesn't recur: - New divergence-table row noting darwin-kubernetes/ is source of truth for prod. - New "Critical facts that bite" §9 documenting the two-tree split, when to touch which, and the per-pod adaptation checklist (image registry, configmap, secrets, REDIS_PASSWORD, affinity, PVCs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5ce3f49 commit 017f127

6 files changed

Lines changed: 515 additions & 0 deletions

AGENTS.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ moved on substantially. This table is the explicit map.
7575
| Test buckets | `backend/tests/{unit,external_dependency_unit,integration}` + Playwright e2e | No comparable structure here. Most code lacks tests; add tests with the change if practical, otherwise note in PR. |
7676
| Plan template | The "Creating a Plan" section in their `CLAUDE.md` (Issues / Notes / Strategy / Tests) | Useful template; can be borrowed for non-trivial changes here too. |
7777
| Frontend stack | Next.js 15+, React 18+ | Next.js 14.2.x (App Router), React 18 |
78+
| K8s manifest path | `deployment/kubernetes/*` is what upstream documents | **`darwin-kubernetes/*` is the source of truth for the Darwin prod cluster.** `deployment/kubernetes/*` is upstream legacy / scratch — Darwin doesn't apply from there. New manifests for Darwin go in `darwin-kubernetes/`. See critical fact §9. |
7879

7980
**Rule of thumb when reading upstream code or upstream guidance:** assume
8081
it doesn't apply unless you can verify the same construct exists here.
@@ -340,6 +341,35 @@ auto-parse entirely with a raw `requests.get` against the
340341
`/drives/{drive_id}/items/{item_id}/content` endpoint using the bearer
341342
token. Don't reintroduce the lossy re-serialization.
342343

344+
### 9. `darwin-kubernetes/` is the source of truth for the Darwin cluster
345+
346+
The repo has two parallel k8s manifest trees and they are **not** kept
347+
in sync:
348+
349+
| Path | What it is | When to touch |
350+
|---|---|---|
351+
| `darwin-kubernetes/*.yaml` | **The actual manifests applied to Darwin's AKS cluster (the `darwin` kube context).** Image registry is `sfbrdevhelmweacr.azurecr.io/...`, configmap is `env-configmap`, secrets is `danswer-secrets`, indexing pods have `indexcpu`-pool affinity + `darwin/indexing` toleration, env vars come from the Darwin configmap. | **Edit here for any prod-affecting change**, including new deployments. |
352+
| `deployment/kubernetes/*.yaml` | Upstream-style manifests inherited from Onyx / authored to match the OSS docker-compose. Generic image (`danswer/danswer-backend:latest`), no Azure-specific affinity / tolerations, no Darwin-specific configmap wiring. | Reference only — not deployed to Darwin. Useful for seeing the "upstream shape" of a new component before adapting it to `darwin-kubernetes/`. |
353+
354+
When upstream (or a branch like `feature/backgroundscaling`) adds a
355+
new manifest in `deployment/kubernetes/`, the corresponding
356+
`darwin-kubernetes/` version must be hand-ported with:
357+
358+
- Image: `sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:<tag>`
359+
- `envFrom: configMapRef name: env-configmap`
360+
- POSTGRES_USER / POSTGRES_PASSWORD via `secretKeyRef name: danswer-secrets`
361+
- REDIS_PASSWORD via `secretKeyRef name: danswer-secrets, optional: true`
362+
(so unauth'd in-cluster Redis still works)
363+
- For indexing-related pods: `nodeAffinity` on `agentpool=indexcpu` +
364+
`tolerations` for `darwin/indexing/NoSchedule` + `dynamic-pvc` /
365+
`file-connector-pvc` volume mounts.
366+
367+
A drop-in port that misses any of these will boot in Darwin but
368+
mis-route, miss secrets, or end up on the wrong node pool. The
369+
existing `darwin-kubernetes/background-deployment.yaml` and
370+
`api_server-service-deployment.yaml` are the canonical templates for
371+
the conventions.
372+
343373
---
344374

345375
## Common workflows
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Celery beat — periodic-task scheduler. (Darwin variant.)
2+
#
3+
# MUST be a singleton: two beats on the same broker fire every
4+
# crontab entry twice. `Recreate` strategy guarantees no overlap
5+
# during rollout (the old pod is terminated before the new one
6+
# starts), at the cost of a brief beat outage during deploy. That's
7+
# acceptable because beat-fired tasks are all "check / catch-up"
8+
# style — missing one cycle is harmless, the next one cleans up.
9+
#
10+
# Beat is light (~100MB RSS); doesn't need the indexcpu node pool
11+
# the indexing-side pods sit on. Stays on the default pool.
12+
#
13+
# This deployment is part of the split-background topology
14+
# (beat / celery / indexer-scheduler / dask-scheduler / dask-worker).
15+
# Apply alongside the other four to retire `background-deployment.yaml`.
16+
# Keep the old combined deployment in place during cutover so you can
17+
# scale it to 0 once the new pods are healthy.
18+
apiVersion: apps/v1
19+
kind: Deployment
20+
metadata:
21+
name: background-beat-deployment
22+
spec:
23+
replicas: 1
24+
strategy:
25+
type: Recreate
26+
selector:
27+
matchLabels:
28+
app: background-beat
29+
template:
30+
metadata:
31+
labels:
32+
app: background-beat
33+
spec:
34+
containers:
35+
- name: beat
36+
image: sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-5
37+
imagePullPolicy: IfNotPresent
38+
command:
39+
- celery
40+
- -A
41+
- danswer.background.celery.celery_run:celery_app
42+
- beat
43+
- --loglevel=INFO
44+
env:
45+
- name: POSTGRES_USER
46+
valueFrom:
47+
secretKeyRef:
48+
key: postgres_user
49+
name: danswer-secrets
50+
- name: POSTGRES_PASSWORD
51+
valueFrom:
52+
secretKeyRef:
53+
key: postgres_password
54+
name: danswer-secrets
55+
# Beat itself doesn't mutate personas, but stays wired for
56+
# parity with the other split-background pods. Optional secret
57+
# so an unauth'd in-cluster Redis still works.
58+
- name: REDIS_PASSWORD
59+
valueFrom:
60+
secretKeyRef:
61+
key: redis_password
62+
name: danswer-secrets
63+
optional: true
64+
envFrom:
65+
- configMapRef:
66+
name: env-configmap
67+
resources:
68+
requests:
69+
cpu: "50m"
70+
memory: "128Mi"
71+
limits:
72+
cpu: "200m"
73+
memory: "256Mi"
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Celery worker — executes periodic / on-demand tasks (prune, sync,
2+
# retention, deletion, etc.). (Darwin variant.)
3+
#
4+
# Horizontally scalable. Beat fires tasks → broker (Postgres) → any
5+
# worker pulls the task. Postgres-level row locks plus the
6+
# per-cc-pair advisory locks (DELE/RETENTIO namespaces) prevent
7+
# duplicate execution of the same task even when many workers race.
8+
#
9+
# Indexing is NOT in this deployment — see
10+
# background-indexer-scheduler-deployment.yaml. Slack listener is
11+
# still in its own deployment.
12+
#
13+
# Pool=threads (not prefork) is required because of the Celery +
14+
# SQLAlchemy SIGSEGV issue documented at the top of supervisord.conf.
15+
#
16+
# Lives on the indexcpu node pool with the same toleration as the old
17+
# combined background deployment — connector deletion / retention can
18+
# do heavy file-store work and that's where the dynamic / file-
19+
# connector PVCs are wired.
20+
apiVersion: apps/v1
21+
kind: Deployment
22+
metadata:
23+
name: background-celery-deployment
24+
spec:
25+
replicas: 2
26+
selector:
27+
matchLabels:
28+
app: background-celery
29+
template:
30+
metadata:
31+
labels:
32+
app: background-celery
33+
spec:
34+
affinity:
35+
nodeAffinity:
36+
requiredDuringSchedulingIgnoredDuringExecution:
37+
nodeSelectorTerms:
38+
- matchExpressions:
39+
- key: agentpool
40+
operator: In
41+
values:
42+
- indexcpu
43+
containers:
44+
- name: celery
45+
image: sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-5
46+
imagePullPolicy: IfNotPresent
47+
command:
48+
- celery
49+
- -A
50+
- danswer.background.celery.celery_run:celery_app
51+
- worker
52+
- --pool=threads
53+
- --autoscale=3,10
54+
- --loglevel=INFO
55+
env:
56+
- name: POSTGRES_USER
57+
valueFrom:
58+
secretKeyRef:
59+
key: postgres_user
60+
name: danswer-secrets
61+
- name: POSTGRES_PASSWORD
62+
valueFrom:
63+
secretKeyRef:
64+
key: postgres_password
65+
name: danswer-secrets
66+
# Celery tasks may transitively touch persona-cache invalidation
67+
# via shared db/ code paths (e.g. cleanup tasks calling into
68+
# functions imported from db/persona.py). Wired even when the
69+
# task path doesn't currently need it — cheap, fail-open.
70+
- name: REDIS_PASSWORD
71+
valueFrom:
72+
secretKeyRef:
73+
key: redis_password
74+
name: danswer-secrets
75+
optional: true
76+
envFrom:
77+
- configMapRef:
78+
name: env-configmap
79+
volumeMounts:
80+
- name: dynamic-storage
81+
mountPath: /home/storage
82+
- name: file-connector-storage
83+
mountPath: /home/file_connector_storage
84+
resources:
85+
requests:
86+
cpu: "200m"
87+
memory: "512Mi"
88+
limits:
89+
cpu: "1"
90+
memory: "2Gi"
91+
tolerations:
92+
- effect: NoSchedule
93+
key: darwin
94+
operator: Equal
95+
value: indexing
96+
volumes:
97+
- name: dynamic-storage
98+
persistentVolumeClaim:
99+
claimName: dynamic-pvc
100+
- name: file-connector-storage
101+
persistentVolumeClaim:
102+
claimName: file-connector-pvc
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Indexer-scheduler — runs the polling loop in
2+
# `danswer/background/update.py`. Every 10s it scans Postgres for
3+
# cc-pairs due for re-indexing, creates index_attempt rows, and
4+
# submits `run_indexing_entrypoint` tasks to the remote Dask
5+
# scheduler service. The actual indexing CPU/RAM work happens on
6+
# dask-worker pods, NOT here. (Darwin variant.)
7+
#
8+
# Singleton (replicas: 1, strategy: Recreate). Scaling indexing
9+
# concurrency = scaling dask-worker pods, not this one. Two
10+
# scheduler loops would race on `index_attempt` table inserts.
11+
#
12+
# DASK_SCHEDULER_ADDRESS env switches `update.py` from in-process
13+
# LocalCluster to the remote-scheduler client mode — this is the
14+
# topology flip the bg-scaling work introduced.
15+
#
16+
# Lives on the indexcpu pool because it polls the indexing state and
17+
# needs the file-connector volume mounts (some connectors stage files
18+
# locally before handing off to dask-worker).
19+
apiVersion: apps/v1
20+
kind: Deployment
21+
metadata:
22+
name: background-indexer-scheduler-deployment
23+
spec:
24+
replicas: 1
25+
strategy:
26+
type: Recreate
27+
selector:
28+
matchLabels:
29+
app: background-indexer-scheduler
30+
template:
31+
metadata:
32+
labels:
33+
app: background-indexer-scheduler
34+
spec:
35+
affinity:
36+
nodeAffinity:
37+
requiredDuringSchedulingIgnoredDuringExecution:
38+
nodeSelectorTerms:
39+
- matchExpressions:
40+
- key: agentpool
41+
operator: In
42+
values:
43+
- indexcpu
44+
containers:
45+
- name: indexer-scheduler
46+
image: sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-5
47+
imagePullPolicy: IfNotPresent
48+
command: ["python", "danswer/background/update.py"]
49+
env:
50+
- name: DASK_SCHEDULER_ADDRESS
51+
value: "tcp://dask-scheduler-service:8786"
52+
- name: CURRENT_PROCESS_IS_AN_INDEXING_JOB
53+
value: "true"
54+
- name: POSTGRES_USER
55+
valueFrom:
56+
secretKeyRef:
57+
key: postgres_user
58+
name: danswer-secrets
59+
- name: POSTGRES_PASSWORD
60+
valueFrom:
61+
secretKeyRef:
62+
key: postgres_password
63+
name: danswer-secrets
64+
- name: REDIS_PASSWORD
65+
valueFrom:
66+
secretKeyRef:
67+
key: redis_password
68+
name: danswer-secrets
69+
optional: true
70+
envFrom:
71+
- configMapRef:
72+
name: env-configmap
73+
volumeMounts:
74+
- name: dynamic-storage
75+
mountPath: /home/storage
76+
- name: file-connector-storage
77+
mountPath: /home/file_connector_storage
78+
resources:
79+
requests:
80+
cpu: "200m"
81+
memory: "512Mi"
82+
limits:
83+
cpu: "500m"
84+
memory: "1Gi"
85+
tolerations:
86+
- effect: NoSchedule
87+
key: darwin
88+
operator: Equal
89+
value: indexing
90+
volumes:
91+
- name: dynamic-storage
92+
persistentVolumeClaim:
93+
claimName: dynamic-pvc
94+
- name: file-connector-storage
95+
persistentVolumeClaim:
96+
claimName: file-connector-pvc

0 commit comments

Comments
 (0)