Skip to content

Commit 52e6282

Browse files
backspaceclaude
andcommitted
deploy: replace post-migrate-db / post-deploy-worker sleeps with real readiness checks
The two `sleep 180` jobs were workarounds from when pg-migration and worker containers had no healthcheck, so ECS service-stability returned on container start instead of on actual readiness (see commits 36cc92f, af51fcd, fbee5d7). Now that prerender + prerender- manager have proven the healthcheck-based approach (dbdfc33), apply the same pattern to the other two services so we can drop the heuristic sleeps from the critical path. pg-migration: append `touch /tmp/migrations-complete` to the migration CMD and add a HEALTHCHECK that requires the sentinel. Service-stability now waits for migrations to actually finish, not just for ECS to start the task. worker: pass `--port=3000` in the staging/prod startup scripts so worker-manager mounts its existing readiness endpoint (`GET /` returns 503 until `isReady = true`, then 200), and add a curl-based HEALTHCHECK on it. Workflow: flip `wait-for-service-stability: true` on both deploys, set `timeout-minutes: 10`, delete the two `sleep 180` wait jobs, and rewire the `needs:` of dependents (`deploy-ai-bot`, `deploy-bot-runner`, `deploy-worker`, `deploy-realm-server`, `finalize-deployment`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b57e1c2 commit 52e6282

5 files changed

Lines changed: 36 additions & 29 deletions

File tree

.github/workflows/manual-deploy.yml

Lines changed: 18 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ jobs:
8383
dockerfile: "packages/ai-bot/Dockerfile"
8484

8585
deploy-ai-bot:
86-
needs: [build-ai-bot, post-migrate-db]
86+
needs: [build-ai-bot, migrate-db]
8787
name: Deploy ai-bot to AWS ECS
8888
uses: cardstack/gh-actions/.github/workflows/ecs-deploy.yml@main
8989
secrets: inherit
@@ -105,7 +105,7 @@ jobs:
105105
dockerfile: "packages/bot-runner/Dockerfile"
106106

107107
deploy-bot-runner:
108-
needs: [build-bot-runner, post-migrate-db]
108+
needs: [build-bot-runner, migrate-db]
109109
name: Deploy bot-runner to AWS ECS
110110
uses: cardstack/gh-actions/.github/workflows/ecs-deploy.yml@main
111111
secrets: inherit
@@ -207,17 +207,13 @@ jobs:
207207
cluster: ${{ inputs.environment }}
208208
service-name: "boxel-pg-migration-${{ inputs.environment }}"
209209
image: ${{ needs.build-pg-migration.outputs.image }}
210-
wait-for-service-stability: false
211-
212-
# the wait-for-service-stability flag doesn't seem to work in
213-
# aws-actions/amazon-ecs-deploy-task-definition@v2. we keep getting timeouts
214-
# waiting for service stability. So we are manually waiting here.
215-
post-migrate-db:
216-
name: Wait for db-migration
217-
needs: [migrate-db]
218-
runs-on: ubuntu-latest
219-
steps:
220-
- run: sleep 180
210+
timeout-minutes: 10
211+
# The pg-migration container writes /tmp/migrations-complete after
212+
# `node-pg-migrate up` finishes (packages/postgres/Dockerfile) and its
213+
# HEALTHCHECK depends on that sentinel, so service-stability now waits
214+
# for migrations to actually finish rather than just for ECS to start
215+
# the task. This replaces the heuristic `sleep 180` post-migrate job.
216+
wait-for-service-stability: true
221217

222218
deploy-prerender:
223219
name: Deploy prerender
@@ -250,7 +246,7 @@ jobs:
250246
deploy-worker:
251247
name: Deploy worker
252248
needs:
253-
[build-worker, deploy-host, post-migrate-db, deploy-prerender-manager]
249+
[build-worker, deploy-host, migrate-db, deploy-prerender-manager]
254250
uses: cardstack/gh-actions/.github/workflows/ecs-deploy.yml@main
255251
secrets: inherit
256252
with:
@@ -259,22 +255,18 @@ jobs:
259255
cluster: ${{ inputs.environment }}
260256
service-name: "boxel-worker-${{ inputs.environment }}"
261257
image: ${{ needs.build-worker.outputs.image }}
262-
wait-for-service-stability: false
263-
264-
# the wait-for-service-stability flag doesn't seem to work in
265-
# aws-actions/amazon-ecs-deploy-task-definition@v2. we keep getting timeouts
266-
# waiting for service stability. So we are manually waiting here.
267-
post-deploy-worker:
268-
name: Wait for worker
269-
needs: [deploy-worker]
270-
runs-on: ubuntu-latest
271-
steps:
272-
- run: sleep 180
258+
timeout-minutes: 10
259+
# The worker container's HEALTHCHECK curls `GET /` on the worker-manager,
260+
# which returns 200 only once `isReady = true` — i.e. all workers have
261+
# actually spawned (worker-manager.ts). Service-stability therefore
262+
# waits for true readiness, replacing the heuristic `sleep 180` we used
263+
# to do after deploy-worker.
264+
wait-for-service-stability: true
273265

274266
deploy-realm-server:
275267
name: Deploy realm server
276268
needs:
277-
[post-deploy-worker, build-realm-server, deploy-host, post-migrate-db]
269+
[deploy-worker, build-realm-server, deploy-host, migrate-db]
278270
uses: cardstack/gh-actions/.github/workflows/ecs-deploy.yml@main
279271
secrets: inherit
280272
with:
@@ -350,11 +342,9 @@ jobs:
350342
build-worker,
351343
build-pg-migration,
352344
migrate-db,
353-
post-migrate-db,
354345
deploy-prerender,
355346
deploy-prerender-manager,
356347
deploy-worker,
357-
post-deploy-worker,
358348
deploy-realm-server,
359349
post-deploy-realm-server,
360350
apply-observability,

packages/postgres/Dockerfile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,12 @@ RUN CI=1 pnpm install -r --offline
2020

2121
WORKDIR /boxel/packages/postgres
2222

23-
CMD ./node_modules/.bin/ts-node --transpileOnly ./scripts/fix-migration-names.ts && ./node_modules/.bin/node-pg-migrate --check-order false --migrations-table migrations up && sleep infinity
23+
# Touch a sentinel file after migrations complete so the HEALTHCHECK can
24+
# signal readiness. ECS treats the task as "stable" only once the healthcheck
25+
# passes, which lets the deploy job's `wait-for-service-stability: true` block
26+
# until migrations have actually finished — replacing the heuristic 180s sleep
27+
# we used to do after migrate-db.
28+
HEALTHCHECK --interval=5s --timeout=2s --start-period=60s --retries=120 \
29+
CMD test -f /tmp/migrations-complete || exit 1
30+
31+
CMD ./node_modules/.bin/ts-node --transpileOnly ./scripts/fix-migration-names.ts && ./node_modules/.bin/node-pg-migrate --check-order false --migrations-table migrations up && touch /tmp/migrations-complete && sleep infinity

packages/realm-server/scripts/start-worker-production.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ NODE_NO_WARNINGS=1 \
1414
OPENROUTER_REALM_URL='https://app.boxel.ai/openrouter/' \
1515
ts-node \
1616
--transpileOnly worker-manager \
17+
--port=3000 \
1718
--allPriorityCount="${WORKER_ALL_PRIORITY_COUNT:-1}" \
1819
--highPriorityCount="${WORKER_HIGH_PRIORITY_COUNT:-0}" \
1920
--prerendererUrl='http://boxel-prerender-manager.boxel-production-internal:4222' \

packages/realm-server/scripts/start-worker-staging.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ NODE_NO_WARNINGS=1 \
1414
OPENROUTER_REALM_URL='https://realms-staging.stack.cards/openrouter/' \
1515
ts-node \
1616
--transpileOnly worker-manager \
17+
--port=3000 \
1718
--allPriorityCount="${WORKER_ALL_PRIORITY_COUNT:-1}" \
1819
--highPriorityCount="${WORKER_HIGH_PRIORITY_COUNT:-0}" \
1920
--prerendererUrl='http://boxel-prerender-manager.boxel-staging-internal:4222' \

packages/realm-server/worker.Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,11 @@ RUN CI=1 pnpm install -r --offline
2222

2323
EXPOSE 3000
2424

25+
# `GET /` returns 200 once all workers have started (`isReady = true` in
26+
# worker-manager.ts) and 503 before that, so this is also a readiness probe.
27+
# ECS `wait-for-service-stability` won't return until this passes, replacing
28+
# the heuristic 180s sleep we used to do after deploy-worker.
29+
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=6 \
30+
CMD curl --fail --silent --show-error --max-time 5 --output /dev/null http://localhost:3000/ || exit 1
31+
2532
CMD pnpm --filter "./packages/realm-server" $worker_script

0 commit comments

Comments
 (0)