Skip to content

Commit 1ddde9e

Browse files
committed
ci: add cross-process cluster smoke test suite
Add a GitHub Actions workflow, Makefile target, and supporting scripts to catch cross-node bugs that in-process unit tests miss. - .github/workflows/cluster.yml: new CI job that boots the 5-node docker-compose stack, waits for all /healthz endpoints, runs the assertion script, and dumps container logs on failure - Makefile: add `test-cluster` target mirroring the CI flow for local development, propagating the smoke's exit code on teardown - scripts/tests/wait-for-cluster.sh: polling helper that blocks until every node's /healthz returns 200, configurable via PORTS / TIMEOUT_SECS / POLL_INTERVAL env vars - CHANGELOG.md: document all additions under [Unreleased] - cspell.config.yaml: add healthz to the word list This specifically guards against the class of regressions that escaped Phase D review: factory dropping DistMemoryOptions, seeds without node IDs producing broken rings, and json.RawMessage mis-encoding on non-owner GET requests.
1 parent 229f5fc commit 1ddde9e

5 files changed

Lines changed: 168 additions & 1 deletion

File tree

.github/workflows/cluster.yml

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
name: cluster
3+
4+
# Cross-process cluster smoke. Boots the 5-node docker-compose
5+
# stack defined at docker-compose.cluster.yml, waits for every
6+
# node's /healthz to flip to 200, then runs the assertion script
7+
# under scripts/tests/. Catches the class of bugs that unit
8+
# tests miss because they only exercise in-process behavior:
9+
# * config not flowing from HyperCache wrapper to DistMemory
10+
# * seed lists with empty node IDs producing broken rings
11+
# * wire-encoding asymmetries between writer and replica
12+
#
13+
# Container logs are dumped on failure so a CI failure is
14+
# debuggable without re-running locally.
15+
16+
on:
17+
pull_request:
18+
push:
19+
branches: [main]
20+
21+
permissions:
22+
contents: read
23+
24+
jobs:
25+
smoke:
26+
name: 5-node smoke
27+
runs-on: ubuntu-latest
28+
timeout-minutes: 10
29+
30+
steps:
31+
- uses: actions/checkout@v6
32+
33+
- name: Build cluster image
34+
run: |
35+
docker compose -f docker-compose.cluster.yml build
36+
37+
- name: Bring cluster up
38+
run: |
39+
docker compose -f docker-compose.cluster.yml up -d
40+
41+
- name: Wait for cluster /healthz
42+
run: bash scripts/tests/wait-for-cluster.sh
43+
44+
- name: Run cross-node smoke
45+
run: bash scripts/tests/10-test-cluster-api.sh
46+
47+
- name: Dump container logs (on failure)
48+
if: failure()
49+
run: |
50+
for c in hypercache-1 hypercache-2 hypercache-3 hypercache-4 hypercache-5; do
51+
echo "::group::$c"
52+
docker logs --tail 200 "$c" || true
53+
echo "::endgroup::"
54+
done
55+
56+
- name: Tear down
57+
if: always()
58+
run: |
59+
docker compose -f docker-compose.cluster.yml down -v --remove-orphans

CHANGELOG.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,40 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
## [Unreleased]
88

9+
### Added
10+
11+
- **Cross-process cluster smoke in CI**
12+
[.github/workflows/cluster.yml](.github/workflows/cluster.yml) boots
13+
the 5-node `docker-compose.cluster.yml` stack on every PR/push,
14+
waits for `/healthz` on every node, then runs the assertion
15+
script at
16+
[scripts/tests/10-test-cluster-api.sh](scripts/tests/10-test-cluster-api.sh).
17+
Container logs are dumped on failure for debuggability without a
18+
re-run. This catches the class of bugs that escaped the previous
19+
PR (factory dropped DistMemoryOptions, seeds without IDs,
20+
json.RawMessage on non-owner GET) — none would have been
21+
detected by unit/integration tests because they only exercised
22+
in-process behavior.
23+
- **`make test-cluster` Makefile target** mirrors the CI flow for
24+
local development: brings the cluster up, waits, runs the smoke,
25+
and tears down on the way out (preserving the smoke's exit code).
26+
- **`scripts/tests/wait-for-cluster.sh`** is the polling helper that
27+
blocks until every node's `/healthz` returns 200, with a default
28+
30-second deadline configurable via `TIMEOUT_SECS`. Used by both
29+
the Makefile and the CI workflow so the assertion script downstream
30+
never races the listener bind.
31+
- **`scripts/tests/10-test-cluster-api.sh` hardened** from a
32+
print-only smoke into a real regression test: 17 explicit
33+
assertions across propagation / wire-encoding / cross-node
34+
delete, color-coded `OK`/`FAIL` output, exit code reflects
35+
total failure count.
36+
- **`cmd/hypercache-server/main_test.go`** — fast Go unit tests
37+
pinning the wire-encoding contracts on `writeValue` /
38+
`decodeBase64Bytes`. Covers `[]byte` (writer path), `string`
39+
(replica path), `json.RawMessage` (non-owner-GET path), and the
40+
base64-heuristic length floors. Runs without docker for tight
41+
feedback during development.
42+
943
### Fixed
1044

1145
- **Cluster propagation was completely broken.** The
@@ -77,7 +111,7 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77111
the migration path swallowed errors silently, so the hint enqueue
78112
rate from rebalance ticks was much lower.
79113

80-
### Added
114+
### Added (earlier in this cycle)
81115

82116
- **Structured logging on the dist backend.** New `WithDistLogger(*slog.Logger)`
83117
option wires a structured logger into the dist backend's background

Makefile

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,24 @@ stop-dev-cluster:
3737
@echo
3838
docker compose -f docker-compose.cluster.yml down -v --rmi local --remove-orphans
3939

40+
# test-cluster brings up the 5-node docker-compose cluster, waits for
41+
# every node's /healthz to be 200, runs the cross-node smoke test
42+
# (PUT/GET/DELETE asserted on every node), and tears the stack down —
43+
# always, even on assertion failure — so a failing run leaves no
44+
# stragglers. The shell-script's exit code is propagated so CI can
45+
# fail the build on any regression of the bugs that escaped Phase D
46+
# initial review (factory dropped options, seeds without IDs,
47+
# json.RawMessage on non-owner GET).
48+
test-cluster: stop-dev-cluster
49+
@echo "spinning up cluster + running cross-node smoke"
50+
@echo
51+
docker compose -f docker-compose.cluster.yml up --build -d
52+
@bash scripts/tests/wait-for-cluster.sh
53+
@rc=0; bash scripts/tests/10-test-cluster-api.sh || rc=$$?; \
54+
echo ""; echo "tearing down cluster (rc=$$rc)"; \
55+
docker compose -f docker-compose.cluster.yml down -v --rmi local --remove-orphans >/dev/null 2>&1 || true; \
56+
exit $$rc
57+
4058
# ci aggregates the gates required before declaring a task done (see AGENTS.md).
4159
ci: lint typecheck test-race sec build
4260
@echo "All CI gates passed."

cspell.config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ words:
111111
- gosec
112112
- GOTOOLCHAIN
113113
- govulncheck
114+
- healthz
114115
- histogramcollector
115116
- HMAC
116117
- honnef

scripts/tests/wait-for-cluster.sh

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/usr/bin/env bash
2+
# Block until every node in the docker-compose.cluster.yml stack
3+
# answers `GET /healthz` with HTTP 200 — or fail with a clear
4+
# error after the deadline elapses. Used by `make test-cluster`
5+
# and by CI so the assertion script downstream is never racing
6+
# the listener bind.
7+
#
8+
# Usage:
9+
# ./scripts/tests/wait-for-cluster.sh
10+
# PORTS="8081 8082" TIMEOUT_SECS=60 ./scripts/tests/wait-for-cluster.sh
11+
12+
set -euo pipefail
13+
14+
readonly PORTS="${PORTS:-8081 8082 8083 8084 8085}"
15+
readonly TIMEOUT_SECS="${TIMEOUT_SECS:-30}"
16+
readonly POLL_INTERVAL="${POLL_INTERVAL:-1}"
17+
18+
start_epoch=$(date +%s)
19+
deadline=$((start_epoch + TIMEOUT_SECS))
20+
21+
# wait_one polls a single port's /healthz endpoint until it returns
22+
# 200 or the global deadline passes. Returns 0 on success, 1 on
23+
# timeout — caller decides whether to abort (we abort on the first
24+
# failed port).
25+
wait_one() {
26+
local port="$1"
27+
28+
while true; do
29+
now=$(date +%s)
30+
if [[ "$now" -ge "$deadline" ]]; then
31+
printf 'wait-for-cluster: port %s not ready after %ds\n' "$port" "$TIMEOUT_SECS" >&2
32+
return 1
33+
fi
34+
35+
status=$(curl -sS -o /dev/null -w '%{http_code}' \
36+
--max-time 1 \
37+
"http://localhost:$port/healthz" 2>/dev/null || true)
38+
39+
if [[ "$status" == "200" ]]; then
40+
printf ' ready: :%s\n' "$port"
41+
42+
return 0
43+
fi
44+
45+
sleep "$POLL_INTERVAL"
46+
done
47+
}
48+
49+
printf 'waiting for cluster ports: %s (timeout %ds)\n' "$PORTS" "$TIMEOUT_SECS"
50+
51+
for port in $PORTS; do
52+
wait_one "$port"
53+
done
54+
55+
printf 'cluster ready in %ds\n' "$(( $(date +%s) - start_epoch ))"

0 commit comments

Comments
 (0)