Skip to content

fix(go-services): mysql healthcheck false-healthy race (kafka-ecommerce CI flake)#14

Open
slayerjain wants to merge 1 commit into
gofrom
fix/mysql-healthcheck-false-healthy-race
Open

fix(go-services): mysql healthcheck false-healthy race (kafka-ecommerce CI flake)#14
slayerjain wants to merge 1 commit into
gofrom
fix/mysql-healthcheck-false-healthy-race

Conversation

@slayerjain

Copy link
Copy Markdown
Member

What

Fix the intermittent mysql-users/products/orders "container unhealthy" → "dependency failed to start" flake (seen in keploy/enterprise's kafka-ecommerce CI lane, which clones this repo's go branch and runs go-services under keploy record).

Root cause (traced)

The mysql healthchecks used mysqladmin ping -h localhost. MySQL 8.0's entrypoint first runs a temporary, socket-only init server (to apply the seed db.sql), then stops it and starts the real server on :3306. ping -h localhost hits that unix socket and checks exit code only — and mysqladmin exits 0 even on "Access denied". A per-second trace caught the probe reporting mysqld is alive at t=8s against the temp server, ~3s before the real :3306 listener came up (t=11s).

So docker marked the container healthy on the temp server. A dependent service (condition: service_healthy) that connects over TCP mysql-users:3306 then started too early and failed during the temp→real restart gap (~4-6s, wider under load); docker's next probes hit the now-stopped temp server, driving the failing streak to the 20-retry limit → unhealthy. There was also no start_period, so cold-init failures under CI contention burned the retry budget.

Fix

Probe the real TCP listener with root credsmysqladmin ping -h 127.0.0.1 -P 3306 -uroot -proot (only the fully-started real server answers on :3306, never the temp server) — and add start_period: 60s so slow cold init doesn't count against the retries. Applied to all three mysql services.

Validation

  • docker compose config validates.
  • Per-second health trace before: healthy at t=8s with Access denied … (using password: NO) — i.e. healthy without a real connection, against the temp server.
  • After: FailingStreak stays 0 through cold init (absorbed by start_period); healthy only at t=16s with mysqld is alive over real TCP :3306 — never the temp server. 5/5 cold-start loops healthy.

Only the three mysql healthchecks change (+30/−3); no service/app changes.

…it server

The mysql-users/products/orders healthchecks used `mysqladmin ping -h localhost`,
which hits MySQL 8.0's socket-only TEMPORARY init server (run to apply the
seed db.sql before the real server starts) and passes on exit-code-only — it
returns 0 even on "Access denied". So docker marked the container healthy
~3s before the real :3306 TCP listener was up. A service depending on it via
`condition: service_healthy` then connected over TCP too early and failed
during the temp-server -> real-server restart gap; docker's subsequent probes
against the now-stopped temp server drove the failing streak to the retry
limit -> "container unhealthy" -> "dependency failed to start". This is the
intermittent kafka-ecommerce CI flake (it passes whenever timing happens to
favour it).

Fix: probe the REAL TCP listener with root creds
(`mysqladmin ping -h 127.0.0.1 -P 3306 -uroot -proot`) — only the
fully-started real server answers there, never the temp server — and add
`start_period: 60s` so slow cold init under CI contention doesn't burn the
retry budget before :3306 is up. Applied to all three mysql services.

Signed-off-by: Shubham Jain <shubham@keploy.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant