Skip to content

fix(mysql): wait for SQL before PITR replay#2650

Open
weicao wants to merge 1 commit into
release-1.0from
william/mysql-pitr-postready-sql-ready
Open

fix(mysql): wait for SQL before PITR replay#2650
weicao wants to merge 1 commit into
release-1.0from
william/mysql-pitr-postready-sql-ready

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 19, 2026

What

Add a bounded SQL readiness wait before MySQL PITR postReady binlog replay starts.

The replay script now:

  • probes ${DP_DB_HOST}:${DP_DB_PORT} with mysql ... -e "SELECT 1" before creating /var/lib/mysql/pitr-logs;
  • defaults to a 180s timeout and 2s interval;
  • uses ${DP_DB_PORT} in the actual WALG_MYSQL_BINLOG_REPLAY_COMMAND.

Why

idc4 MySQL 5.7.44 PITR multifile N=1 with syncer PR #160 patch crossed the previous DCS/role blocker, but PostReady still failed:

  • leader CM existed and restore pod reached role=primary;
  • first archive replay container started at 2026-05-19T08:45:52Z;
  • replay failed at 2026-05-19T08:45:53Z with ERROR 2003 ... Can't connect to MySQL server ... :3306 (111);
  • retry failed because /var/lib/mysql/pitr-logs had already been created by the first failed attempt.

This patch prevents the observed pre-SQL-ready replay start. It keeps the existing pitr-logs guard fail-closed instead of blindly deleting the directory, because a directory left after replay has started may mean binlogs were partially applied.

Validation

Static:

  • bash -n addons/mysql/dataprotection/mysql-pitr-restore.sh passed.
  • helm template mysql addons/mysql --namespace kb-system passed.
  • Render contains the SQL readiness wait and port-aware replay command.

Runtime evidence before this patch:

  • evidence dir: evidence/mysql-tests-idc4-5.7-pitr-multifile-pr160-patch-20260519T084326Z/
  • evidence sha after adding PostReady readiness summary: ed3fe8ec829717f8be0588c24ca962c9eac1e1553135c4cfabba7b06ff6ec90c

Boundary

This is a candidate fix for the observed MySQL addon ActionSet readiness race. It is not release-ready and not yet a proven PITR fix. Patch-image runtime validation still needs to prove:

  • PostReady job start time;
  • SQL readiness confirmation before replay;
  • wal-g binlog-replay result;
  • final PostReady phase;
  • restored row count, expected row_count=8 for the PITR window;
  • retry behavior if triggered.

@weicao weicao requested review from a team, leon-ape and xuriwuyun as code owners May 19, 2026 09:05
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (66b6e94) to head (cb39930).

Additional details and impacted files
@@             Coverage Diff             @@
##           release-1.0   #2650   +/-   ##
===========================================
  Coverage         0.00%   0.00%           
===========================================
  Files               61      61           
  Lines             6506    6506           
===========================================
  Misses            6506    6506           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 19, 2026

Runtime validation update (idc4):

Scope:

Result:

  • Product gates inside backup-pitr-multifile: PASS: 5 FAIL: 0 SKIP: 0
  • PostReady Restore CR final phase: Completed
  • Hardgate row count: source_row_count=14, restored_row_count=8, T0=2026-05-19T09:13:56Z
  • Manual live watcher sampled restore pod at 2026-05-19T09:15:06Z, 09:15:11Z, 09:15:16Z: ready=True, role=primary, leader_cm=exists, PostReady/preparedata Completed, row_count=8

Runner boundary:

  • Overall runner rc was 1 only because cleanup observed transient residual/terminating objects during its cleanup window.
  • Final post-rollback live check is clean: no cluster,pod,backup,restore,job,pvc residue in mysql-tests.
  • Environment restored: Helm rollback done, ComponentVersion/mysql 5.7 init-syncer restored to docker.io/apecloud/syncer:0.6.8.

Evidence:

  • evidence/mysql-tests-idc4-5.7-pitr-multifile-pr160-patch-20260519T091240Z/
  • evidence sha: 13a23496b96c7e9e3fbc8fcee2e1af03ac1d4139e3f7842db4c838880312789f

Boundary: this is one scoped validation run, not release-ready and not broad version coverage.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 19, 2026

Runtime validation update (idc4, N=3 scoped):

Scope:

Valid product-path runs:

Run Product gates PostReady Row count Runner rc / boundary Evidence sha
N=1 PASS: 5 FAIL: 0 SKIP: 0 Completed source=14 restored=8 rc=1, cleanup transient residual 13a23496b96c7e9e3fbc8fcee2e1af03ac1d4139e3f7842db4c838880312789f
N=2 PASS: 5 FAIL: 0 SKIP: 0 Completed source=14 restored=8 rc=1, cleanup transient residual 51247311ef9ef200b409ffc34f3249626d1fe262b25a58b206138ec17049bcc1
N=3 PASS: 5 FAIL: 0 SKIP: 0 Completed source=14 restored=8 rc=1, cleanup transient residual d4bd775d0eef3a0864ec95e79783b4cbcda8b68639f5e25d21b3664d38941832

Excluded from fix conclusion:

  • 20260519T093140Z hit an idc4 image gate first (config-manager rendered docker.io/library/mysql:5.7.44, Docker Hub pull timeout), then the wrapper script was edited while still running and aborted. No PITR product conclusion is taken from that run.
  • 20260519T091820Z failed before PostReady in archive startup, so it is excluded from PR fix(mysql): wait for SQL before PITR replay #2650 fix validation.

Environment cleanup/restoration:

  • Final live check after rollback: no cluster,pod,backup,restore,job,pvc residue in mysql-tests.
  • ComponentVersion/mysql 5.7 restored to init-syncer=docker.io/apecloud/syncer:0.6.8; MySQL 5.7 images restored to docker.io/apecloud/mysql:5.7.44.

Boundary: this is N=3 scoped validation for idc4 / MySQL 5.7.44 / PITR multifile with syncer PR #160 patch. It is not release-ready and not cross-version coverage. Successful runs still did not preserve the PostReady job container log line for MySQL SQL readiness confirmed before PITR replay; the live ActionSet render contains the wait path, and runtime behavior validates the scoped outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants