Overview of the Issue
When setState(Error, …) is invoked but the UPDATE _vt.vreplication SET state='Error', … cannot be executed (for example, the target's MySQL is in read_only / super_read_only mode during a reparent), the in-memory state metric is advanced before the DB write is attempted. The DB write then fails, the row in _vt.vreplication keeps its prior value, no row is inserted into _vt.vreplication_log.
This error should be temporary (ex the UPDATE failed due to a PRS) the workflow retry loop will retry the work and the stream will actually be fine. Because the saved state never got to be "Error" the state update will not get triggered again and the in-memory state is not corrected to be Copying again, it will only get updated again of the state changes again (ex going to Running).
The result is a stream that it is actuallyCopying, but reports Error to operators via metrics and /debug/vars.
This causes false alerts: operators see vttablet_v_replication_stream_state{state="Error",…} and VReplicationStreamState in /debug/vars showing Error, while:
select state from _vt.vreplication where id=… still returns the prior state,
- there is no corresponding
LogStateChange1 row in _vt.vreplication_log,
- the workflow's controller keeps applying events / copying.
Reproduction Steps
- Start a vreplication workflow that hits an unrecoverable apply error on the target — anything that goes through runBlp's terminal-error handling at controller.go:309-318.
- Make the target's MySQL read_only (e.g., a planned reparent that hasn't promoted, or an external operator setting it). Errno 1290 ("MySQL server is running with the --read-only option") will be returned for any write.
- Observe the controller log spam:
E… controller.go:314] INTERNAL: unable to setState() in controller: could not set state: update _vt.vreplication set state='Error', …: errno 1290 …
E… dbclient.go:139] error in stream N, will retry after 5s: terminal error: …
- While this is happening, query and observe the divergence:
- curl http:///debug/vars | jq '.VReplicationStreamState' → "Error"
- vttablet_v_replication_stream_state{state="Error",…} → set
- SELECT state, message FROM _vt.vreplication WHERE id=N; → still the prior state (e.g. Copying)
- SELECT type, state, message FROM _vt.vreplication_log WHERE vrepl_id=N ORDER BY id DESC LIMIT 5; → no State Change row to Error
Binary Version
Reproducible on release-22.0 and current main.
Operating System and Environment details
Log Fragments
Overview of the Issue
When setState(Error, …) is invoked but the
UPDATE _vt.vreplication SET state='Error', …cannot be executed (for example, the target's MySQL is in read_only / super_read_only mode during a reparent), the in-memory state metric is advanced before the DB write is attempted. The DB write then fails, the row in _vt.vreplication keeps its prior value, no row is inserted into _vt.vreplication_log.This error should be temporary (ex the UPDATE failed due to a PRS) the workflow retry loop will retry the work and the stream will actually be fine. Because the saved state never got to be "Error" the state update will not get triggered again and the in-memory state is not corrected to be
Copyingagain, it will only get updated again of the state changes again (ex going toRunning).The result is a stream that it is actually
Copying, but reportsErrorto operators via metrics and /debug/vars.This causes false alerts: operators see
vttablet_v_replication_stream_state{state="Error",…}andVReplicationStreamStatein /debug/vars showing Error, while:select state from _vt.vreplication where id=…still returns the prior state,LogStateChange1 row in_vt.vreplication_log,Reproduction Steps
E… controller.go:314] INTERNAL: unable to setState() in controller: could not set state: update _vt.vreplication set state='Error', …: errno 1290 …
E… dbclient.go:139] error in stream N, will retry after 5s: terminal error: …
- curl http:///debug/vars | jq '.VReplicationStreamState' → "Error"
- vttablet_v_replication_stream_state{state="Error",…} → set
- SELECT state, message FROM _vt.vreplication WHERE id=N; → still the prior state (e.g. Copying)
- SELECT type, state, message FROM _vt.vreplication_log WHERE vrepl_id=N ORDER BY id DESC LIMIT 5; → no State Change row to Error
Binary Version
Operating System and Environment details
Log Fragments