Skip to content

Bug Report: VReplication: stream state metric falsely reports Error when _vt.vreplication UPDATE fails #20012

@mcrauwel

Description

@mcrauwel

Overview of the Issue

When setState(Error, …) is invoked but the UPDATE _vt.vreplication SET state='Error', … cannot be executed (for example, the target's MySQL is in read_only / super_read_only mode during a reparent), the in-memory state metric is advanced before the DB write is attempted. The DB write then fails, the row in _vt.vreplication keeps its prior value, no row is inserted into _vt.vreplication_log.

This error should be temporary (ex the UPDATE failed due to a PRS) the workflow retry loop will retry the work and the stream will actually be fine. Because the saved state never got to be "Error" the state update will not get triggered again and the in-memory state is not corrected to be Copying again, it will only get updated again of the state changes again (ex going to Running).

The result is a stream that it is actuallyCopying, but reports Error to operators via metrics and /debug/vars.

This causes false alerts: operators see vttablet_v_replication_stream_state{state="Error",…} and VReplicationStreamState in /debug/vars showing Error, while:

  • select state from _vt.vreplication where id=… still returns the prior state,
  • there is no corresponding LogStateChange1 row in _vt.vreplication_log,
  • the workflow's controller keeps applying events / copying.

Reproduction Steps

  1. Start a vreplication workflow that hits an unrecoverable apply error on the target — anything that goes through runBlp's terminal-error handling at controller.go:309-318.
  2. Make the target's MySQL read_only (e.g., a planned reparent that hasn't promoted, or an external operator setting it). Errno 1290 ("MySQL server is running with the --read-only option") will be returned for any write.
  3. Observe the controller log spam:
    E… controller.go:314] INTERNAL: unable to setState() in controller: could not set state: update _vt.vreplication set state='Error', …: errno 1290 …
    E… dbclient.go:139] error in stream N, will retry after 5s: terminal error: …
  4. While this is happening, query and observe the divergence:
    - curl http:///debug/vars | jq '.VReplicationStreamState' → "Error"
    - vttablet_v_replication_stream_state{state="Error",…} → set
    - SELECT state, message FROM _vt.vreplication WHERE id=N; → still the prior state (e.g. Copying)
    - SELECT type, state, message FROM _vt.vreplication_log WHERE vrepl_id=N ORDER BY id DESC LIMIT 5; → no State Change row to Error

Binary Version

Reproducible on release-22.0 and current main.

Operating System and Environment details

PlanetScale vttablet

Log Fragments

Metadata

Metadata

Type

No fields configured for Bug.

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions