Skip to content

Flaky test: PublicationManagerTest "handles relation tracker restart" races the supervisor restart #4606

Description

@erik-the-implementer

Summary

The test component restarts handles relation tracker restart in packages/sync-service/test/electric/replication/publication_manager_test.exs:641 is flaky. It fails intermittently with a no process exit when calling PublicationManager.remove_shape/2 after restarting the RelationTracker.

Observed on an unrelated PR (#4562, a telemetry-only change), CI run 27678953105, seed 839006:

1) test component restarts handles relation tracker restart (Electric.Replication.PublicationManagerTest)
   test/electric/replication/publication_manager_test.exs:641
   ** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:...",
        {Electric.Replication.PublicationManager.RelationTracker, nil}}}, {:remove_shape, "..."}, 5000)
      ** (EXIT) no process: the process is not alive or there's no process currently associated
        with the given name, possibly because its application isn't started
   code: PublicationManager.remove_shape(ctx.stack_id, shape_handle)
   stacktrace:
     (elixir 1.19.5) lib/gen_server.ex:1135: GenServer.call/3
     (electric 1.7.2) lib/electric/replication/publication_manager/relation_tracker.ex:77: ...RelationTracker.remove_shape/2
     test/electric/replication/publication_manager_test.exs:660: (test)

Root cause

The test is a timing race:

# Stop the RelationTracker - supervisor will restart it
relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
GenServer.stop(relation_tracker_name)                 # 653: async restart by supervisor

# After restart, the publication manager should repopulate from ShapeStatus.
# The publication should still have the relation.
assert_pub_tables(ctx, [ctx.relation], 2_000)         # 657: passes IMMEDIATELY

# Verify we can still remove the shape after the restart
PublicationManager.remove_shape(ctx.stack_id, shape_handle)  # 660: (EXIT) no process
assert_pub_tables(ctx, [], 2_000)

The assert_pub_tables on line 657 is intended to act as the "wait for restart" barrier, but it isn't one. The relation was already in the publication before the restart and the restart never removes it, so the assertion is satisfied immediately — well before the supervisor has restarted the RelationTracker and re-registered it in Electric.ProcessRegistry. Line 660 then does a GenServer.call to the via-Registry name and hits no process because the new tracker hasn't registered yet.

This subsystem's restart tests have a history of flakiness (see #3573 "Fix flaky restart test"); this is the same class of bug. The test was added in #3359.

Draft fix

Explicitly wait for the restarted process to be registered (and finished restoring) before issuing the next call, rather than relying on a no-op publication assertion. RelationTracker already exposes wait_for_restore/2, but it does a GenServer.call to the via-name and would itself hit no process if called before re-registration — so we first poll until the process is alive.

test "handles relation tracker restart", ctx do
  shape = generate_shape(ctx.relation_with_oid, @where_clause_1)

  {:ok, shape_handle} = Electric.ShapeCache.ShapeStatus.add_shape(ctx.stack_id, shape)
  assert :ok = PublicationManager.add_shape(ctx.stack_id, shape_handle, shape)
  assert_pub_tables(ctx, [ctx.relation])

  # Stop the RelationTracker - supervisor will restart it
  relation_tracker_name = PublicationManager.RelationTracker.name(ctx.stack_id)
  GenServer.stop(relation_tracker_name)

  # After restart, the publication manager should repopulate from ShapeStatus.
  # The publication should still have the relation.
  assert_pub_tables(ctx, [ctx.relation], 2_000)

  # Wait for the supervisor to actually restart the RelationTracker and finish
  # restoring before issuing further calls. The assert_pub_tables above passes
  # immediately (the relation is never removed), so it does NOT guarantee the
  # new process has registered yet — without this, remove_shape/2 races the
  # restart and hits the via-Registry name before the process exists.
  wait_until_registered(relation_tracker_name, 2_000)
  PublicationManager.RelationTracker.wait_for_restore(ctx.stack_id, timeout: 2_000)

  # Verify we can still remove the shape after the restart
  PublicationManager.remove_shape(ctx.stack_id, shape_handle)
  assert_pub_tables(ctx, [], 2_000)
end

Helper:

defp wait_until_registered(name, timeout, start_time \\ :erlang.monotonic_time(:millisecond)) do
  case GenServer.whereis(name) do
    pid when is_pid(pid) ->
      pid

    nil ->
      if :erlang.monotonic_time(:millisecond) - start_time < timeout do
        Process.sleep(10)
        wait_until_registered(name, timeout, start_time)
      else
        flunk("RelationTracker was not re-registered within #{timeout}ms after restart")
      end
  end
end

Note: GenServer.stop/1 is synchronous on the old pid, but GenServer.whereis/1 can still briefly return the dying pid until the registry entry is cleared and the new one registered — a short Process.sleep before the first whereis poll, or a Process.monitor/:DOWN wait on the old pid before polling, makes the wait fully robust. Either refinement is fine; the key point is to stop treating the line-657 assertion as a restart barrier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions