Skip to content

[Registry] Switch {:duplicate, :key} key_ets to ordered_set with composite keys#15304

Open
studzien wants to merge 23 commits intoelixir-lang:mainfrom
studzien:registry/duplicate-ordered-set-redo
Open

[Registry] Switch {:duplicate, :key} key_ets to ordered_set with composite keys#15304
studzien wants to merge 23 commits intoelixir-lang:mainfrom
studzien:registry/duplicate-ordered-set-redo

Conversation

@studzien
Copy link
Copy Markdown
Contributor

Motivation

With {:duplicate, :key} partitioning, all entries for a given key land in a single partition's duplicate_bag ETS table. At high churn rates (many processes joining/leaving the same key), unregister becomes a bottleneck — :ets.match_delete/2 on a duplicate_bag must scan all entries for that key to find the ones belonging to the leaving process. This causes the partition GenServer's message queue to grow unboundedly.

We hit this in production with Phoenix.PubSub (backed by Registry) for livestreams: a single key with ~150k subscribers on one node (~800k across a 6-node cluster) at ~1k joins and ~1k leaves per second per node (~6k each cluster-wide) was enough to cause the partition GenServer's message queue to run away. This change eliminates the issue.

Approach

Replace the duplicate_bag with an ordered_set using composite keys {key, pid, counter}. Since ordered_set sorts entries, all entries for a {key, pid} prefix are adjacent in the tree, so:

  • Unregister can seek directly to {key, pid, _} instead of scanning all entries for that key — O(log N) vs O(N).
  • Lookup uses ets:select/2 with a bound key prefix, which leverages the tree ordering — O(log N + m) seek + scan of matching entries.

Benchmark summary

With multiple partitions, {:duplicate, :key} outperforms {:duplicate, :pid} across all operations in both the "many keys" and "hot key" scenarios. The most dramatic difference is unregister on a hot key (200k subscribers): 146–1172x faster because the ordered_set can seek directly to {key, pid, _} instead of scanning 200k entries.

With a single partition and many keys, {:duplicate, :pid} wins lookup/dispatch by ~4-5x due to the direct hash lookup (lookup_element) vs ordered_set match spec — but this advantage disappears with multiple partitions because :pid must fan out across all of them.

For the hot-key scenario with a single partition, lookup/dispatch throughput is comparable between the two at the ETS level, but the duplicate_bag exhibits higher tail latency (~30% deviation and 2x worse p99 vs ~9% for the ordered_set).

Full benchmark results and the benchmark script are in the comments below.

Caveats

  • No parallel dispatch: parallel: true in dispatch/4 is silently ignored for {:duplicate, :key} — all entries for a given key live in one partition, so there is nothing to parallelize across. This is inherent to key-based partitioning, not a limitation of the ordered_set change. In practice, even without parallelization, :key performs better on dispatch for hot keys because :pid partitioning must fan out across all partitions and merge results, which adds more overhead than it saves through parallelism.

  • :"$_" behavior in select/2 and count_select/2: The internal ETS representation changes from {key, {pid, value}} to {{key, pid, counter}, value}. Match specs that reference :"$_" (the whole object) will see the new shape. This is documented in the start_link/1 docs with guidance to use named match variables (:"$1", :"$2", etc.) instead. This could be improved with a runtime error when :"$_" is used in select specs for {:duplicate, :key} registries, but that is left for a follow-up.

  • Reserved-atom keys (e.g., :_, :"$1", :"$2"): When one of these atoms is used as a registry key, it clashes with ETS match spec syntax (:_ is a wildcard, :"$N" are capture variables). The implementation falls back to using a wildcard in the match head and a guard-based equality check, which means ETS cannot use the partial key bound to seek in the tree — it must do a full table scan. This only affects users who register with these specific atoms as keys, which is uncommon in practice.

All 216 parameterized duplicate registry tests pass (3 key types × 2 partition counts × 36 tests).

studzien and others added 23 commits April 23, 2026 12:29
Change init_key_ets to create an ordered_set table for {:duplicate, :key}
and add ordered boolean to Partition state.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
The match spec body needs triple braces {{{A, B}}} in Elixir to produce
a tuple result in ETS. Also preserve parallel dispatch contract for
{:duplicate, :key} when parallel: true option is passed.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
All entries for a given key are in a single partition, so parallel
dispatch has no effect. Simplify by always dispatching in-process
and document this in the :parallel option.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Handle {:duplicate, :key} (ordered) first, then {:duplicate, :pid}
multi-partition, then collapse the remaining cases (:unique and
{:duplicate, :pid} single-partition) into a single branch that builds
the bag spec once.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Add an optional `ordered` parameter to __unregister__/4 so callers
don't need to branch between __unregister__ and ordered_unregister_key.
When ordered is true and pos is 1, the function builds the composite
key match pattern internally.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Remove the duplicated bag_unregister_match/ordered_unregister_match
9-argument functions. The only difference was the match specs, so
extract a single unregister_match_specs/5 that returns {total_spec,
delete_spec} for both ordered and bag formats.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
The rewrite is needed because existing code (including tests) uses
{:element, 1, :"$_"} in select body to extract the key. Without
rewriting, this returns the composite tuple {key, pid, counter}
instead of just the key in ordered_set tables.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Users of {:duplicate, :key} registries should use capture variables
(:"$1", :"$2") instead of :"$_" in select specs. The internal ETS
layout differs between registry kinds, and we don't paper over that.

Updated docs to note this explicitly and fixed the test that relied
on {:element, 1, :"$_"} to use capture variables instead.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Users switching from {:duplicate, :pid} to {:duplicate, :key} should
know that match specs using :"$_" in select/count_select may behave
differently due to the ordered_set internal layout.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Replace the two separate lookup helpers with a single lookup_second/3
that dispatches on kind. This also allows collapsing the single-partition
{:duplicate, :key} and {:duplicate, :pid} branches in dispatch, lookup,
and values.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Replace ordered_match_spec/3 and ordered_count_match_spec/3 with
match_spec/4 and count_match_spec/4 that dispatch on kind. The bag
spec construction moves from inline in match/4 and count_match/4 into
the helpers, simplifying both functions from 5 branches to 2.

Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI

Co-Authored-By: Claude <noreply@anthropic.com>
@studzien
Copy link
Copy Markdown
Contributor Author

Benchmark: {:duplicate, :key} vs {:duplicate, :pid}

Tested on 16 schedulers with 2s warmup and 10s measurement window per benchmark. Numbers below are averages from Benchee. Deviation is noted where it tells an important story.

Scenario A: 10,000 keys × 10 subscribers (100,000 total)

Operation 1 part / :key 1 part / :pid 16 part / :key 16 part / :pid
Lookup 1.09 μs 0.26 μs (4.3x) 1.13 μs 1.86 μs (1.6x)
Dispatch 1.13 μs 0.27 μs (4.2x) 1.16 μs 1.82 μs (1.6x)
Register 1.74 μs 1.53 μs (1.1x) 1.69 μs 2.35 μs (1.4x)
Unregister 2.65 μs 2.34 μs (1.1x) 2.53 μs 2.92 μs (1.2x)

With 1 partition, :pid wins lookup/dispatch by ~4x (direct hash lookup via lookup_element vs ordered_set match spec, no multi-partition overhead). With 16 partitions, :key wins across the board — the fan-out cost of querying all partitions erases :pid's raw ETS advantage.

Scenario B: 1 key × 200,000 subscribers

Operation 1 part / :key 1 part / :pid 16 part / :key 16 part / :pid
Lookup 10.79 ms (±17%) 15.29 ms (±34%) 10.01 ms (±9%) 24.51 ms (±6%)
Dispatch 10.40 ms (±9%) 14.40 ms (±34%) 10.31 ms (±9%) 28.32 ms (±6%)
Register 1.82 μs 3.38 μs (1.9x) 1.96 μs 3.43 μs (1.8x)
Unregister 2.71 μs 3.18 ms (1172x) 5.19 μs 759 μs (146x)

For lookup/dispatch with 1 partition, the averages suggest :key is ~1.4x faster. However, a raw ETS microbenchmark (stripping out Registry overhead) shows the median throughput of ets:lookup_element on the bag and ets:select on the ordered_set is comparable at this scale. The average difference is driven by tail latency: the duplicate_bag shows ~34% deviation and ~2x worse p99 compared to ~9-17% for the ordered_set.

With 16 partitions, the :pid fan-out overhead across all partitions adds a clear and consistent penalty on top of the variance.

The unregister gap (146–1172x) is the critical result and the primary motivation for this change — with 200k entries in a duplicate_bag, unregister must scan all entries for the key to find the matching pid. The ordered_set seeks directly to {key, pid, _}.

@studzien
Copy link
Copy Markdown
Contributor Author

Benchmark script

Mix.install([{:benchee, "~> 1.0"}])

# Compare {:duplicate, :key} vs {:duplicate, :pid} registries.
#
# Scenario A: many keys, few subscribers each (10k × 10)
#   → :key wins on lookup/dispatch (single partition hit vs all partitions)
#
# Scenario B: one key, many subscribers (1 × 200k)
#   → :pid may win on lookup (bag lookup_element vs ordered_set select)
#
# Run: elixir bench/registry_key_vs_pid_bench.exs

defmodule Sub do
  @moduledoc """
  Minimal GenServer worker. Registry.register/3 must be called by the owning
  process, so each "subscriber" is its own GenServer that takes instructions
  via call.
  """
  use GenServer

  def start do
    {:ok, pid} = GenServer.start(__MODULE__, nil)
    pid
  end

  def register(pid, reg, key, val \\ :v) do
    GenServer.call(pid, {:register, reg, key, val})
  end

  def unregister(pid, reg, key) do
    GenServer.call(pid, {:unregister, reg, key})
  end

  def stop(pid), do: GenServer.stop(pid, :normal, :infinity)

  @impl true
  def init(nil), do: {:ok, nil}

  @impl true
  def handle_call({:register, reg, key, val}, _from, state) do
    {:ok, _} = Registry.register(reg, key, val)
    {:reply, :ok, state}
  end

  def handle_call({:unregister, reg, key}, _from, state) do
    Registry.unregister(reg, key)
    {:reply, :ok, state}
  end
end

defmodule Env do
  def start_registry(name, keys, partitions) do
    if pid = Process.whereis(name) do
      GenServer.stop(pid)
      Process.sleep(20)
    end

    {:ok, _} = Registry.start_link(keys: keys, name: name, partitions: partitions)
  end

  def populate(reg, n_keys, subs_per_key) do
    total = n_keys * subs_per_key
    agents = for _ <- 1..total, do: Sub.start()

    Enum.with_index(agents, fn pid, i ->
      key = "key_#{rem(i, n_keys)}"
      Sub.register(pid, reg, key)
    end)

    agents
  end

  def teardown(agents, name) do
    Enum.each(agents, &Sub.stop/1)
    if pid = Process.whereis(name), do: GenServer.stop(pid)
    Process.sleep(50)
  end
end

schedulers = System.schedulers_online()
IO.puts("Registry benchmark: {:duplicate, :key} vs {:duplicate, :pid}")
IO.puts("Schedulers: #{schedulers}\n")

bench_opts = [warmup: 2, time: 10, print: [configuration: false]]

run_scenario = fn title, n_keys, subs_per_key, partitions ->
  total = n_keys * subs_per_key
  key = "key_0"

  IO.puts(String.duplicate("=", 70))
  IO.puts("#{title}, #{partitions} partition(s)")
  IO.puts(String.duplicate("=", 70))

  Env.start_registry(:bench_key, {:duplicate, :key}, partitions)
  Env.start_registry(:bench_pid, {:duplicate, :pid}, partitions)

  IO.puts("Populating #{total} entries...")
  agents_key = Env.populate(:bench_key, n_keys, subs_per_key)
  agents_pid = Env.populate(:bench_pid, n_keys, subs_per_key)
  IO.puts("Done.\n")

  # -- Lookup --
  Benchee.run(
    %{
      "duplicate_key" => fn -> Registry.lookup(:bench_key, key) end,
      "duplicate_pid" => fn -> Registry.lookup(:bench_pid, key) end
    },
    [{:title, "Lookup"} | bench_opts]
  )

  # -- Dispatch --
  Benchee.run(
    %{
      "duplicate_key" => fn ->
        Registry.dispatch(:bench_key, key, fn entries -> length(entries) end)
      end,
      "duplicate_pid" => fn ->
        Registry.dispatch(:bench_pid, key, fn entries -> length(entries) end)
      end
    },
    [{:title, "Dispatch"} | bench_opts]
  )

  # -- Register --
  Benchee.run(
    %{
      "duplicate_key" =>
        {fn pid ->
           Sub.register(pid, :bench_key, key)
           pid
         end,
         before_each: fn _ -> Sub.start() end,
         after_each: fn pid ->
           Sub.unregister(pid, :bench_key, key)
           Sub.stop(pid)
         end},
      "duplicate_pid" =>
        {fn pid ->
           Sub.register(pid, :bench_pid, key)
           pid
         end,
         before_each: fn _ -> Sub.start() end,
         after_each: fn pid ->
           Sub.unregister(pid, :bench_pid, key)
           Sub.stop(pid)
         end}
    },
    [{:title, "Register"} | bench_opts]
  )

  # -- Unregister --
  Benchee.run(
    %{
      "duplicate_key" =>
        {fn pid ->
           Sub.unregister(pid, :bench_key, key)
           pid
         end,
         before_each: fn _ ->
           pid = Sub.start()
           Sub.register(pid, :bench_key, key)
           pid
         end,
         after_each: fn pid -> Sub.stop(pid) end},
      "duplicate_pid" =>
        {fn pid ->
           Sub.unregister(pid, :bench_pid, key)
           pid
         end,
         before_each: fn _ ->
           pid = Sub.start()
           Sub.register(pid, :bench_pid, key)
           pid
         end,
         after_each: fn pid -> Sub.stop(pid) end}
    },
    [{:title, "Unregister"} | bench_opts]
  )

  Env.teardown(agents_key, :bench_key)
  Env.teardown(agents_pid, :bench_pid)
  IO.puts("")
end

for partitions <- [1, schedulers] do
  # ── Scenario A: many keys, few subscribers ──────────────────────
  run_scenario.(
    "Scenario A: 10,000 keys x 10 subscribers (100,000 total)",
    10_000,
    10,
    partitions
  )

  # ── Scenario B: one key, many subscribers ───────────────────────
  run_scenario.("Scenario B: 1 key x 200,000 subscribers", 1, 200_000, partitions)
end

IO.puts("Done.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant