[Registry] Switch {:duplicate, :key} key_ets to ordered_set with composite keys#15304
[Registry] Switch {:duplicate, :key} key_ets to ordered_set with composite keys#15304studzien wants to merge 23 commits intoelixir-lang:mainfrom
{:duplicate, :key} key_ets to ordered_set with composite keys#15304Conversation
Change init_key_ets to create an ordered_set table for {:duplicate, :key}
and add ordered boolean to Partition state.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
The match spec body needs triple braces {{{A, B}}} in Elixir to produce
a tuple result in ETS. Also preserve parallel dispatch contract for
{:duplicate, :key} when parallel: true option is passed.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
All entries for a given key are in a single partition, so parallel dispatch has no effect. Simplify by always dispatching in-process and document this in the :parallel option. Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Handle {:duplicate, :key} (ordered) first, then {:duplicate, :pid}
multi-partition, then collapse the remaining cases (:unique and
{:duplicate, :pid} single-partition) into a single branch that builds
the bag spec once.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Add an optional `ordered` parameter to __unregister__/4 so callers don't need to branch between __unregister__ and ordered_unregister_key. When ordered is true and pos is 1, the function builds the composite key match pattern internally. Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Remove the duplicated bag_unregister_match/ordered_unregister_match
9-argument functions. The only difference was the match specs, so
extract a single unregister_match_specs/5 that returns {total_spec,
delete_spec} for both ordered and bag formats.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
The rewrite is needed because existing code (including tests) uses
{:element, 1, :"$_"} in select body to extract the key. Without
rewriting, this returns the composite tuple {key, pid, counter}
instead of just the key in ordered_set tables.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Users of {:duplicate, :key} registries should use capture variables
(:"$1", :"$2") instead of :"$_" in select specs. The internal ETS
layout differs between registry kinds, and we don't paper over that.
Updated docs to note this explicitly and fixed the test that relied
on {:element, 1, :"$_"} to use capture variables instead.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Users switching from {:duplicate, :pid} to {:duplicate, :key} should
know that match specs using :"$_" in select/count_select may behave
differently due to the ordered_set internal layout.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Replace the two separate lookup helpers with a single lookup_second/3
that dispatches on kind. This also allows collapsing the single-partition
{:duplicate, :key} and {:duplicate, :pid} branches in dispatch, lookup,
and values.
Assisted by AI
Co-Authored-By: Claude <noreply@anthropic.com>
Replace ordered_match_spec/3 and ordered_count_match_spec/3 with match_spec/4 and count_match_spec/4 that dispatch on kind. The bag spec construction moves from inline in match/4 and count_match/4 into the helpers, simplifying both functions from 5 branches to 2. Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Assisted by AI Co-Authored-By: Claude <noreply@anthropic.com>
Benchmark:
|
| Operation | 1 part / :key |
1 part / :pid |
16 part / :key |
16 part / :pid |
|---|---|---|---|---|
| Lookup | 1.09 μs | 0.26 μs (4.3x) | 1.13 μs | 1.86 μs (1.6x) |
| Dispatch | 1.13 μs | 0.27 μs (4.2x) | 1.16 μs | 1.82 μs (1.6x) |
| Register | 1.74 μs | 1.53 μs (1.1x) | 1.69 μs | 2.35 μs (1.4x) |
| Unregister | 2.65 μs | 2.34 μs (1.1x) | 2.53 μs | 2.92 μs (1.2x) |
With 1 partition, :pid wins lookup/dispatch by ~4x (direct hash lookup via lookup_element vs ordered_set match spec, no multi-partition overhead). With 16 partitions, :key wins across the board — the fan-out cost of querying all partitions erases :pid's raw ETS advantage.
Scenario B: 1 key × 200,000 subscribers
| Operation | 1 part / :key |
1 part / :pid |
16 part / :key |
16 part / :pid |
|---|---|---|---|---|
| Lookup | 10.79 ms (±17%) | 15.29 ms (±34%) | 10.01 ms (±9%) | 24.51 ms (±6%) |
| Dispatch | 10.40 ms (±9%) | 14.40 ms (±34%) | 10.31 ms (±9%) | 28.32 ms (±6%) |
| Register | 1.82 μs | 3.38 μs (1.9x) | 1.96 μs | 3.43 μs (1.8x) |
| Unregister | 2.71 μs | 3.18 ms (1172x) | 5.19 μs | 759 μs (146x) |
For lookup/dispatch with 1 partition, the averages suggest :key is ~1.4x faster. However, a raw ETS microbenchmark (stripping out Registry overhead) shows the median throughput of ets:lookup_element on the bag and ets:select on the ordered_set is comparable at this scale. The average difference is driven by tail latency: the duplicate_bag shows ~34% deviation and ~2x worse p99 compared to ~9-17% for the ordered_set.
With 16 partitions, the :pid fan-out overhead across all partitions adds a clear and consistent penalty on top of the variance.
The unregister gap (146–1172x) is the critical result and the primary motivation for this change — with 200k entries in a duplicate_bag, unregister must scan all entries for the key to find the matching pid. The ordered_set seeks directly to {key, pid, _}.
Benchmark scriptMix.install([{:benchee, "~> 1.0"}])
# Compare {:duplicate, :key} vs {:duplicate, :pid} registries.
#
# Scenario A: many keys, few subscribers each (10k × 10)
# → :key wins on lookup/dispatch (single partition hit vs all partitions)
#
# Scenario B: one key, many subscribers (1 × 200k)
# → :pid may win on lookup (bag lookup_element vs ordered_set select)
#
# Run: elixir bench/registry_key_vs_pid_bench.exs
defmodule Sub do
@moduledoc """
Minimal GenServer worker. Registry.register/3 must be called by the owning
process, so each "subscriber" is its own GenServer that takes instructions
via call.
"""
use GenServer
def start do
{:ok, pid} = GenServer.start(__MODULE__, nil)
pid
end
def register(pid, reg, key, val \\ :v) do
GenServer.call(pid, {:register, reg, key, val})
end
def unregister(pid, reg, key) do
GenServer.call(pid, {:unregister, reg, key})
end
def stop(pid), do: GenServer.stop(pid, :normal, :infinity)
@impl true
def init(nil), do: {:ok, nil}
@impl true
def handle_call({:register, reg, key, val}, _from, state) do
{:ok, _} = Registry.register(reg, key, val)
{:reply, :ok, state}
end
def handle_call({:unregister, reg, key}, _from, state) do
Registry.unregister(reg, key)
{:reply, :ok, state}
end
end
defmodule Env do
def start_registry(name, keys, partitions) do
if pid = Process.whereis(name) do
GenServer.stop(pid)
Process.sleep(20)
end
{:ok, _} = Registry.start_link(keys: keys, name: name, partitions: partitions)
end
def populate(reg, n_keys, subs_per_key) do
total = n_keys * subs_per_key
agents = for _ <- 1..total, do: Sub.start()
Enum.with_index(agents, fn pid, i ->
key = "key_#{rem(i, n_keys)}"
Sub.register(pid, reg, key)
end)
agents
end
def teardown(agents, name) do
Enum.each(agents, &Sub.stop/1)
if pid = Process.whereis(name), do: GenServer.stop(pid)
Process.sleep(50)
end
end
schedulers = System.schedulers_online()
IO.puts("Registry benchmark: {:duplicate, :key} vs {:duplicate, :pid}")
IO.puts("Schedulers: #{schedulers}\n")
bench_opts = [warmup: 2, time: 10, print: [configuration: false]]
run_scenario = fn title, n_keys, subs_per_key, partitions ->
total = n_keys * subs_per_key
key = "key_0"
IO.puts(String.duplicate("=", 70))
IO.puts("#{title}, #{partitions} partition(s)")
IO.puts(String.duplicate("=", 70))
Env.start_registry(:bench_key, {:duplicate, :key}, partitions)
Env.start_registry(:bench_pid, {:duplicate, :pid}, partitions)
IO.puts("Populating #{total} entries...")
agents_key = Env.populate(:bench_key, n_keys, subs_per_key)
agents_pid = Env.populate(:bench_pid, n_keys, subs_per_key)
IO.puts("Done.\n")
# -- Lookup --
Benchee.run(
%{
"duplicate_key" => fn -> Registry.lookup(:bench_key, key) end,
"duplicate_pid" => fn -> Registry.lookup(:bench_pid, key) end
},
[{:title, "Lookup"} | bench_opts]
)
# -- Dispatch --
Benchee.run(
%{
"duplicate_key" => fn ->
Registry.dispatch(:bench_key, key, fn entries -> length(entries) end)
end,
"duplicate_pid" => fn ->
Registry.dispatch(:bench_pid, key, fn entries -> length(entries) end)
end
},
[{:title, "Dispatch"} | bench_opts]
)
# -- Register --
Benchee.run(
%{
"duplicate_key" =>
{fn pid ->
Sub.register(pid, :bench_key, key)
pid
end,
before_each: fn _ -> Sub.start() end,
after_each: fn pid ->
Sub.unregister(pid, :bench_key, key)
Sub.stop(pid)
end},
"duplicate_pid" =>
{fn pid ->
Sub.register(pid, :bench_pid, key)
pid
end,
before_each: fn _ -> Sub.start() end,
after_each: fn pid ->
Sub.unregister(pid, :bench_pid, key)
Sub.stop(pid)
end}
},
[{:title, "Register"} | bench_opts]
)
# -- Unregister --
Benchee.run(
%{
"duplicate_key" =>
{fn pid ->
Sub.unregister(pid, :bench_key, key)
pid
end,
before_each: fn _ ->
pid = Sub.start()
Sub.register(pid, :bench_key, key)
pid
end,
after_each: fn pid -> Sub.stop(pid) end},
"duplicate_pid" =>
{fn pid ->
Sub.unregister(pid, :bench_pid, key)
pid
end,
before_each: fn _ ->
pid = Sub.start()
Sub.register(pid, :bench_pid, key)
pid
end,
after_each: fn pid -> Sub.stop(pid) end}
},
[{:title, "Unregister"} | bench_opts]
)
Env.teardown(agents_key, :bench_key)
Env.teardown(agents_pid, :bench_pid)
IO.puts("")
end
for partitions <- [1, schedulers] do
# ── Scenario A: many keys, few subscribers ──────────────────────
run_scenario.(
"Scenario A: 10,000 keys x 10 subscribers (100,000 total)",
10_000,
10,
partitions
)
# ── Scenario B: one key, many subscribers ───────────────────────
run_scenario.("Scenario B: 1 key x 200,000 subscribers", 1, 200_000, partitions)
end
IO.puts("Done.") |
Motivation
With
{:duplicate, :key}partitioning, all entries for a given key land in a single partition'sduplicate_bagETS table. At high churn rates (many processes joining/leaving the same key),unregisterbecomes a bottleneck —:ets.match_delete/2on aduplicate_bagmust scan all entries for that key to find the ones belonging to the leaving process. This causes the partition GenServer's message queue to grow unboundedly.We hit this in production with
Phoenix.PubSub(backed byRegistry) for livestreams: a single key with ~150k subscribers on one node (~800k across a 6-node cluster) at ~1k joins and ~1k leaves per second per node (~6k each cluster-wide) was enough to cause the partition GenServer's message queue to run away. This change eliminates the issue.Approach
Replace the
duplicate_bagwith anordered_setusing composite keys{key, pid, counter}. Sinceordered_setsorts entries, all entries for a{key, pid}prefix are adjacent in the tree, so:{key, pid, _}instead of scanning all entries for that key — O(log N) vs O(N).ets:select/2with a bound key prefix, which leverages the tree ordering — O(log N + m) seek + scan of matching entries.Benchmark summary
With multiple partitions,
{:duplicate, :key}outperforms{:duplicate, :pid}across all operations in both the "many keys" and "hot key" scenarios. The most dramatic difference is unregister on a hot key (200k subscribers): 146–1172x faster because the ordered_set can seek directly to{key, pid, _}instead of scanning 200k entries.With a single partition and many keys,
{:duplicate, :pid}wins lookup/dispatch by ~4-5x due to the direct hash lookup (lookup_element) vs ordered_set match spec — but this advantage disappears with multiple partitions because:pidmust fan out across all of them.For the hot-key scenario with a single partition, lookup/dispatch throughput is comparable between the two at the ETS level, but the
duplicate_bagexhibits higher tail latency (~30% deviation and 2x worse p99 vs ~9% for the ordered_set).Full benchmark results and the benchmark script are in the comments below.
Caveats
No parallel dispatch:
parallel: trueindispatch/4is silently ignored for{:duplicate, :key}— all entries for a given key live in one partition, so there is nothing to parallelize across. This is inherent to key-based partitioning, not a limitation of the ordered_set change. In practice, even without parallelization,:keyperforms better on dispatch for hot keys because:pidpartitioning must fan out across all partitions and merge results, which adds more overhead than it saves through parallelism.:"$_"behavior inselect/2andcount_select/2: The internal ETS representation changes from{key, {pid, value}}to{{key, pid, counter}, value}. Match specs that reference:"$_"(the whole object) will see the new shape. This is documented in thestart_link/1docs with guidance to use named match variables (:"$1",:"$2", etc.) instead. This could be improved with a runtime error when:"$_"is used in select specs for{:duplicate, :key}registries, but that is left for a follow-up.Reserved-atom keys (e.g.,
:_,:"$1",:"$2"): When one of these atoms is used as a registry key, it clashes with ETS match spec syntax (:_is a wildcard,:"$N"are capture variables). The implementation falls back to using a wildcard in the match head and a guard-based equality check, which means ETS cannot use the partial key bound to seek in the tree — it must do a full table scan. This only affects users who register with these specific atoms as keys, which is uncommon in practice.All 216 parameterized duplicate registry tests pass (3 key types × 2 partition counts × 36 tests).