11defmodule Lightning.LogLines.SearchVectorWorker do
22 @ moduledoc """
3- Asynchronously backfills `log_lines.search_vector` for rows that were left
4- `NULL` at insert time.
5-
6- ## Why defer the tsvector?
7-
8- Computing the full-text `search_vector` synchronously (via an insert trigger)
9- put `to_tsvector` on the hot path of every log line write. Under heavy run
10- load that work serialises behind the worker's log firehose and slows
11- ingestion. A sibling migration removes the synchronous trigger, leaving the
12- column `NULL` on insert, and adds:
13-
14- * a `safe_to_tsvector(regconfig, text)` SQL function (tolerant of bad input);
15- * a partial index `... WHERE search_vector IS NULL` so finding pending rows
16- stays cheap.
17-
18- This worker then fills `search_vector` out-of-band. The read side
19- (`Lightning.Invocation`) queries with `to_tsquery('english_nostop', ...)`, so
20- this worker MUST build vectors with the matching `english_nostop` config,
21- otherwise searches would silently miss freshly-written log lines.
22-
23- ## Draining and snowballing
24-
25- Each run drains pending rows in bounded batches (`@batch_size` rows, up to
26- `@max_batches` per run). When a run consumes its full budget there is almost
27- certainly more backlog, so it enqueues an immediate follow-up job (a
28- "snowball") rather than waiting for the next 1-minute cron tick. This lets the
29- worker keep pace with bursty load while the dedicated `search_indexing` queue
30- (concurrency 1) plus job uniqueness keep the snowball self-limiting.
31-
32- The cron entry enqueues with default args; the snowball uses
33- `%{"trigger" => "snowball"}`. The differing `trigger` key produces a distinct
34- uniqueness key, so a queued snowball is never swallowed by the cron job (and
35- vice versa).
3+ Backfills the full-text `search_vector` on `log_lines` rows.
4+
5+ Log lines are inserted with `search_vector` left `NULL`; the vector is built
6+ here rather than on the insert path, keeping `to_tsvector` off the hot path of
7+ high-volume log ingestion. Search is eventually consistent as a result,
8+ typically catching up within a minute.
9+
10+ Two database objects support this: `safe_to_tsvector(regconfig, text)`, which
11+ builds the vector while tolerating NULL and oversized input, and a partial
12+ index over `search_vector IS NULL`, which keeps locating pending rows cheap as
13+ the table grows. Vectors use the `english_nostop` config to match the read
14+ side (`Lightning.Invocation`), which queries with
15+ `to_tsquery('english_nostop', ...)`.
16+
17+ Each run drains pending rows newest-first, in batches of `@batch_size` up to
18+ `@max_batches` per run. A run that exhausts its budget leaves backlog behind
19+ and enqueues an immediate follow-up ("snowball"); otherwise the minute-ly cron
20+ tick keeps pace. The worker runs on the dedicated `search_indexing` queue at
21+ concurrency 1, so only one job executes at a time, and the cron tick and the
22+ snowball carry distinct `trigger` args, so job uniqueness allows one of each to
23+ queue but never a duplicate.
3624 """
3725
3826 use Oban.Worker ,
3927 queue: :search_indexing ,
4028 priority: 1 ,
4129 max_attempts: 10 ,
42- # `states` is restricted to the queued states on purpose. Oban's default
43- # unique states include `:executing` and `:completed`, which would make a
44- # running snowball job match *itself* when it tries to enqueue its
45- # successor, silently dedup the insert, and break the chain after a single
46- # hop. Limiting uniqueness to `:available`/`:scheduled` still guarantees at
47- # most one queued snowball (and one queued cron heartbeat, via the distinct
48- # `:trigger` key) while letting the executing job enqueue the next link.
30+ # Restrict uniqueness to queued states. Oban's defaults also dedup against
31+ # :executing/:completed, so a running snowball would match itself and fail
32+ # to enqueue its successor — breaking the chain after one hop.
4933 unique: [ period: 55 , keys: [ :trigger ] , states: [ :available , :scheduled ] ]
5034
5135 alias Lightning.Repo
5236
5337 require Logger
5438
55- # Rows to fill per batch.
5639 @ batch_size 2_500
57- # Maximum batches to drain in a single run (per -run budget) .
40+ # Per -run budget.
5841 @ max_batches 10
5942
6043 @ drain_sql """
@@ -80,9 +63,8 @@ defmodule Lightning.LogLines.SearchVectorWorker do
8063 end )
8164
8265 if budget_exhausted? do
83- # The run hit its per-run budget, so more backlog almost certainly
84- # remains. Snowball an immediate follow-up with a distinct uniqueness key
85- # so the cron job's uniqueness does not swallow it.
66+ # Budget exhausted, so backlog likely remains: enqueue an immediate
67+ # follow-up rather than waiting for the next cron tick.
8668 Oban . insert ( Lightning.Oban , __MODULE__ . new ( % { "trigger" => "snowball" } ) )
8769 end
8870
0 commit comments