Refine comments and moduledoc for deferred log_lines indexing

stuartc · stuartc · commit cae40bc438e9 · 2026-05-30T20:31:07.000+02:00
Trim hindsight/diff-narrating comments down to what's non-obvious, and
rewrite the SearchVectorWorker moduledoc to read as documentation of the
mechanism rather than a justification of the change.
diff --git a/lib/lightning/log_lines/search_vector_worker.ex b/lib/lightning/log_lines/search_vector_worker.ex
@@ -1,60 +1,43 @@
 defmodule Lightning.LogLines.SearchVectorWorker do
   @moduledoc """
-  Asynchronously backfills `log_lines.search_vector` for rows that were left
-  `NULL` at insert time.
-
-  ## Why defer the tsvector?
-
-  Computing the full-text `search_vector` synchronously (via an insert trigger)
-  put `to_tsvector` on the hot path of every log line write. Under heavy run
-  load that work serialises behind the worker's log firehose and slows
-  ingestion. A sibling migration removes the synchronous trigger, leaving the
-  column `NULL` on insert, and adds:
-
-    * a `safe_to_tsvector(regconfig, text)` SQL function (tolerant of bad input);
-    * a partial index `... WHERE search_vector IS NULL` so finding pending rows
-      stays cheap.
-
-  This worker then fills `search_vector` out-of-band. The read side
-  (`Lightning.Invocation`) queries with `to_tsquery('english_nostop', ...)`, so
-  this worker MUST build vectors with the matching `english_nostop` config,
-  otherwise searches would silently miss freshly-written log lines.
-
-  ## Draining and snowballing
-
-  Each run drains pending rows in bounded batches (`@batch_size` rows, up to
-  `@max_batches` per run). When a run consumes its full budget there is almost
-  certainly more backlog, so it enqueues an immediate follow-up job (a
-  "snowball") rather than waiting for the next 1-minute cron tick. This lets the
-  worker keep pace with bursty load while the dedicated `search_indexing` queue
-  (concurrency 1) plus job uniqueness keep the snowball self-limiting.
-
-  The cron entry enqueues with default args; the snowball uses
-  `%{"trigger" => "snowball"}`. The differing `trigger` key produces a distinct
-  uniqueness key, so a queued snowball is never swallowed by the cron job (and
-  vice versa).
+  Backfills the full-text `search_vector` on `log_lines` rows.
+
+  Log lines are inserted with `search_vector` left `NULL`; the vector is built
+  here rather than on the insert path, keeping `to_tsvector` off the hot path of
+  high-volume log ingestion. Search is eventually consistent as a result,
+  typically catching up within a minute.
+
+  Two database objects support this: `safe_to_tsvector(regconfig, text)`, which
+  builds the vector while tolerating NULL and oversized input, and a partial
+  index over `search_vector IS NULL`, which keeps locating pending rows cheap as
+  the table grows. Vectors use the `english_nostop` config to match the read
+  side (`Lightning.Invocation`), which queries with
+  `to_tsquery('english_nostop', ...)`.
+
+  Each run drains pending rows newest-first, in batches of `@batch_size` up to
+  `@max_batches` per run. A run that exhausts its budget leaves backlog behind
+  and enqueues an immediate follow-up ("snowball"); otherwise the minute-ly cron
+  tick keeps pace. The worker runs on the dedicated `search_indexing` queue at
+  concurrency 1, so only one job executes at a time, and the cron tick and the
+  snowball carry distinct `trigger` args, so job uniqueness allows one of each to
+  queue but never a duplicate.
   """
 
   use Oban.Worker,
     queue: :search_indexing,
     priority: 1,
     max_attempts: 10,
-    # `states` is restricted to the queued states on purpose. Oban's default
-    # unique states include `:executing` and `:completed`, which would make a
-    # running snowball job match *itself* when it tries to enqueue its
-    # successor, silently dedup the insert, and break the chain after a single
-    # hop. Limiting uniqueness to `:available`/`:scheduled` still guarantees at
-    # most one queued snowball (and one queued cron heartbeat, via the distinct
-    # `:trigger` key) while letting the executing job enqueue the next link.
+    # Restrict uniqueness to queued states. Oban's defaults also dedup against
+    # :executing/:completed, so a running snowball would match itself and fail
+    # to enqueue its successor — breaking the chain after one hop.
     unique: [period: 55, keys: [:trigger], states: [:available, :scheduled]]
 
   alias Lightning.Repo
 
   require Logger
 
-  # Rows to fill per batch.
   @batch_size 2_500
-  # Maximum batches to drain in a single run (per-run budget).
+  # Per-run budget.
   @max_batches 10
 
   @drain_sql """
@@ -80,9 +63,8 @@ defmodule Lightning.LogLines.SearchVectorWorker do
     end)
 
     if budget_exhausted? do
-      # The run hit its per-run budget, so more backlog almost certainly
-      # remains. Snowball an immediate follow-up with a distinct uniqueness key
-      # so the cron job's uniqueness does not swallow it.
+      # Budget exhausted, so backlog likely remains: enqueue an immediate
+      # follow-up rather than waiting for the next cron tick.
       Oban.insert(Lightning.Oban, __MODULE__.new(%{"trigger" => "snowball"}))
     end
 
diff --git a/priv/repo/migrations/20260530091125_add_safe_to_tsvector_function.exs b/priv/repo/migrations/20260530091125_add_safe_to_tsvector_function.exs
@@ -2,10 +2,9 @@ defmodule Lightning.Repo.Migrations.AddSafeToTsvectorFunction do
   use Ecto.Migration
 
   def up do
-    # Not STRICT: a STRICT function returns NULL (without running) when `doc` is
-    # NULL, which would leave the row's search_vector NULL forever and stuck in
-    # the pending index. COALESCE the doc instead so the function always yields
-    # a non-NULL tsvector. CREATE OR REPLACE keeps the migration re-runnable.
+    # Deliberately not STRICT: a STRICT function returns NULL for a NULL doc,
+    # which would leave search_vector NULL forever and stuck in the pending
+    # index. COALESCE instead so the result is always a non-NULL tsvector.
     execute("""
     CREATE OR REPLACE FUNCTION safe_to_tsvector(config regconfig, doc text) RETURNS tsvector
     LANGUAGE plpgsql IMMUTABLE AS $$
diff --git a/priv/repo/migrations/20260530091126_add_log_lines_pending_search_index.exs b/priv/repo/migrations/20260530091126_add_log_lines_pending_search_index.exs
@@ -7,7 +7,7 @@ defmodule Lightning.Repo.Migrations.AddLogLinesPendingSearchIndex do
   @num_partitions 100
 
   def up do
-    # Create the partial index on the parent table (ONLY), unattached so far.
+    # Partial index on the parent (ONLY); attached per-partition below.
     execute("""
     CREATE INDEX IF NOT EXISTS log_lines_pending_search_idx
     ON ONLY log_lines (timestamp)
@@ -31,11 +31,9 @@ defmodule Lightning.Repo.Migrations.AddLogLinesPendingSearchIndex do
   end
 
   defp create_partition_index(_num_partitions, part_num) do
-    # A failed CREATE INDEX CONCURRENTLY leaves an INVALID index behind. The
-    # IF NOT EXISTS below would then skip rebuilding it, and the subsequent
-    # ATTACH would leave the parent permanently INVALID. Drop any invalid
-    # leftover first so a re-run rebuilds it cleanly. (A valid index is kept;
-    # only invalid ones are dropped.)
+    # A failed CREATE INDEX CONCURRENTLY leaves an INVALID index that IF NOT
+    # EXISTS would skip, so the ATTACH below would mark the parent INVALID too.
+    # Drop any invalid leftover first so a re-run rebuilds cleanly.
     execute("""
     DO $$
     BEGIN
diff --git a/test/lightning/log_lines/search_vector_worker_test.exs b/test/lightning/log_lines/search_vector_worker_test.exs
@@ -18,9 +18,8 @@ defmodule Lightning.LogLines.SearchVectorWorkerTest do
     %{run: run}
   end
 
-  # Inserts a log line via the public API. With the synchronous trigger removed
-  # this leaves `search_vector` NULL, which is exactly the pending state the
-  # worker drains.
+  # Inserts via the public API, which leaves `search_vector` NULL — the pending
+  # state the worker drains.
   defp append_log(run, message) do
     {:ok, log_line} =
       Runs.append_run_log(run, %{
@@ -74,14 +73,12 @@ defmodule Lightning.LogLines.SearchVectorWorkerTest do
           append_log(run, "logline number #{n} doing work").id
         end
 
-      # Freshly inserted lines start out unindexed (deferred computation).
       for id <- ids do
         assert %{null?: true, matches?: false} = search_vector_state(id)
       end
 
       assert {:ok, 5} = perform_job(SearchVectorWorker, %{})
 
-      # After draining, every row has a populated, matching search_vector.
       for id <- ids do
         assert %{null?: false, matches?: true} = search_vector_state(id)
       end
@@ -145,11 +142,9 @@ defmodule Lightning.LogLines.SearchVectorWorkerTest do
   end
 
   describe "snowball uniqueness" do
-    # Regression: Oban's default unique states include :executing and :completed,
-    # so a running snowball job (state :executing) matched *itself* when it tried
-    # to enqueue its successor — the insert was silently deduped and the chain
-    # died after one hop. The worker restricts uniqueness to the queued states so
-    # an executing job can always enqueue the next link.
+    # Guards the snowball chain: an executing job must be able to enqueue its
+    # successor. Oban's default unique states include :executing, so a snowball
+    # would otherwise match itself and the chain would die after one hop.
     test "an executing snowball does not block enqueuing its successor" do
       Oban.Testing.with_testing_mode(:manual, fn ->
         {:ok, running} =
diff --git a/test/lightning/runs_test.exs b/test/lightning/runs_test.exs
@@ -921,9 +921,8 @@ defmodule Lightning.RunsTest do
           timestamp: DateTime.utc_now() |> DateTime.to_unix(:millisecond)
         })
 
-      # The synchronous trigger is gone: search_vector is computed out-of-band
-      # by Lightning.LogLines.SearchVectorWorker, so it starts NULL and is not
-      # yet matched by a full-text query.
+      # search_vector is computed out-of-band by SearchVectorWorker, so it
+      # starts NULL and isn't matched by a full-text query yet.
       assert search_vector_null?(log_line.id)
       refute log_line_searchable?(log_line.id, "searchable")
     end
@@ -953,12 +952,10 @@ defmodule Lightning.RunsTest do
       assert length(log_lines) == 3
       assert Enum.map(log_lines, & &1.message) == Enum.map(entries, & &1.message)
 
-      # Each inserted line broadcasts a LogAppended event.
       for _ <- entries do
         assert_received %Runs.Events.LogAppended{}
       end
 
-      # All rows are persisted with a NULL search_vector (deferred indexing).
       for log_line <- log_lines do
         assert search_vector_null?(log_line.id)
         refute log_line_searchable?(log_line.id, "logline")