Ident-1090
diff --git a/‎docs/backend/replay.md‎
Lines changed: 55 additions & 51 deletions b/‎docs/backend/replay.md‎
Lines changed: 55 additions & 51 deletions
diff --git a/‎docs/frontend/trails-replay.md‎
Lines changed: 9 additions & 4 deletions b/‎docs/frontend/trails-replay.md‎
Lines changed: 9 additions & 4 deletions
diff --git a/‎docs/getting-started/configuration.md‎
Lines changed: 24 additions & 18 deletions b/‎docs/getting-started/configuration.md‎
Lines changed: 24 additions & 18 deletions
diff --git a/‎docs/operations/deployment.md‎
Lines changed: 5 additions & 4 deletions b/‎docs/operations/deployment.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/operations/security.md‎
Lines changed: 6 additions & 7 deletions b/‎docs/operations/security.md‎
Lines changed: 6 additions & 7 deletions
diff --git a/‎ident/src/data/replay.test.ts‎
Lines changed: 11 additions & 1 deletion b/‎ident/src/data/replay.test.ts‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎ident/src/data/replay.ts‎
Lines changed: 19 additions & 0 deletions b/‎ident/src/data/replay.ts‎
Lines changed: 19 additions & 0 deletions
@@ -18,60 +18,66 @@ rebuilding trails from replay has to infer leg boundaries from the recorded data
 
 `identd` takes a snapshot of the aircraft present at most once per sample
 interval and appends it to an in-memory block that covers a fixed span of time,
-five minutes by default. A snapshot holds a timestamp and the set of aircraft
-visible at that moment. When a sample arrives that belongs to a later span, the
-open block is finalized and written, and a new one starts.
+five minutes. A snapshot holds a timestamp and the set of aircraft visible at
+that moment. Empty snapshots still matter because they prove the receiver was
+being sampled even when no aircraft were visible. When a sample arrives that
+belongs to a later span, the open block is finalized and written, and a new one
+starts.
 
 The block currently being filled is not listed and not served until it rolls
-over. The smallest thing a viewer can load is therefore one finalized block. Both
-the block length and the sample interval are configurable, with the block length
-bounded below at one minute so a block always spans more than a single sample.
+over. The smallest thing a viewer can load is therefore one finalized block. The
+sample interval is configurable; the block duration is fixed so storage paths,
+cache metadata, and frontend loading all agree about the same time grid.
 
 ## On-disk layout
 
-Blocks live in a single flat directory. Each finalized block is one
-zstd-compressed JSON file whose name encodes the time range it covers. An index
-file sits beside that directory and caches the list of blocks between restarts.
-
-At startup `identd` reads the index, scans the directory, and merges the two,
-preferring what the scan actually finds on disk. It does not decompress every
-block to validate it; the file name and size are enough to build the in-memory
-list, and decompressing the whole corpus on a cold boot would dominate startup
-time on modest hardware. A block is only read from disk when a viewer asks for
-it. If the index is missing, unreadable, or written in a version this build does
-not recognize, `identd` falls back to the directory scan and records a diagnostic
-rather than refusing to start.
-
-The block format carries its own version. A block whose version this build does
-not support is skipped, not deleted. Earlier behavior deleted mismatched blocks
-and could silently destroy recorded history across an upgrade or downgrade, so
-the current code never deletes a block on the basis of its version.
+Blocks are grouped by UTC day instead of all living in one directory. Each
+finalized block is one zstd-compressed JSON file whose name encodes the time
+range it covers. The grouping is for filesystem fanout and static serving; it is
+not a time-retention policy.
+
+Replay keeps cache manifests next to the blocks. The root cache is intentionally
+small: it records the covered days and the overall range, not every block. Each
+day cache records the blocks for that day. A valid cache lets startup avoid
+walking the full tree and statting every historical file, which matters on small
+receiver hosts. If the cache is missing or unreadable, an operator-controlled
+reindex setting decides whether `identd` scans filenames to rebuild the cache or
+starts with replay unavailable and records a diagnostic.
+
+The normal startup path trusts cache metadata. It does not decompress every block
+to validate it; the file name and size are enough to publish availability, and
+decompressing the whole corpus on a cold boot would dominate startup time on
+modest hardware. A block is only read from disk when a viewer asks for it. If a
+cached block is missing or a viewer reports that it could not be decoded, the
+cache is corrected for that day and a diagnostic is recorded instead of leaving
+the stale coverage in place.
 
 ## Retention
 
-Two limits bound disk use, and both are required when replay is enabled:
+Replay is bounded by a byte budget. The operator sets the high watermark for
+finalized blocks. When the estimated size rises above that watermark, `identd`
+removes the oldest cached blocks until usage falls below a lower target.
 
-- A byte budget caps the total size of finalized blocks. When the total would
-  exceed it, the oldest blocks are removed first until the total fits. This is
-  checked both before writing a new block and after.
-- An age cap sets the oldest a block may be. Blocks past that age are removed
-  regardless of how much room the byte budget has left.
-
-The two cover different failure modes. A byte budget alone does not bound how old
-data gets: on a quiet receiver the budget might never fill, leaving stale history
-around indefinitely. An age cap alone does not protect against disk exhaustion
-when traffic is unexpectedly heavy. Together the byte budget is the hard ceiling
-on space and the age cap sets the history window.
+Using two watermarks avoids deleting a single old block every time a new block
+rolls over near the limit. The tradeoff is that a cleanup pass can remove a
+batch of history at once. That is deliberate: it reduces metadata churn on
+storage that may be SD-card-backed. There is no separate age cap in this storage
+version, so a quiet receiver can keep old history as long as it fits inside the
+byte budget.
 
 ## Serving blocks
 
-Two endpoints make replay available to the frontend. One returns a manifest: the
-enabled flag, the time range covered, the block length, and the list of finalized
-blocks with their URLs and sizes. The other serves a single block file by name,
-after checking the requested name against the expected pattern so a request
-cannot reach outside the blocks directory. Finalized blocks are served as
-cacheable and immutable, since a block's contents are fixed once its time range
-has passed.
+Replay exposes a dynamic manifest endpoint plus a static artifact subtree. The
+manifest tells the frontend which finalized blocks are available for playback.
+The artifact subtree contains immutable block files and cache manifests that can
+be served directly by a reverse proxy. Dynamic repair endpoints live outside
+that subtree so a deployment can hand static replay artifacts to the proxy
+without hiding the `identd` APIs that still need application logic.
+
+When `identd` serves a block itself, it checks that the requested name has the
+date-partitioned shape that replay writes and that the block is present in its
+current cache. A name that does not match replay's own storage shape is rejected
+before it can become a filesystem lookup.
 
 ### Why blocks are negotiated, not decompressed
 
@@ -89,15 +95,13 @@ that common case.
 The server resolves this by content negotiation. When the request says it accepts
 zstd, the server sets the encoding header and ships the raw bytes; the browser
 decompresses them natively and JavaScript receives JSON. When the request does
-not say so — the plain-HTTP browser being the case that matters — the server
-ships the same raw bytes with no encoding header, and the frontend decompresses
-them itself.
- It decides which path applies by inspecting the first few bytes of
-the body for the zstd frame signature rather than trusting a response header,
-because a browser strips the encoding header once it has decoded a response and a
-cache may surface either form for the same URL. The frontend caps the size it
-will expand a block to, and a decode failure shows up as a diagnostic in the
-notification area rather than a silent blank.
+not say so, the server ships the same raw bytes with no encoding header, and the
+frontend decompresses them itself. The frontend decides which path applies by
+inspecting the first few bytes of the body for the zstd frame signature rather
+than trusting a response header, because a browser strips the encoding header
+once it has decoded a response and a cache may surface either form for the same
+URL. The frontend caps the size it will expand a block to, and a decode failure
+shows up as a diagnostic in the notification area rather than a silent blank.
 
 Treating a wildcard or an explicit request for no encoding as "does not accept
 zstd" is deliberate: a wildcard only says unlisted encodings are acceptable, not
 
@@ -37,6 +37,11 @@ immutable once written, so they are cached aggressively in the browser; the
 manifest itself is fetched without caching so a freshly recorded block becomes
 visible.
 
+The scrubber treats availability as coverage, not progress. A highlighted
+segment means Ident has a finalized block for that part of the selected window.
+Gaps can appear inside the same replay window when old blocks were cleaned up or
+when a cache repair removed a bad block.
+
 The block list is held in time order, and the code that finds which blocks
 cover a requested range, and that later detects gaps between them, depends on
 that ordering. Rather than trust the manifest to arrive sorted, the list is
@@ -52,10 +57,10 @@ up fetches that will land too late to matter.
 
 Block failures are handled differently depending on the cause. Bytes that
 arrive but cannot be decoded as a valid block surface an error to the user and
-are not retried, because re-fetching the same bad bytes would not help. A
-failed or rejected request, by contrast, refreshes the manifest and retries,
-since the backend may have rotated the file and a newer manifest can point at a
-working URL.
+are reported back to `identd` so stale coverage can be repaired. A failed or
+rejected request, by contrast, refreshes the manifest and retries, since the
+backend may have rotated the file and a newer manifest can point at a working
+URL.
 
 ## Reconstructing replay trails
 
 
@@ -83,41 +83,47 @@ Disabling the restart cache keeps trails memory-only; they will be lost when
 
 Replay is opt-in because it writes longer-lived history blocks. When enabled,
 `identd` samples live `aircraft.json`, closes one compressed block every five
-minutes, writes an index, and prunes old blocks by both age and byte budget.
-The byte budget is mandatory so a misconfigured receiver cannot fill the host
-disk.
+minutes, writes cache manifests, and prunes old blocks by byte budget. The byte
+budget is mandatory so a misconfigured receiver cannot fill the host disk.
 
 ```sh
 IDENT_REPLAY_ENABLE=true
 IDENT_REPLAY_DIR=/var/lib/ident/replay
-IDENT_REPLAY_RETENTION_SEC=259200
 IDENT_REPLAY_MAX_BYTES=524288000
-IDENT_REPLAY_BLOCK_SEC=300
+IDENT_REPLAY_CLEANUP_LOW_WATERMARK=0.90
+IDENT_REPLAY_CACHE_REINDEX=true
 IDENT_REPLAY_SAMPLE_INTERVAL_SEC=5
 ```
 
-With the example above, Ident keeps up to three days of replay data and never
-keeps more than 500 MiB of finalized blocks. The currently open block is not
-listed or served until it rolls over, so the smallest replay unit is five
-minutes. See [Replay history](/backend/replay) for how blocks are recorded and
-served.
+With the example above, Ident treats 500 MiB as the high watermark. When the
+estimated finalized replay size exceeds that value, it may delete oldest cached
+blocks until the estimate falls below 90% of the byte budget. The currently open
+block is not listed or served until it rolls over, so the smallest replay unit is
+five minutes. See [Replay history](/backend/replay) for how blocks are recorded
+and served.
 
 ## Serving replay blocks through a reverse proxy
 
-`identd` can serve replay blocks itself through `/api/replay/blocks/*`. Replay
-blocks are JSON compressed with zstd. For busy public displays, put the replay
-directory behind the reverse proxy and let the proxy serve finalized
-`.json.zst` files directly:
+`identd` can serve replay artifacts itself through `/api/replay/blocks/*`.
+Replay blocks are JSON compressed with zstd, while `manifest.cache.json` files
+are ordinary JSON. For busy public displays, put the replay `blocks` directory
+behind the reverse proxy and let the proxy serve finalized artifacts directly:
 
 ```text
-@accepts_zstd header Accept-Encoding *zstd*
-
 handle_path /api/replay/blocks/* {
 	root * /var/lib/ident/replay/blocks
-	header Content-Type application/octet-stream
+
+	@zstd_block path *.zst
+	header @zstd_block Content-Type application/octet-stream
+	header @zstd_block Cache-Control "public, max-age=31536000, immutable"
+
+	@accepts_zstd {
+		path *.zst
+		header Accept-Encoding *zstd*
+	}
 	header @accepts_zstd Content-Type application/json
 	header @accepts_zstd Content-Encoding zstd
-	header Cache-Control "public, max-age=31536000, immutable"
+
 	file_server
 }
 
 
@@ -72,10 +72,11 @@ works at the root or behind a prefix without further configuration.
 
 ## Serving replay blocks from the proxy
 
-Ident can serve finalized replay blocks itself. For a busy public display, those
-files can instead be served straight from disk by the reverse proxy, taking that
-I/O off Ident. This works because finalized blocks are immutable files on disk,
-but it carries a constraint that is easy to miss.
+Ident can serve finalized replay artifacts itself. For a busy public display,
+those files can instead be served straight from disk by the reverse proxy,
+taking that I/O off Ident. This works because finalized blocks are immutable
+files on disk and cache manifests are ordinary JSON, but it carries a constraint
+that is easy to miss.
 
 The blocks are stored as raw zstd-compressed JSON, and Ident's own handler
 negotiates how to deliver them. When a client advertises that it accepts zstd,
 
@@ -45,13 +45,12 @@ cannot alter what the receiver produces.
 When replay is enabled, finalized blocks are served from disk by name under a
 fixed endpoint prefix. Two checks stand between a request and the filesystem.
 The requested name must match the exact shape `identd` gives its own blocks (a
-plain numeric pattern with a fixed extension, no path separators or relative
-segments), and a name that passes that check must also be present in the
-in-memory index of blocks `identd` has actually written. A crafted name aimed at
-escaping the blocks directory fails the first check; a well-formed name for a
-file `identd` never produced fails the second. Both paths return a not-found
-result before any filesystem path is built from caller input. Tests cover a
-traversal attempt against this endpoint.
+UTC day path and a fixed extension, with no relative segments), and a name that
+passes that check must also be present in the in-memory cache of finalized
+blocks. A crafted name aimed at escaping the blocks directory fails the first
+check; a well-formed name for a file `identd` never produced fails the second.
+Both paths return a not-found result before caller input can choose an arbitrary
+filesystem path.
 
 ## Outbound network access
 
 
@@ -574,7 +574,17 @@ describe("replay data loading", () => {
     await refreshReplayManifest();
     await ensureReplayRange(120_000, 180_000, { background: true });
 
-    expect(globalThis.fetch).toHaveBeenCalledTimes(3);
+    expect(globalThis.fetch).toHaveBeenCalledTimes(4);
+    expect(globalThis.fetch).toHaveBeenCalledWith(
+      "/ident/api/replay/block-failure",
+      expect.objectContaining({
+        method: "POST",
+        body: JSON.stringify({
+          url: "/api/replay/blocks/120000-180000.json.zst",
+          reason: "decode_failed",
+        }),
+      }),
+    );
     expect(warn).not.toHaveBeenCalledWith(
       "[ident replay] background block load failed",
       expect.any(Error),
 
@@ -212,6 +212,7 @@ function loadReplayBlock(block: ReplayBlockIndex): Promise<void> | null {
         throw err;
       }
       if (err instanceof ReplayBlockFormatError) {
+        void reportReplayBlockFailure(block.url, "decode_failed");
         emitFrontendDiagnostic({
           severity: "warning",
           channel: "frontend.replay",
@@ -221,6 +222,7 @@ function loadReplayBlock(block: ReplayBlockIndex): Promise<void> | null {
         throw err;
       }
       if (err instanceof ReplayBlockBodyError) {
+        void reportReplayBlockFailure(block.url, "decode_failed");
         emitFrontendDiagnostic({
           severity: "warning",
           channel: "frontend.replay",
@@ -249,6 +251,23 @@ function loadReplayBlock(block: ReplayBlockIndex): Promise<void> | null {
   return load;
 }
 
+async function reportReplayBlockFailure(
+  url: string,
+  reason: "decode_failed" | "missing",
+): Promise<void> {
+  try {
+    await fetch(appPath("api/replay/block-failure"), {
+      method: "POST",
+      headers: { "Content-Type": "application/json" },
+      cache: "no-store",
+      body: JSON.stringify({ url, reason }),
+    });
+  } catch {
+    // Best-effort cache repair signal only; replay loading already reports
+    // the user-visible diagnostic above.
+  }
+}
+
 function abortStaleBlockLoads(
   manifestBlocks: ReplayBlockIndex[],
   requestedBlocks: ReplayBlockIndex[],