perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783
perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783a-monteiro wants to merge 1 commit into
Conversation
…te-arrival bound Shrink the incremental re-scan window on the addresses_events <chain>_first_activity family from a hardcoded 7d to the CUR2-2678 profiled landing-lag bound: 2d for EVM chains, 3d for bnb. These models append only genuinely-new addresses (anti-join against the existing table), so the wide window only ever served to catch late-arriving first transactions -- measured at p99.9 = minutes for EVM and 6.6h (worst tail 1.9d) for bnb, far inside the new bounds. Parameterize the shared macro (lookback_days, default 2; bnb passes 3) and update the 15 inline chain models.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Correctness / validation risk worth double-checking: Since this changes the correctness boundary for every CI full-refresh/seed tests do not exercise this late-arrival boundary, and the PR evidence currently only covers |
|
@tomfutago good call — here's the full per-chain run. Method (read-only, prod, dtrino UTC): for every affected chain, reproduce the model's incremental insert set against the live WITH na AS (
SELECT et."from" AS a, MIN(et.block_time) AS ft
FROM delta_prod.<chain>.transactions et
LEFT JOIN hive.addresses_events_<chain>.first_activity f ON et."from" = f.address
WHERE f.address IS NULL -- genuinely-new addresses only
AND et.block_time >= date_trunc('day', now() - interval '7' day)
GROUP BY et."from")
SELECT count(*) AS old_window_new_inserts,
count(*) FILTER (WHERE ft < date_trunc('day', now() - interval '<bound>' day)) AS missed_by_new_window,
date_diff('hour', min(ft), now()) AS oldest_insert_age_h
FROM na;
Every chain: 0 missed, 0 mistimed. The oldest genuinely-new address any chain inserts right now is 4 hours old (abstract); everywhere else it's ≤1h. Because these models run hourly and anti-join against the full target, in steady state the only thing a wider window can ever recover is the late-arrival tail — and the CUR2-2678 profile bounds that tail at p99.9 = minutes for EVM (worst single incident: base 21.1h) and 6.6h for bnb (worst observed 1.9d), all inside 2d/3d. Caveat you're right to flag: this is a point-in-time snapshot, so it can't itself observe a multi-day harvester catch-up. That worst case is exactly what the late-arrival profile measures (and is why bnb gets 3d, not 2d). If you'd prefer more margin on any specific chain, easy to bump — the macro is now parameterized ( (CI is green — |

Shrinks the hardcoded
interval '7' dayre-scan window on theaddresses_events_<chain>_first_activityfamily down to the landing-lag bound measured in the CUR2-2678 late-arrival profile: 2 days for EVM chains, 3 days for bnb. Parameterizes the sharedaddresses_events_first_activitymacro (lookback_days, default 2; bnb caller passes 3) and updates the 15 inline chain models.Owner-gated: a lookback window is a correctness contract, not just a perf knob, so this is intentionally a draft pending Dune curated-data review. The window only ever existed to re-catch late-arriving first transactions. These models are anti-join appends (insert addresses not already in the table whose first tx is in the window), and they run ~hourly, so in steady state the table already holds every address whose first tx landed more than one run ago.
Why it's safe (proven on prod data)
Of the new addresses a 7-day incremental run would insert right now, every single one has a
first_block_timewithin the last ~40 minutes — zero are older than even 2 days:This matches the profile: EVM raw p99.9 landing lag = minutes (0% of rows >1d late); bnb p99.9 = 6.6h, worst observed tail 1.9d. The 2d/3d bounds clear those tails with margin (and the
date_trunc('day', ...)floor adds another 0–1 day).Measured A/B (component, 3 warm-run medians, checksum-forced over all output columns)
The bnb IO cut is diluted by a fixed ~12 GB self-scan of
{{this}}(a separate structural cost tracked in CUR2-2784); EVM chains shrink to 2d and cut more. Across the ~28-chain family (~11 CPU-hrs/day, ~3.5 TB IO/day on spellbook-hourly) this is an estimated ~30–40% reduction (~3.5–4 CPU-hrs/day, ~1–1.4 TB/day).Notes
opbnbis treated as an EVM L2 (2d), not bnb-L1's 3d — flag if you'd prefer 3d.check_seedregression test needs the full-history initial build (it asserts genesis-era seed addresses are present). This PR doesn't change that path, so CI behaves as before; the heaviest initial full-refresh (bnb, 13.3B rows → 555M groups) measured ~11 min wall / 1.7 TB / 0 spill, well inside the 90-min CI budget.Towards CUR2-2809