perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound by a-monteiro · Pull Request #9783 · duneanalytics/spellbook

a-monteiro · 2026-06-13T21:47:07Z

Shrinks the hardcoded interval '7' day re-scan window on the addresses_events_<chain>_first_activity family down to the landing-lag bound measured in the CUR2-2678 late-arrival profile: 2 days for EVM chains, 3 days for bnb. Parameterizes the shared addresses_events_first_activity macro (lookback_days, default 2; bnb caller passes 3) and updates the 15 inline chain models.

Owner-gated: a lookback window is a correctness contract, not just a perf knob, so this is intentionally a draft pending Dune curated-data review. The window only ever existed to re-catch late-arriving first transactions. These models are anti-join appends (insert addresses not already in the table whose first tx is in the window), and they run ~hourly, so in steady state the table already holds every address whose first tx landed more than one run ago.

Why it's safe (proven on prod data)

Of the new addresses a 7-day incremental run would insert right now, every single one has a first_block_time within the last ~40 minutes — zero are older than even 2 days:

chain	new addrs a 7d run inserts now	within 2d	older than 2d
bnb	13,407	13,407	0
base	1,642	1,642	0

This matches the profile: EVM raw p99.9 landing lag = minutes (0% of rows >1d late); bnb p99.9 = 6.6h, worst observed tail 1.9d. The 2d/3d bounds clear those tails with margin (and the date_trunc('day', ...) floor adds another 0–1 day).

Measured A/B (component, 3 warm-run medians, checksum-forced over all output columns)

chain (shrink)	CPU	IO (scan)	wall	peak mem	spill
bnb 7d→3d	457 → 320 s (−30%)	58.1 → 45.9 GB (−21%)	16.8 → 10.6 s (−37%)	6.6 → 6.0 GB	0
base 7d→2d	202 → 119 s (−41%)	16.7 → 11.0 GB (−34%)	8.8 → 4.4 s (−50%)	flat	0

The bnb IO cut is diluted by a fixed ~12 GB self-scan of {{this}} (a separate structural cost tracked in CUR2-2784); EVM chains shrink to 2d and cut more. Across the ~28-chain family (~11 CPU-hrs/day, ~3.5 TB IO/day on spellbook-hourly) this is an estimated ~30–40% reduction (~3.5–4 CPU-hrs/day, ~1–1.4 TB/day).

Notes

opbnb is treated as an EVM L2 (2d), not bnb-L1's 3d — flag if you'd prefer 3d.
No CI floor added: these models' full-refresh path was already unbounded, and the check_seed regression test needs the full-history initial build (it asserts genesis-era seed addresses are present). This PR doesn't change that path, so CI behaves as before; the heaviest initial full-refresh (bnb, 13.3B rows → 555M groups) measured ~11 min wall / 1.7 TB / 0 spill, well inside the 90-min CI budget.

Towards CUR2-2809

…te-arrival bound Shrink the incremental re-scan window on the addresses_events <chain>_first_activity family from a hardcoded 7d to the CUR2-2678 profiled landing-lag bound: 2d for EVM chains, 3d for bnb. These models append only genuinely-new addresses (anti-join against the existing table), so the wide window only ever served to catch late-arriving first transactions -- measured at p99.9 = minutes for EVM and 6.6h (worst tail 1.9d) for bnb, far inside the new bounds. Parameterize the shared macro (lookback_days, default 2; bnb passes 3) and update the 15 inline chain models.

a-monteiro · 2026-06-13T21:47:20Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

tomfutago · 2026-06-18T08:32:01Z

Correctness / validation risk worth double-checking:

Since this changes the correctness boundary for every addresses_events_*_first_activity incremental model, can we add the same old-7d vs new-window comparison for all affected chains, or at least publish a per-chain table with old_window_new_inserts, missed_by_new_window, and max first_block_time age?

CI full-refresh/seed tests do not exercise this late-arrival boundary, and the PR evidence currently only covers bnb and base.

a-monteiro · 2026-06-20T12:41:46Z

@tomfutago good call — here's the full per-chain run.

Method (read-only, prod, dtrino UTC): for every affected chain, reproduce the model's incremental insert set against the live first_activity table —

WITH na AS (
  SELECT et."from" AS a, MIN(et.block_time) AS ft
  FROM delta_prod.<chain>.transactions et
  LEFT JOIN hive.addresses_events_<chain>.first_activity f ON et."from" = f.address
  WHERE f.address IS NULL                                   -- genuinely-new addresses only
    AND et.block_time >= date_trunc('day', now() - interval '7' day)
  GROUP BY et."from")
SELECT count(*)                                                          AS old_window_new_inserts,
       count(*) FILTER (WHERE ft < date_trunc('day', now() - interval '<bound>' day)) AS missed_by_new_window,
       date_diff('hour', min(ft), now())                                AS oldest_insert_age_h
FROM na;

missed_by_new_window counts new addresses whose earliest 7d tx is older than the new bound — i.e. every row the shrunk window would either drop or record a later first_block_time for. Bound = 3d for bnb, 2d for all other (EVM) chains.

chain	bound	old_window_new_inserts	oldest_insert_age (h)
bnb	3d	26,621	1
ethereum	2d	4,684	1
polygon	2d	4,044	1
arbitrum	2d	2,417	1
base	2d	2,246	1
celo	2d	1,370	1
monad	2d	380	0
abstract	2d	248	4
optimism	2d	73	1
hyperevm	2d	30	1
gnosis	2d	17	1
linea	2d	12	1
berachain	2d	8	0
mantle	2d	4	0
fantom	2d	3	1
avalanche_c	2d	3	0
katana	2d	2	0
scroll	2d	1	1
unichain	2d	1	0
apechain, blast, ink, opbnb, sei, zora, ronin, sepolia, zkevm, nova, zksync	2d	0	0

Every chain: 0 missed, 0 mistimed. The oldest genuinely-new address any chain inserts right now is 4 hours old (abstract); everywhere else it's ≤1h. Because these models run hourly and anti-join against the full target, in steady state the only thing a wider window can ever recover is the late-arrival tail — and the CUR2-2678 profile bounds that tail at p99.9 = minutes for EVM (worst single incident: base 21.1h) and 6.6h for bnb (worst observed 1.9d), all inside 2d/3d.

Caveat you're right to flag: this is a point-in-time snapshot, so it can't itself observe a multi-day harvester catch-up. That worst case is exactly what the late-arrival profile measures (and is why bnb gets 3d, not 2d). If you'd prefer more margin on any specific chain, easy to bump — the macro is now parameterized (lookback_days).

(CI is green — dbt-run (hourly_spellbook) passed, full-history initial build + seed/check_seed tests included.)

github-actions Bot added WIP work in progress dbt: hourly covers the hourly dbt subproject labels Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783

perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783
a-monteiro wants to merge 1 commit into
mainfrom
andre/cur2-2809-addresses-events-lookback

a-monteiro commented Jun 13, 2026 •

edited

Loading

Uh oh!

a-monteiro commented Jun 13, 2026

Uh oh!

tomfutago commented Jun 18, 2026

Uh oh!

a-monteiro commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

a-monteiro commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why it's safe (proven on prod data)

Measured A/B (component, 3 warm-run medians, checksum-forced over all output columns)

Notes

Uh oh!

a-monteiro commented Jun 13, 2026

Uh oh!

tomfutago commented Jun 18, 2026

Uh oh!

a-monteiro commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

a-monteiro commented Jun 13, 2026 •

edited

Loading