Skip to content

perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783

Draft
a-monteiro wants to merge 1 commit into
mainfrom
andre/cur2-2809-addresses-events-lookback
Draft

perf(addresses_events): shrink first_activity lookback to profiled late-arrival bound#9783
a-monteiro wants to merge 1 commit into
mainfrom
andre/cur2-2809-addresses-events-lookback

Conversation

@a-monteiro

@a-monteiro a-monteiro commented Jun 13, 2026

Copy link
Copy Markdown
Member

Shrinks the hardcoded interval '7' day re-scan window on the addresses_events_<chain>_first_activity family down to the landing-lag bound measured in the CUR2-2678 late-arrival profile: 2 days for EVM chains, 3 days for bnb. Parameterizes the shared addresses_events_first_activity macro (lookback_days, default 2; bnb caller passes 3) and updates the 15 inline chain models.

Owner-gated: a lookback window is a correctness contract, not just a perf knob, so this is intentionally a draft pending Dune curated-data review. The window only ever existed to re-catch late-arriving first transactions. These models are anti-join appends (insert addresses not already in the table whose first tx is in the window), and they run ~hourly, so in steady state the table already holds every address whose first tx landed more than one run ago.

Why it's safe (proven on prod data)

Of the new addresses a 7-day incremental run would insert right now, every single one has a first_block_time within the last ~40 minutes — zero are older than even 2 days:

chain new addrs a 7d run inserts now within 2d older than 2d
bnb 13,407 13,407 0
base 1,642 1,642 0

This matches the profile: EVM raw p99.9 landing lag = minutes (0% of rows >1d late); bnb p99.9 = 6.6h, worst observed tail 1.9d. The 2d/3d bounds clear those tails with margin (and the date_trunc('day', ...) floor adds another 0–1 day).

Measured A/B (component, 3 warm-run medians, checksum-forced over all output columns)

chain (shrink) CPU IO (scan) wall peak mem spill
bnb 7d→3d 457 → 320 s (−30%) 58.1 → 45.9 GB (−21%) 16.8 → 10.6 s (−37%) 6.6 → 6.0 GB 0
base 7d→2d 202 → 119 s (−41%) 16.7 → 11.0 GB (−34%) 8.8 → 4.4 s (−50%) flat 0

The bnb IO cut is diluted by a fixed ~12 GB self-scan of {{this}} (a separate structural cost tracked in CUR2-2784); EVM chains shrink to 2d and cut more. Across the ~28-chain family (~11 CPU-hrs/day, ~3.5 TB IO/day on spellbook-hourly) this is an estimated ~30–40% reduction (~3.5–4 CPU-hrs/day, ~1–1.4 TB/day).

Notes

  • opbnb is treated as an EVM L2 (2d), not bnb-L1's 3d — flag if you'd prefer 3d.
  • No CI floor added: these models' full-refresh path was already unbounded, and the check_seed regression test needs the full-history initial build (it asserts genesis-era seed addresses are present). This PR doesn't change that path, so CI behaves as before; the heaviest initial full-refresh (bnb, 13.3B rows → 555M groups) measured ~11 min wall / 1.7 TB / 0 spill, well inside the 90-min CI budget.

Towards CUR2-2809

…te-arrival bound

Shrink the incremental re-scan window on the addresses_events <chain>_first_activity
family from a hardcoded 7d to the CUR2-2678 profiled landing-lag bound: 2d for EVM
chains, 3d for bnb. These models append only genuinely-new addresses (anti-join against
the existing table), so the wide window only ever served to catch late-arriving first
transactions -- measured at p99.9 = minutes for EVM and 6.6h (worst tail 1.9d) for bnb,
far inside the new bounds. Parameterize the shared macro (lookback_days, default 2; bnb
passes 3) and update the 15 inline chain models.

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions github-actions Bot added WIP work in progress dbt: hourly covers the hourly dbt subproject labels Jun 13, 2026
@tomfutago

Copy link
Copy Markdown
Contributor

Correctness / validation risk worth double-checking:

Since this changes the correctness boundary for every addresses_events_*_first_activity incremental model, can we add the same old-7d vs new-window comparison for all affected chains, or at least publish a per-chain table with old_window_new_inserts, missed_by_new_window, and max first_block_time age?

CI full-refresh/seed tests do not exercise this late-arrival boundary, and the PR evidence currently only covers bnb and base.

@a-monteiro

Copy link
Copy Markdown
Member Author

@tomfutago good call — here's the full per-chain run.

Method (read-only, prod, dtrino UTC): for every affected chain, reproduce the model's incremental insert set against the live first_activity table —

WITH na AS (
  SELECT et."from" AS a, MIN(et.block_time) AS ft
  FROM delta_prod.<chain>.transactions et
  LEFT JOIN hive.addresses_events_<chain>.first_activity f ON et."from" = f.address
  WHERE f.address IS NULL                                   -- genuinely-new addresses only
    AND et.block_time >= date_trunc('day', now() - interval '7' day)
  GROUP BY et."from")
SELECT count(*)                                                          AS old_window_new_inserts,
       count(*) FILTER (WHERE ft < date_trunc('day', now() - interval '<bound>' day)) AS missed_by_new_window,
       date_diff('hour', min(ft), now())                                AS oldest_insert_age_h
FROM na;

missed_by_new_window counts new addresses whose earliest 7d tx is older than the new bound — i.e. every row the shrunk window would either drop or record a later first_block_time for. Bound = 3d for bnb, 2d for all other (EVM) chains.

chain bound old_window_new_inserts missed_by_new_window oldest_insert_age (h)
bnb 3d 26,621 0 1
ethereum 2d 4,684 0 1
polygon 2d 4,044 0 1
arbitrum 2d 2,417 0 1
base 2d 2,246 0 1
celo 2d 1,370 0 1
monad 2d 380 0 0
abstract 2d 248 0 4
optimism 2d 73 0 1
hyperevm 2d 30 0 1
gnosis 2d 17 0 1
linea 2d 12 0 1
berachain 2d 8 0 0
mantle 2d 4 0 0
fantom 2d 3 0 1
avalanche_c 2d 3 0 0
katana 2d 2 0 0
scroll 2d 1 0 1
unichain 2d 1 0 0
apechain, blast, ink, opbnb, sei, zora, ronin, sepolia, zkevm, nova, zksync 2d 0 0 0

Every chain: 0 missed, 0 mistimed. The oldest genuinely-new address any chain inserts right now is 4 hours old (abstract); everywhere else it's ≤1h. Because these models run hourly and anti-join against the full target, in steady state the only thing a wider window can ever recover is the late-arrival tail — and the CUR2-2678 profile bounds that tail at p99.9 = minutes for EVM (worst single incident: base 21.1h) and 6.6h for bnb (worst observed 1.9d), all inside 2d/3d.

Caveat you're right to flag: this is a point-in-time snapshot, so it can't itself observe a multi-day harvester catch-up. That worst case is exactly what the late-arrival profile measures (and is why bnb gets 3d, not 2d). If you'd prefer more margin on any specific chain, easy to bump — the macro is now parameterized (lookback_days).

(CI is green — dbt-run (hourly_spellbook) passed, full-history initial build + seed/check_seed tests included.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dbt: hourly covers the hourly dbt subproject WIP work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants