Skip to content

Consul KV does not scale for dense catchup range storage on high-throughput chains #88

@vietddude

Description

@vietddude

Summary

The current catchup-progress model stores one KV entry per catchup range under:

block_states/<internal_code>/catchup_progress/<start>-<end>

This does not scale well on high-throughput chains such as Sui, where block/checkpoint production is very fast and backlog can generate a very large number of ranges.

Problem

Even though large ranges are split into small chunks, the resulting number of KV keys becomes very large.

This creates multiple problems:

  • startup bootstrap becomes expensive because we need to list and parse all catchup keys
  • catchup worker initialization becomes slow
  • Consul becomes stressed by large key counts under a single prefix
  • local environments such as Colima become unstable more easily
  • /status originally became slow because it scanned all catchup keys on request path
  • even after moving /status to in-memory registry state, the underlying KV model is still too heavy

This is particularly visible on Sui because block/checkpoint creation is much faster than slower chains, so backlog produces very dense catchup state.

Current behavior

  • catchup ranges are split into very small chunks
  • each chunk is stored as an individual Consul KV key
  • a large backlog can produce tens or hundreds of thousands of keys per chain
  • loading catchup progress requires a full prefix scan

Expected behavior

Catchup progress should remain manageable even for chains with very high block throughput.

The storage model should not require extremely large numbers of KV keys per chain.

Impact

High

This affects reliability and scalability of catchup progress persistence, especially for Sui and other high-throughput chains.

Suggested directions

Possible fixes:

  1. Increase catchup range size to reduce KV key count
  2. Store summary metadata separately for status/monitoring
  3. Replace the many-small-range model with a more compact cursor + gaps model
  4. Move catchup-progress persistence away from Consul KV to a store better suited for high-churn range state

Notes

This issue is not only about /status latency. The deeper problem is the persistence model itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions