Skip to content

[RFC] Pluggable Remote Store Strategies for Remote-Backed Storage #21850

@hnguyen10atlassian

Description

@hnguyen10atlassian

Is your feature request related to a problem? Please describe

OpenSearch RBS uploads every Lucene segment file and every translog generation individually to
object storage. Under high-throughput indexing (200 active shards, 1-second refresh) this generates
roughly around 5.2 billion PUTs/month (~$26,000), the primary contributor to S3 (or other cloud blob stores) cost.

The upload logic is inlined in RemoteStoreUploaderService and RemoteFsTranslog. Optimising
it requires invasive core changes and cannot be deployed as a plugin or enabled per-index.

Describe the solution you'd like

Introduce two strategy interfaces in OpenSearch core and expose them through the existing plugin
framework. Custom strategies can then control how segment and translog data are physically stored,
downloaded, and garbage-collected without any changes to core.

Extension Points

1. RemoteStorePlugin - plugin registration

@ExperimentalApi
public interface RemoteStorePlugin {
    // Return a map of strategy-name → implementation.
    // "default" is always registered by core; plugins add additional keys.
    default Map<String, RemoteStoreSegmentStrategy>  getRemoteStoreSegmentStrategies()  { … }
    default Map<String, RemoteStoreTranslogStrategy> getRemoteStoreTranslogStrategies() { … }
}

2. RemoteStoreSegmentStrategy segment lifecycle

@ExperimentalApi
public interface RemoteStoreSegmentStrategy {

    // Upload segment files for one refresh cycle.
    // UploadContext carries the file list, local Directory, checkpoint, and crypto config.
    // UploadListener receives per-file and batch success/failure callbacks.
    void upload(RemoteSegmentStoreDirectory dir, ShardId shard,
                UploadContext ctx, UploadListener listener) throws IOException;

    // Upload the metadata blob after a successful upload().
    // Default is a no-op — strategies that embed metadata inside the data blob (e.g. TAR
    // index.bin) do not need a separate metadata file.
    default void uploadMetadata(RemoteSegmentStoreDirectory dir, ShardId shard,
                                MetadataUploadContext ctx) throws IOException {}

    // Delete segment files no longer referenced by any active commit.
    void deleteStaleSegments(RemoteSegmentStoreDirectory dir, ShardId shard,
                             int minCommitsToKeep) throws IOException;

    void deleteFile(RemoteSegmentStoreDirectory dir, ShardId shard, String name) throws IOException;

    // Return a reader that can discover and parse segment metadata.
    // Default strategy uses the metadata__<term>_<gen>_<nodeId> naming convention.
    // TAR strategy reads index.bin from the start of each bundle via a single range-GET.
    MetadataReader getMetadataReader(RemoteSegmentStoreDirectory dir, ShardId shard);
}

RemoteSegmentFile is a companion interface that abstracts range reads on a remote segment file,
decoupling the core from Lucene's IndexInput so strategies can implement range-GETs directly
against their own storage layout (e.g. a byte slice inside a TAR blob).

3. RemoteStoreTranslogStrategy translog lifecycle

@ExperimentalApi
public interface RemoteStoreTranslogStrategy {

    // Upload one translog snapshot. Returns true on success.
    boolean transferSnapshot(ShardId shard, TransferSnapshot snapshot,
                             TranslogTransferListener listener,
                             CryptoMetadata crypto) throws IOException;

    // Download all generations in [minGen..maxGen] to `location`.
    // Batch strategies (e.g. TAR) perform one master round-trip then issue targeted
    // range-GETs. Per-file strategies loop over generationToPrimaryTerm and call
    // downloadTranslog() once per generation.
    void downloadRange(ShardId shard, long minGen, long maxGen,
                       Map<String, String> generationToPrimaryTerm,
                       Path location) throws IOException;

    // Controls whether core's per-generation remote DELETE loop runs.
    // Return SKIP when generations are stored in shared bundles — the DELETE paths
    // do not exist, so core would flood S3 with 404-producing requests on every flush.
    // Plugin owns GC when SKIP is returned.
    default GcDecision resolveStaleTranslogBlobs() { return GcDecision.USE_DEFAULT; }

    // Fallback chain tried when downloadRange() cannot satisfy all generations.
    // Enables safe rolling upgrades: declare DefaultRemoteStoreTranslogStrategy here
    // so blobs written by the old strategy remain recoverable after a strategy switch.
    default List<RemoteStoreTranslogStrategy> fallbackStrategies() { return emptyList(); }

    enum GcDecision { SKIP, USE_DEFAULT }
}

Code: RemoteStorePlugin, RemoteStoreSegmentStrategy, RemoteStoreTranslogStrategy, RemoteSegmentFile, settings and registry wiring

Settings

Two new index-level settings select the active strategy per index:

index.remote_store.segment.strategy   (default: "default")
index.remote_store.translog.strategy  (default: "default")

Lifecycle

  1. Node startup IndicesService collects strategy maps from all plugins and registers
    the built-in "default" implementations.
  2. Index open IndexModule / IndexService resolve the strategy names from index settings
    against the registry and inject singleton instances into each shard.
  3. Upload RemoteStoreUploaderService calls strategy.upload() + strategy.uploadMetadata()
    on every refresh; RemoteFsTranslog calls strategy.transferSnapshot() on every sync.
  4. Recovery RemoteFsTranslog.downloadOnce() calls strategy.downloadRange(), then tries
    each strategy in fallbackStrategies() for any generations not yet found.
  5. GC trimUnreferencedReaders() and deleteStaleRemotePrimaryTerms() check
    resolveStaleTranslogBlobs() before issuing remote DELETEs.

Code: DefaultRemoteStoreSegmentStrategy, DefaultRemoteStoreTranslogStrategy, and all core delegation wiring

Reference Implementation: rbs-tar Plugin

not suggest to adopt this in this RFC, just a reference implementation to understand the solution better

The plugins/rbs-tar module ships two concrete strategies.

TarSegmentUploadStrategy packs all segment files for one refresh into a single TAR blob.
The first entry is always index.bin — a compact binary manifest of filename → byte offset + length.
Result: 1 PUT per refresh instead of N, reads via targeted range-GETs, and GC via readIndexBinOnly()
(skips segments_N, saving one range-GET per active bundle per flush).

TarTranslogUploadStrategy batches translog snapshots from all shards on a node into a single
TAR bundle per flush window via NodeTranslogUploadQueue:

  • transferSnapshot() enqueues; a worker drains and uploads one bundle per node per window.
  • downloadRange() fetches all FileLocations from NodeBundleRegistry in one master round-trip,
    then issues one range-GET per TAR file.
  • resolveStaleTranslogBlobs() returns SKIP; GC is owned by NodeBundleRegistry using
    min(minRemoteGenReferenced, globalCheckpoint+1) to avoid starving replicas.
  • fallbackStrategies() declares DefaultRemoteStoreTranslogStrategy for safe rolling upgrades.

Cost Impact

Metric Default TAR bundle
Segment PUTs/month ~2.6 B ~290 M (−89%)
Translog PUTs/month ~2.6 B ~52 M (−98%)
Total API cost ~$26,000 ~$1,700

Code: TarSegmentUploadStrategy, TarTranslogUploadStrategy, NodeBundleRegistry, NodeTranslogUploadQueue, transport actions and tests

Related component

Storage:Remote

Describe alternatives you've considered

1. Hardcoding optimisations in core

Couples a specific format to the release cycle, blocks community experimentation, and makes core
harder to maintain.

2. Custom BlobStore / storage gateway

The BlobStore layer has no visibility into Lucene refresh boundaries or primary term checkpoints —
it cannot know which files belong to a single flush batch or when it is safe to bundle them.

3. External KV store for translog (Valkey / Redis)

RAM storage is orders of magnitude more expensive than S3, async replication cannot match S3's
durability guarantees, and it introduces an external operational dependency.

4. Shared filesystem (EFS / NFS)

EFS costs ~$0.30/GB-month vs S3's ~$0.023/GB-month, and NFS mounts introduce locking and
reliability risks under network partitions.

5. Pull-based ingestion (no-op translog)

Documents are ingested from a durable upstream log (e.g. Kafka); recovery replays from a
checkpoint instead of a translog. This eliminates translog PUTs entirely but is not viable as a
general solution for two reasons:

  1. Segment upload cost remains — every refresh still produces segment files that must be pushed
    individually to remote store, leaving the larger cost driver unaddressed.
  2. Not universally applicable — most deployments use direct REST indexing with no upstream
    log. Not everyone can switch to pull-based ingestion yet, and there is no migration path for
    already-indexed data.

The pluggable strategy approach is orthogonal: clusters that do adopt pull-based ingestion can
implement a no-op RemoteStoreTranslogStrategy while still benefiting from
TarSegmentUploadStrategy for segments.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions