Skip to content

Design discussion: WithRetry option for transient operation failures #76

@bigbes

Description

@bigbes

Background

storage.WithRetry() currently exists as a no-op stub in storage.go. The question came up from a TCM use case: TCM has a GetWithRetry helper that issues up to len(endpoints) * 2 requests against the storage to ride out transient failures. We don't have an equivalent right now, and connection-pool failover alone isn't enough — the pool handles "instance is dead", not "this request just timed out / the leader changed".

What "retry" means here

Operation-level retry on transient errors (timeouts, Unavailable, network/leader-change errors) for safe-to-replay calls:

  • Execute is safe by construction — predicates make CAS-style transactions naturally idempotent.
  • Range is a pure read.
  • Watch: at most we'd retry the initial subscribe, never re-emit inside the stream — that's a caller concern.

What "retry" doesn't cover: business-logic errors (predicate-fail), validation errors, or anything the driver classifies as permanent.

Design axes

Independent of the chosen placement, every design needs to answer:

  • Policy shape: max attempts vs. unlimited+ctx, backoff curve, jitter, per-driver retriability classifier.
  • Default classifier per driver:
    • etcd: gRPC Unavailable, DeadlineExceeded, leader/election errors.
    • TCS: tarantool transient error codes, network errors, pool errors.
  • TCM parity heuristic: do we expose MaxAttempts = len(endpoints) * 2 out of the box, or just a flat number?

A reasonable starting policy:

type Policy struct {
    MaxAttempts    int           // 0 = unlimited, gated by context
    InitialBackoff time.Duration
    MaxBackoff     time.Duration
    Multiplier     float64
    Jitter         float64       // 0..1, full jitter
    IsRetriable    func(error) bool
}

Design A — Driver decorator + storage option

New driver/retry package; storage.WithRetry(...) wires it in.

package retry
func Wrap(d driver.Driver, policy Policy) driver.Driver

// usage
s := storage.NewStorage(d, storage.WithRetry(retry.WithMaxAttempts(5)))

Pros:

  • Orthogonal to the rest of the codebase, trivially testable with a fake driver.
  • Composes cleanly with pool failover (pool: dead instance, retry: flaky request).
  • Single retry surface for every consumer of the driver.

Cons:

  • Adds a layer between Storage and the underlying driver — slightly more indirection in stack traces.
  • Watch retry has to be handled specially or skipped.

Design B — Per-call option

Retry config travels with the call instead of the storage instance.

s.Range(ctx, storage.WithPrefix("/foo"), storage.WithRetry(...))
tx.Commit(storage.WithRetry(...))

Pros:

  • Fine-grained: callers pick retry per operation (e.g., aggressive on reads, none on writes).
  • No hidden behavior — what you see at the call site is what runs.

Cons:

  • Verbose; same config repeated at every call site.
  • Tx is a builder; threading retry through Commit() adds API surface to the tx package.
  • Two retry knobs (storage-level + call-level) if we ever add A on top — precedence rules get awkward.

Design C — Connection-level only

Push retry into the connect package and the underlying pool/etcd client configuration. No new abstraction at the driver/storage layer.

cfg := connect.Config{Endpoints: ..., Retry: connect.RetryConfig{...}}

Pros:

  • Closest to where the transient errors physically occur.
  • No new package; no decorator.

Cons:

  • Tarantool pool and etcd client have different retry models — we'd end up with two divergent configs hidden under one knob, or leak the difference.
  • Doesn't cover request-level retry that crosses pool semantics (e.g., timeout on a healthy instance) uniformly.
  • Roman's TCM use case explicitly motivates request-level retries that pool failover doesn't catch — this design alone may not solve it.

Design D — Helper functions, no built-in option

Drop the stub entirely. Provide retry.Do(ctx, policy, fn) as a standalone helper and let callers wrap their own calls.

err := retry.Do(ctx, policy, func() error {
    _, err := s.Range(ctx, ...)
    return err
})

Pros:

  • Smallest API surface; easiest to ship.
  • Caller has full control, no hidden behavior.

Cons:

  • Every consumer reimplements the same wrapper around storage calls.
  • No driver-aware default classifier — caller has to know which errors are transient for which backend.
  • Doesn't deliver on the existing WithRetry option's implied promise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions