Skip to content

Latest commit

 

History

History
605 lines (499 loc) · 15.1 KB

File metadata and controls

605 lines (499 loc) · 15.1 KB
title Scheduling Pattern 5: Advanced Retry Chains and Circuit Breakers
id scheduling-pattern-advanced-retry-chains
skillLevel advanced
applicationPatternId scheduling-periodic-tasks
summary Build sophisticated retry chains with circuit breakers, fallbacks, and complex failure patterns for production-grade reliability.
tags
scheduling
retry
circuit-breaker
fault-tolerance
error-recovery
resilience
rule
description
Use retry chains with circuit breakers to handle complex failure scenarios, detect cascade failures early, and prevent resource exhaustion.
related
scheduling-pattern-exponential-backoff
error-handling-pattern-accumulation
error-handling-pattern-propagation
author effect_website
lessonOrder 1

Guideline

Advanced retry strategies handle multiple failure types:

  • Circuit breaker: Stop retrying when error rate is high
  • Bulkhead: Limit concurrency per operation
  • Fallback chain: Try multiple approaches in order
  • Adaptive retry: Adjust strategy based on failure pattern
  • Health checks: Verify recovery before resuming

Pattern: Combine Schedule.retry, Ref state, and error classification


Rationale

Simple retry fails in production:

Scenario 1: Cascade Failure

  • Service A calls Service B (down)
  • Retries pile up, consuming resources
  • A gets overloaded trying to recover B
  • System collapses

Scenario 2: Mixed Failures

  • 404 (not found) - retrying won't help
  • 500 (server error) - retrying might help
  • Network timeout - retrying might help
  • Same retry strategy for all = inefficient

Scenario 3: Thundering Herd

  • 10,000 clients all retrying at once
  • Server recovers, gets hammered again
  • Needs coordinated backoff + jitter

Solutions:

Circuit breaker:

  • Monitor error rate
  • Stop requests when high
  • Resume gradually
  • Prevent cascade failures

Fallback chain:

  • Try primary endpoint
  • Try secondary endpoint
  • Use cache
  • Return degraded result

Adaptive retry:

  • Classify error type
  • Use appropriate strategy
  • Skip unretryable errors
  • Adjust backoff dynamically

Good Example

This example demonstrates circuit breaker and fallback chain patterns.

import { Effect, Schedule, Ref, Data } from "effect";

// Error classification
class RetryableError extends Data.TaggedError("RetryableError")<{
  message: string;
  code: string;
}> {}

class NonRetryableError extends Data.TaggedError("NonRetryableError")<{
  message: string;
  code: string;
}> {}

class CircuitBreakerOpenError extends Data.TaggedError("CircuitBreakerOpenError")<{
  message: string;
}> {}

// Circuit breaker state
interface CircuitBreakerState {
  status: "closed" | "open" | "half-open";
  failureCount: number;
  lastFailureTime: Date | null;
  successCount: number;
}

// Create circuit breaker
const createCircuitBreaker = (config: {
  failureThreshold: number;
  resetTimeoutMs: number;
  halfOpenRequests: number;
}) =>
  Effect.gen(function* () {
    const state = yield* Ref.make<CircuitBreakerState>({
      status: "closed",
      failureCount: 0,
      lastFailureTime: null,
      successCount: 0,
    });

    const recordSuccess = Effect.gen(function* () {
      yield* Ref.modify(state, (s) => {
        if (s.status === "half-open") {
          return [
            undefined,
            {
              ...s,
              successCount: s.successCount + 1,
              status: s.successCount + 1 >= config.halfOpenRequests
                ? "closed"
                : "half-open",
              failureCount: 0,
            },
          ];
        }
        return [undefined, s];
      });
    });

    const recordFailure = Effect.gen(function* () {
      yield* Ref.modify(state, (s) => {
        const newFailureCount = s.failureCount + 1;
        const newStatus = newFailureCount >= config.failureThreshold
          ? "open"
          : s.status;

        return [
          undefined,
          {
            ...s,
            failureCount: newFailureCount,
            lastFailureTime: new Date(),
            status: newStatus,
          },
        ];
      });
    });

    const canExecute = Effect.gen(function* () {
      const current = yield* Ref.get(state);

      if (current.status === "closed") {
        return true;
      }

      if (current.status === "open") {
        const timeSinceFailure = Date.now() - (current.lastFailureTime?.getTime() ?? 0);

        if (timeSinceFailure > config.resetTimeoutMs) {
          yield* Ref.modify(state, (s) => [
            undefined,
            {
              ...s,
              status: "half-open",
              failureCount: 0,
              successCount: 0,
            },
          ]);
          return true;
        }

        return false;
      }

      // half-open: allow limited requests
      return true;
    });

    return { recordSuccess, recordFailure, canExecute, state };
  });

// Main example
const program = Effect.gen(function* () {
  console.log(`\n[ADVANCED RETRY] Circuit breaker and fallback chains\n`);

  // Create circuit breaker
  const cb = yield* createCircuitBreaker({
    failureThreshold: 3,
    resetTimeoutMs: 1000,
    halfOpenRequests: 2,
  });

  // Example 1: Circuit breaker in action
  console.log(`[1] Circuit breaker state transitions:\n`);

  let requestCount = 0;

  const callWithCircuitBreaker = (shouldFail: boolean) =>
    Effect.gen(function* () {
      const canExecute = yield* cb.canExecute;

      if (!canExecute) {
        yield* Effect.fail(
          new CircuitBreakerOpenError({
            message: "Circuit breaker is open",
          })
        );
      }

      requestCount++;

      if (shouldFail) {
        yield* cb.recordFailure;
        yield* Effect.log(
          `[REQUEST ${requestCount}] FAILED (Circuit: ${(yield* Ref.get(cb.state)).status})`
        );
        yield* Effect.fail(
          new RetryableError({
            message: "Service error",
            code: "500",
          })
        );
      } else {
        yield* cb.recordSuccess;
        yield* Effect.log(
          `[REQUEST ${requestCount}] SUCCESS (Circuit: ${(yield* Ref.get(cb.state)).status})`
        );
        return "success";
      }
    });

  // Simulate failures then recovery
  const failSequence = [true, true, true, false, false, false];

  for (const shouldFail of failSequence) {
    yield* callWithCircuitBreaker(shouldFail).pipe(
      Effect.catchAll((error) =>
        Effect.gen(function* () {
          if (error._tag === "CircuitBreakerOpenError") {
            yield* Effect.log(
              `[REQUEST ${requestCount + 1}] REJECTED (Circuit open)`
            );
          } else {
            yield* Effect.log(
              `[REQUEST ${requestCount + 1}] ERROR caught`
            );
          }
        })
      )
    );

    // Add delay between requests
    yield* Effect.sleep("100 millis");
  }

  // Example 2: Fallback chain
  console.log(`\n[2] Fallback chain (primary → secondary → cache):\n`);

  const endpoints = {
    primary: "https://api.primary.com/data",
    secondary: "https://api.secondary.com/data",
    cache: "cached-data",
  };

  const callEndpoint = (name: string, shouldFail: boolean) =>
    Effect.gen(function* () {
      yield* Effect.log(`[CALL] Trying ${name}`);

      if (shouldFail) {
        yield* Effect.sleep("50 millis");
        yield* Effect.fail(
          new RetryableError({
            message: `${name} failed`,
            code: "500",
          })
        );
      }

      yield* Effect.sleep("50 millis");
      return `data-from-${name}`;
    });

  const fallbackChain = callEndpoint("primary", true).pipe(
    Effect.orElse(() => callEndpoint("secondary", false)),
    Effect.orElse(() =>
      Effect.gen(function* () {
        yield* Effect.log(`[FALLBACK] Using cached data`);
        return endpoints.cache;
      })
    )
  );

  const result = yield* fallbackChain;

  yield* Effect.log(`[RESULT] Got: ${result}\n`);

  // Example 3: Error-specific retry strategy
  console.log(`[3] Error classification and adaptive retry:\n`);

  const classifyError = (code: string) => {
    if (["502", "503", "504"].includes(code)) {
      return "retryable-service-error";
    }
    if (["408", "429"].includes(code)) {
      return "retryable-rate-limit";
    }
    if (["404", "401", "403"].includes(code)) {
      return "non-retryable";
    }
    if (code === "timeout") {
      return "retryable-network";
    }
    return "unknown";
  };

  const errorCodes = ["500", "404", "429", "503", "timeout"];

  for (const code of errorCodes) {
    const classification = classifyError(code);
    const shouldRetry = !classification.startsWith("non-retryable");

    yield* Effect.log(
      `[ERROR ${code}] → ${classification} (Retry: ${shouldRetry})`
    );
  }

  // Example 4: Bulkhead pattern
  console.log(`\n[4] Bulkhead isolation (limit concurrency per endpoint):\n`);

  const bulkheads = {
    "primary-api": { maxConcurrent: 5, currentCount: 0 },
    "secondary-api": { maxConcurrent: 3, currentCount: 0 },
  };

  const acquirePermit = (endpoint: string) =>
    Effect.gen(function* () {
      const bulkhead = bulkheads[endpoint as keyof typeof bulkheads];

      if (!bulkhead) {
        return false;
      }

      if (bulkhead.currentCount < bulkhead.maxConcurrent) {
        bulkhead.currentCount++;
        return true;
      }

      yield* Effect.log(
        `[BULKHEAD] ${endpoint} at capacity (${bulkhead.currentCount}/${bulkhead.maxConcurrent})`
      );

      return false;
    });

  // Simulate requests
  for (let i = 0; i < 10; i++) {
    const endpoint = i < 6 ? "primary-api" : "secondary-api";
    const acquired = yield* acquirePermit(endpoint);

    if (acquired) {
      yield* Effect.log(
        `[REQUEST] Acquired permit for ${endpoint}`
      );
    }
  }
});

Effect.runPromise(program);

Advanced: Complex Retry Strategy

Combine multiple retry conditions:

const createComplexRetryStrategy = (config: {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  timeoutMs: number;
}) =>
  Schedule.recurse<number, number>((attempt) => {
    if (attempt >= config.maxAttempts) {
      return Schedule.stop;
    }

    // Exponential backoff with jitter
    const delay = Math.min(
      config.maxDelayMs,
      config.baseDelayMs * Math.pow(2, attempt) +
        Math.random() * config.baseDelayMs
    );

    // Add jitter to prevent thundering herd
    const jitter = Math.random() * 0.1 * delay;

    return Schedule.delay(
      Duration.millis(Math.ceil(delay + jitter))
    );
  }).pipe(
    Schedule.upTo(Duration.millis(config.timeoutMs)),
    Schedule.recurs(config.maxAttempts)
  );

// Usage with error filtering
const robustCall = operation.pipe(
  Effect.retry(
    createComplexRetryStrategy({
      maxAttempts: 5,
      baseDelayMs: 100,
      maxDelayMs: 5000,
      timeoutMs: 30000,
    }).pipe(
      Schedule.filter((error) => {
        // Don't retry non-retryable errors
        if (error instanceof NonRetryableError) {
          return false;
        }
        return true;
      })
    )
  )
);

Advanced: Health Check with Recovery

Verify recovery before resuming traffic:

const createHealthAwareRetry = (config: {
  maxRetries: number;
  healthCheckDelayMs: number;
  healthCheckTimeoutMs: number;
}) =>
  Effect.gen(function* () {
    const isHealthy = yield* Ref.make(true);

    const performHealthCheck = Effect.gen(function* () {
      yield* Effect.log("[HEALTH] Checking service health");

      // Replace with actual health check
      const healthy = Math.random() > 0.3;

      yield* Ref.set(isHealthy, healthy);
      yield* Effect.log(
        `[HEALTH] Service is ${healthy ? "healthy" : "still failing"}`
      );

      return healthy;
    }).pipe(
      Effect.timeout(Duration.millis(config.healthCheckTimeoutMs)),
      Effect.catchAll(() =>
        Effect.gen(function* () {
          yield* Effect.log("[HEALTH] Check timed out, assuming unhealthy");
          return false;
        })
      )
    );

    return performHealthCheck;
  });

Advanced: Telemetry and Observability

Track retry metrics for debugging:

interface RetryMetrics {
  attempts: number;
  failures: number;
  totalDelayMs: number;
  errorCodes: string[];
  lastError: Error | null;
}

const trackRetryMetrics = (operation: Effect.Effect<string>) =>
  Effect.gen(function* () {
    const metrics = yield* Ref.make<RetryMetrics>({
      attempts: 0,
      failures: 0,
      totalDelayMs: 0,
      errorCodes: [],
      lastError: null,
    });

    const result = yield* operation.pipe(
      Effect.retry(
        Schedule.exponential("100 millis").pipe(
          Schedule.compose(
            Schedule.recurs(3),
            Schedule.tapOutput((delay) =>
              Effect.gen(function* () {
                yield* Ref.modify(metrics, (m) => [
                  undefined,
                  {
                    ...m,
                    attempts: m.attempts + 1,
                    totalDelayMs: m.totalDelayMs + delay.millis,
                  },
                ]);
              })
            )
          )
        )
      ),
      Effect.catchAll((error) =>
        Effect.gen(function* () {
          yield* Ref.modify(metrics, (m) => [
            undefined,
            {
              ...m,
              failures: m.failures + 1,
              lastError: error instanceof Error ? error : new Error(String(error)),
            },
          ]);

          return Effect.fail(error);
        })
      )
    );

    const final = yield* Ref.get(metrics);

    yield* Effect.log(
      `[METRICS] Attempts: ${final.attempts}, Failures: ${final.failures}, Total delay: ${final.totalDelayMs}ms`
    );

    return result;
  });

When to Use This Pattern

Use circuit breaker when:

  • Calling external services
  • Error rate might spike
  • Want to fail fast
  • Protecting dependent services

Use fallback chain when:

  • Multiple approaches available
  • Want graceful degradation
  • Have secondary endpoints
  • Need cache as last resort

Use adaptive retry when:

  • Mixed error types expected
  • Different errors need different strategies
  • Want to optimize retry efficiency
  • Observability matters

⚠️ Trade-offs:

  • Complexity increases significantly
  • More state to manage
  • Requires careful tuning
  • Harder to reason about

Configuration Guide

Setting Recommendation Notes
Failure Threshold 5-10 errors Too low = circuit opens incorrectly
Reset Timeout 30-60 seconds Too short = rapid oscillation
Half-Open Requests 2-5 Verify recovery gradually
Max Retries 3-5 Balance between resilience and delay
Max Delay 10-30 seconds Prevent long tail latencies

See Also