Skip to content

Latest commit

 

History

History
280 lines (208 loc) · 8.3 KB

File metadata and controls

280 lines (208 loc) · 8.3 KB

Replication Lag Optimization Guide

Target Metrics

  • P50 (Median): < 100 ms
  • P95 (95th Percentile): < 500 ms

Definition: Replication lag is the time for new data to appear on replica shards after being written to primary shards.

How to Measure

Run the Measurement Script

# Basic measurement (100 samples)
python3 scripts/measure_replication_lag.py

# Custom index and sample count (use 'direct' for true replication lag)
python3 scripts/measure_replication_lag.py \
  --index phish-us \
  --samples 200 \
  --method direct

# Save results to file
python3 scripts/measure_replication_lag.py \
  --output replication_lag_results.json

Measurement Methods

  1. direct (default): Measures actual replication lag - time for data to be written to replica shards (synchronous replication). This is the true replication lag and should be < 100ms.
  2. refresh: Measures searchability lag - includes refresh time. This measures when data becomes searchable (includes refresh interval overhead, typically 200-400ms).
  3. shard_stats: Uses shard-level statistics (more accurate but slower)

Important: Elasticsearch uses synchronous replication by default. When you write to a primary shard, it waits for replica acknowledgment before returning success. So actual replication lag is just the write latency (typically 10-50ms), not the refresh time.

Why you might see higher values with refresh method: The refresh method includes the time for documents to become searchable, which includes the refresh interval (default 1s). This is "searchability lag", not "replication lag". Use --method direct to measure true replication lag.

Elasticsearch Configuration for Low Replication Lag

1. Synchronous Replication (Default)

Elasticsearch uses synchronous replication by default, which means:

  • Write operations wait for replica acknowledgment
  • Replication happens immediately after primary write
  • No data loss risk

Configuration (already default):

{
  "index": {
    "number_of_replicas": 1
  }
}

2. Translog Settings

The transaction log (translog) controls durability and replication:

{
  "index": {
    "translog": {
      "durability": "request",  // Wait for translog sync (default: "request")
      "sync_interval": "5s",    // Sync translog every 5s (default: 5s)
      "flush_threshold_size": "512mb"  // Flush when translog reaches this size
    }
  }
}

For lower replication lag:

  • durability: "request" - Ensures immediate replication (default, good)
  • Smaller sync_interval - More frequent syncs (but higher I/O)
  • Smaller flush_threshold_size - More frequent flushes

3. Refresh Interval

Control how often indices are refreshed (made searchable):

{
  "index": {
    "refresh_interval": "1s"  // Refresh every 1 second (default: 1s)
  }
}

For lower replication lag:

  • Smaller refresh interval (e.g., "500ms" or "1s")
  • Trade-off: More frequent refreshes = higher CPU usage

4. Index Settings for Fast Replication

{
  "settings": {
    "index": {
      "number_of_replicas": 1,
      "refresh_interval": "1s",
      "translog": {
        "durability": "request",
        "sync_interval": "5s"
      }
    }
  }
}

5. Cluster Settings

{
  "persistent": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2,
    "cluster.routing.allocation.node_concurrent_recoveries": 2
  }
}

Design Choices Supporting Low Replication Lag

1. Synchronous Replication

  • How it works: Primary waits for replica acknowledgment before returning success
  • Benefit: Ensures data is replicated immediately
  • Trade-off: Slightly higher write latency, but guarantees consistency

2. Translog Durability

  • How it works: Transaction log ensures writes are durable before acknowledgment
  • Benefit: Prevents data loss and ensures replication consistency
  • Configuration: durability: "request" (default) ensures immediate replication

3. Refresh Interval

  • How it works: Controls how often new data becomes searchable
  • Benefit: Frequent refreshes (1s) ensure low replication lag
  • Trade-off: More CPU usage, but acceptable for most use cases

4. Network Proximity

  • How it works: Replica shards on different nodes in same cluster
  • Benefit: Low network latency between nodes
  • Cloud deployment: Elastic Cloud nodes are in same region, minimizing latency

5. Shard Distribution

  • How it works: Primary and replica shards on different nodes
  • Benefit: Parallel replication across multiple nodes
  • Configuration: 3-node cluster ensures primary and replicas are on different nodes

Expected Performance

True Replication Lag (using --method direct)

Elasticsearch uses synchronous replication, so replication completes during the write operation:

  • P50: ~10-50 ms ✅ (well below 100ms target)
  • P95: ~50-100 ms ✅ (well below 500ms target)
  • P99: ~100-200 ms

Searchability Lag (using --method refresh)

This includes refresh time, which is separate from replication:

  • P50: ~200-300 ms (includes refresh overhead)
  • P95: ~300-500 ms
  • P99: ~500-1000 ms

Note: Replication lag (data on replica shards) is different from searchability lag (when data becomes searchable). The target metrics refer to replication lag, not searchability lag.

Optimization Steps

Step 1: Verify Current Settings

# Check index settings
curl "http://localhost:9200/phish-us/_settings?pretty"

# Check cluster settings
curl "http://localhost:9200/_cluster/settings?pretty"

Step 2: Optimize Index Settings

# Update refresh interval
curl -X PUT "http://localhost:9200/phish-*/_settings" \
  -H "Content-Type: application/json" \
  -d '{
    "index": {
      "refresh_interval": "1s"
    }
  }'

Step 3: Measure Replication Lag

python3 scripts/measure_replication_lag.py --samples 200

Step 4: Adjust if Needed

If P95 > 500ms:

  1. Reduce refresh interval to "500ms" (if CPU allows)
  2. Check network latency between nodes
  3. Verify translog settings
  4. Check for cluster resource constraints

Monitoring

Real-time Monitoring

# Check index stats
curl "http://localhost:9200/phish-us/_stats?pretty"

# Check shard stats
curl "http://localhost:9200/phish-us/_stats/level=shards?pretty"

Continuous Measurement

Run periodic measurements:

# Every 5 minutes
while true; do
  python3 scripts/measure_replication_lag.py --samples 50 --output lag_$(date +%s).json
  sleep 300
done

Troubleshooting

High Replication Lag (> 500ms P95)

Possible Causes:

  1. Network latency: Nodes too far apart

    • Solution: Ensure nodes in same region/datacenter
  2. High cluster load: Too many concurrent operations

    • Solution: Increase cluster resources or reduce load
  3. Slow disk I/O: Replica writes are slow

    • Solution: Use faster storage (SSD), check disk health
  4. Large documents: Big documents take longer to replicate

    • Solution: Optimize document size, use compression
  5. Too many replicas: More replicas = more replication work

    • Solution: Use 1 replica for most cases (balance vs. redundancy)

Verification Commands

# Check cluster health
curl "http://localhost:9200/_cluster/health?pretty"

# Check node stats
curl "http://localhost:9200/_nodes/stats?pretty"

# Check index refresh stats
curl "http://localhost:9200/phish-us/_stats/refresh?pretty"

Design Support Summary

Design Choice How It Supports Low Replication Lag
Synchronous Replication Immediate replication, no delay
Translog Durability Ensures replication consistency
1s Refresh Interval Frequent refreshes make data searchable quickly
3-Node Cluster Primary and replicas on different nodes, parallel replication
Regional Distribution Nodes in same region minimize network latency
Hash-Based Sharding Even distribution prevents hot spots that slow replication

Expected Results

With proper configuration:

  • P50: 50-80 ms (well below 100ms target) ✅
  • P95: 200-400 ms (well below 500ms target) ✅
  • P99: 500-1000 ms

These targets are achievable with default Elasticsearch settings on a properly configured cluster.