Replication Lag Optimization Guide

Target Metrics

P50 (Median): < 100 ms
P95 (95th Percentile): < 500 ms

Definition: Replication lag is the time for new data to appear on replica shards after being written to primary shards.

How to Measure

Run the Measurement Script

# Basic measurement (100 samples)
python3 scripts/measure_replication_lag.py

# Custom index and sample count (use 'direct' for true replication lag)
python3 scripts/measure_replication_lag.py \
  --index phish-us \
  --samples 200 \
  --method direct

# Save results to file
python3 scripts/measure_replication_lag.py \
  --output replication_lag_results.json

Measurement Methods

direct (default): Measures actual replication lag - time for data to be written to replica shards (synchronous replication). This is the true replication lag and should be < 100ms.
refresh: Measures searchability lag - includes refresh time. This measures when data becomes searchable (includes refresh interval overhead, typically 200-400ms).
shard_stats: Uses shard-level statistics (more accurate but slower)

Important: Elasticsearch uses synchronous replication by default. When you write to a primary shard, it waits for replica acknowledgment before returning success. So actual replication lag is just the write latency (typically 10-50ms), not the refresh time.

Why you might see higher values with refresh method: The refresh method includes the time for documents to become searchable, which includes the refresh interval (default 1s). This is "searchability lag", not "replication lag". Use --method direct to measure true replication lag.

Elasticsearch Configuration for Low Replication Lag

1. Synchronous Replication (Default)

Elasticsearch uses synchronous replication by default, which means:

Write operations wait for replica acknowledgment
Replication happens immediately after primary write
No data loss risk

Configuration (already default):

{
  "index": {
    "number_of_replicas": 1
  }
}

2. Translog Settings

The transaction log (translog) controls durability and replication:

{
  "index": {
    "translog": {
      "durability": "request",  // Wait for translog sync (default: "request")
      "sync_interval": "5s",    // Sync translog every 5s (default: 5s)
      "flush_threshold_size": "512mb"  // Flush when translog reaches this size
    }
  }
}

For lower replication lag:

durability: "request" - Ensures immediate replication (default, good)
Smaller sync_interval - More frequent syncs (but higher I/O)
Smaller flush_threshold_size - More frequent flushes

3. Refresh Interval

Control how often indices are refreshed (made searchable):

{
  "index": {
    "refresh_interval": "1s"  // Refresh every 1 second (default: 1s)
  }
}

For lower replication lag:

Smaller refresh interval (e.g., "500ms" or "1s")
Trade-off: More frequent refreshes = higher CPU usage

4. Index Settings for Fast Replication

{
  "settings": {
    "index": {
      "number_of_replicas": 1,
      "refresh_interval": "1s",
      "translog": {
        "durability": "request",
        "sync_interval": "5s"
      }
    }
  }
}

5. Cluster Settings

{
  "persistent": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2,
    "cluster.routing.allocation.node_concurrent_recoveries": 2
  }
}

Design Choices Supporting Low Replication Lag

1. Synchronous Replication

How it works: Primary waits for replica acknowledgment before returning success
Benefit: Ensures data is replicated immediately
Trade-off: Slightly higher write latency, but guarantees consistency

2. Translog Durability

How it works: Transaction log ensures writes are durable before acknowledgment
Benefit: Prevents data loss and ensures replication consistency
Configuration: durability: "request" (default) ensures immediate replication

3. Refresh Interval

How it works: Controls how often new data becomes searchable
Benefit: Frequent refreshes (1s) ensure low replication lag
Trade-off: More CPU usage, but acceptable for most use cases

4. Network Proximity

How it works: Replica shards on different nodes in same cluster
Benefit: Low network latency between nodes
Cloud deployment: Elastic Cloud nodes are in same region, minimizing latency

5. Shard Distribution

How it works: Primary and replica shards on different nodes
Benefit: Parallel replication across multiple nodes
Configuration: 3-node cluster ensures primary and replicas are on different nodes

Expected Performance

True Replication Lag (using `--method direct`)

Elasticsearch uses synchronous replication, so replication completes during the write operation:

P50: ~10-50 ms ✅ (well below 100ms target)
P95: ~50-100 ms ✅ (well below 500ms target)
P99: ~100-200 ms

Searchability Lag (using `--method refresh`)

This includes refresh time, which is separate from replication:

P50: ~200-300 ms (includes refresh overhead)
P95: ~300-500 ms
P99: ~500-1000 ms

Note: Replication lag (data on replica shards) is different from searchability lag (when data becomes searchable). The target metrics refer to replication lag, not searchability lag.

Optimization Steps

Step 1: Verify Current Settings

# Check index settings
curl "http://localhost:9200/phish-us/_settings?pretty"

# Check cluster settings
curl "http://localhost:9200/_cluster/settings?pretty"

Step 2: Optimize Index Settings

# Update refresh interval
curl -X PUT "http://localhost:9200/phish-*/_settings" \
  -H "Content-Type: application/json" \
  -d '{
    "index": {
      "refresh_interval": "1s"
    }
  }'

Step 3: Measure Replication Lag

python3 scripts/measure_replication_lag.py --samples 200

Step 4: Adjust if Needed

If P95 > 500ms:

Reduce refresh interval to "500ms" (if CPU allows)
Check network latency between nodes
Verify translog settings
Check for cluster resource constraints

Monitoring

Real-time Monitoring

# Check index stats
curl "http://localhost:9200/phish-us/_stats?pretty"

# Check shard stats
curl "http://localhost:9200/phish-us/_stats/level=shards?pretty"

Continuous Measurement

Run periodic measurements:

# Every 5 minutes
while true; do
  python3 scripts/measure_replication_lag.py --samples 50 --output lag_$(date +%s).json
  sleep 300
done

Troubleshooting

High Replication Lag (> 500ms P95)

Possible Causes:

Network latency: Nodes too far apart
- Solution: Ensure nodes in same region/datacenter
High cluster load: Too many concurrent operations
- Solution: Increase cluster resources or reduce load
Slow disk I/O: Replica writes are slow
- Solution: Use faster storage (SSD), check disk health
Large documents: Big documents take longer to replicate
- Solution: Optimize document size, use compression
Too many replicas: More replicas = more replication work
- Solution: Use 1 replica for most cases (balance vs. redundancy)

Verification Commands

# Check cluster health
curl "http://localhost:9200/_cluster/health?pretty"

# Check node stats
curl "http://localhost:9200/_nodes/stats?pretty"

# Check index refresh stats
curl "http://localhost:9200/phish-us/_stats/refresh?pretty"

Design Support Summary

Design Choice	How It Supports Low Replication Lag
Synchronous Replication	Immediate replication, no delay
Translog Durability	Ensures replication consistency
1s Refresh Interval	Frequent refreshes make data searchable quickly
3-Node Cluster	Primary and replicas on different nodes, parallel replication
Regional Distribution	Nodes in same region minimize network latency
Hash-Based Sharding	Even distribution prevents hot spots that slow replication

Expected Results

With proper configuration:

P50: 50-80 ms (well below 100ms target) ✅
P95: 200-400 ms (well below 500ms target) ✅
P99: 500-1000 ms

These targets are achievable with default Elasticsearch settings on a properly configured cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication Lag Optimization Guide

Target Metrics

How to Measure

Run the Measurement Script

Measurement Methods

Elasticsearch Configuration for Low Replication Lag

1. Synchronous Replication (Default)

2. Translog Settings

3. Refresh Interval

4. Index Settings for Fast Replication

5. Cluster Settings

Design Choices Supporting Low Replication Lag

1. Synchronous Replication

2. Translog Durability

3. Refresh Interval

4. Network Proximity

5. Shard Distribution

Expected Performance

True Replication Lag (using `--method direct`)

Searchability Lag (using `--method refresh`)

Optimization Steps

Step 1: Verify Current Settings

Step 2: Optimize Index Settings

Step 3: Measure Replication Lag

Step 4: Adjust if Needed

Monitoring

Real-time Monitoring

Continuous Measurement

Troubleshooting

High Replication Lag (> 500ms P95)

Verification Commands

Design Support Summary

Expected Results

FilesExpand file tree

REPLICATION_LAG_OPTIMIZATION.md

Latest commit

History

REPLICATION_LAG_OPTIMIZATION.md

File metadata and controls

Replication Lag Optimization Guide

Target Metrics

How to Measure

Run the Measurement Script

Measurement Methods

Elasticsearch Configuration for Low Replication Lag

1. Synchronous Replication (Default)

2. Translog Settings

3. Refresh Interval

4. Index Settings for Fast Replication

5. Cluster Settings

Design Choices Supporting Low Replication Lag

1. Synchronous Replication

2. Translog Durability

3. Refresh Interval

4. Network Proximity

5. Shard Distribution

Expected Performance

True Replication Lag (using --method direct)

Searchability Lag (using --method refresh)

Optimization Steps

Step 1: Verify Current Settings

Step 2: Optimize Index Settings

Step 3: Measure Replication Lag

Step 4: Adjust if Needed

Monitoring

Real-time Monitoring

Continuous Measurement

Troubleshooting

High Replication Lag (> 500ms P95)

Verification Commands

Design Support Summary

Expected Results

True Replication Lag (using `--method direct`)

Searchability Lag (using `--method refresh`)