|
1 | 1 | # Replication Performance Analysis |
2 | 2 |
|
| 3 | +## Architecture Overview |
| 4 | + |
| 5 | +Replication in D-LOCKSS is driven by two complementary mechanisms: |
| 6 | + |
| 7 | +1. **CRDT Cluster Sync** — Each shard runs an embedded IPFS Cluster with CRDT consensus. When a file is pinned to a shard's cluster, the `LocalPinTracker` on every peer in that shard automatically syncs and pins the content locally. |
| 8 | + |
| 9 | +2. **ReplicationRequest Protocol** — The `replicationManager` (extracted from `ShardManager`) periodically broadcasts `ReplicationRequest` messages for pinned manifests. Peers that don't yet have the file perform **auto-replication**: fetch via `PinRecursive` and add to the cluster. |
| 10 | + |
| 11 | +## Key Constants and Defaults |
| 12 | + |
| 13 | +| Parameter | Default | Env Variable | Location | |
| 14 | +|-----------|---------|-------------|----------| |
| 15 | +| Replication Check Interval | 1 minute | `DLOCKSS_CHECK_INTERVAL` | `config.Replication.CheckInterval` | |
| 16 | +| Root Shard Check Interval | 20 seconds | (hardcoded) | `rootReplicationCheckInterval` | |
| 17 | +| Request Cooldown Per Manifest | 5 minutes | (hardcoded) | `replicationRequestCooldownDuration` | |
| 18 | +| Max Requests Per Cycle | 50 | (hardcoded) | `maxReplicationRequestsPerCycle` | |
| 19 | +| Auto-Replication Enabled | true | `DLOCKSS_AUTO_REPLICATION_ENABLED` | `config.Replication.AutoReplicationEnabled` | |
| 20 | +| Auto-Replication Timeout | 5 minutes | `DLOCKSS_AUTO_REPLICATION_TIMEOUT` | `config.Replication.AutoReplicationTimeout` | |
| 21 | +| Max Concurrent Checks | 5 | `DLOCKSS_MAX_CONCURRENT_CHECKS` | `config.Replication.MaxConcurrentReplicationChecks` | |
| 22 | +| Pin Reannounce Interval | 2 minutes | `DLOCKSS_PIN_REANNOUNCE_INTERVAL` | `config.Replication.PinReannounceInterval` | |
| 23 | +| Min Replication | 5 | `DLOCKSS_MIN_REPLICATION` | `config.Replication.MinReplication` | |
| 24 | +| Max Replication | 10 | `DLOCKSS_MAX_REPLICATION` | `config.Replication.MaxReplication` | |
| 25 | + |
| 26 | +## Convergence Timeline |
| 27 | + |
| 28 | +For a newly ingested file to reach full replication across a shard: |
| 29 | + |
| 30 | +1. **Ingest** (immediate): File pinned locally, `IngestMessage` broadcast to shard, cluster `Pin()` called. |
| 31 | +2. **CRDT Sync** (seconds): Cluster state propagates to peers via PubSub; `LocalPinTracker` detects new pin and starts `PinRecursive`. |
| 32 | +3. **First Replication Check** (up to 20s at root, 1m elsewhere): `replicationManager.runChecker()` sends `ReplicationRequest` for all pinned manifests. |
| 33 | +4. **Auto-Replication** (seconds to minutes): Peers receiving the request that don't have the file fetch it via `PinRecursive` (up to 5-minute timeout). |
| 34 | +5. **Cooldown** (5 minutes): After sending a request for a manifest, no new request is sent for that manifest for 5 minutes. |
| 35 | + |
| 36 | +**Typical convergence**: Most files replicate within 1-2 minutes via CRDT sync alone. Files that fail the initial sync (large DAGs, slow block propagation) recover on the next replication cycle after the 5-minute cooldown. |
| 37 | + |
3 | 38 | ## Current Bottlenecks |
4 | 39 |
|
5 | | -Based on code analysis, the following factors contribute to slow replication convergence: |
6 | | - |
7 | | -### 1. **Replication Check Interval** (Default: 1 minute) |
8 | | -- **Location**: `CheckInterval = 1*time.Minute` |
9 | | -- **Impact**: Replication levels are only checked once per minute |
10 | | -- **Effect**: Minimum delay of 1 minute before detecting under-replication |
11 | | - |
12 | | -### 2. **Hysteresis Verification Delay** (Default: 30 seconds) |
13 | | -- **Location**: `ReplicationVerificationDelay = 30*time.Second` |
14 | | -- **Impact**: When under-replication is detected, system waits ~30 seconds before triggering replication requests |
15 | | -- **Effect**: Adds 30+ seconds delay before NEED messages are broadcast |
16 | | -- **Rationale**: Prevents false alarms from transient DHT issues |
17 | | - |
18 | | -### 3. **Replication Check Cooldown** (Default: 15 seconds) |
19 | | -- **Location**: `ReplicationCheckCooldown = 15*time.Second` |
20 | | -- **Impact**: Prevents checking the same file more than once every 15 seconds |
21 | | -- **Effect**: Limits how quickly replication can be re-checked after a change |
22 | | - |
23 | | -### 4. **Replication Cache TTL** (Default: 5 minutes) |
24 | | -- **Location**: `ReplicationCacheTTL = 5*time.Minute` |
25 | | -- **Impact**: Cached replication counts prevent frequent DHT queries |
26 | | -- **Effect**: Replication counts may be stale for up to 5 minutes |
27 | | -- **Trade-off**: Reduces DHT load but slows convergence detection |
28 | | - |
29 | | -### 5. **DHT Query Timeout** (Default: 2 minutes) |
30 | | -- **Location**: `context.WithTimeout(ctx, 2*time.Minute)` in `checkReplication()` |
31 | | -- **Impact**: DHT queries can take up to 2 minutes to timeout |
32 | | -- **Effect**: Slow DHT queries delay replication checks |
33 | | - |
34 | | -### 6. **DHT Max Sample Size** (Default: 50) |
35 | | -- **Location**: `DHTMaxSampleSize = 50` |
36 | | -- **Impact**: Limits how many providers are queried per DHT lookup |
37 | | -- **Effect**: May underestimate replication count in large networks |
38 | | - |
39 | | -### 7. **Worker Pool Limit** (Default: 10 concurrent checks) |
40 | | -- **Location**: `MaxConcurrentReplicationChecks = 10` |
41 | | -- **Impact**: Limits parallelism of replication checks |
42 | | -- **Effect**: With many files, checks are serialized |
43 | | - |
44 | | -### 8. **Missing Automatic Replication** |
45 | | -- **Issue**: When a `ReplicationRequest` is received, nodes only check replication - they don't automatically fetch and pin missing files |
46 | | -- **Impact**: Nodes must already have the file to replicate it |
47 | | -- **Effect**: Replication requests don't trigger new replication, only verify existing state |
48 | | - |
49 | | -## Total Minimum Delay |
50 | | - |
51 | | -For a new file to reach target replication: |
52 | | -1. **Initial check**: Up to 1 minute (CheckInterval) |
53 | | -2. **Verification delay**: ~30 seconds (ReplicationVerificationDelay) |
54 | | -3. **Replication request broadcast**: Immediate |
55 | | -4. **Other nodes check**: Up to 1 minute (their CheckInterval) |
56 | | -5. **Re-check after replication**: Up to 1 minute + 15 seconds cooldown |
57 | | - |
58 | | -**Minimum time to convergence**: ~3-4 minutes in ideal conditions |
59 | | -**With DHT delays**: Can be 5-10 minutes or more |
| 40 | +### 1. Request Cooldown (5 minutes) |
| 41 | + |
| 42 | +Once a `ReplicationRequest` is sent for a manifest, `replicationRequestCooldownDuration` prevents resending for 5 minutes. If the first request fails (e.g., the receiving peer's `PinRecursive` times out), the file appears "stuck" until the cooldown expires. |
| 43 | + |
| 44 | +**Mitigation**: The cooldown prevents flooding but causes visible delays for files that fail on the first attempt. |
| 45 | + |
| 46 | +### 2. Auto-Replication Timeout (5 minutes) |
| 47 | + |
| 48 | +`PinRecursive` for large files or over slow links may hit the `AutoReplicationTimeout`. The file remains unreplicated until the next replication cycle. |
| 49 | + |
| 50 | +**Mitigation**: Heartbeat-driven re-pin gradually fills in missing blocks (see below). |
| 51 | + |
| 52 | +### 3. Concurrent Replication Limit (5) |
| 53 | + |
| 54 | +The `replicationManager.sem` channel limits concurrent auto-replications to `MaxConcurrentReplicationChecks` (default 5). When all slots are occupied, additional `ReplicationRequest` messages are silently dropped. |
| 55 | + |
| 56 | +**Mitigation**: Increase `DLOCKSS_MAX_CONCURRENT_CHECKS` for nodes with sufficient bandwidth. |
| 57 | + |
| 58 | +### 4. Max Requests Per Cycle (50) |
| 59 | + |
| 60 | +At most 50 `ReplicationRequest` messages are sent per checker cycle. With thousands of files, not all manifests are requested in a single cycle. |
| 61 | + |
| 62 | +**Mitigation**: Subsequent cycles pick up remaining manifests. The cooldown map ensures already-sent requests aren't duplicated. |
| 63 | + |
| 64 | +## Heartbeat-Driven Gradual DAG Completion (Built-In) |
| 65 | + |
| 66 | +Every heartbeat (~10s), each node picks **one** pinned manifest CID (round-robin) and: |
| 67 | + |
| 68 | +1. **Re-pins the ManifestCID recursively** (`PinRecursive`, 2-minute timeout). Idempotent — returns instantly when the DAG is already complete locally, and incrementally fetches missing blocks otherwise. |
| 69 | +2. **Pins the PayloadCID as its own root** so Kubo's reprovider (`pinned` strategy) re-announces it. |
| 70 | +3. **Provides both CIDs to the DHT** (only if the re-pin succeeded). |
| 71 | + |
| 72 | +A `CompareAndSwap` guard prevents concurrent re-provides from piling up. |
| 73 | + |
| 74 | +**Impact**: Resource-constrained nodes (e.g., Raspberry Pis) that failed the initial `PinRecursive` gradually complete the DAG over successive heartbeats without manual intervention. DHT provider records (which expire after ~24h) are kept fresh. |
60 | 75 |
|
61 | 76 | ## Optimization Options |
62 | 77 |
|
63 | | -### Option 1: Reduce Check Interval (Quick Win) |
| 78 | +### Reduce Check Interval (Quick Win) |
64 | 79 | ```bash |
65 | 80 | export DLOCKSS_CHECK_INTERVAL=15s # Default: 1m |
66 | 81 | ``` |
67 | | -**Pros**: Faster detection of under-replication |
68 | | -**Cons**: More DHT queries, higher CPU usage |
69 | | -**Recommendation**: Use 15-30s for testnets |
| 82 | +Faster detection at non-root shards. Root shards already check every 20s. |
70 | 83 |
|
71 | | -### Option 3: Reduce Replication Cooldown (Quick Win) |
| 84 | +### Increase Concurrent Checks (Moderate Impact) |
72 | 85 | ```bash |
73 | | -export DLOCKSS_REPLICATION_COOLDOWN=5s # Default: 15s |
| 86 | +export DLOCKSS_MAX_CONCURRENT_CHECKS=10 # Default: 5 |
74 | 87 | ``` |
75 | | -**Pros**: Faster re-checking after replication changes |
76 | | -**Cons**: More frequent checks of same files |
77 | | -**Recommendation**: Use 5s for testnets |
| 88 | +More parallel auto-replications. Higher bandwidth usage. |
78 | 89 |
|
79 | | -### Option 5: Increase Worker Pool (Moderate Impact) |
| 90 | +### Increase Auto-Replication Timeout (Large Files) |
80 | 91 | ```bash |
81 | | -export DLOCKSS_MAX_CONCURRENT_CHECKS=20 # Default: 10 |
| 92 | +export DLOCKSS_AUTO_REPLICATION_TIMEOUT=10m # Default: 5m |
82 | 93 | ``` |
83 | | -**Pros**: More parallel replication checks |
84 | | -**Cons**: Higher CPU/memory usage |
85 | | -**Recommendation**: Use 20-30 for testnets with many files |
86 | | - |
87 | | -### Option 8: Implement Automatic Replication (Major Feature) |
88 | | -**Code Change Required**: Add logic to fetch and pin files when receiving ReplicationRequest |
89 | | -**Pros**: Actually triggers replication, not just checks |
90 | | -**Cons**: Requires IPFS content fetching, bandwidth usage |
91 | | -**Recommendation**: High priority for production |
| 94 | +Allows more time for large DAG fetches. Ties up semaphore slots longer. |
92 | 95 |
|
93 | 96 | ## Recommended Testnet Configuration |
94 | 97 |
|
95 | | -For faster convergence in testnets, use: |
| 98 | +For faster convergence in testnets: |
96 | 99 |
|
97 | 100 | ```bash |
98 | 101 | export DLOCKSS_CHECK_INTERVAL=15s |
99 | | -export DLOCKSS_REPLICATION_VERIFICATION_DELAY=5s |
100 | | -export DLOCKSS_REPLICATION_COOLDOWN=5s |
101 | | -export DLOCKSS_REPLICATION_CACHE_TTL=30s |
102 | | -export DLOCKSS_MAX_CONCURRENT_CHECKS=20 |
103 | | -export DLOCKSS_DHT_MAX_SAMPLE_SIZE=100 |
| 102 | +export DLOCKSS_MAX_CONCURRENT_CHECKS=10 |
104 | 103 | ``` |
105 | 104 |
|
106 | | -This reduces minimum convergence time from ~3-4 minutes to ~30-60 seconds. |
107 | | - |
108 | | -### Heartbeat-Driven Gradual DAG Completion (Built-In) |
109 | | - |
110 | | -Every heartbeat (~10s), each node picks one pinned manifest (round-robin) and calls `PinRecursive` with a 2-minute timeout. This is idempotent: if the DAG is already fully local it returns instantly, otherwise it incrementally fetches the missing blocks. On success, the manifest and payload CIDs are re-provided to the DHT. |
111 | | - |
112 | | -**Impact on resource-constrained nodes (Raspberry Pis):** |
113 | | -- Initial `PinRecursive` during replication may time out or OOM before fetching the full DAG. |
114 | | -- Instead of leaving the file permanently incomplete, subsequent heartbeats gradually fetch the remaining blocks. |
115 | | -- After all blocks are local, Kubo's reprovider stops emitting "block not found locally, cannot provide" errors. |
116 | | -- DHT provider records (which expire after ~24h) are kept fresh without relying solely on Kubo's reprovider. |
117 | | - |
118 | | -No configuration needed — this runs automatically on every node. |
119 | | - |
120 | 105 | ## Production Considerations |
121 | 106 |
|
122 | | -For production networks: |
123 | | -- Keep `ReplicationVerificationDelay` at 30s to prevent false alarms |
124 | | -- Keep `ReplicationCacheTTL` at 5m to reduce DHT load |
125 | | -- Keep `CheckInterval` at 1m for reasonable resource usage |
126 | | -- Consider implementing Option 8 (automatic replication) for better convergence |
| 107 | +- Keep `CheckInterval` at 1m for reasonable resource usage (root shards already use 20s). |
| 108 | +- Keep `AutoReplicationTimeout` at 5m unless dealing with consistently large files. |
| 109 | +- The 5-minute request cooldown is a deliberate trade-off between convergence speed and network overhead; files that fail on the first attempt self-heal after the cooldown expires. |
127 | 110 |
|
128 | 111 | ## Monitoring |
129 | 112 |
|
130 | | -Watch these metrics to understand replication performance: |
131 | | -- `replicationChecks`: Number of checks performed |
132 | | -- `dhtQueries`: Number of DHT queries |
133 | | -- `dhtQueryTimeouts`: DHT query failures |
134 | | -- `filesAtTargetReplication`: Files with adequate replication |
135 | | -- `lowReplicationFiles`: Files needing replication |
136 | | -- `avgReplicationLevel`: Average replication across all files |
| 113 | +The monitor's `replication snapshot` log line reports: |
| 114 | +- `total_manifests`: Number of known manifests |
| 115 | +- `total_at_target`: Files with replica count >= min(MinReplication, shard_peer_count) |
| 116 | +- `avg_replication`: Average replica count across all manifests |
| 117 | + |
| 118 | +Node daemon logs to watch: |
| 119 | +- `"auto-replication: fetched and pinned"` — successful auto-replication |
| 120 | +- `"auto-replication: failed to fetch/pin"` — `PinRecursive` timeout or failure |
| 121 | +- `"auto-replication skipped, concurrency limit reached"` — semaphore full |
| 122 | +- `"ReplicationRequest sent"` — outbound request (debug level) |
0 commit comments