You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-10-13-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md
+27-55Lines changed: 27 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -215,53 +215,44 @@ One hypothesis is that network saturation at degree 128 acts as a pacing mechani
215
215
216
216
## 5. Disk I/O and Network Analysis
217
217
218
-
### 5.1 FIO Disk Benchmarks vs PostgreSQL Performance
**Figure 13a: FIO Benchmark vs PostgreSQL Actual Performance (Bandwidth)** - PostgreSQL achieves 3,759 MB/s peak (100.5% of FIO's 3,741 MB/s), demonstrating it can saturate disk during bursts. However, average is only 170 MB/s (4.5% of peak), revealing highly bursty behavior with long idle periods.
**Figure 13b: FIO Benchmark vs PostgreSQL Actual Performance (IOPS)** - Peak IOPS reaches 55,501 operations per second during intensive bursts, but average IOPS is only 207 operations per second, further confirming the bursty pattern with most time spent idle or at low activity levels.
220
+
The source instance has 256GB RAM, and the lineitem table is ~77GB. An important detail explains the disk behavior across test runs:
227
221
228
-
**FIO Results:**
222
+
**Degree 1 was the first test run** with no prior warm-up or cold run to pre-load the table into cache. During this first run:
- Mean Write: 170 MB/s, 207 IOPS (only 4.5% of peak)
238
-
- Mean Disk Utilization: 14.6% (disk idle 85% of the time)
228
+
**At degrees 2-128**: Essentially zero disk activity; the entire table remains cached in memory from the initial degree 1 load.
239
229
240
-
### 5.2 Source Disk I/O Analysis
230
+
**This explains why degree 2 is more than twice as fast as degree 1**: The degree 1 run includes the initial table-loading overhead (~10 seconds of intensive disk I/O), while degree 2 benefits from the already-cached table with no disk loading required. The speedup from degree 1 to 2 reflects both the doubling of parallelism AND the elimination of the initial cache-loading penalty.
241
231
242
-
This section examines source PostgreSQL disk I/O behavior to provide concrete evidence that the source is experiencing backpressure from the target bottleneck rather than being disk-limited.
232
+
<imgsrc="/img/2025-10-13_01/source_disk_utilization_timeseries.png"alt="Source Disk Utilization Time Series."width="900">
243
233
244
-
#### The Cache Effect and First-Run Impact
234
+
**Figure 13: Source Disk Utilization Over Time** - Shows disk utilization across all test runs (vertical lines mark test boundaries for degrees 1, 2, 4, 8, 16, 32, 64, 128). At degree 1, utilization peaks at ~50% during the initial table load, then drops to near-zero. At higher degrees (2-128), utilization remains below 1% throughout, confirming the disk is idle and not limiting performance.
245
235
246
-
The source instance has 256GB RAM, and the lineitem table is ~77GB. An important detail explains the disk behavior across test runs:
236
+
**Source disk I/O is NOT a bottleneck at any parallelism degree.** The source exhibits different behavior depending on parallelism:
247
237
248
-
**Degree 1 was the first test run** with no prior warm-up or cold run to pre-load the table into cache. During this first run:
238
+
-**Degree 1**: Brief initial disk load (~10 seconds), then reads from RAM cache
**At degree 1 (first ~10 seconds)**: Heavy disk activity (500 MB/s, 100% utilization) loads the table into memory (shared_buffers + OS page cache).
241
+
The near-zero disk utilization (<1%) at high parallelism degrees confirms the disk is idle and not limiting performance. This indicates that source processes are constrained by downstream backpressure rather than local disk I/O capacity. Resolving target CPU lock contention would automatically improve source performance, as the source has substantial unused capacity waiting to be unlocked.
251
242
252
-
**At degree 1 (remaining ~860 seconds)**: Near-zero disk activity; the table is fully cached in RAM, no disk reads needed.
243
+
### 5.2 Target Disk I/O Time Series
253
244
254
-
**At degrees 2-128**: Essentially zero disk activity; the entire table remains cached in memory from the initial degree 1 load.
245
+
<imgsrc="/img/2025-10-13_01/target_disk_write_throughput_timeseries.png"alt="Target Disk Write Throughput Time Series."width="900">
255
246
256
-
**This explains why degree 2 is more than twice as fast as degree 1**: The degree 1 run includes the initial table-loading overhead (~10 seconds of intensive disk I/O), while degree 2 benefits from the already-cached table with no disk loading required. The speedup from degree 1 to 2 reflects both the doubling of parallelism AND the elimination of the initial cache-loading penalty.
247
+
**Figure 14: Target Disk Write Throughput Over Time** - Vertical lines mark test boundaries (degrees 1, 2, 4, 8, 16, 32, 64, 128). Throughput exhibits bursty behavior with spikes to 2000-3759 MB/s followed by drops to near zero. Sustained baseline varies from ~100 MB/s (low degrees) to ~300 MB/s (degree 128) but never sustains disk capacity.
257
248
258
-
<imgsrc="/img/2025-10-13_01/source_disk_utilization_timeseries.png"alt="Source Disk Utilization Time Series."width="900">
249
+
<imgsrc="/img/2025-10-13_01/target_disk_utilization_timeseries.png"alt="Target Disk Utilization Time Series."width="900">
259
250
260
-
**Figure 14: Source Disk Utilization Over Time** - Shows disk utilization across all test runs (vertical lines mark test boundaries for degrees 1, 2, 4, 8, 16, 32, 64, 128). At degree 1, utilization peaks at ~50% during the initial table load, then drops to near-zero. At higher degrees (2-128), utilization remains below 1% throughout, confirming the disk is idle and not limiting performance.
251
+
**Figure 15: Target Disk Utilization Over Time** - Mean utilization remains below 25% across all degrees. Spikes reach 70-90% during bursts but quickly return to low baseline. This strongly suggests disk I/O is not the bottleneck.
261
252
262
-
#### The Evidence Chain
253
+
Disk utilization measures the percentage of time the disk is busy serving I/O requests.
263
254
264
-
The source disk analysis reveals the interplay between caching and backpressure:
255
+
The disk analysis reveals the interplay between caching and backpressure:
The near-zero disk utilization (<1%) at high parallelism degrees confirms the disk is idle and not limiting performance. This indicates that source processes are constrained by downstream backpressure rather than local disk I/O capacity. Resolving target CPU lock contention would automatically improve source performance, as the source has substantial unused capacity waiting to be unlocked.
288
-
289
-
### 5.3 Target Disk I/O Time Series
290
-
291
-
<imgsrc="/img/2025-10-13_01/target_disk_write_throughput_timeseries.png"alt="Target Disk Write Throughput Time Series."width="900">
292
-
293
-
**Figure 15: Target Disk Write Throughput Over Time** - Vertical lines mark test boundaries (degrees 1, 2, 4, 8, 16, 32, 64, 128). Throughput exhibits bursty behavior with spikes to 2000-3759 MB/s followed by drops to near zero. Sustained baseline varies from ~100 MB/s (low degrees) to ~300 MB/s (degree 128) but never sustains disk capacity.
294
-
295
-
<imgsrc="/img/2025-10-13_01/target_disk_utilization_timeseries.png"alt="Target Disk Utilization Time Series."width="900">
296
-
297
-
**Figure 16: Target Disk Utilization Over Time** - Mean utilization remains below 25% across all degrees. Spikes reach 70-90% during bursts but quickly return to low baseline. This strongly suggests disk I/O is not the bottleneck.
298
-
299
-
### 5.4 Network Throughput Analysis
271
+
### 5.3 Network Throughput Analysis
300
272
301
273
<imgsrc="/img/2025-10-13_01/target_network_rx_timeseries.png"alt="Target Network RX Time Series."width="900">
302
274
303
-
**Figure 17: Target Network Ingress Over Time** - At degree 128, throughput plateaus at ~2,450 MB/s (98% of capacity) during active bursts, but averages only 1,088 MB/s (43.5%) due to alternating active/idle periods. At degrees 1-64, network remains well below capacity.
275
+
**Figure 16: Target Network Ingress Over Time** - At degree 128, throughput plateaus at ~2,450 MB/s (98% of capacity) during active bursts, but averages only 1,088 MB/s (43.5%) due to alternating active/idle periods. At degrees 1-64, network remains well below capacity.
304
276
305
277
Network saturation occurs **only at degree 128** during active bursts. Therefore, network doesn't explain poor scaling from degree 1 through 64, target CPU lock contention remains the primary bottleneck.
306
278
307
-
### 5.5 Cross-Degree Scaling Analysis
279
+
### 5.4 Cross-Degree Scaling Analysis
308
280
309
281
<imgsrc="/img/2025-10-13_01/cross_degree_disk_write_mean.png"alt="Cross Degree Mean Disk Write."width="900">
310
282
311
-
**Figure 18: Mean Disk Write Throughput by Degree** - Scales from 90 MB/s (degree 1) to 1,099 MB/s (degree 128), only 12.3x improvement for 128x parallelism (9.6% efficiency).
283
+
**Figure 17: Mean Disk Write Throughput by Degree** - Scales from 90 MB/s (degree 1) to 1,099 MB/s (degree 128), only 12.3x improvement for 128x parallelism (9.6% efficiency).
**Figure 19: Network Throughput Comparison: Source TX vs Target RX** - At degree 128, source transmits 1,684 MB/s while target receives only 1,088 MB/s, creating a 596 MB/s (35%) deficit. This suggests the target cannot keep pace with source data production, likely due to CPU lock contention.
287
+
**Figure 18: Network Throughput Comparison: Source TX vs Target RX** - At degree 128, source transmits 1,684 MB/s while target receives only 1,088 MB/s, creating a 596 MB/s (35%) deficit. This suggests the target cannot keep pace with source data production, likely due to CPU lock contention.
316
288
317
289
**Technical Note on TX/RX Discrepancy:** The apparent 35% violation of flow conservation is explained by TCP retransmissions. The source TX counter (measured via `sar -n DEV`) counts both original packets and retransmitted packets, while the target RX counter only counts successfully received unique packets. When the target is overloaded with CPU lock contention (83.9% system CPU at degree 64), it cannot drain receive buffers fast enough, causing packet drops that trigger TCP retransmissions. The 596 MB/s "deficit" is actually retransmitted data counted twice at the source but only once at the target, providing quantitative evidence of the target's inability to keep pace with source data production.
318
290
319
291
<imgsrc="/img/2025-10-13_01/cross_degree_disk_utilization.png"alt="Cross Degree Disk Utilization."width="900">
320
292
321
293
322
-
**Figure 20: Disk Utilization by Degree** - Mean utilization increases from 2.2% (degree 1) to only 24.3% (degree 128), remaining far below the 80% saturation threshold at all degrees. This strongly indicates disk I/O is not the bottleneck.
294
+
**Figure 19: Disk Utilization by Degree** - Mean utilization increases from 2.2% (degree 1) to only 24.3% (degree 128), remaining far below the 80% saturation threshold at all degrees. This strongly indicates disk I/O is not the bottleneck.
323
295
324
-
### 5.6 I/O Analysis Conclusions
296
+
### 5.5 I/O Analysis Conclusions
325
297
326
298
1.**Disk does not appear to be the bottleneck**: 24% average utilization at degree 128 with 76% idle capacity. PostgreSQL matches FIO peak (3,759 MB/s) but sustains only 170 MB/s average.
0 commit comments