Replication Performance Analysis

Architecture Overview

Replication in D-LOCKSS is driven by two complementary mechanisms:

CRDT Cluster Sync — Each shard runs an embedded IPFS Cluster with CRDT consensus. When a file is pinned to a shard's cluster, the LocalPinTracker on every peer in that shard automatically syncs and pins the content locally.
ReplicationRequest Protocol — The replicationManager (extracted from ShardManager) periodically broadcasts ReplicationRequest messages for pinned manifests. Peers that don't yet have the file perform auto-replication: fetch via PinRecursive and add to the cluster.

Key Constants and Defaults

Parameter	Default	Env Variable	Location
Replication Check Interval	1 minute	`DLOCKSS_CHECK_INTERVAL`	`config.Replication.CheckInterval`
Root Shard Check Interval	20 seconds	(hardcoded)	`rootReplicationCheckInterval`
Request Cooldown Per Manifest	5 minutes	(hardcoded)	`replicationRequestCooldownDuration`
Max Requests Per Cycle	50	(hardcoded)	`maxReplicationRequestsPerCycle`
Auto-Replication Enabled	true	`DLOCKSS_AUTO_REPLICATION_ENABLED`	`config.Replication.AutoReplicationEnabled`
Auto-Replication Timeout	5 minutes	`DLOCKSS_AUTO_REPLICATION_TIMEOUT`	`config.Replication.AutoReplicationTimeout`
Max Concurrent Checks	5	`DLOCKSS_MAX_CONCURRENT_CHECKS`	`config.Replication.MaxConcurrentReplicationChecks`
Pin Reannounce Interval	2 minutes	`DLOCKSS_PIN_REANNOUNCE_INTERVAL`	`config.Replication.PinReannounceInterval`
Min Replication	5	`DLOCKSS_MIN_REPLICATION`	`config.Replication.MinReplication`
Max Replication	10	`DLOCKSS_MAX_REPLICATION`	`config.Replication.MaxReplication`

Convergence Timeline

For a newly ingested file to reach full replication across a shard:

Ingest (immediate): File pinned locally, IngestMessage broadcast to shard, cluster Pin() called.
CRDT Sync (seconds): Cluster state propagates to peers via PubSub; LocalPinTracker detects new pin and starts PinRecursive.
First Replication Check (up to 20s at root, 1m elsewhere): replicationManager.runChecker() sends ReplicationRequest for all pinned manifests.
Auto-Replication (seconds to minutes): Peers receiving the request that don't have the file fetch it via PinRecursive (up to 5-minute timeout).
Cooldown (5 minutes): After sending a request for a manifest, no new request is sent for that manifest for 5 minutes.

Typical convergence: Most files replicate within 1-2 minutes via CRDT sync alone. Files that fail the initial sync (large DAGs, slow block propagation) recover on the next replication cycle after the 5-minute cooldown.

Current Bottlenecks

1. Request Cooldown (5 minutes)

Once a ReplicationRequest is sent for a manifest, replicationRequestCooldownDuration prevents resending for 5 minutes. If the first request fails (e.g., the receiving peer's PinRecursive times out), the file appears "stuck" until the cooldown expires.

Mitigation: The cooldown prevents flooding but causes visible delays for files that fail on the first attempt.

2. Auto-Replication Timeout (5 minutes)

PinRecursive for large files or over slow links may hit the AutoReplicationTimeout. The file remains unreplicated until the next replication cycle.

Mitigation: Heartbeat-driven re-pin gradually fills in missing blocks (see below).

3. Concurrent Replication Limit (5)

The replicationManager.sem channel limits concurrent auto-replications to MaxConcurrentReplicationChecks (default 5). When all slots are occupied, additional ReplicationRequest messages are silently dropped.

Mitigation: Increase DLOCKSS_MAX_CONCURRENT_CHECKS for nodes with sufficient bandwidth.

4. Max Requests Per Cycle (50)

At most 50 ReplicationRequest messages are sent per checker cycle. With thousands of files, not all manifests are requested in a single cycle.

Mitigation: Subsequent cycles pick up remaining manifests. The cooldown map ensures already-sent requests aren't duplicated.

Heartbeat-Driven Gradual DAG Completion (Built-In)

Every heartbeat (~10s), each node picks one pinned manifest CID (round-robin) and:

Re-pins the ManifestCID recursively (PinRecursive, 2-minute timeout). Idempotent — returns instantly when the DAG is already complete locally, and incrementally fetches missing blocks otherwise.
Pins the PayloadCID as its own root so Kubo's reprovider (pinned strategy) re-announces it.
Provides both CIDs to the DHT (only if the re-pin succeeded).

A CompareAndSwap guard prevents concurrent re-provides from piling up.

Impact: Resource-constrained nodes (e.g., Raspberry Pis) that failed the initial PinRecursive gradually complete the DAG over successive heartbeats without manual intervention. DHT provider records (which expire after ~24h) are kept fresh.

Optimization Options

Reduce Check Interval (Quick Win)

export DLOCKSS_CHECK_INTERVAL=15s  # Default: 1m

Faster detection at non-root shards. Root shards already check every 20s.

Increase Concurrent Checks (Moderate Impact)

export DLOCKSS_MAX_CONCURRENT_CHECKS=10  # Default: 5

More parallel auto-replications. Higher bandwidth usage.

Increase Auto-Replication Timeout (Large Files)

export DLOCKSS_AUTO_REPLICATION_TIMEOUT=10m  # Default: 5m

Allows more time for large DAG fetches. Ties up semaphore slots longer.

Recommended Testnet Configuration

For faster convergence in testnets:

export DLOCKSS_CHECK_INTERVAL=15s
export DLOCKSS_MAX_CONCURRENT_CHECKS=10

Production Considerations

Keep CheckInterval at 1m for reasonable resource usage (root shards already use 20s).
Keep AutoReplicationTimeout at 5m unless dealing with consistently large files.
The 5-minute request cooldown is a deliberate trade-off between convergence speed and network overhead; files that fail on the first attempt self-heal after the cooldown expires.

Monitoring

The monitor's replication snapshot log line reports:

total_manifests: Number of known manifests
total_at_target: Files with replica count >= min(MinReplication, shard_peer_count)
avg_replication: Average replica count across all manifests

Node daemon logs to watch:

"auto-replication: fetched and pinned" — successful auto-replication
"auto-replication: failed to fetch/pin" — PinRecursive timeout or failure
"auto-replication skipped, concurrency limit reached" — semaphore full
"ReplicationRequest sent" — outbound request (debug level)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication Performance Analysis

Architecture Overview

Key Constants and Defaults

Convergence Timeline

Current Bottlenecks

1. Request Cooldown (5 minutes)

2. Auto-Replication Timeout (5 minutes)

3. Concurrent Replication Limit (5)

4. Max Requests Per Cycle (50)

Heartbeat-Driven Gradual DAG Completion (Built-In)

Optimization Options

Reduce Check Interval (Quick Win)

Increase Concurrent Checks (Moderate Impact)

Increase Auto-Replication Timeout (Large Files)

Recommended Testnet Configuration

Production Considerations

Monitoring

FilesExpand file tree

REPLICATION_PERFORMANCE.md

Latest commit

History

REPLICATION_PERFORMANCE.md

File metadata and controls

Replication Performance Analysis

Architecture Overview

Key Constants and Defaults

Convergence Timeline

Current Bottlenecks

1. Request Cooldown (5 minutes)

2. Auto-Replication Timeout (5 minutes)

3. Concurrent Replication Limit (5)

4. Max Requests Per Cycle (50)

Heartbeat-Driven Gradual DAG Completion (Built-In)

Optimization Options

Reduce Check Interval (Quick Win)

Increase Concurrent Checks (Moderate Impact)

Increase Auto-Replication Timeout (Large Files)

Recommended Testnet Configuration

Production Considerations

Monitoring