Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913) (#13142)

saintstack · web-flow · commit 09d67f918ea8 · 2026-05-05T14:17:09.000-07:00
* Add DD init and team collection logging for diagnosing slow startups

When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often
opaque. Add logging.

For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events.  Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.

DDTxnProcessor.actor.cpp:
- Log elapsed time for the server list + data move read transaction
  (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
- Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete)
  with NumShards
- Warn when getRange(dataMoveKeys) takes &gt;5 seconds
  (DDInitSlowDataMoveRead)

DataDistribution.actor.cpp:
- Add NumShards and NumServers to DDInitGotInitialDD
- Promote DDInitFoundDataMove from SevDebug to SevInfo so individual
  data moves are visible in production logs
- Add DDInitResumedDataMoves summary event with ValidMoves,
  CancelledMoves, EmptyMoves counts and elapsed time

DDTeamCollection.actor.cpp:
- Add Reason and Address details to UndesiredStorageServer trace events
  to distinguish version lag, same-address, wrong-class, and exclusion
  causes without needing to correlate with other log lines

* Revert DDInitFoundDataMove to SevDebug to avoid log spam with many data moves

* Add DD startup visibility: metrics retries, shard tracking, scan progress

Additional logging to address DD operational opacity during startup,
based on past incidents where DD hung with no visibility into the cause.

NativeAPI.actor.cpp:
- Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is
  stuck retrying wrong_shard_server or all_alternatives_failed, with
  the key range, retry count, and elapsed time. Previously these retries
  were silent (SevDebug only), making it impossible to identify which
  shard was stuck or that retries were even happening.

DDShardTracker.actor.cpp:
- Log TrackInitialShardsComplete after shard tracker setup with count
- Log TrackInitialShardsMetricsComplete after changeSizes() finishes
  with elapsed time. changeSizes() waits for ALL shards to report
  metrics via getFirstSize/waitStorageMetrics -- if any shard metrics
  never arrive, this hangs silently.

DDTxnProcessor.actor.cpp:
- Log DDInitKeyServerScanProgress every 30s during the multi-transaction
  keyServer scan with current beginKey, batch count, shards scanned,
  and elapsed time. With 255K shards this scan requires many transactions
  and a stuck one was previously invisible.

DataDistribution.actor.cpp:
- Log DDInitComplete with elapsed time after DataDistributor::init()
  finishes, providing a single event showing total init duration.

- NativeAPI.actor.cpp: Move retry logging outside the error-type if block
  so all errors get keys/elapsed/retries details. Use severity upgrade
  (SevDebug -&gt; SevWarn after 60s) on the existing WaitStorageMetricsHandleError
  event instead of a separate event name.
- DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in
  resumeFromDataMoves loop so operators can watch counts go up during long
  data move recovery.
- CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is
  broken independently of this change.

DD exits (e.g. movekeys_conflict) were invisible because
reportErrorsExcept suppresses logging for "normal" DD errors.
Add DDExiting trace event at SevWarn with error and code so
every DD death is visible in trace logs.
diff --git a/fdbclient/NativeAPI.actor.cpp b/fdbclient/NativeAPI.actor.cpp
@@ -5865,6 +5865,8 @@ Future<std::pair<Optional<StorageMetrics>, int>> waitStorageMetrics(Database cx,
                                                                     int expectedShardCount,
                                                                     Optional<Reference<TransactionState>> trState) {
 	Span span("NAPI:WaitStorageMetrics"_loc, generateSpanID(cx->transactionTracingSample));
+	double startTime = now();
+	int retryCount = 0;
 	while (true) {
 		if (trState.present()) {
 			co_await trState.get()->startTransaction();
@@ -5911,7 +5913,14 @@ Future<std::pair<Optional<StorageMetrics>, int>> waitStorageMetrics(Database cx,
 		} catch (Error& e) {
 			err = e;
 		}
-		TraceEvent(SevDebug, "WaitStorageMetricsHandleError").error(err);
+		retryCount++;
+		// Upgrade from SevDebug to SevWarn after 60 seconds of retrying
+		Severity sev = (now() - startTime > 60.0) ? SevWarn : SevDebug;
+		TraceEvent(sev, "WaitStorageMetricsHandleError")
+		    .error(err)
+		    .detail("Keys", keys)
+		    .detail("Elapsed", now() - startTime)
+		    .detail("Retries", retryCount);
 		if (err.code() == error_code_wrong_shard_server || err.code() == error_code_all_alternatives_failed) {
 			cx->invalidateCache(keys);
 			co_await delay(CLIENT_KNOBS->WRONG_SHARD_SERVER_DELAY, TaskPriority::DataDistribution);
diff --git a/fdbserver/datadistributor/DDShardTracker.cpp b/fdbserver/datadistributor/DDShardTracker.cpp
@@ -1051,9 +1051,19 @@ Future<Void> trackInitialShards(DataDistributionTracker* self, Reference<Initial
 		co_await yield(TaskPriority::DataDistribution);
 	}
 
+	TraceEvent("TrackInitialShardsComplete", self->distributorId).detail("ShardsTracked", s);
+
+	double changeSizesStart = now();
 	Future<Void> initialSize = changeSizes(self, KeyRangeRef(allKeys.begin, allKeys.end), 0, "ShardInit");
 	self->readyToStart.send(Void());
 	co_await initialSize;
+
+	TraceEvent("TrackInitialShardsMetricsComplete", self->distributorId)
+	    .detail("ElapsedSeconds", now() - changeSizesStart);
+
+	// DDInitDone bookends DDInitRunning — marks DD fully operational. Uses DD* prefix so
+	// the full startup sequence can be queried with Type="DDInit*" in trace logs.
+	TraceEvent("DDInitDone", self->distributorId);
 	self->maxShardSizeUpdater = updateMaxShardSize(self->dbSizeEstimate, self->maxShardSize);
 }
 
diff --git a/fdbserver/datadistributor/DDTeamCollection.actor.cpp b/fdbserver/datadistributor/DDTeamCollection.actor.cpp
@@ -1379,7 +1379,17 @@ class DDTeamCollectionImpl {
 		state Future<Void> storageMetadataTracker = self->updateStorageMetadata(server);
 		try {
 			loop {
-				status.isUndesired = (!self->disableFailingLaggingServers.get() && server->ssVersionTooFarBehind.get());
+				{
+					bool versionLagUndesired =
+					    !self->disableFailingLaggingServers.get() && server->ssVersionTooFarBehind.get();
+					if (versionLagUndesired && !status.isUndesired) {
+						TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
+						    .detail("Server", server->getId())
+						    .detail("Address", server->getLastKnownInterface().address())
+						    .detail("Reason", "VersionLag");
+					}
+					status.isUndesired = versionLagUndesired;
+				}
 				status.isWrongConfiguration = false;
 				status.isWiggling = false;
 				hasWrongDC = !self->isCorrectDC(*server);
@@ -1415,6 +1425,7 @@ class DDTeamCollectionImpl {
 								TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
 								    .detail("Server", server->getId())
 								    .detail("Address", server->getLastKnownInterface().address())
+								    .detail("Reason", "SameAddress")
 								    .detail("OtherServer", i.second->getId())
 								    .detail("NumShards",
 								            self->shardsAffectedByTeamFailure->getNumberOfShards(server->getId()))
@@ -1440,6 +1451,8 @@ class DDTeamCollectionImpl {
 					if (self->optimalTeamCount > 0) {
 						TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
 						    .detail("Server", server->getId())
+						    .detail("Address", server->getLastKnownInterface().address())
+						    .detail("Reason", "WrongMachineClass")
 						    .detail("OptimalTeamCount", self->optimalTeamCount)
 						    .detail("Fitness", server->getLastKnownClass().machineClassFitness(ProcessClass::Storage));
 						status.isUndesired = true;
@@ -1517,9 +1530,16 @@ class DDTeamCollectionImpl {
 				}
 
 				if (worstStatus != DDTeamCollection::Status::NONE) {
+					const char* exclusionType = worstStatus == DDTeamCollection::Status::WIGGLING   ? "Wiggling"
+					                            : worstStatus == DDTeamCollection::Status::FAILED   ? "Failed"
+					                            : worstStatus == DDTeamCollection::Status::EXCLUDED ? "Excluded"
+					                                                                                : "Unknown";
 					TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
 					    .detail("Server", server->getId())
-					    .detail("Excluded", worstAddr.toString());
+					    .detail("Address", server->getLastKnownInterface().address())
+					    .detail("Reason", "Excluded")
+					    .detail("ExclusionType", exclusionType)
+					    .detail("ExcludedAddress", worstAddr.toString());
 					status.isUndesired = true;
 					status.isWrongConfiguration = true;
 
diff --git a/fdbserver/datadistributor/DDTxnProcessor.cpp b/fdbserver/datadistributor/DDTxnProcessor.cpp
@@ -329,6 +329,8 @@ class DDTxnProcessorImpl {
 		CODE_PROBE((bool)skipDDModeCheck, "DD Mode won't prevent read initial data distribution.");
 		// Get the server list in its own try/catch block since it modifies result.  We don't want a subsequent failure
 		// causing entries to be duplicated
+		// Phase 1: Single transaction to read server list and all persisted data moves
+		double serverListAndDataMoveReadStart = now();
 		while (true) {
 			numDataMoves = 0;
 			server_dc.clear();
@@ -390,7 +392,12 @@ class DDTxnProcessorImpl {
 					}
 				}
 
+				double dataMoveReadStart = now();
 				RangeResult dms = co_await tr.getRange(dataMoveKeys, CLIENT_KNOBS->TOO_MANY);
+				if (now() - dataMoveReadStart > 5.0) {
+					TraceEvent(SevWarn, "DDInitSlowDataMoveRead", distributorId)
+					    .detail("ElapsedSeconds", now() - dataMoveReadStart);
+				}
 				ASSERT(!dms.more && dms.size() < CLIENT_KNOBS->TOO_MANY);
 				// For each data move, find out the src or dst servers are in primary or remote DC.
 				for (int i = 0; i < dms.size(); ++i) {
@@ -438,6 +445,11 @@ class DDTxnProcessorImpl {
 
 				succeeded = true;
 
+				TraceEvent("DDInitServerListAndDataMoveReadComplete", distributorId)
+				    .detail("NumDataMoves", numDataMoves)
+				    .detail("NumServers", result->allServers.size())
+				    .detail("ElapsedSeconds", now() - serverListAndDataMoveReadStart);
+
 				break;
 			} catch (Error& e) {
 				err = e;
@@ -450,6 +462,10 @@ class DDTxnProcessorImpl {
 
 		// If keyServers is too large to read in a single transaction, then we will have to break this process up into
 		// multiple transactions. In that case, each iteration should begin where the previous left off
+		// Scan keyServers in batches to build the shard map
+		double keyServerScanStart = now();
+		double lastScanLogTime = now();
+		int scanBatchCount = 0;
 		while (beginKey < allKeys.end) {
 			CODE_PROBE(beginKey > allKeys.begin, "Multi-transactional getInitialDataDistribution");
 			while (true) {
@@ -537,6 +553,15 @@ class DDTxnProcessorImpl {
 
 					ASSERT_GT(keyServers.size(), 0);
 					beginKey = keyServers.end()[-1].key;
+					scanBatchCount++;
+					if (now() - lastScanLogTime >= 30.0) {
+						lastScanLogTime = now();
+						TraceEvent("DDInitKeyServerScanProgress", distributorId)
+						    .detail("BeginKey", beginKey)
+						    .detail("Batches", scanBatchCount)
+						    .detail("ShardsScanned", result->shards.size())
+						    .detail("ElapsedSeconds", now() - keyServerScanStart);
+					}
 					break;
 				} catch (Error& e) {
 					err = e;
@@ -553,6 +578,10 @@ class DDTxnProcessorImpl {
 		// a dummy shard at the end with no keys or servers makes life easier for trackInitialShards()
 		result->shards.push_back(DDShardInfo(allKeys.end));
 
+		TraceEvent("DDInitKeyServerScanComplete", distributorId)
+		    .detail("NumShards", result->shards.size())
+		    .detail("ElapsedSeconds", now() - keyServerScanStart);
+
 		if (SERVER_KNOBS->SHARD_ENCODE_LOCATION_METADATA && numDataMoves > 0) {
 			for (int shard = 0; shard < result->shards.size() - 1; ++shard) {
 				const DDShardInfo& iShard = result->shards[shard];
diff --git a/fdbserver/datadistributor/DataDistribution.cpp b/fdbserver/datadistributor/DataDistribution.cpp
@@ -595,6 +595,31 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 	// Initialize the required internal states of DataDistributor from system metadata. It's necessary before
 	// DataDistributor start working. Doesn't include initialization of optional components, like DDQueue,
 	// Tracker, TeamCollection. The components should call its own ::init methods.
+	//
+	// DD Startup Progress (trace events in order):
+	//   DDInitRunning                     - DD process recruited and starting init
+	//   DDInitTakingMoveKeysLock          - Acquiring move keys lock
+	//   DDInitTookMoveKeysLock            - Lock acquired
+	//   DDInitGotConfiguration            - Database configuration loaded
+	//   DDInitUpdatedReplicaKeys          - Replica keys updated
+	//   DDInitSlowDataMoveRead            - (SevWarn) dataMoveKeys read taking >5s
+	//   DDInitServerListAndDataMoveReadComplete - Server list + data moves read: NumDataMoves, NumServers,
+	//   ElapsedSeconds
+	//   DDInitKeyServerScanProgress       - (every 30s) keyServer scan: BeginKey, Batches, ShardsScanned
+	//   DDInitKeyServerScanComplete       - keyServer scan done: NumShards, ElapsedSeconds
+	//   DDInitGotInitialDD                - Init data loaded: NumShards, NumServers
+	//   DDInitDataLoaded                  - Init data loaded, ElapsedSeconds (does NOT mean DD is fully operational)
+	//
+	// After init(), the following startup events fire from other components:
+	//   DDInitResumeDataMovesProgress     - (every 30s) data move resume: ValidMoves, CancelledMoves, EmptyMoves
+	//   DDInitResumedDataMoves            - Data move resume complete with counts
+	//   TrackInitialShards                - Shard tracker setup started with InitialShardCount
+	//   TrackInitialShardsComplete        - Shard trackers created: ShardsTracked
+	//   DDTrackerStarting                 - Teams ready (fires from DDTeamCollection after readyToStart + delay)
+	//   TrackInitialShardsMetricsComplete - All shard metrics received: ElapsedSeconds
+	//                                       WaitStorageMetricsHandleError may fire (SevWarn after 60s) if a
+	//                                       shard's metrics read is stuck retrying: Keys, Retries
+	//   DDInitDone                        - DD is fully operational with all shard sizes loaded
 	static Future<Void> init(Reference<DataDistributor> self) {
 		while (true) {
 			co_await self->waitDataDistributorEnabled();
@@ -653,13 +678,17 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 				    .detail("E", self->initData->shards.end()[-1].key)
 				    .detail("Src", describe(self->initData->shards.end()[-2].primarySrc))
 				    .detail("Dest", describe(self->initData->shards.end()[-2].primaryDest))
+				    .detail("NumShards", self->initData->shards.size())
+				    .detail("NumServers", self->initData->allServers.size())
 				    .trackLatest(self->initialDDEventHolder->trackingKey);
 			} else {
 				TraceEvent("DDInitGotInitialDD", self->ddId)
 				    .detail("B", "")
 				    .detail("E", "")
 				    .detail("Src", "[no items]")
 				    .detail("Dest", "[no items]")
+				    .detail("NumShards", self->initData->shards.size())
+				    .detail("NumServers", self->initData->allServers.size())
 				    .trackLatest(self->initialDDEventHolder->trackingKey);
 			}
 
@@ -859,6 +888,11 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 	// TODO: unit test needed
 	static Future<Void> resumeFromDataMoves(Reference<DataDistributor> self, Future<Void> readyToStart) {
 		KeyRangeMap<std::shared_ptr<DataMove>>::iterator it = self->initData->dataMoveMap.ranges().begin();
+		int validMoves = 0;
+		int cancelledMoves = 0;
+		int emptyMoves = 0;
+		double resumeStart = now();
+		double lastLogTime = now();
 
 		co_await readyToStart;
 
@@ -867,6 +901,7 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 			DataMoveType dataMoveType = getDataMoveTypeFromDataMoveId(meta.id);
 			if (meta.ranges.empty()) {
 				TraceEvent(SevInfo, "EmptyDataMoveRange", self->ddId).detail("DataMoveMetaData", meta.toString());
+				emptyMoves++;
 				continue;
 			}
 			if (meta.bulkLoadTaskState.present()) {
@@ -879,6 +914,7 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 				TraceEvent(SevWarnAlways, "DDBulkLoadTaskCancelDataMove", self->ddId)
 				    .detail("Reason", "DDInit")
 				    .detail("DataMove", meta.toString());
+				cancelledMoves++;
 			} else if (dataMoveType == DataMoveType::LOGICAL_BULKLOAD ||
 			           dataMoveType == DataMoveType::PHYSICAL_BULKLOAD) {
 				// The metadata is from the old system
@@ -890,13 +926,15 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 				    .detail("Reason", "WrongTypeWhenDDInit")
 				    .detail("DataMoveType", dataMoveType)
 				    .detail("DataMove", meta.toString());
+				cancelledMoves++;
 			} else if (it.value()->isCancelled() ||
 			           (it.value()->valid && !SERVER_KNOBS->SHARD_ENCODE_LOCATION_METADATA)) {
 				RelocateShard rs(meta.ranges.front(), DataMovementReason::RECOVER_MOVE, RelocateReason::OTHER);
 				rs.dataMoveId = meta.id;
 				rs.cancelled = true;
 				self->relocationProducer.send(rs);
 				TraceEvent("DDInitScheduledCancelDataMove", self->ddId).detail("DataMove", meta.toString());
+				cancelledMoves++;
 			} else if (it.value()->valid) {
 				TraceEvent(SevDebug, "DDInitFoundDataMove", self->ddId).detail("DataMove", meta.toString());
 				ASSERT(meta.ranges.front() == it.range());
@@ -920,9 +958,24 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
 				self->shardsAffectedByTeamFailure->moveShard(rs.keys, teams);
 				self->relocationProducer.send(rs);
 				co_await yield(TaskPriority::DataDistribution);
+				validMoves++;
+			}
+			if (now() - lastLogTime >= 30.0) {
+				lastLogTime = now();
+				TraceEvent("DDInitResumeDataMovesProgress", self->ddId)
+				    .detail("ValidMoves", validMoves)
+				    .detail("CancelledMoves", cancelledMoves)
+				    .detail("EmptyMoves", emptyMoves)
+				    .detail("ElapsedSeconds", now() - resumeStart);
 			}
 		}
 
+		TraceEvent("DDInitResumedDataMoves", self->ddId)
+		    .detail("ValidMoves", validMoves)
+		    .detail("CancelledMoves", cancelledMoves)
+		    .detail("EmptyMoves", emptyMoves)
+		    .detail("ElapsedSeconds", now() - resumeStart);
+
 		// Trigger background cleanup for datamove tombstones
 		if (!self->txnProcessor->isMocked()) {
 			self->addActor.send(self->removeDataMoveTombstoneBackground(self));
@@ -2701,6 +2754,7 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
 	TraceEvent(SevInfo, "DataDistributionInitProgress", self->ddId).detail("Phase", "DDConfigWatch Initialized");
 
 	while (true) {
+		double ddStartTime = now();
 		self->context->trackerCancelled = false;
 		// whether all initial shard are tracked
 		self->initialized = Promise<Void>();
@@ -2714,6 +2768,8 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
 		try {
 			co_await DataDistributor::init(self);
 
+			TraceEvent("DDInitDataLoaded", self->ddId).detail("ElapsedSeconds", now() - ddStartTime);
+
 			TraceEvent(SevInfo, "DataDistributionInitProgress", self->ddId).detail("Phase", "Metadata Initialized");
 
 			PromiseStream<Promise<int64_t>> getAverageShardBytes;
@@ -2925,6 +2981,7 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
 			self->context->markTrackerCancelled();
 			Error err = caughtErr;
 			TraceEvent("DataDistributorDestroyTeamCollections", self->ddId).error(caughtErr);
+			TraceEvent(SevWarn, "DDExiting", self->ddId).error(caughtErr);
 			std::vector<UID> teamForDroppedRange;
 			if (removeFailedServer.getFuture().isReady() && !removeFailedServer.getFuture().isError()) {
 				// Choose a random healthy team to host the to-be-dropped range.
@@ -5069,6 +5126,9 @@ Future<Void> dataDistributor_impl(DataDistributorInterface di, Reference<DataDis
 	std::map<UID, ErrorOr<Void>> ddSnapReqResultMap;
 
 	TraceEvent("DataDistributorRunning", di.id()).detail("IsMocked", isMocked);
+	// DDInitRunning duplicates the above with DDInit* prefix so the full startup sequence
+	// can be queried with Type="DDInit*" in trace logs
+	TraceEvent("DDInitRunning", di.id());
 	self->addActor.send(actors.getResult());
 	self->addActor.send(traceRole(Role::DATA_DISTRIBUTOR, di.id()));
 	self->addActor.send(waitFailureServer(di.waitFailure.getFuture()));