Skip to content

Commit 09d67f9

Browse files
authored
Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913) (#13142)
* Add DD init and team collection logging for diagnosing slow startups When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often opaque. Add logging. For example, DD init hung for 14-16 minutes with zero visibility into what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys and DDInitGotInitialDD trace events. Diagnosing the root cause required extensive log splunking of SS metrics to determine that a single getRange(dataMoveKeys) read was queued on an overloaded storage server. DDTxnProcessor.actor.cpp: - Log elapsed time for the server list + data move read transaction (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers - Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete) with NumShards - Warn when getRange(dataMoveKeys) takes >5 seconds (DDInitSlowDataMoveRead) DataDistribution.actor.cpp: - Add NumShards and NumServers to DDInitGotInitialDD - Promote DDInitFoundDataMove from SevDebug to SevInfo so individual data moves are visible in production logs - Add DDInitResumedDataMoves summary event with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time DDTeamCollection.actor.cpp: - Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes without needing to correlate with other log lines * Revert DDInitFoundDataMove to SevDebug to avoid log spam with many data moves * Add DD startup visibility: metrics retries, shard tracking, scan progress Additional logging to address DD operational opacity during startup, based on past incidents where DD hung with no visibility into the cause. NativeAPI.actor.cpp: - Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is stuck retrying wrong_shard_server or all_alternatives_failed, with the key range, retry count, and elapsed time. Previously these retries were silent (SevDebug only), making it impossible to identify which shard was stuck or that retries were even happening. DDShardTracker.actor.cpp: - Log TrackInitialShardsComplete after shard tracker setup with count - Log TrackInitialShardsMetricsComplete after changeSizes() finishes with elapsed time. changeSizes() waits for ALL shards to report metrics via getFirstSize/waitStorageMetrics -- if any shard metrics never arrive, this hangs silently. DDTxnProcessor.actor.cpp: - Log DDInitKeyServerScanProgress every 30s during the multi-transaction keyServer scan with current beginKey, batch count, shards scanned, and elapsed time. With 255K shards this scan requires many transactions and a stuck one was previously invisible. DataDistribution.actor.cpp: - Log DDInitComplete with elapsed time after DataDistributor::init() finishes, providing a single event showing total init duration. - NativeAPI.actor.cpp: Move retry logging outside the error-type if block so all errors get keys/elapsed/retries details. Use severity upgrade (SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError event instead of a separate event name. - DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in resumeFromDataMoves loop so operators can watch counts go up during long data move recovery. - CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is broken independently of this change. DD exits (e.g. movekeys_conflict) were invisible because reportErrorsExcept suppresses logging for "normal" DD errors. Add DDExiting trace event at SevWarn with error and code so every DD death is visible in trace logs.
1 parent 5c472d2 commit 09d67f9

5 files changed

Lines changed: 131 additions & 3 deletions

File tree

fdbclient/NativeAPI.actor.cpp

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5865,6 +5865,8 @@ Future<std::pair<Optional<StorageMetrics>, int>> waitStorageMetrics(Database cx,
58655865
int expectedShardCount,
58665866
Optional<Reference<TransactionState>> trState) {
58675867
Span span("NAPI:WaitStorageMetrics"_loc, generateSpanID(cx->transactionTracingSample));
5868+
double startTime = now();
5869+
int retryCount = 0;
58685870
while (true) {
58695871
if (trState.present()) {
58705872
co_await trState.get()->startTransaction();
@@ -5911,7 +5913,14 @@ Future<std::pair<Optional<StorageMetrics>, int>> waitStorageMetrics(Database cx,
59115913
} catch (Error& e) {
59125914
err = e;
59135915
}
5914-
TraceEvent(SevDebug, "WaitStorageMetricsHandleError").error(err);
5916+
retryCount++;
5917+
// Upgrade from SevDebug to SevWarn after 60 seconds of retrying
5918+
Severity sev = (now() - startTime > 60.0) ? SevWarn : SevDebug;
5919+
TraceEvent(sev, "WaitStorageMetricsHandleError")
5920+
.error(err)
5921+
.detail("Keys", keys)
5922+
.detail("Elapsed", now() - startTime)
5923+
.detail("Retries", retryCount);
59155924
if (err.code() == error_code_wrong_shard_server || err.code() == error_code_all_alternatives_failed) {
59165925
cx->invalidateCache(keys);
59175926
co_await delay(CLIENT_KNOBS->WRONG_SHARD_SERVER_DELAY, TaskPriority::DataDistribution);

fdbserver/datadistributor/DDShardTracker.cpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1051,9 +1051,19 @@ Future<Void> trackInitialShards(DataDistributionTracker* self, Reference<Initial
10511051
co_await yield(TaskPriority::DataDistribution);
10521052
}
10531053

1054+
TraceEvent("TrackInitialShardsComplete", self->distributorId).detail("ShardsTracked", s);
1055+
1056+
double changeSizesStart = now();
10541057
Future<Void> initialSize = changeSizes(self, KeyRangeRef(allKeys.begin, allKeys.end), 0, "ShardInit");
10551058
self->readyToStart.send(Void());
10561059
co_await initialSize;
1060+
1061+
TraceEvent("TrackInitialShardsMetricsComplete", self->distributorId)
1062+
.detail("ElapsedSeconds", now() - changeSizesStart);
1063+
1064+
// DDInitDone bookends DDInitRunning — marks DD fully operational. Uses DD* prefix so
1065+
// the full startup sequence can be queried with Type="DDInit*" in trace logs.
1066+
TraceEvent("DDInitDone", self->distributorId);
10571067
self->maxShardSizeUpdater = updateMaxShardSize(self->dbSizeEstimate, self->maxShardSize);
10581068
}
10591069

fdbserver/datadistributor/DDTeamCollection.actor.cpp

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1379,7 +1379,17 @@ class DDTeamCollectionImpl {
13791379
state Future<Void> storageMetadataTracker = self->updateStorageMetadata(server);
13801380
try {
13811381
loop {
1382-
status.isUndesired = (!self->disableFailingLaggingServers.get() && server->ssVersionTooFarBehind.get());
1382+
{
1383+
bool versionLagUndesired =
1384+
!self->disableFailingLaggingServers.get() && server->ssVersionTooFarBehind.get();
1385+
if (versionLagUndesired && !status.isUndesired) {
1386+
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
1387+
.detail("Server", server->getId())
1388+
.detail("Address", server->getLastKnownInterface().address())
1389+
.detail("Reason", "VersionLag");
1390+
}
1391+
status.isUndesired = versionLagUndesired;
1392+
}
13831393
status.isWrongConfiguration = false;
13841394
status.isWiggling = false;
13851395
hasWrongDC = !self->isCorrectDC(*server);
@@ -1415,6 +1425,7 @@ class DDTeamCollectionImpl {
14151425
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
14161426
.detail("Server", server->getId())
14171427
.detail("Address", server->getLastKnownInterface().address())
1428+
.detail("Reason", "SameAddress")
14181429
.detail("OtherServer", i.second->getId())
14191430
.detail("NumShards",
14201431
self->shardsAffectedByTeamFailure->getNumberOfShards(server->getId()))
@@ -1440,6 +1451,8 @@ class DDTeamCollectionImpl {
14401451
if (self->optimalTeamCount > 0) {
14411452
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
14421453
.detail("Server", server->getId())
1454+
.detail("Address", server->getLastKnownInterface().address())
1455+
.detail("Reason", "WrongMachineClass")
14431456
.detail("OptimalTeamCount", self->optimalTeamCount)
14441457
.detail("Fitness", server->getLastKnownClass().machineClassFitness(ProcessClass::Storage));
14451458
status.isUndesired = true;
@@ -1517,9 +1530,16 @@ class DDTeamCollectionImpl {
15171530
}
15181531

15191532
if (worstStatus != DDTeamCollection::Status::NONE) {
1533+
const char* exclusionType = worstStatus == DDTeamCollection::Status::WIGGLING ? "Wiggling"
1534+
: worstStatus == DDTeamCollection::Status::FAILED ? "Failed"
1535+
: worstStatus == DDTeamCollection::Status::EXCLUDED ? "Excluded"
1536+
: "Unknown";
15201537
TraceEvent(SevWarn, "UndesiredStorageServer", self->distributorId)
15211538
.detail("Server", server->getId())
1522-
.detail("Excluded", worstAddr.toString());
1539+
.detail("Address", server->getLastKnownInterface().address())
1540+
.detail("Reason", "Excluded")
1541+
.detail("ExclusionType", exclusionType)
1542+
.detail("ExcludedAddress", worstAddr.toString());
15231543
status.isUndesired = true;
15241544
status.isWrongConfiguration = true;
15251545

fdbserver/datadistributor/DDTxnProcessor.cpp

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,8 @@ class DDTxnProcessorImpl {
329329
CODE_PROBE((bool)skipDDModeCheck, "DD Mode won't prevent read initial data distribution.");
330330
// Get the server list in its own try/catch block since it modifies result. We don't want a subsequent failure
331331
// causing entries to be duplicated
332+
// Phase 1: Single transaction to read server list and all persisted data moves
333+
double serverListAndDataMoveReadStart = now();
332334
while (true) {
333335
numDataMoves = 0;
334336
server_dc.clear();
@@ -390,7 +392,12 @@ class DDTxnProcessorImpl {
390392
}
391393
}
392394

395+
double dataMoveReadStart = now();
393396
RangeResult dms = co_await tr.getRange(dataMoveKeys, CLIENT_KNOBS->TOO_MANY);
397+
if (now() - dataMoveReadStart > 5.0) {
398+
TraceEvent(SevWarn, "DDInitSlowDataMoveRead", distributorId)
399+
.detail("ElapsedSeconds", now() - dataMoveReadStart);
400+
}
394401
ASSERT(!dms.more && dms.size() < CLIENT_KNOBS->TOO_MANY);
395402
// For each data move, find out the src or dst servers are in primary or remote DC.
396403
for (int i = 0; i < dms.size(); ++i) {
@@ -438,6 +445,11 @@ class DDTxnProcessorImpl {
438445

439446
succeeded = true;
440447

448+
TraceEvent("DDInitServerListAndDataMoveReadComplete", distributorId)
449+
.detail("NumDataMoves", numDataMoves)
450+
.detail("NumServers", result->allServers.size())
451+
.detail("ElapsedSeconds", now() - serverListAndDataMoveReadStart);
452+
441453
break;
442454
} catch (Error& e) {
443455
err = e;
@@ -450,6 +462,10 @@ class DDTxnProcessorImpl {
450462

451463
// If keyServers is too large to read in a single transaction, then we will have to break this process up into
452464
// multiple transactions. In that case, each iteration should begin where the previous left off
465+
// Scan keyServers in batches to build the shard map
466+
double keyServerScanStart = now();
467+
double lastScanLogTime = now();
468+
int scanBatchCount = 0;
453469
while (beginKey < allKeys.end) {
454470
CODE_PROBE(beginKey > allKeys.begin, "Multi-transactional getInitialDataDistribution");
455471
while (true) {
@@ -537,6 +553,15 @@ class DDTxnProcessorImpl {
537553

538554
ASSERT_GT(keyServers.size(), 0);
539555
beginKey = keyServers.end()[-1].key;
556+
scanBatchCount++;
557+
if (now() - lastScanLogTime >= 30.0) {
558+
lastScanLogTime = now();
559+
TraceEvent("DDInitKeyServerScanProgress", distributorId)
560+
.detail("BeginKey", beginKey)
561+
.detail("Batches", scanBatchCount)
562+
.detail("ShardsScanned", result->shards.size())
563+
.detail("ElapsedSeconds", now() - keyServerScanStart);
564+
}
540565
break;
541566
} catch (Error& e) {
542567
err = e;
@@ -553,6 +578,10 @@ class DDTxnProcessorImpl {
553578
// a dummy shard at the end with no keys or servers makes life easier for trackInitialShards()
554579
result->shards.push_back(DDShardInfo(allKeys.end));
555580

581+
TraceEvent("DDInitKeyServerScanComplete", distributorId)
582+
.detail("NumShards", result->shards.size())
583+
.detail("ElapsedSeconds", now() - keyServerScanStart);
584+
556585
if (SERVER_KNOBS->SHARD_ENCODE_LOCATION_METADATA && numDataMoves > 0) {
557586
for (int shard = 0; shard < result->shards.size() - 1; ++shard) {
558587
const DDShardInfo& iShard = result->shards[shard];

fdbserver/datadistributor/DataDistribution.cpp

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -595,6 +595,31 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
595595
// Initialize the required internal states of DataDistributor from system metadata. It's necessary before
596596
// DataDistributor start working. Doesn't include initialization of optional components, like DDQueue,
597597
// Tracker, TeamCollection. The components should call its own ::init methods.
598+
//
599+
// DD Startup Progress (trace events in order):
600+
// DDInitRunning - DD process recruited and starting init
601+
// DDInitTakingMoveKeysLock - Acquiring move keys lock
602+
// DDInitTookMoveKeysLock - Lock acquired
603+
// DDInitGotConfiguration - Database configuration loaded
604+
// DDInitUpdatedReplicaKeys - Replica keys updated
605+
// DDInitSlowDataMoveRead - (SevWarn) dataMoveKeys read taking >5s
606+
// DDInitServerListAndDataMoveReadComplete - Server list + data moves read: NumDataMoves, NumServers,
607+
// ElapsedSeconds
608+
// DDInitKeyServerScanProgress - (every 30s) keyServer scan: BeginKey, Batches, ShardsScanned
609+
// DDInitKeyServerScanComplete - keyServer scan done: NumShards, ElapsedSeconds
610+
// DDInitGotInitialDD - Init data loaded: NumShards, NumServers
611+
// DDInitDataLoaded - Init data loaded, ElapsedSeconds (does NOT mean DD is fully operational)
612+
//
613+
// After init(), the following startup events fire from other components:
614+
// DDInitResumeDataMovesProgress - (every 30s) data move resume: ValidMoves, CancelledMoves, EmptyMoves
615+
// DDInitResumedDataMoves - Data move resume complete with counts
616+
// TrackInitialShards - Shard tracker setup started with InitialShardCount
617+
// TrackInitialShardsComplete - Shard trackers created: ShardsTracked
618+
// DDTrackerStarting - Teams ready (fires from DDTeamCollection after readyToStart + delay)
619+
// TrackInitialShardsMetricsComplete - All shard metrics received: ElapsedSeconds
620+
// WaitStorageMetricsHandleError may fire (SevWarn after 60s) if a
621+
// shard's metrics read is stuck retrying: Keys, Retries
622+
// DDInitDone - DD is fully operational with all shard sizes loaded
598623
static Future<Void> init(Reference<DataDistributor> self) {
599624
while (true) {
600625
co_await self->waitDataDistributorEnabled();
@@ -653,13 +678,17 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
653678
.detail("E", self->initData->shards.end()[-1].key)
654679
.detail("Src", describe(self->initData->shards.end()[-2].primarySrc))
655680
.detail("Dest", describe(self->initData->shards.end()[-2].primaryDest))
681+
.detail("NumShards", self->initData->shards.size())
682+
.detail("NumServers", self->initData->allServers.size())
656683
.trackLatest(self->initialDDEventHolder->trackingKey);
657684
} else {
658685
TraceEvent("DDInitGotInitialDD", self->ddId)
659686
.detail("B", "")
660687
.detail("E", "")
661688
.detail("Src", "[no items]")
662689
.detail("Dest", "[no items]")
690+
.detail("NumShards", self->initData->shards.size())
691+
.detail("NumServers", self->initData->allServers.size())
663692
.trackLatest(self->initialDDEventHolder->trackingKey);
664693
}
665694

@@ -859,6 +888,11 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
859888
// TODO: unit test needed
860889
static Future<Void> resumeFromDataMoves(Reference<DataDistributor> self, Future<Void> readyToStart) {
861890
KeyRangeMap<std::shared_ptr<DataMove>>::iterator it = self->initData->dataMoveMap.ranges().begin();
891+
int validMoves = 0;
892+
int cancelledMoves = 0;
893+
int emptyMoves = 0;
894+
double resumeStart = now();
895+
double lastLogTime = now();
862896

863897
co_await readyToStart;
864898

@@ -867,6 +901,7 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
867901
DataMoveType dataMoveType = getDataMoveTypeFromDataMoveId(meta.id);
868902
if (meta.ranges.empty()) {
869903
TraceEvent(SevInfo, "EmptyDataMoveRange", self->ddId).detail("DataMoveMetaData", meta.toString());
904+
emptyMoves++;
870905
continue;
871906
}
872907
if (meta.bulkLoadTaskState.present()) {
@@ -879,6 +914,7 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
879914
TraceEvent(SevWarnAlways, "DDBulkLoadTaskCancelDataMove", self->ddId)
880915
.detail("Reason", "DDInit")
881916
.detail("DataMove", meta.toString());
917+
cancelledMoves++;
882918
} else if (dataMoveType == DataMoveType::LOGICAL_BULKLOAD ||
883919
dataMoveType == DataMoveType::PHYSICAL_BULKLOAD) {
884920
// The metadata is from the old system
@@ -890,13 +926,15 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
890926
.detail("Reason", "WrongTypeWhenDDInit")
891927
.detail("DataMoveType", dataMoveType)
892928
.detail("DataMove", meta.toString());
929+
cancelledMoves++;
893930
} else if (it.value()->isCancelled() ||
894931
(it.value()->valid && !SERVER_KNOBS->SHARD_ENCODE_LOCATION_METADATA)) {
895932
RelocateShard rs(meta.ranges.front(), DataMovementReason::RECOVER_MOVE, RelocateReason::OTHER);
896933
rs.dataMoveId = meta.id;
897934
rs.cancelled = true;
898935
self->relocationProducer.send(rs);
899936
TraceEvent("DDInitScheduledCancelDataMove", self->ddId).detail("DataMove", meta.toString());
937+
cancelledMoves++;
900938
} else if (it.value()->valid) {
901939
TraceEvent(SevDebug, "DDInitFoundDataMove", self->ddId).detail("DataMove", meta.toString());
902940
ASSERT(meta.ranges.front() == it.range());
@@ -920,9 +958,24 @@ struct DataDistributor : NonCopyable, ReferenceCounted<DataDistributor> {
920958
self->shardsAffectedByTeamFailure->moveShard(rs.keys, teams);
921959
self->relocationProducer.send(rs);
922960
co_await yield(TaskPriority::DataDistribution);
961+
validMoves++;
962+
}
963+
if (now() - lastLogTime >= 30.0) {
964+
lastLogTime = now();
965+
TraceEvent("DDInitResumeDataMovesProgress", self->ddId)
966+
.detail("ValidMoves", validMoves)
967+
.detail("CancelledMoves", cancelledMoves)
968+
.detail("EmptyMoves", emptyMoves)
969+
.detail("ElapsedSeconds", now() - resumeStart);
923970
}
924971
}
925972

973+
TraceEvent("DDInitResumedDataMoves", self->ddId)
974+
.detail("ValidMoves", validMoves)
975+
.detail("CancelledMoves", cancelledMoves)
976+
.detail("EmptyMoves", emptyMoves)
977+
.detail("ElapsedSeconds", now() - resumeStart);
978+
926979
// Trigger background cleanup for datamove tombstones
927980
if (!self->txnProcessor->isMocked()) {
928981
self->addActor.send(self->removeDataMoveTombstoneBackground(self));
@@ -2701,6 +2754,7 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
27012754
TraceEvent(SevInfo, "DataDistributionInitProgress", self->ddId).detail("Phase", "DDConfigWatch Initialized");
27022755

27032756
while (true) {
2757+
double ddStartTime = now();
27042758
self->context->trackerCancelled = false;
27052759
// whether all initial shard are tracked
27062760
self->initialized = Promise<Void>();
@@ -2714,6 +2768,8 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
27142768
try {
27152769
co_await DataDistributor::init(self);
27162770

2771+
TraceEvent("DDInitDataLoaded", self->ddId).detail("ElapsedSeconds", now() - ddStartTime);
2772+
27172773
TraceEvent(SevInfo, "DataDistributionInitProgress", self->ddId).detail("Phase", "Metadata Initialized");
27182774

27192775
PromiseStream<Promise<int64_t>> getAverageShardBytes;
@@ -2925,6 +2981,7 @@ Future<Void> dataDistribution(Reference<DataDistributor> self,
29252981
self->context->markTrackerCancelled();
29262982
Error err = caughtErr;
29272983
TraceEvent("DataDistributorDestroyTeamCollections", self->ddId).error(caughtErr);
2984+
TraceEvent(SevWarn, "DDExiting", self->ddId).error(caughtErr);
29282985
std::vector<UID> teamForDroppedRange;
29292986
if (removeFailedServer.getFuture().isReady() && !removeFailedServer.getFuture().isError()) {
29302987
// Choose a random healthy team to host the to-be-dropped range.
@@ -5069,6 +5126,9 @@ Future<Void> dataDistributor_impl(DataDistributorInterface di, Reference<DataDis
50695126
std::map<UID, ErrorOr<Void>> ddSnapReqResultMap;
50705127

50715128
TraceEvent("DataDistributorRunning", di.id()).detail("IsMocked", isMocked);
5129+
// DDInitRunning duplicates the above with DDInit* prefix so the full startup sequence
5130+
// can be queried with Type="DDInit*" in trace logs
5131+
TraceEvent("DDInitRunning", di.id());
50725132
self->addActor.send(actors.getResult());
50735133
self->addActor.send(traceRole(Role::DATA_DISTRIBUTOR, di.id()));
50745134
self->addActor.send(waitFailureServer(di.waitFailure.getFuture()));

0 commit comments

Comments
 (0)