You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913) (#13142)
* Add DD init and team collection logging for diagnosing slow startups
When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often
opaque. Add logging.
For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events. Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.
DDTxnProcessor.actor.cpp:
- Log elapsed time for the server list + data move read transaction
(DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
- Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete)
with NumShards
- Warn when getRange(dataMoveKeys) takes >5 seconds
(DDInitSlowDataMoveRead)
DataDistribution.actor.cpp:
- Add NumShards and NumServers to DDInitGotInitialDD
- Promote DDInitFoundDataMove from SevDebug to SevInfo so individual
data moves are visible in production logs
- Add DDInitResumedDataMoves summary event with ValidMoves,
CancelledMoves, EmptyMoves counts and elapsed time
DDTeamCollection.actor.cpp:
- Add Reason and Address details to UndesiredStorageServer trace events
to distinguish version lag, same-address, wrong-class, and exclusion
causes without needing to correlate with other log lines
* Revert DDInitFoundDataMove to SevDebug to avoid log spam with many data moves
* Add DD startup visibility: metrics retries, shard tracking, scan progress
Additional logging to address DD operational opacity during startup,
based on past incidents where DD hung with no visibility into the cause.
NativeAPI.actor.cpp:
- Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is
stuck retrying wrong_shard_server or all_alternatives_failed, with
the key range, retry count, and elapsed time. Previously these retries
were silent (SevDebug only), making it impossible to identify which
shard was stuck or that retries were even happening.
DDShardTracker.actor.cpp:
- Log TrackInitialShardsComplete after shard tracker setup with count
- Log TrackInitialShardsMetricsComplete after changeSizes() finishes
with elapsed time. changeSizes() waits for ALL shards to report
metrics via getFirstSize/waitStorageMetrics -- if any shard metrics
never arrive, this hangs silently.
DDTxnProcessor.actor.cpp:
- Log DDInitKeyServerScanProgress every 30s during the multi-transaction
keyServer scan with current beginKey, batch count, shards scanned,
and elapsed time. With 255K shards this scan requires many transactions
and a stuck one was previously invisible.
DataDistribution.actor.cpp:
- Log DDInitComplete with elapsed time after DataDistributor::init()
finishes, providing a single event showing total init duration.
- NativeAPI.actor.cpp: Move retry logging outside the error-type if block
so all errors get keys/elapsed/retries details. Use severity upgrade
(SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError
event instead of a separate event name.
- DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in
resumeFromDataMoves loop so operators can watch counts go up during long
data move recovery.
- CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is
broken independently of this change.
DD exits (e.g. movekeys_conflict) were invisible because
reportErrorsExcept suppresses logging for "normal" DD errors.
Add DDExiting trace event at SevWarn with error and code so
every DD death is visible in trace logs.
0 commit comments