monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 3 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
+ Coverage 74.48% 74.51% +0.03%
===================================================
Files 200 200
Lines 13603 13659 +56
===================================================
+ Hits 10132 10178 +46
- Misses 3461 3471 +10
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
| const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId; | ||
| const scanStartTimestamp = (result.contextInfo | ||
| && result.contextInfo.scanStartTimestamp) || Date.now(); | ||
| if (conductorScanId && conductorScanId !== this._metricsScanId) { |
There was a problem hiding this comment.
resetting the current scan ID when it changes WILL break if we have a bug, and make the metrics mostly useless...
| this.listBuckets(messageSendQueue, scanId, log, (err, listedBucketsCount) => { | ||
| LifecycleMetrics.onBucketListing(log, err); | ||
| return next(err, nBucketsListed); | ||
| nBucketsListed = listedBucketsCount; | ||
| return next(err, listedBucketsCount); | ||
| }); | ||
| }, | ||
| (nBucketsListed, next) => { | ||
| (listedBucketsCount, next) => { | ||
| async.until( | ||
| () => nBucketsQueued === nBucketsListed, | ||
| () => nBucketsQueued === listedBucketsCount, |
| }); | ||
|
|
||
| const bucketProcessorMessagesProcessed = ZenkoMetrics.createCounter({ | ||
| name: 's3_lifecycle_bucket_processor_messages_processed_total', |
There was a problem hiding this comment.
I think we already have this metric, in BackbeatConsumer ?
esp. since there is no "scanId" parameter, it does not seem to add much information
| labelNames: [LIFECYCLE_LABEL_ORIGIN], | ||
| }); | ||
|
|
||
| // Info-style gauge: a single active series per bucket-processor instance, |
There was a problem hiding this comment.
if we keep an unbounded series, why keep this one (with always has the value 1) instead of the previous one (number of message for a specific scanID) ?
| bucketProcessorLastConductorScanId.reset(); | ||
| bucketProcessorLastConductorScanId.set({ | ||
| [LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor', | ||
| [LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId, | ||
| }, 1); |
There was a problem hiding this comment.
storing a metric does not mean it is scraped by prometheus.
with this approach, if we have 2 scans (ScanID) in parallel, depending on workload pattern we will have a very high probability that a single ScanID will appear in the metrics
Review by Claude Code |
d3d5c67 to
1db020b
Compare
|
LGTM |
Overall this is a well-structured PR. The scan ID propagation is thorough, the new Prometheus metrics follow existing naming conventions ( Review by Claude Code |
| } | ||
| entry.addLoggedAttributes({ | ||
| bucketName: 'target.bucket', | ||
| objectKey: 'target.key', |
There was a problem hiding this comment.
Unlike LifecycleUpdateExpirationTask and LifecycleUpdateTransitionTask, this file does not include conductorScanId: 'contextInfo.conductorScanId' in addLoggedAttributes. The scan ID is still emitted via log.addDefaultFields, but for consistency with the other two task types, consider adding it here as well.
— Claude Code
Review by Claude Code |
a943c58 to
5499089
Compare
| try { | ||
| const now = Date.now(); | ||
| bucketProcessorScanMetricLastUpdated.set(conductorScanId, now); | ||
| removeStaleBucketProcessorScanMetrics(now); |
There was a problem hiding this comment.
Defensive ordering: removeStaleBucketProcessorScanMetrics(now) runs before inc(). If prom-client's remove() throws for any stale entry during iteration, the exception propagates to the outer catch and the current scan's counter is never incremented — silently dropping counts on every subsequent message until the stale entry ages out of the map.
Moving the cleanup after inc() ensures the current message is always counted, even if stale-series removal fails.
— Claude Code
Review by Claude Code |
| const scanId = uuid(); | ||
| this._currentScanId = scanId; | ||
| log.addDefaultFields({ conductorScanId: scanId }); | ||
| LifecycleMetrics.onProcessBuckets(log); |
There was a problem hiding this comment.
onProcessBuckets was previously called at scan completion (after the success log), so s3_lifecycle_latest_batch_start_time was really a "latest batch completion" timestamp. Moving it to scan start is semantically correct, but the existing LifecycleLateScan alert in monitoring/lifecycle/alerts.yaml still depends on this metric with the description "Last lifecycle scan was performed more than … ago." With the new timing, a conductor that starts scans but never finishes them would keep refreshing this metric, delaying alert firing. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time to catch scans that start but never complete, or updating the alert description.
— Claude Code
|
3d8286a to
ca6f753
Compare
|
LGTM - well-structured monitoring enhancement. The conductorScanId propagation is consistent across all lifecycle task types, the stale metric cleanup in LifecycleMetrics avoids unbounded prom-client memory growth, the semantic change to onProcessBuckets (called at scan start rather than end) is correctly reflected in the alert wording, and the new onConductorScanComplete metrics cover the completion/duration use case. Test coverage is thorough. |
|
LGTM |
|
LGTM — Clean, well-tested implementation. The scan-id propagation is thorough, the |
bc5ef91 to
1a88160
Compare
|
LGTM — clean, well-tested addition of conductor scan monitoring. The scan-id propagation through conductor → bucket tasks → bucket processor is consistent, the per-scan counter cleanup via |
1a88160 to
6967a0b
Compare
| const start = new Date(); | ||
| const start = Date.now(); | ||
| const scanId = uuid(); | ||
| this._currentScanId = scanId; |
There was a problem hiding this comment.
Race condition: this._currentScanId is assigned before _controlBacklog (line 437). If a cron-triggered processBuckets call overlaps with a running scan, this line overwrites _currentScanId with the new UUID, then the throttling path (line 468) nulls it out — corrupting _currentScanId for the in-flight scan. Any _taskToMessage calls that run after the overwrite would embed conductorScanId: null into their Kafka messages.
Impact is observability-only (no correctness issue), but easily fixable: move this._currentScanId = scanId into the _batchInProgress = true block (line 442) so it is only set after backlog control passes.
| this._currentScanId = scanId; | |
| this._currentScanId = scanId; |
— Claude Code
|
Issue: BB-740