Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 5 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 5 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha
Copy link
Copy Markdown
Contributor

@benzekrimaha benzekrimaha commented Mar 2, 2026

Issue: BB-740

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 94.36620% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.51%. Comparing base (145f7a6) to head (6967a0b).
⚠️ Report is 33 commits behind head on development/9.3.

Files with missing lines Patch % Lines
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 66.66% 1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/LifecycleMetrics.js 98.78% <100.00%> (+0.56%) ⬆️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 80.48% <100.00%> (+0.61%) ⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js 84.16% <100.00%> (+0.45%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.88% <ø> (ø)
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.25% <66.66%> (-0.51%) ⬇️
extensions/lifecycle/tasks/LifecycleTask.js 91.41% <66.66%> (-0.14%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 80.51% <66.66%> (-0.57%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 91.17% <66.66%> (-0.75%) ⬇️

... and 3 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 80.71% <ø> (+0.01%) ⬆️
Ingestion 70.53% <ø> (-0.62%) ⬇️
Lifecycle 78.88% <94.36%> (+0.26%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.65% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
+ Coverage            74.48%   74.51%   +0.03%     
===================================================
  Files                  200      200              
  Lines                13603    13659      +56     
===================================================
+ Hits                 10132    10178      +46     
- Misses                3461     3471      +10     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.10% <0.00%> (-0.04%) ⬇️
api:routes 8.92% <0.00%> (-0.04%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.04% <11.26%> (+0.01%) ⬆️
ingestion 12.44% <0.00%> (-0.11%) ⬇️
lib 7.58% <0.00%> (-0.04%) ⬇️
lifecycle 18.95% <69.01%> (+0.10%) ⬆️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.47% <11.26%> (-0.02%) ⬇️
unit 51.18% <88.73%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
const log = this.logger.newRequestLogger();
const start = new Date();
const start = Date.now();
this._scanId = uuid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId;
const scanStartTimestamp = (result.contextInfo
&& result.contextInfo.scanStartTimestamp) || Date.now();
if (conductorScanId && conductorScanId !== this._metricsScanId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benzekrimaha ?

resetting the current scan ID when it changes WILL break if we have a bug, and make the metrics mostly useless...

Comment on lines +455 to +463
this.listBuckets(messageSendQueue, scanId, log, (err, listedBucketsCount) => {
LifecycleMetrics.onBucketListing(log, err);
return next(err, nBucketsListed);
nBucketsListed = listedBucketsCount;
return next(err, listedBucketsCount);
});
},
(nBucketsListed, next) => {
(listedBucketsCount, next) => {
async.until(
() => nBucketsQueued === nBucketsListed,
() => nBucketsQueued === listedBucketsCount,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

});

const bucketProcessorMessagesProcessed = ZenkoMetrics.createCounter({
name: 's3_lifecycle_bucket_processor_messages_processed_total',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have this metric, in BackbeatConsumer ?

esp. since there is no "scanId" parameter, it does not seem to add much information

labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

// Info-style gauge: a single active series per bucket-processor instance,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep an unbounded series, why keep this one (with always has the value 1) instead of the previous one (number of message for a specific scanID) ?

Comment on lines +333 to +337
bucketProcessorLastConductorScanId.reset();
bucketProcessorLastConductorScanId.set({
[LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor',
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
}, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storing a metric does not mean it is scraped by prometheus.
with this approach, if we have 2 scans (ScanID) in parallel, depending on workload pattern we will have a very high probability that a single ScanID will appear in the metrics

Comment thread tests/utils/BackbeatTestConsumer.js Outdated
Comment thread tests/utils/BackbeatTestConsumer.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

  • tests/utils/BackbeatTestConsumer.js:44 — trailing whitespace after &&, will likely fail linter
  • tests/utils/BackbeatTestConsumer.js:44-47 — bucketSource normalization block references a contextInfo field that _taskToMessage never sets; dead code unless a future change adds it

Review by Claude Code

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from d3d5c67 to 1db020b Compare April 27, 2026 10:23
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

LGTM

This PR adds comprehensive conductor scan monitoring by:
- Generating a UUID conductorScanId per scan and propagating it through Kafka messages, bucket processors, and all lifecycle task types for end-to-end log correlation
- Adding new Prometheus metrics: batch end time, batch duration (seconds), bucket count, and per-scan message processing counter
- Moving onProcessBuckets to scan start (heartbeat semantics) and adding onConductorScanComplete at scan end
- Adding Grafana dashboard panels for scan duration, bucket count, and per-scan/per-pod message processing rate
- Bounded label cardinality via 24-hour retention cleanup of stale conductor_scan_id series in prom-client

Code is well-structured: error handling wraps all metric updates, the scan ID lifecycle is properly managed (set at start, nulled on all exit paths including throttling and error), and the async.until synchronization ensures _currentScanId is valid for all _taskToMessage calls. Kafka message format change is backward-compatible (additive field in contextInfo). Tests cover metric updates, error paths, scan ID generation, and stale series cleanup.

Review by Claude Code

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

  • extensions/lifecycle/conductor/LifecycleConductor.js:568 — isLifecycled: true added in the zookeeper listing path appears unrelated to conductor monitoring and is not consumed by any downstream code in this PR. Consider removing or splitting into a separate change.

Overall this is a well-structured PR. The scan ID propagation is thorough, the new Prometheus metrics follow existing naming conventions (s3_backbeat_* / s3_lifecycle_*), the per-scan counter cardinality is bounded by the 24h cleanup mechanism, and the test coverage is comprehensive. The onProcessBuckets relocation from end-of-scan to start-of-scan correctly aligns the s3_lifecycle_latest_batch_start_time metric with its documented semantics.

Review by Claude Code

}
entry.addLoggedAttributes({
bucketName: 'target.bucket',
objectKey: 'target.key',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike LifecycleUpdateExpirationTask and LifecycleUpdateTransitionTask, this file does not include conductorScanId: 'contextInfo.conductorScanId' in addLoggedAttributes. The scan ID is still emitted via log.addDefaultFields, but for consistency with the other two task types, consider adding it here as well.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

  • extensions/lifecycle/tasks/LifecycleDeleteObjectTask.js:314 — Missing conductorScanId: 'contextInfo.conductorScanId' in addLoggedAttributes, unlike the other two action task types (LifecycleUpdateExpirationTask, LifecycleUpdateTransitionTask) which include it. Minor inconsistency.

Review by Claude Code

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a943c58 to 5499089 Compare April 27, 2026 12:02
try {
const now = Date.now();
bucketProcessorScanMetricLastUpdated.set(conductorScanId, now);
removeStaleBucketProcessorScanMetrics(now);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive ordering: removeStaleBucketProcessorScanMetrics(now) runs before inc(). If prom-client's remove() throws for any stale entry during iteration, the exception propagates to the outer catch and the current scan's counter is never incremented — silently dropping counts on every subsequent message until the stale entry ages out of the map.

Moving the cleanup after inc() ensures the current message is always counted, even if stale-series removal fails.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

  • extensions/lifecycle/LifecycleMetrics.js:314 — removeStaleBucketProcessorScanMetrics(now) is called before inc() inside the same try block. If remove() throws for any stale entry, the current scan's counter increment is skipped. Consider moving the cleanup after the inc() call.

Review by Claude Code

const scanId = uuid();
this._currentScanId = scanId;
log.addDefaultFields({ conductorScanId: scanId });
LifecycleMetrics.onProcessBuckets(log);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onProcessBuckets was previously called at scan completion (after the success log), so s3_lifecycle_latest_batch_start_time was really a "latest batch completion" timestamp. Moving it to scan start is semantically correct, but the existing LifecycleLateScan alert in monitoring/lifecycle/alerts.yaml still depends on this metric with the description "Last lifecycle scan was performed more than … ago." With the new timing, a conductor that starts scans but never finishes them would keep refreshing this metric, delaying alert firing. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time to catch scans that start but never complete, or updating the alert description.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

  • extensions/lifecycle/conductor/LifecycleConductor.js:383 — onProcessBuckets moved from scan completion to scan start, changing the semantics of s3_lifecycle_latest_batch_start_time. The existing LifecycleLateScan alert in alerts.yaml depends on this metric but its description and behavior assume completion semantics. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time for stuck-scan detection, or updating the alert description.

    Overall the PR is well-structured: the conductor scan id threading is thorough across tasks and task versions, the stale-metric cleanup in removeStaleBucketProcessorScanMetrics correctly bounds prom-client memory, error handling is consistent, and test coverage is solid.

    Review by Claude Code

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 3d8286a to ca6f753 Compare April 29, 2026 16:10
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

LGTM - well-structured monitoring enhancement. The conductorScanId propagation is consistent across all lifecycle task types, the stale metric cleanup in LifecycleMetrics avoids unbounded prom-client memory growth, the semantic change to onProcessBuckets (called at scan start rather than end) is correctly reflected in the alert wording, and the new onConductorScanComplete metrics cover the completion/duration use case. Test coverage is thorough.

Review by Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

LGTM

Review by Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

LGTM — Clean, well-tested implementation. The scan-id propagation is thorough, the conductor_scan_id label cardinality is bounded by the 24h stale-series cleanup, and the semantic shift of onProcessBuckets from scan-completion to scan-start is correctly reflected in the alert wording. New metrics follow existing naming conventions and the dashboard additions are consistent.

Review by Claude Code

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from bc5ef91 to 1a88160 Compare April 29, 2026 16:33
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

LGTM — clean, well-tested addition of conductor scan monitoring. The scan-id propagation through conductor → bucket tasks → bucket processor is consistent, the per-scan counter cleanup via removeStaleBucketProcessorScanMetrics bounds prom-client memory, the semantic fix moving onProcessBuckets to scan start (with matching alert text updates) is correct, and the new onConductorScanComplete metrics (end time, duration, bucket count) give dashboards what they need without PromQL mid-scan sign-flip issues. Test coverage is thorough (error paths, stale cleanup, no-op on missing scan id, UUID format). No issues found.

Review by Claude Code

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 1a88160 to 6967a0b Compare April 29, 2026 16:39
const start = new Date();
const start = Date.now();
const scanId = uuid();
this._currentScanId = scanId;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: this._currentScanId is assigned before _controlBacklog (line 437). If a cron-triggered processBuckets call overlaps with a running scan, this line overwrites _currentScanId with the new UUID, then the throttling path (line 468) nulls it out — corrupting _currentScanId for the in-flight scan. Any _taskToMessage calls that run after the overwrite would embed conductorScanId: null into their Kafka messages.

Impact is observability-only (no correctness issue), but easily fixable: move this._currentScanId = scanId into the _batchInProgress = true block (line 442) so it is only set after backlog control passes.

Suggested change
this._currentScanId = scanId;
this._currentScanId = scanId;

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

  • extensions/lifecycle/conductor/LifecycleConductor.js:381 — Race condition: this._currentScanId is set before _controlBacklog check. If a cron-triggered processBuckets call overlaps with an in-flight scan, it overwrites and then nulls out _currentScanId, causing remaining in-flight messages to embed conductorScanId: null. Move the assignment to after backlog control passes (inside the _batchInProgress = true block). Impact is observability-only.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants