Skip to content

refactor(metrics): migrate buffer/retry metrics to unified VendorAPIMetricsRecorder#6428

Merged
graytaylor0 merged 1 commit into
opensearch-project:mainfrom
chrisale000:BufferMetricMigrtion
Jan 26, 2026
Merged

refactor(metrics): migrate buffer/retry metrics to unified VendorAPIMetricsRecorder#6428
graytaylor0 merged 1 commit into
opensearch-project:mainfrom
chrisale000:BufferMetricMigrtion

Conversation

@chrisale000

@chrisale000 chrisale000 commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

Description

This PR migrates buffer and retry metrics to the unified VendorAPIMetricsRecorder, consolidating metric recording across the codebase.

Changes

  • Consolidate buffer write metrics (latency, attempts, success, retry, failures)
  • Add error categorization metrics (retryable/non-retryable errors)
  • Enhance logsRequested() method to accept log count parameter
  • Comprehensive test coverage for all new metrics functionality

Testing

  • All unit tests pass
  • Spun up a pipeline locally and confirmed metrics were instantiated and emitted.

Signed-off-by: Alexander Christensen alchrisk@amazon.com


Runnable leaderScheduler = new LeaderScheduler(coordinator, crawler);
LeaderScheduler leaderScheduler = new LeaderScheduler(coordinator, crawler);
leaderScheduler.setLeaseInterval(sourceConfig.getLeaseInterval());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me again that 1 min is the current leaseInterval for LeaderScheduler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the LeaderScheduler currently uses a 1-minute leaseInterval, which is why we set the default to 1 minute in CrawlerSourceConfig.

@chrisale000 chrisale000 Jan 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this change into a separate PR for independent review: #6432

bufferWriteFailuresCounter.increment();
throw e;
}
metricsRecorder.recordBufferWriteLatency(() -> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to wrap this with the try-catch block for error right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! We don't need the try-catch wrapper for two reasons: 1) writeRecordsWithRetry() already calls metricsRecorder.recordBufferWriteFailure() internally when exceptions occur, so wrapping it would double-count buffer write failures, and 2) it's already inside the outer try-catch block in executePartition() which properly handles all exceptions with metrics recording and retryable error categorization.

* CrawlerClient implementation for Office 365 Management API audit logs.
*
* Handles partition execution, audit log retrieval, JSON processing,
* and buffer management with retry logic and comprehensive metrics recording.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am open to discussion, but I have concerns adding so many ReadMe information for each function.It is because one main goal is to add AI automation for future review and even code generation.One challenge is to avoid hitting context window. Do you think it make sense to remove function description on each connector?I am fine to keep it for shareable classes like VendorAPIMetricsRecorder

*
* @return Duration for lease interval
*/
default Duration getLeaseInterval() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid having so many changes in one PR/CR? It would slow down the review.In this case, getLeaseInterval should have its own PR so folks can discuss this independently. It is also higher priority given it is high impact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this change into a separate PR for independent review: #6432

…etricsRecorder

Signed-off-by: Alexander Christensen <alchrisk@amazon.com>
@graytaylor0 graytaylor0 merged commit 08bd403 into opensearch-project:main Jan 26, 2026
52 checks passed
san81 pushed a commit to san81/data-prepper that referenced this pull request Jan 27, 2026
…etricsRecorder (opensearch-project#6428)

Signed-off-by: Alexander Christensen <alchrisk@amazon.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…etricsRecorder (opensearch-project#6428)

Signed-off-by: Alexander Christensen <alchrisk@amazon.com>
Signed-off-by: Simon ELBAZ <elbazsimon9@gmail.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…etricsRecorder (opensearch-project#6428)

Signed-off-by: Alexander Christensen <alchrisk@amazon.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…etricsRecorder (opensearch-project#6428)

Signed-off-by: Alexander Christensen <alchrisk@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants