Skip to content

Add Prometheus scrape/pull source to prometheus plugin [Adding in the same Prometheus Remote write source] #6743

Merged
dlvenable merged 3 commits into
opensearch-project:mainfrom
srikanthpadakanti:prometheus-scrape-source
Apr 21, 2026
Merged

Add Prometheus scrape/pull source to prometheus plugin [Adding in the same Prometheus Remote write source] #6743
dlvenable merged 3 commits into
opensearch-project:mainfrom
srikanthpadakanti:prometheus-scrape-source

Conversation

@srikanthpadakanti

Copy link
Copy Markdown
Collaborator

Description

Add Prometheus scrape/pull source to the existing prometheus plugin. The source periodically HTTP GETs target /metrics endpoints, parses the Prometheus text exposition format, and converts metrics into Data Prepper events (Gauge, Sum, Histogram, Summary). Supports static target URLs, configurable scrape interval/timeout, HTTP Basic Auth, Bearer Token, and TLS certificates. Both scrape and remote write modes can run simultaneously under the same plugin.

Three usage modes:

 Mode 1: Remote write only (existing behavior, unchanged)
  source:
    prometheus:
      port: 9090

Mode 2: Scrape only
  source:
    prometheus:
      scrape:
        targets:
          - url: "http://host:9090/metrics"

Mode 3: Both simultaneously
  source:
    prometheus:
      port: 9090
      scrape:
        targets:
          - url: "http://host:9090/metrics"

Issues Resolved

#1997
Resolves #1997

Check List

  • [ X ] New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • [ X ] New functionality has javadoc added
  • [ X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dlvenable dlvenable left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srikanthpadakanti .

At a high-level would a scraper be expected to get data from the same Prometheus nodes pushing? I'm wondering if it makes sense to have a separate source or not.

@JsonProperty("http_basic")
private HttpBasicConfig httpBasic;

@JsonProperty("bearer_token")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be an oauth2 group with only bearer_token for consistency.

scrape:
  oauth2:
    bearer_token:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bearer token here is a static token passed as an HTTP Authorization header, not an OAuth2 flow. Nesting it under oauth2 would imply token endpoint and refresh support that we do not have. Kept it flat under authentication to match the actual behavior.

Happy to add an oauth2 group later when we support the full flow with token exchange.

final HttpHeadersBuilder builder = HttpHeaders.builder()
.add(HttpHeaderNames.ACCEPT, ACCEPT_HEADER_VALUE);

if (config.getAuthentication() != null) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be ideal to have small classes and a common interface for updating the request based on the authentication. This could allow more complicated auth schemes like SigV4 in the future.

I had some similar comments on an http sink PR that is in progress currently. #6747 (comment). Though my comments went the other way. Have the code more amenable to basic auth instead of SigV4. 😄

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into a ScrapeRequestAuthenticator interface with BasicAuthenticator and BearerTokenAuthenticator as separate classes. Factory method handles selection. Adding SigV4 or other schemes in the future is just a new implementation without touching the scraper.

@JsonProperty("flatten_labels")
private boolean flattenLabels = false;

@JsonProperty("insecure")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the URLs in targets has a non-HTTPS scheme we should throw a validation error unless insecure is true.

You can add a JSR @AssertTrue to handle this in this class.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Added @AssertTrue validation.

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti srikanthpadakanti force-pushed the prometheus-scrape-source branch from d118aa1 to 713b2d2 Compare April 10, 2026 17:29
@srikanthpadakanti

Copy link
Copy Markdown
Collaborator Author

Thanks @srikanthpadakanti .

At a high-level would a scraper be expected to get data from the same Prometheus nodes pushing? I'm wondering if it makes sense to have a separate source or not.

A scraper would not typically target the same nodes pushing via remote write. They serve different use cases: remote write is Prometheus servers configured to push, while scrape pulls from any endpoint exposing the Prometheus text exposition format, including application instances that are not Prometheus servers themselves.

The reason they share a plugin is the precedent from kkondaka's feedback on PR #6627 where the recommendation was to keep it as a single prometheus source with config options for different modes, similar to how s3 source handles s3-sqs and s3-scan under one plugin. Both modes can also run simultaneously which simplifies pipeline configuration when users need both ingestion paths.

That said, if you prefer a separate source I am open to splitting it out.

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@dlvenable

Copy link
Copy Markdown
Member

Thanks @srikanthpadakanti .
At a high-level would a scraper be expected to get data from the same Prometheus nodes pushing? I'm wondering if it makes sense to have a separate source or not.

A scraper would not typically target the same nodes pushing via remote write. They serve different use cases: remote write is Prometheus servers configured to push, while scrape pulls from any endpoint exposing the Prometheus text exposition format, including application instances that are not Prometheus servers themselves.

The reason they share a plugin is the precedent from kkondaka's feedback on PR #6627 where the recommendation was to keep it as a single prometheus source with config options for different modes, similar to how s3 source handles s3-sqs and s3-scan under one plugin. Both modes can also run simultaneously which simplifies pipeline configuration when users need both ingestion paths.

That said, if you prefer a separate source I am open to splitting it out.

@srikanthpadakanti , @kkondaka ,

I think this should be a different source and that the S3 analogy doesn't quite fit here.

The S3 source is entirely pull-based. Whether it is scan or SQS or EventBridge, it is pulling the data from S3. In no case does it open a port. The scan and the SQS are just two different ways to discover the content. The S3 source pulls the data from S3 the exact same way and uses the same code path regardless of the discovery. Additionally we had also once wanted the ability to scan and then start using SQS for a CDC-like behavior, similar to what we do for DynamoDB/RDS.

In this case, there is a big difference between opening a port and going out to an external system to make requests. The source configuration will have a security difference depending on the configuration and may not be clear.

I recommend that we split this into a prometheus_pull source (maybe there is a better name). @kkondaka , please chime in.

@kkondaka

Copy link
Copy Markdown
Collaborator

@dlvenable Whether it is pull or push is not something users/customers know about or care about, right? Why make it difficult for the users?

@dlvenable

Copy link
Copy Markdown
Member

@kkondaka ,

Whether it is pull or push is not something users/customers know about or care about, right? Why make it difficult for the users?

I think this is not at all the case. Users must know whether it is push or pull because that is how they get the data there right? If it is pull-based, they must enable a scrape URI and provide that. If it is push-based they must configure their remote-write protocol.

I think this simplifies it for users because these are in reality two different sources with disjoint configurations. The documentation will need to say for just about every field whether it is remote-write or pull/scape. It seems more straightforward as a user to start with one model and see the configurations only for that model.

In what way would two sources make the the configuration more complicated for users?

@dlvenable

Copy link
Copy Markdown
Member

@kkondaka , I suggest that we accept this PR and discuss further on the source.

public class ScrapeAuthenticationConfig {

@Valid
@JsonProperty("http_basic")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these auth mechanisms are common for all http. We should move them to a common location.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acknowledge, offer to do it as a follow-up

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a new issue for the followup - #6766

…ource

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti

srikanthpadakanti commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator Author

@kkondaka , I suggest that we accept this PR and discuss further on the source.

Makes sense. Opened an issue to track moving these to a shared module. Will do that in a follow-up PR so other HTTP plugins can reuse them.

@dlvenable dlvenable left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srikanthpadakanti for this contribution!

@dlvenable dlvenable merged commit 4990f26 into opensearch-project:main Apr 21, 2026
70 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest data from Prometheus as Source

3 participants