Skip to content

Slow crawling memory leak with log_to_metric → aggregate → prometheus_remote_write pipeline #25424

@hrytskivr-tl

Description

@hrytskivr-tl

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Slow crawling memory leak with log_to_metricaggregateprometheus_remote_write pipeline

Problem

We're experiencing a slow, steady memory leak in Vector running as a Stateless-Aggregator on Kubernetes. Memory usage grows linearly over time until the pod eventually gets OOM-killed.

The pipeline is:

aws_s3 source → remap (ALB log parsing) → log_to_metric → aggregate → prometheus_remote_write

The memory growth is clearly visible in Grafana (screenshot attached below).

Grafana memory usage

Image

Environment

  • Vector version: 0.55.0 (Helm chart 0.52.0)
  • Deployment: Kubernetes (EKS), Stateless-Aggregator role
  • Resources: requests 1 CPU / 2Gi memory, limits 3 CPU / 6Gi memory
  • Persistence: disabled
  • Replicas: 1

Configuration

sources:
  alb:
    type: aws_s3
    region: us-east-1
    compression: gzip
    sqs:
      client_concurrency: 1
      max_number_of_messages: 1
      queue_url: <redacted>

  vector_metrics:
    type: internal_metrics
    scrape_interval_secs: 1

transforms:
  parser:
    type: remap
    inputs:
      - alb
    source: |
      alb_regex = r'^\S+ (?<timestamp>\S+) \S+ (?<client_ip>[^:]+):\d+ \S+ (?<request_processing_time>[-.\d]+) \S+ \S+ (?<elb_status_code>[-\d]+) \S+ (?<received_bytes>[-\d]+) (?<sent_bytes>[-\d]+) "(?<request_verb>\S+) \S+:\/\/[^\/]+(?<request_path>\/(?<service>[^\/?#]+)[^?\s"]*)[^"]*" "(?<user_agent>[^"]*)" .*$'
      parsed, err = parse_regex(.message, alb_regex)
      if err != null {
        abort
      }

      if !includes(["svc-a", "svc-b", "svc-c", "svc-d", "svc-e", "svc-f", "svc-g", "svc-h", "svc-i"], parsed.service) {
        abort
      }

      . = parsed

      ts, err = parse_timestamp(.timestamp, format: "%Y-%m-%dT%H:%M:%S.%fZ")
      if err != null { abort }
      .timestamp = ts

      .elb_status_code = to_int!(.elb_status_code)
      .received_bytes = to_int!(.received_bytes)
      .sent_bytes = to_int!(.sent_bytes)
      .request_processing_time = to_float!(.request_processing_time)
      .user_agent = to_string(.user_agent)

      if contains(.user_agent, "/") {
        .user_agent = split(.user_agent, "/")[0]
      } else {
        .user_agent = "other"
      }

      # Replace IDs/UUIDs in paths to reduce cardinality
      id_patterns = r'[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}|[A-Z0-9]{3}(?:-[A-Z0-9]{3}){1,2}|\b[0-9a-fA-F]{20,}\b|\b\d{2,}\b'
      .request_path = replace!(.request_path, id_patterns, ":id")

  alb_metrics:
    type: log_to_metric
    inputs:
      - parser
    metrics:
      - type: counter
        name: requests_total
        field: elb_status_code
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"
          status_code: "{{ elb_status_code }}"
          user_agent: "{{ user_agent }}"

      - type: histogram
        name: request_processing_time_seconds
        field: request_processing_time
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

      - type: summary
        name: response_received_bytes
        field: received_bytes
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

      - type: summary
        name: response_sent_bytes
        field: sent_bytes
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

  alb_metrics_aggregated:
    type: aggregate
    inputs:
      - alb_metrics
    interval_ms: 15000

  vector_metrics_aggregated:
    type: aggregate
    inputs:
      - vector_metrics
    interval_ms: 30000

sinks:
  prom_alb:
    type: prometheus_remote_write
    inputs:
      - alb_metrics_aggregated
    quantiles: []
    default_namespace: alb
    endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
    expire_metrics_secs: 600
    batch:
      aggregate: false
      max_events: 1000
      timeout_secs: 1
    healthcheck:
      enabled: false

  prom_internal:
    type: prometheus_remote_write
    inputs:
      - vector_metrics_aggregated
    endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
    expire_metrics_secs: 600
    batch:
      aggregate: false
      max_events: 1000
      timeout_secs: 1
    healthcheck:
      enabled: false

Observations

  • Memory grows linearly and steadily — classic slow leak pattern, not a spike
  • persistence is disabled, so no disk buffer is involved
  • We already apply ID/UUID normalization in request paths to reduce cardinality
  • expire_metrics_secs: 600 is set on both prometheus_remote_write sinks
  • The aggregate transform does not specify a mode (uses default)
  • No unusual errors in Vector's internal logs

Possibly related issues

Configuration


Version

0.55.0

Debug Output


Example Data

No response

Additional Context

No response

References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions