Slow crawling memory leak with log_to_metric → aggregate → prometheus_remote_write pipeline

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment



### Problem

## Slow crawling memory leak with `log_to_metric` → `aggregate` → `prometheus_remote_write` pipeline

### Problem

We're experiencing a slow, steady memory leak in Vector running as a Stateless-Aggregator on Kubernetes. Memory usage grows linearly over time until the pod eventually gets OOM-killed.

The pipeline is:
```
aws_s3 source → remap (ALB log parsing) → log_to_metric → aggregate → prometheus_remote_write
```

The memory growth is clearly visible in Grafana (screenshot attached below).

Grafana memory usage

<img width="1391" height="962" alt="Image" src="https://github.com/user-attachments/assets/3702513c-954c-467d-8746-906ce163bf54" />

### Environment

- **Vector version**: 0.55.0 (Helm chart 0.52.0)
- **Deployment**: Kubernetes (EKS), Stateless-Aggregator role
- **Resources**: requests 1 CPU / 2Gi memory, limits 3 CPU / 6Gi memory
- **Persistence**: disabled
- **Replicas**: 1

### Configuration

```yaml
sources:
  alb:
    type: aws_s3
    region: us-east-1
    compression: gzip
    sqs:
      client_concurrency: 1
      max_number_of_messages: 1
      queue_url: <redacted>

  vector_metrics:
    type: internal_metrics
    scrape_interval_secs: 1

transforms:
  parser:
    type: remap
    inputs:
      - alb
    source: |
      alb_regex = r'^\S+ (?<timestamp>\S+) \S+ (?<client_ip>[^:]+):\d+ \S+ (?<request_processing_time>[-.\d]+) \S+ \S+ (?<elb_status_code>[-\d]+) \S+ (?<received_bytes>[-\d]+) (?<sent_bytes>[-\d]+) "(?<request_verb>\S+) \S+:\/\/[^\/]+(?<request_path>\/(?<service>[^\/?#]+)[^?\s"]*)[^"]*" "(?<user_agent>[^"]*)" .*$'
      parsed, err = parse_regex(.message, alb_regex)
      if err != null {
        abort
      }

      if !includes(["svc-a", "svc-b", "svc-c", "svc-d", "svc-e", "svc-f", "svc-g", "svc-h", "svc-i"], parsed.service) {
        abort
      }

      . = parsed

      ts, err = parse_timestamp(.timestamp, format: "%Y-%m-%dT%H:%M:%S.%fZ")
      if err != null { abort }
      .timestamp = ts

      .elb_status_code = to_int!(.elb_status_code)
      .received_bytes = to_int!(.received_bytes)
      .sent_bytes = to_int!(.sent_bytes)
      .request_processing_time = to_float!(.request_processing_time)
      .user_agent = to_string(.user_agent)

      if contains(.user_agent, "/") {
        .user_agent = split(.user_agent, "/")[0]
      } else {
        .user_agent = "other"
      }

      # Replace IDs/UUIDs in paths to reduce cardinality
      id_patterns = r'[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}|[A-Z0-9]{3}(?:-[A-Z0-9]{3}){1,2}|\b[0-9a-fA-F]{20,}\b|\b\d{2,}\b'
      .request_path = replace!(.request_path, id_patterns, ":id")

  alb_metrics:
    type: log_to_metric
    inputs:
      - parser
    metrics:
      - type: counter
        name: requests_total
        field: elb_status_code
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"
          status_code: "{{ elb_status_code }}"
          user_agent: "{{ user_agent }}"

      - type: histogram
        name: request_processing_time_seconds
        field: request_processing_time
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

      - type: summary
        name: response_received_bytes
        field: received_bytes
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

      - type: summary
        name: response_sent_bytes
        field: sent_bytes
        tags:
          service: "{{ service }}"
          request_verb: "{{ request_verb }}"
          request_path: "{{ request_path }}"

  alb_metrics_aggregated:
    type: aggregate
    inputs:
      - alb_metrics
    interval_ms: 15000

  vector_metrics_aggregated:
    type: aggregate
    inputs:
      - vector_metrics
    interval_ms: 30000

sinks:
  prom_alb:
    type: prometheus_remote_write
    inputs:
      - alb_metrics_aggregated
    quantiles: []
    default_namespace: alb
    endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
    expire_metrics_secs: 600
    batch:
      aggregate: false
      max_events: 1000
      timeout_secs: 1
    healthcheck:
      enabled: false

  prom_internal:
    type: prometheus_remote_write
    inputs:
      - vector_metrics_aggregated
    endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
    expire_metrics_secs: 600
    batch:
      aggregate: false
      max_events: 1000
      timeout_secs: 1
    healthcheck:
      enabled: false
```

### Observations

- Memory grows linearly and steadily — classic slow leak pattern, not a spike
- `persistence` is disabled, so no disk buffer is involved
- We already apply ID/UUID normalization in request paths to reduce cardinality
- `expire_metrics_secs: 600` is set on both `prometheus_remote_write` sinks
- The `aggregate` transform does not specify a `mode` (uses default)
- No unusual errors in Vector's internal logs

### Possibly related issues

- #23093 — Slow memory leak in vector aggregate (open, looks very similar)
- #24357 / #24943 — Fixes for aggregate transform memory leaks with high-cardinality metrics
- #20470 — prometheus_remote_write memory leak (fixed in v0.49.0)
- #15295 — prometheus_remote_write + distribution memory leak (open)



### Configuration

```text

```

### Version

0.55.0

### Debug Output

```text

```

### Example Data

_No response_

### Additional Context

_No response_

### References

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow crawling memory leak with log_to_metric → aggregate → prometheus_remote_write pipeline #25424

A note for the community

Problem