A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Slow crawling memory leak with log_to_metric → aggregate → prometheus_remote_write pipeline
Problem
We're experiencing a slow, steady memory leak in Vector running as a Stateless-Aggregator on Kubernetes. Memory usage grows linearly over time until the pod eventually gets OOM-killed.
The pipeline is:
aws_s3 source → remap (ALB log parsing) → log_to_metric → aggregate → prometheus_remote_write
The memory growth is clearly visible in Grafana (screenshot attached below).
Grafana memory usage
Environment
- Vector version: 0.55.0 (Helm chart 0.52.0)
- Deployment: Kubernetes (EKS), Stateless-Aggregator role
- Resources: requests 1 CPU / 2Gi memory, limits 3 CPU / 6Gi memory
- Persistence: disabled
- Replicas: 1
Configuration
sources:
alb:
type: aws_s3
region: us-east-1
compression: gzip
sqs:
client_concurrency: 1
max_number_of_messages: 1
queue_url: <redacted>
vector_metrics:
type: internal_metrics
scrape_interval_secs: 1
transforms:
parser:
type: remap
inputs:
- alb
source: |
alb_regex = r'^\S+ (?<timestamp>\S+) \S+ (?<client_ip>[^:]+):\d+ \S+ (?<request_processing_time>[-.\d]+) \S+ \S+ (?<elb_status_code>[-\d]+) \S+ (?<received_bytes>[-\d]+) (?<sent_bytes>[-\d]+) "(?<request_verb>\S+) \S+:\/\/[^\/]+(?<request_path>\/(?<service>[^\/?#]+)[^?\s"]*)[^"]*" "(?<user_agent>[^"]*)" .*$'
parsed, err = parse_regex(.message, alb_regex)
if err != null {
abort
}
if !includes(["svc-a", "svc-b", "svc-c", "svc-d", "svc-e", "svc-f", "svc-g", "svc-h", "svc-i"], parsed.service) {
abort
}
. = parsed
ts, err = parse_timestamp(.timestamp, format: "%Y-%m-%dT%H:%M:%S.%fZ")
if err != null { abort }
.timestamp = ts
.elb_status_code = to_int!(.elb_status_code)
.received_bytes = to_int!(.received_bytes)
.sent_bytes = to_int!(.sent_bytes)
.request_processing_time = to_float!(.request_processing_time)
.user_agent = to_string(.user_agent)
if contains(.user_agent, "/") {
.user_agent = split(.user_agent, "/")[0]
} else {
.user_agent = "other"
}
# Replace IDs/UUIDs in paths to reduce cardinality
id_patterns = r'[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}|[A-Z0-9]{3}(?:-[A-Z0-9]{3}){1,2}|\b[0-9a-fA-F]{20,}\b|\b\d{2,}\b'
.request_path = replace!(.request_path, id_patterns, ":id")
alb_metrics:
type: log_to_metric
inputs:
- parser
metrics:
- type: counter
name: requests_total
field: elb_status_code
tags:
service: "{{ service }}"
request_verb: "{{ request_verb }}"
request_path: "{{ request_path }}"
status_code: "{{ elb_status_code }}"
user_agent: "{{ user_agent }}"
- type: histogram
name: request_processing_time_seconds
field: request_processing_time
tags:
service: "{{ service }}"
request_verb: "{{ request_verb }}"
request_path: "{{ request_path }}"
- type: summary
name: response_received_bytes
field: received_bytes
tags:
service: "{{ service }}"
request_verb: "{{ request_verb }}"
request_path: "{{ request_path }}"
- type: summary
name: response_sent_bytes
field: sent_bytes
tags:
service: "{{ service }}"
request_verb: "{{ request_verb }}"
request_path: "{{ request_path }}"
alb_metrics_aggregated:
type: aggregate
inputs:
- alb_metrics
interval_ms: 15000
vector_metrics_aggregated:
type: aggregate
inputs:
- vector_metrics
interval_ms: 30000
sinks:
prom_alb:
type: prometheus_remote_write
inputs:
- alb_metrics_aggregated
quantiles: []
default_namespace: alb
endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
expire_metrics_secs: 600
batch:
aggregate: false
max_events: 1000
timeout_secs: 1
healthcheck:
enabled: false
prom_internal:
type: prometheus_remote_write
inputs:
- vector_metrics_aggregated
endpoint: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write
expire_metrics_secs: 600
batch:
aggregate: false
max_events: 1000
timeout_secs: 1
healthcheck:
enabled: false
Observations
- Memory grows linearly and steadily — classic slow leak pattern, not a spike
persistence is disabled, so no disk buffer is involved
- We already apply ID/UUID normalization in request paths to reduce cardinality
expire_metrics_secs: 600 is set on both prometheus_remote_write sinks
- The
aggregate transform does not specify a mode (uses default)
- No unusual errors in Vector's internal logs
Possibly related issues
Configuration
Version
0.55.0
Debug Output
Example Data
No response
Additional Context
No response
References
No response
A note for the community
Problem
Slow crawling memory leak with
log_to_metric→aggregate→prometheus_remote_writepipelineProblem
We're experiencing a slow, steady memory leak in Vector running as a Stateless-Aggregator on Kubernetes. Memory usage grows linearly over time until the pod eventually gets OOM-killed.
The pipeline is:
The memory growth is clearly visible in Grafana (screenshot attached below).
Grafana memory usage
Environment
Configuration
Observations
persistenceis disabled, so no disk buffer is involvedexpire_metrics_secs: 600is set on bothprometheus_remote_writesinksaggregatetransform does not specify amode(uses default)Possibly related issues
Diffmode #24943 — Fixes for aggregate transform memory leaks with high-cardinality metricsConfiguration
Version
0.55.0
Debug Output
Example Data
No response
Additional Context
No response
References
No response