Kubernetes API Server Impact During Collector Restarts

Table of Contents

Overview
Background
Resource Implications Per kubernetes_logs Source
- API Server Connections
Impact During Collector Restarts
Best Practices and Recommendations
Troubleshooting
Related Documentation
References

Overview

This document describes the behavior and impact of ClusterLogForwarder input configuration on the Kubernetes API server, particularly during sudden collector pod restarts and redeployments. Understanding these impacts is crucial for maintaining cluster stability and avoiding excessive load on the Kubernetes API server.

Background

The collector (Vector) uses the kubernetes_logs source to collect container logs from pods running on each node. Each configured application or infrastructure container input of a ClusterLogForwarder represents a separate source configuration for the collection, which in turn creates independent connections to the Kubernetes API server.

Resource Implications Per kubernetes_logs Source

Each application or infrastructure container input configured for collector creates the following resources:

API Server Connections

N separate Kubernetes clients with N connection pools
N watch streams for Pods (node-scoped)
N watch streams for Namespaces (cluster-wide)
N watch streams for Nodes (single node)

This can create significant load on the Kubernetes API server and duplicate metadata caches in collector memory. The namespace watcher is particularly impactful as it is cluster-scoped, watching ALL namespaces regardless of where pods are running.

Impact During Collector Restarts

When collector pods restart or are redeployed (e.g., during ClusterLogForwarder updates):

Initial spike: All collectors simultaneously reconnect to the API server
LIST requests surge: Each collector performs initial LIST operations to populate metadata caches
WATCH stream establishment: Multiple WATCH connections established per collector
Metadata synchronization: Full pod, namespace, and node metadata loaded into memory

Best Practices and Recommendations

1. Minimize Number of Container Inputs

Problem: Each application or infrastructure container input creates a separate source with independent API server connections and metadata caches.

Recommendation: Configure the fewest application or infrastructure container inputs per ClusterLogForwarder to meet your container log collection unless you have specific filtering requirements (e.g. different workloads for different outputs).

apiVersion: observability.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
spec:
  serviceAccount:
    name: logcollector
  inputs:
    - name: application-logs
      type: application
    - name: infra-logs
      type: infrastructure
      sources:
      - container
  pipelines:
    - name: app-to-default
      inputRefs:
        - application-logs
        - infra-logs
      outputRefs:
        - default
  outputs:
    - name: default
      type: lokiStack
      lokiStack:
        target:
          name: logging-loki
          namespace: openshift-logging

2. Configure MaxUnavailable for Collector Redeployments

Recommendation: Set maxUnavailable to 10% in production environments to reduce API server load during collector redeployments.

This setting is configured via the ClusterLogForwarder resource:

apiVersion: observability.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
spec:
  serviceAccount:
    name: logcollector
  collector:
    maxUnavailable: 10%  # Recommended for production
  # ... inputs, pipelines, outputs

Note	The default is 100%. Always configure this to 10% or lower in production environments where there are a large number of nodes.

Impact of MaxUnavailable Setting

The maxUnavailable setting controls how many collector pods can be unavailable during updates:

MaxUnavailable	API Server Impact	Recommendation
100%	All collectors restart simultaneously. Massive spike in LIST, WATCH, and GET requests. Can overwhelm API server.	Modify this value for production environments
30%	Moderate request rate increase. Controlled rollout reduces peak load.	Acceptable for small to medium clusters
10%	Minimal request rate increase. Requests settle to near steady-state levels. Graceful rollout.	Recommended for production - balances update speed with API server load

3. Monitor Collector Resource Usage

Utilize the collector monitoring dashboard to identify issues with high memory or cpu usage. Additionally, track these metrics to identify issues:

container_memory_working_set_bytes{container="collector"} - Collector memory usage
apiserver_request_total - API server request rates by verb (LIST, GET, WATCH)
apiserver_longrunning_requests - Watch stream counts
log_logged_bytes_total - Log throughput per collector

Warning signs:

Collector memory usage approaching limits during startup
Unusual spikes in API server LIST/WATCH requests
Collector CrashLoopBackoff or OOMKilled events

4. Plan for Cluster Growth

Consider future scaling when designing ClusterLogForwarder configurations:

Cluster Size	Namespaces	Recommendation
Small (<50 nodes)	<100	Multiple inputs acceptable with monitoring
Medium (50-250 nodes)	100-500	Minimize inputs, use 10% MaxUnavailable
Large (>250 nodes)	>500	Single input only, carefully tune MaxUnavailable

Troubleshooting

Symptoms of Excessive API Server Load

Collector pods stuck in Pending or CrashLoopBackoff
OOMKilled events in collector pod status
API server request latency increases during collector rollouts
etcd performance degradation
Cluster-wide slowdowns during updates

Diagnostic Steps

Check number of container inputs:

oc get clusterlogforwarder instance -o yaml | grep -A 20 "inputs:"

Check collector memory usage:

oc adm top pods -n openshift-logging -l component=collector

Check API server request rates:

# Query Prometheus for API server metrics
apiserver_request_total{verb="WATCH"}
apiserver_request_total{verb="LIST"}

Check MaxUnavailable setting:

oc get daemonset collector -n openshift-logging -o yaml | grep -A 5 updateStrategy

Recovery Actions

Reduce number of container inputs - Consolidate to minimal inputs
Increase collector memory limits - If justified by log volume (use with caution)
Reduce MaxUnavailable percentage - Slow down collector rollouts to 10% or less
Stagger ClusterLogForwarder updates - Avoid multiple simultaneous CLF updates

References

JIRA Investigation: LOG-7588: Investigate kubernetes_source interactions with the kube-apiserver
Vector kubernetes_logs source: https://vector.dev/docs/reference/configuration/sources/kubernetes_logs/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes API Server Impact During Collector Restarts

Overview

Background

Resource Implications Per kubernetes_logs Source

API Server Connections

Impact During Collector Restarts

Best Practices and Recommendations

1. Minimize Number of Container Inputs

2. Configure MaxUnavailable for Collector Redeployments

Impact of MaxUnavailable Setting

3. Monitor Collector Resource Usage

4. Plan for Cluster Growth

Troubleshooting

Symptoms of Excessive API Server Load

Diagnostic Steps

Recovery Actions

References

Uh oh!

FilesExpand file tree

kubernetes-api-server-impact.adoc

Latest commit

History

kubernetes-api-server-impact.adoc

File metadata and controls

Kubernetes API Server Impact During Collector Restarts

Overview

Background

Resource Implications Per kubernetes_logs Source

API Server Connections

Impact During Collector Restarts

Best Practices and Recommendations

1. Minimize Number of Container Inputs

2. Configure MaxUnavailable for Collector Redeployments

Impact of MaxUnavailable Setting

3. Monitor Collector Resource Usage

4. Plan for Cluster Growth

Troubleshooting

Symptoms of Excessive API Server Load

Diagnostic Steps

Recovery Actions

Related Documentation

References