You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -11,8 +11,9 @@ and how to configure your cluster to minimize this risk.
11
11
=== Log loss
12
12
13
13
Container logs are written to `/var/log/pods`.
14
-
The forwarder reads and forwards logs as quickly as possible with its available CPU/Memory.
15
-
If the forwarder is too slow, in some cases adjusting its CPU/Memory may resolve the problem.
14
+
The forwarder reads and forwards logs as quickly as possible with its available CPU and memory.
15
+
If the forwarder is too slow, adjusting its CPU and memory limits may help
16
+
(see <<Check forwarder CPU and Memory>>).
16
17
17
18
There are always some _unread logs_, written but not yet read by the forwarder.
18
19
@@ -25,18 +26,19 @@ There is no coordination or flow-control to ensure logs are forwarded before the
25
26
_Log Loss_ occurs when _unread logs_ are deleted by CRI-O _before_ being read by the forwarder.
26
27
Lost logs are gone from the file-system, have not been forwarded anywhere, and cannot be recovered.
27
28
29
+
Logs can also be lost when short-lived pods or jobs terminate and their log files are deleted
30
+
before the collector reads them.
31
+
This is distinct from rotation-based loss and is difficult to mitigate.
32
+
28
33
NOTE: This guide focuses on _container logs_.
29
-
The section <<Other types of logs>> briefly discusses other types of log.
30
-
====
31
-
Not all logs are container logs, the following types of logs are not discussed here but
32
-
can be managed in similar ways:
34
+
Other log types (journald, Linux audit, Kubernetes API audit) have different rotation mechanisms.
35
+
See <<Other types of logs>>.
33
36
34
-
- Journald (node) logs: are
35
-
====
36
37
=== Log rotation
37
38
38
-
CRI-O does the actual log rotation, but the rotation limits are specified via Kubelet.
39
+
CRI-O does the actual log rotation, but the rotation limits are configured via Kubelet parameters.
39
40
The parameters are:
41
+
40
42
[horizontal]
41
43
containerLogMaxSize:: Max size of a single log file (default 10MiB)
42
44
containerLogMaxFiles:: Max number of log files per container (default 5)
@@ -48,6 +50,14 @@ When the active file reaches `containerLogMaxSize` the log files are rotated:
48
50
. a new active file is created
49
51
. if there are more than `containerLogMaxFiles` files, the oldest is deleted.
50
52
53
+
[NOTE]
54
+
====
55
+
CRI-O may compress rotated log files (`.gz`).
56
+
The collector cannot read compressed files — they are excluded from collection.
57
+
Compressed rotated logs save disk space but are effectively lost to the collector.
58
+
Disk size calculations in this guide assume uncompressed log files.
59
+
====
60
+
51
61
=== Best effort delivery
52
62
53
63
OpenShift logging provides _best effort_ delivery of logs.
@@ -58,7 +68,7 @@ This article discusses how you can tune these limits to minimize log loss under
58
68
59
69
[WARNING]
60
70
====
61
-
**NEVER** abuse logs as a way to store or send application data - especially financial data.
71
+
**NEVER** abuse logs as a way to store or send application data — especially financial data.
62
72
This is unreliable, insecure, and in all other ways inconceivable.
63
73
Use appropriate tools that meet your reliability requirements for application data.
64
74
For example: databases, object stores, or reliable messaging (Kafka, AMQP, MQTT).
@@ -67,8 +77,8 @@ For example: databases, object stores, or reliable messaging (Kafka, AMQP, MQTT)
67
77
=== Modes of operation
68
78
69
79
[horizontal]
70
-
writeRate:: long-term average logs per second per container written to `/var/log/pods`
71
-
sendRate:: long-term average logs per second per container forwarded to the store
80
+
writeRate:: long-term average bytes per second per container written to `/var/log/pods`
81
+
sendRate:: long-term average bytes per second per container forwarded to the store
72
82
73
83
During _normal operation_ `sendRate` keeps up with `writeRate` (on average).
74
84
The number of unread logs is small, and does not grow over time.
@@ -79,25 +89,33 @@ If this lasts long enough, log rotation will delete unread logs causing log loss
79
89
After a load surge ends, the system has to _recover_ by processing the accumulated unread logs.
80
90
Until the backlog clears, the system is more vulnerable to log loss if there is another overload.
81
91
92
+
NOTE: If drop or filter rules are configured in the `ClusterLogForwarder`,
93
+
the effective write rate seen by the forwarder is reduced.
94
+
Also, the collector itself can be a bottleneck if its CPU or memory limits are too low,
95
+
causing slow reading and sending regardless of the remote store's capacity.
96
+
See <<Check forwarder CPU and Memory>>.
97
+
82
98
== Metrics for logging
83
99
84
100
Relevant metrics include:
101
+
85
102
[horizontal]
86
103
vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding.
87
-
log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder.
88
-
To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder.
104
+
log_logged_bytes_total:: Produced by the `LogFileMetricExporter`, reported per namespace, pod, and container. Measures bytes written to disk _before_ the forwarder reads them — essential for detecting log loss.
89
105
kube_*:: Metrics from the Kubernetes cluster.
90
106
91
-
[CAUTION]
107
+
[NOTE]
92
108
====
93
109
Metrics named `_bytes_` count bytes, metrics named `_events_` count log records.
94
110
95
-
The forwarder adds metadata to the logs before sending so you cannot assume that a log
96
-
record written to `/var/log` is the same size in bytes as the record sent to the store.
97
-
111
+
The forwarder adds metadata to the logs before sending, so a log record written to `/var/log`
112
+
is not the same size in bytes as the record sent to the store.
98
113
Use event and byte metrics carefully in calculations to get the correct results.
99
114
====
100
115
116
+
TIP: The OpenShift console includes logging dashboards under Observe > Dashboards.
117
+
These provide pre-built views of collection and forwarding metrics.
118
+
101
119
=== Log File Metric Exporter
102
120
103
121
The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container.
@@ -113,15 +131,15 @@ metadata:
113
131
namespace: openshift-logging
114
132
----
115
133
116
-
== Limitations
134
+
=== Limitations
117
135
118
-
Write rate metrics only cover container logs in `/var/log/pods`.
136
+
Write rate metrics (`log_logged_bytes_total`) only cover container logs in `/var/log/pods`.
119
137
The following are excluded from these metrics:
120
138
121
-
* Node-level logs (journal, systemd, audit)
122
-
* API audit logs
139
+
* Node-level logs (journald, systemd, audit)
140
+
* Kubernetes API audit logs
123
141
124
-
This may cause discrepancies when comparing write vs send rates.
142
+
This can cause discrepancies when comparing write vs send rates.
125
143
The principles still apply, but account for this additional volume in capacity planning.
.*MaxNodeWriteRateBytes* (bytes/sec per node): Identifies the busiest node for worst-case sizing.
171
+
----
172
+
max(sum by (instance) (rate(log_logged_bytes_total[1h])))
173
+
----
174
+
152
175
NOTE: The queries above are for container logs only.
153
-
Node and audit may also be forwarded (depending on your `ClusterLogForwarder` configuration)
154
-
which may cause discrepancies when comparing write and send rates.
176
+
Node journal and audit logs may also be forwarded (depending on your `ClusterLogForwarder` configuration)
177
+
which can cause discrepancies when comparing write and send rates.
155
178
156
179
== Other types of logs
157
180
158
181
There are other types of logs besides container logs.
159
182
All are stored under `/var/log`, but log rotation is configured differently.
160
-
The same general principles of log loss apply, here are some tips for configuration.
183
+
The same general principles of log loss apply.
161
184
162
-
journald node logs:: The write-rate in is the total volume of logs from _local_ processes on the node.
163
-
Rotation is controlled by local `journald.conf` configuration files.
185
+
journald node logs:: Rotation is controlled by `journald.conf` configuration files.
186
+
Key settings include `SystemMaxUse`, `SystemMaxFileSize`, and `MaxRetentionSec`.
164
187
165
-
Linux audit node logs:: The write-rate is total of all auditable actions on the node.
166
-
Rotation is controlled by `auditd`, which is configured by `/etc/auditd/auditd.conf`.
188
+
Linux audit node logs:: Rotation is controlled by `auditd`, configured in `/etc/audit/auditd.conf`.
189
+
Key settings include `max_log_file` and `num_logs`.
167
190
168
-
Openshift and Kubernetes audit logs:: #TODO: link to existing docs and features for API audit.#
191
+
Kubernetes API audit logs:: Audit log volume depends on the audit policy level.
192
+
The `kube-apiserver` audit configuration controls verbosity and rotation.
169
193
170
-
#TODO#: explain how to set node configuration in a cluster.
194
+
Node-level configuration in OpenShift is applied via `MachineConfig` resources.
195
+
See the OpenShift documentation on machine configuration for details.
196
+
197
+
NOTE: Kubernetes API audit logs can be extremely verbose — on large clusters, unfiltered audit logs
198
+
can include multi-megabyte request/response dumps. In addition to configuring the audit logs
199
+
produced by the API server, the `ClusterLogForwarder` provides a dedicated audit filter type to
200
+
select the audit logs you want to forward. If you forward audit logs, see the documentation to
201
+
configure an appropriate filter for your needs.
171
202
172
203
== Recommendations
173
204
174
205
=== Check forwarder CPU and Memory
175
206
176
207
If the forwarder can't keep up with `writeRate`, there are two possible causes:
177
-
- `sendRate` is to slow - the forwarder is often blocked waiting to send, which slows down reading once its internal buffers are full.
178
-
- The _forwarder itself_ is too slow: the CPU and Memory limits for the forwarder may be set too low slowing down the forwarder process itself.
179
208
180
-
Adjusting CPU and memory for the forwarder is an easy solution for some logging problems
181
-
and is always a good thing to check.
209
+
- The _remote store_, or the network to it, is too slow — the forwarder is often blocked waiting to send, which slows down reading once its internal buffers are full.
210
+
- The _forwarder itself_ is too slow — the CPU and memory limits for the forwarder may be set too low, causing the collector process to be throttled.
211
+
212
+
Check whether the collector pods are hitting their CPU or memory limits.
213
+
Collector resources can be configured via the `ClusterLogForwarder` resource's collector spec.
182
214
183
-
However, if the real problem is `writeRate > sendRate`, then this won't solve all the problems.
215
+
Adjusting CPU and memory for the forwarder is an easy first step for logging problems
216
+
and is always worth checking.
217
+
218
+
However, if the real problem is that `writeRate > sendRate` due to a slow remote store, adjusting collector resources alone won't solve the problem.
184
219
185
220
=== Estimate long-term load
186
221
187
222
Estimate your expected steady-state load, spike patterns, and tolerable outage duration.
188
223
The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads.
0 commit comments