Skip to content

Commit 6e4006f

Browse files
committed
fix: update log loss article to address comments.
This update is to address the unresolved comments from Pull Request #3166.
1 parent 3c28f59 commit 6e4006f

1 file changed

Lines changed: 146 additions & 70 deletions

File tree

docs/administration/high-volume-log-loss.adoc

Lines changed: 146 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,9 @@ and how to configure your cluster to minimize this risk.
1111
=== Log loss
1212

1313
Container logs are written to `/var/log/pods`.
14-
The forwarder reads and forwards logs as quickly as possible with its available CPU/Memory.
15-
If the forwarder is too slow, in some cases adjusting its CPU/Memory may resolve the problem.
14+
The forwarder reads and forwards logs as quickly as possible with its available CPU and memory.
15+
If the forwarder is too slow, adjusting its CPU and memory limits may help
16+
(see <<Check forwarder CPU and Memory>>).
1617

1718
There are always some _unread logs_, written but not yet read by the forwarder.
1819

@@ -25,18 +26,19 @@ There is no coordination or flow-control to ensure logs are forwarded before the
2526
_Log Loss_ occurs when _unread logs_ are deleted by CRI-O _before_ being read by the forwarder.
2627
Lost logs are gone from the file-system, have not been forwarded anywhere, and cannot be recovered.
2728

29+
Logs can also be lost when short-lived pods or jobs terminate and their log files are deleted
30+
before the collector reads them.
31+
This is distinct from rotation-based loss and is difficult to mitigate.
32+
2833
NOTE: This guide focuses on _container logs_.
29-
The section <<Other types of logs>> briefly discusses other types of log.
30-
====
31-
Not all logs are container logs, the following types of logs are not discussed here but
32-
can be managed in similar ways:
34+
Other log types (journald, Linux audit, Kubernetes API audit) have different rotation mechanisms.
35+
See <<Other types of logs>>.
3336

34-
- Journald (node) logs: are
35-
====
3637
=== Log rotation
3738

38-
CRI-O does the actual log rotation, but the rotation limits are specified via Kubelet.
39+
CRI-O does the actual log rotation, but the rotation limits are configured via Kubelet parameters.
3940
The parameters are:
41+
4042
[horizontal]
4143
containerLogMaxSize:: Max size of a single log file (default 10MiB)
4244
containerLogMaxFiles:: Max number of log files per container (default 5)
@@ -48,6 +50,14 @@ When the active file reaches `containerLogMaxSize` the log files are rotated:
4850
. a new active file is created
4951
. if there are more than `containerLogMaxFiles` files, the oldest is deleted.
5052

53+
[NOTE]
54+
====
55+
CRI-O may compress rotated log files (`.gz`).
56+
The collector cannot read compressed files — they are excluded from collection.
57+
Compressed rotated logs save disk space but are effectively lost to the collector.
58+
Disk size calculations in this guide assume uncompressed log files.
59+
====
60+
5161
=== Best effort delivery
5262

5363
OpenShift logging provides _best effort_ delivery of logs.
@@ -58,7 +68,7 @@ This article discusses how you can tune these limits to minimize log loss under
5868

5969
[WARNING]
6070
====
61-
**NEVER** abuse logs as a way to store or send application data - especially financial data.
71+
**NEVER** abuse logs as a way to store or send application data especially financial data.
6272
This is unreliable, insecure, and in all other ways inconceivable.
6373
Use appropriate tools that meet your reliability requirements for application data.
6474
For example: databases, object stores, or reliable messaging (Kafka, AMQP, MQTT).
@@ -67,8 +77,8 @@ For example: databases, object stores, or reliable messaging (Kafka, AMQP, MQTT)
6777
=== Modes of operation
6878

6979
[horizontal]
70-
writeRate:: long-term average logs per second per container written to `/var/log/pods`
71-
sendRate:: long-term average logs per second per container forwarded to the store
80+
writeRate:: long-term average bytes per second per container written to `/var/log/pods`
81+
sendRate:: long-term average bytes per second per container forwarded to the store
7282

7383
During _normal operation_ `sendRate` keeps up with `writeRate` (on average).
7484
The number of unread logs is small, and does not grow over time.
@@ -79,25 +89,33 @@ If this lasts long enough, log rotation will delete unread logs causing log loss
7989
After a load surge ends, the system has to _recover_ by processing the accumulated unread logs.
8090
Until the backlog clears, the system is more vulnerable to log loss if there is another overload.
8191

92+
NOTE: If drop or filter rules are configured in the `ClusterLogForwarder`,
93+
the effective write rate seen by the forwarder is reduced.
94+
Also, the collector itself can be a bottleneck if its CPU or memory limits are too low,
95+
causing slow reading and sending regardless of the remote store's capacity.
96+
See <<Check forwarder CPU and Memory>>.
97+
8298
== Metrics for logging
8399

84100
Relevant metrics include:
101+
85102
[horizontal]
86103
vector_*:: The `vector` process deployed by the log forwarder generates metrics for log collection, buffering and forwarding.
87-
log_logged_bytes_total:: The `LogFileMetricExporter` measures disk writes _before_ logs are read by the forwarder.
88-
To measure end-to-end log loss it is important to measure data that is _not_ yet read by the forwarder.
104+
log_logged_bytes_total:: Produced by the `LogFileMetricExporter`, reported per namespace, pod, and container. Measures bytes written to disk _before_ the forwarder reads them — essential for detecting log loss.
89105
kube_*:: Metrics from the Kubernetes cluster.
90106

91-
[CAUTION]
107+
[NOTE]
92108
====
93109
Metrics named `_bytes_` count bytes, metrics named `_events_` count log records.
94110
95-
The forwarder adds metadata to the logs before sending so you cannot assume that a log
96-
record written to `/var/log` is the same size in bytes as the record sent to the store.
97-
111+
The forwarder adds metadata to the logs before sending, so a log record written to `/var/log`
112+
is not the same size in bytes as the record sent to the store.
98113
Use event and byte metrics carefully in calculations to get the correct results.
99114
====
100115

116+
TIP: The OpenShift console includes logging dashboards under Observe > Dashboards.
117+
These provide pre-built views of collection and forwarding metrics.
118+
101119
=== Log File Metric Exporter
102120

103121
The metric `log_logged_bytes_total` is the number of bytes written to each file in `/var/log/pods` by a container.
@@ -113,15 +131,15 @@ metadata:
113131
namespace: openshift-logging
114132
----
115133

116-
== Limitations
134+
=== Limitations
117135

118-
Write rate metrics only cover container logs in `/var/log/pods`.
136+
Write rate metrics (`log_logged_bytes_total`) only cover container logs in `/var/log/pods`.
119137
The following are excluded from these metrics:
120138

121-
* Node-level logs (journal, systemd, audit)
122-
* API audit logs
139+
* Node-level logs (journald, systemd, audit)
140+
* Kubernetes API audit logs
123141

124-
This may cause discrepancies when comparing write vs send rates.
142+
This can cause discrepancies when comparing write vs send rates.
125143
The principles still apply, but account for this additional volume in capacity planning.
126144

127145
=== Using metrics to measure log activity
@@ -149,48 +167,74 @@ sum(increase(vector_component_received_events_total{component_type="kubernetes_l
149167
max(rate(log_logged_bytes_total[1h]))
150168
----
151169

170+
.*MaxNodeWriteRateBytes* (bytes/sec per node): Identifies the busiest node for worst-case sizing.
171+
----
172+
max(sum by (instance) (rate(log_logged_bytes_total[1h])))
173+
----
174+
152175
NOTE: The queries above are for container logs only.
153-
Node and audit may also be forwarded (depending on your `ClusterLogForwarder` configuration)
154-
which may cause discrepancies when comparing write and send rates.
176+
Node journal and audit logs may also be forwarded (depending on your `ClusterLogForwarder` configuration)
177+
which can cause discrepancies when comparing write and send rates.
155178

156179
== Other types of logs
157180

158181
There are other types of logs besides container logs.
159182
All are stored under `/var/log`, but log rotation is configured differently.
160-
The same general principles of log loss apply, here are some tips for configuration.
183+
The same general principles of log loss apply.
161184

162-
journald node logs:: The write-rate in is the total volume of logs from _local_ processes on the node.
163-
Rotation is controlled by local `journald.conf` configuration files.
185+
journald node logs:: Rotation is controlled by `journald.conf` configuration files.
186+
Key settings include `SystemMaxUse`, `SystemMaxFileSize`, and `MaxRetentionSec`.
164187

165-
Linux audit node logs:: The write-rate is total of all auditable actions on the node.
166-
Rotation is controlled by `auditd`, which is configured by `/etc/auditd/auditd.conf`.
188+
Linux audit node logs:: Rotation is controlled by `auditd`, configured in `/etc/audit/auditd.conf`.
189+
Key settings include `max_log_file` and `num_logs`.
167190

168-
Openshift and Kubernetes audit logs:: #TODO: link to existing docs and features for API audit.#
191+
Kubernetes API audit logs:: Audit log volume depends on the audit policy level.
192+
The `kube-apiserver` audit configuration controls verbosity and rotation.
169193

170-
#TODO#: explain how to set node configuration in a cluster.
194+
Node-level configuration in OpenShift is applied via `MachineConfig` resources.
195+
See the OpenShift documentation on machine configuration for details.
196+
197+
NOTE: Kubernetes API audit logs can be extremely verbose — on large clusters, unfiltered audit logs
198+
can include multi-megabyte request/response dumps. In addition to configuring the audit logs
199+
produced by the API server, the `ClusterLogForwarder` provides a dedicated audit filter type to
200+
select the audit logs you want to forward. If you forward audit logs, see the documentation to
201+
configure an appropriate filter for your needs.
171202

172203
== Recommendations
173204

174205
=== Check forwarder CPU and Memory
175206

176207
If the forwarder can't keep up with `writeRate`, there are two possible causes:
177-
- `sendRate` is to slow - the forwarder is often blocked waiting to send, which slows down reading once its internal buffers are full.
178-
- The _forwarder itself_ is too slow: the CPU and Memory limits for the forwarder may be set too low slowing down the forwarder process itself.
179208

180-
Adjusting CPU and memory for the forwarder is an easy solution for some logging problems
181-
and is always a good thing to check.
209+
- The _remote store_, or the network to it, is too slow — the forwarder is often blocked waiting to send, which slows down reading once its internal buffers are full.
210+
- The _forwarder itself_ is too slow — the CPU and memory limits for the forwarder may be set too low, causing the collector process to be throttled.
211+
212+
Check whether the collector pods are hitting their CPU or memory limits.
213+
Collector resources can be configured via the `ClusterLogForwarder` resource's collector spec.
182214

183-
However, if the real problem is `writeRate > sendRate`, then this won't solve all the problems.
215+
Adjusting CPU and memory for the forwarder is an easy first step for logging problems
216+
and is always worth checking.
217+
218+
However, if the real problem is that `writeRate > sendRate` due to a slow remote store, adjusting collector resources alone won't solve the problem.
184219

185220
=== Estimate long-term load
186221

187222
Estimate your expected steady-state load, spike patterns, and tolerable outage duration.
188223
The long-term average send rate *must* exceed the write rate (including spikes) to allow recovery after overloads.
189224

190225
----
191-
TotalWriteRateBytes < TotalSendRateLogs × LogSizeBytes
226+
TotalWriteRateBytes < TotalSendRateEvents × LogSizeBytes
192227
----
193228

229+
[WARNING]
230+
====
231+
Cluster-wide averages can hide per-node variation.
232+
In practice, a small number of nodes often produce most of the log volume.
233+
Always size rotation parameters based on the _busiest nodes_, not cluster averages.
234+
235+
Use `MaxNodeWriteRateBytes` (see <<Using metrics to measure log activity>>) to identify the worst-case node.
236+
====
237+
194238
=== Configure rotation
195239

196240
Configure rotation parameters based on the _noisiest_ containers in your cluster,
@@ -210,7 +254,15 @@ containerLogMaxSize = MaxContainerSizeBytes / N
210254
----
211255

212256
NOTE: N should be a relatively small number of files, the default is 5.
213-
The files can be as large as needed so that `N*containerLogMaxSize > MaxContainerSizeBytes`
257+
The files can be as large as needed so that `N × containerLogMaxSize > MaxContainerSizeBytes`.
258+
259+
[CAUTION]
260+
====
261+
Large rotation settings mean more data accumulates on disk during outages.
262+
When the collector catches up after an outage, reading a large backlog causes heavy disk I/O
263+
on the node's primary partition, which can affect latency-sensitive workloads such as etcd.
264+
Balance rotation size against node I/O capacity.
265+
====
214266

215267
=== Estimate total disk requirements
216268

@@ -222,11 +274,26 @@ Total disk space is based on cluster-wide average write rates, not on the noisie
222274
DiskTotalSize = MaxOutageTime × TotalWriteRateBytes × SafetyFactor
223275
----
224276

225-
.Recovery time to clear the backlog from a max outage:
277+
.Recovery time to clear the backlog from a max outage
226278
----
227-
RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateLogs × LogSizeBytes)
279+
RecoveryTime = (MaxOutageTime × TotalWriteRateBytes) / (TotalSendRateEvents × LogSizeBytes)
228280
----
229281

282+
[NOTE]
283+
====
284+
These are cluster-wide estimates.
285+
Individual nodes may need more or less disk depending on their share of the log volume.
286+
Recovery time also varies per node — the busiest nodes take longer and may face backpressure
287+
from the remote store during catch-up.
288+
====
289+
290+
[NOTE]
291+
====
292+
Standard OCP nodes typically use a single ~120GB partition for `/var/log`, `/var/lib`, `/etc`, and workload data.
293+
All log storage competes with other node processes for this space.
294+
With container densities of 200+ pods per node, per-container rotation settings multiply quickly.
295+
====
296+
230297
[TIP]
231298
.To check the size of the /var/log partition on each node
232299
[source,console]
@@ -261,7 +328,7 @@ containerLogMaxFiles: 10
261328
containerLogMaxSize: 100MB
262329
----
263330

264-
For total disk space, suppose the cluster writes 2MB/s for all containers:
331+
For total disk space, suppose the busiest node writes 2MB/s across all its containers:
265332

266333
----
267334
MaxOutageTime = 3600
@@ -272,7 +339,7 @@ DiskTotalSize = 3600s × 2MB/s × 1.5 = 10GB
272339
----
273340

274341
NOTE: `MaxStoragePerContainerBytes=1GB` applies only to the noisiest containers.
275-
The `DiskTotalSize=10GB` is based on the cluster-wide average write rates.
342+
The `DiskTotalSize=10GB` is based on write rates for the busiest node.
276343

277344
=== Configure Kubelet log limits
278345

@@ -301,7 +368,6 @@ You can modify `MachineConfig` resources on older versions of OpenShift that don
301368
*To apply the KubeletConfig:*
302369
[,bash]
303370
----
304-
# Apply the configuration
305371
oc apply -f kubelet-log-limits.yaml
306372
307373
# Monitor the roll-out (this will cause node reboots)
@@ -325,58 +391,68 @@ find /var/log -name "*.log" -exec ls -lah {} \; | head -20
325391
326392
----
327393

328-
== Alternative (non)-solutions
394+
== Bad alternatives
329395

330-
This section presents what seem like alternative solutions at first glance, but have significant problems.
396+
WARNING: This section presents ideas that often come up in the context of log reliability.
397+
They _seem_ like good solutions at first glance, but be aware of the problems hidden underneath.
331398

332399
=== Large forwarder buffers
333400

334-
Instead of modifying rotation parameters, make the forwarder's internal buffers very large.
401+
Instead of increasing rotation limits, why not make the forwarder's internal buffers very large?
335402

336403
==== Duplication of logs
337404

338-
Forwarder buffers are stored on the same disk partition as `/var/log`.
405+
Forwarder buffers are stored in `/var/lib/vector`, which is normally on the same disk partition as `/var/log`.
339406
When the forwarder reads logs, they remain in `/var/log` until rotation deletes them.
340-
This means the forwarder buffer mostly duplicates data from `/var/log` files,
341-
which requires up to double the disk space for logs waiting to be forwarded.
407+
This means most of the data in the forwarder buffer is a duplicate of data still in `/var/log` files.
408+
Very large buffers create a lot of duplicate data on the same disk volume, which is not helpful if that volume begins to fill.
342409

343410
==== Buffer design mismatch
344411

345-
Forwarder buffers are optimized for transmitting data efficiently, based on characteristics of the remote store.
412+
Forwarder buffers are intended for reliable transmission of data, not long-term storage.
413+
Long-term log retention is the purpose of the `/var/log` files themselves.
346414

347-
- *Intended purpose:* Hold records that are ready-to-send or in-flight awaiting acknowledgement.
415+
- *Intended purpose:* Hold records that are sent and awaiting remote acknowledgment or re-transmit.
348416
- *Typical time-frame:* Seconds to minutes of buffering for round-trip request/response times.
349-
- *Not designed for:* Hours/days of log accumulation during extended outages
417+
- *Not designed for:* Hours/days of log accumulation during extended outages.
418+
419+
Each output in each `ClusterLogForwarder` gets its own buffer, by default 256MB per output.
420+
This provides protection against brief network interruptions and re-transmits,
421+
but is too small for long-term, high-volume log accumulation.
350422

351-
==== Supporting other logging tools
352423

353-
Expanding `/var/log` benefits _any_ logging tool, including:
424+
Buffer data is stored in a component-dependent format (with compression and encoding),
425+
so buffer size in bytes does not correspond directly to log size in bytes.
426+
427+
==== Why increasing rotation limits is better
428+
429+
Increasing rotation limits benefits _any_ logging tool, including:
354430

355431
- `oc logs` for local debugging or troubleshooting log collection
356432
- Standard Unix tools when debugging via `oc rsh`
357433

358-
Expanding forwarder buffers only benefits the forwarder, and costs more in disk space.
434+
Expanding forwarder buffers only benefits the forwarder, and uses up valuable /var/log space.
435+
If you deploy multiple forwarders, each needs its own buffer space which multiplies disk usage.
359436

360-
If you deploy multiple forwarders, each additional forwarder will need its own buffer space.
361-
If you expand `/var/log`, all forwarders share the same storage.
437+
Larger rotation limits are shared by all tools reading from `/var/log`, including multiple
438+
forwarders and other log collection tools.
362439

363440
=== Persistent volume buffers
364441

365-
Since large forwarder buffers compete for disk space with `/var/log`,
442+
Since forwarder buffers compete for disk space with `/var/log` on the same partition,
366443
what about storing forwarder buffers on a separate persistent volume?
367444

368-
This would still double the storage requirements (using a separate disk) but
369-
the real problem is that a PV is not a local disk, it is a network service.
370-
Using PVs for buffer storage introduces new network dependencies and reliability and performance issues.
371-
The underlying buffer management code is optimized for local disk response times.
445+
A persistent volume is typically network-attached or remotely-hosted storage.
446+
In effect it is another kind of "remote store", that can get backed up or
447+
become unavailable like your intended forwarding target.
448+
For reliable transmission, the forwarder needs buffers that are reliable and fast like a local disk.
372449

373450
== Summary
374451

375-
1. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates
376-
2. *Calculate storage requirements:* Account for peak periods, recovery time, and spikes
377-
3. *Increase kubelet log rotation limits:* Allow greater storage for noisy containers
378-
4. *Plan for peak scenarios:* Size storage to handle expected patterns without loss
379-
380-
TIP: The OpenShift console Observe>Dashboard section includes helpful log-related dashboards.
381-
452+
1. *Check collector resources:* Ensure the forwarder has sufficient CPU and memory
453+
2. *Monitor log patterns:* Use Prometheus metrics to measure log generation and send rates per node
454+
3. *Calculate storage requirements:* Account for peak periods, recovery time, and per-node variation
455+
4. *Increase CRI-O log rotation limits:* Configure via Kubelet parameters to allow greater storage for noisy containers
456+
5. *Plan for peak scenarios:* Size storage to handle expected patterns on the busiest nodes without loss
382457

458+
TIP: The OpenShift console Observe > Dashboards section includes logging dashboards for monitoring collection and forwarding metrics.

0 commit comments

Comments
 (0)