Updates to log loss article. by alanconway · Pull Request #3267 · openshift/cluster-logging-operator

alanconway · 2026-05-04T16:33:50Z

Description

Update the log loss article to address unresolved comments on the original PR:
#3166

/assign jcantrill
/cc Clee2691
/cc r2d2rnd

Links

See comments on PR: doc: Article on high volume log loss. #3166

fix: update log loss article to address comments.

This update is to address the unresolved comments from Pull Request #3166.

Summary by CodeRabbit

Documentation
- Updated guidance on high-volume log loss, clarifying deletion from short-lived pods vs rotation
- Forwarder guidance clarified (CPU and memory) and note that drops/collector throttling can reduce effective write rate
- Delivery rate terminology switched to bytes/sec; PromQL guidance adds worst-case sizing metric
- Metrics section restructured with Limitations (container-only metric coverage) and recovery/backlog considerations
- Expanded rotation/audit log notes, added CAUTION on large rotation settings and a “Bad alternatives” rationale

coderabbitai · 2026-05-04T16:34:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 00b3fa42-2826-4f03-abb9-af083c553ada

📥 Commits

Reviewing files that changed from the base of the PR and between 6e4006f and d969ade.

📒 Files selected for processing (1)

docs/administration/high-volume-log-loss.adoc

✅ Files skipped from review due to trivial changes (1)

docs/administration/high-volume-log-loss.adoc

Walkthrough

Refines the high-volume log-loss documentation: clarifies unread-container-log loss (including short-lived pods), switches operational metrics from events/sec to bytes/sec, expands rotation/other-log controls and limitations, restructures metrics guidance, warns about rotation backlog I/O, and replaces “alternatives” with a “Bad alternatives” section plus updated checklist.

Changes

High-volume log-loss guide

Layer / File(s)	Summary
Core Concepts / Reader-facing clarifications `docs/administration/high-volume-log-loss.adoc`	Clarifies forwarder wording to use “CPU and memory”; adds that logs can be lost when short-lived pods/jobs terminate before the collector reads their files; adds NOTE distinguishing container logs from other log types.
Rotation details & assumptions `docs/administration/high-volume-log-loss.adoc`	Adds NOTE that CRI-O may compress rotated logs (`.gz`) which the collector cannot read; updates disk-sizing assumptions to treat rotated logs as uncompressed for sizing.
Operational units & forwarder effects `docs/administration/high-volume-log-loss.adoc`	Changes operational rate units from logs/sec to bytes/sec; notes that `ClusterLogForwarder` drop/filter rules and collector CPU/memory throttling reduce effective write/read rates independently of remote store capacity.
Metrics restructuring & limitations `docs/administration/high-volume-log-loss.adoc`	Replaces prior CAUTION with NOTE/TIP blocks; clarifies forwarder-added metadata affects byte vs event size; adds "Limitations" explaining `log_logged_bytes_total` covers only container logs (`/var/log/pods`) and excludes node/journald, Linux audit, and Kubernetes API audit logs.
PromQL & node sizing guidance `docs/administration/high-volume-log-loss.adoc`	Adds `MaxNodeWriteRateBytes` PromQL for busiest-node worst-case sizing; notes node journal/audit logs can affect write/send comparisons depending on forwarder config.
Other log types and rotation controls `docs/administration/high-volume-log-loss.adoc`	Expands “Other types of logs” with rotation/control sources for journald, Linux audit, and Kubernetes API audit; adds NOTE on extreme verbosity of unfiltered Kubernetes API audit logs and mentions `ClusterLogForwarder` audit filter type.
Rotation caution & capacity planning `docs/administration/high-volume-log-loss.adoc`	Adds CAUTION that large rotation settings increase disk backlog and can cause heavy catch-up I/O affecting latency-sensitive workloads; updates capacity-planning to size by busiest-node write rates and includes recovery/backlog considerations and typical OpenShift node partition layout.
Bad alternatives & summary checklist `docs/administration/high-volume-log-loss.adoc`	Replaces “Alternative (non)-solutions” with “Bad alternatives” detailing why large forwarder buffers and persistent-volume buffers are problematic; updates final checklist to include rotation-size/storage planning steps.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and generic, using non-descriptive phrasing that doesn't convey meaningful information about the specific changes made.	Replace with a more specific title that highlights the main changes, such as 'Clarify log loss mechanisms and update operational guidance' or 'Expand log loss article with loss mechanisms and mitigation guidance'.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description includes key sections (description, links, reviewers/assignees) but lacks sufficient detail about the intent and rationale for the changes made.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The custom check for stable and deterministic Ginkgo test names is not applicable to this pull request because it exclusively modifies the documentation file docs/administration/high-volume-log-loss.adoc with no test files added or modified.
Test Structure And Quality	✅ Passed	The PR contains only documentation updates to a single .adoc file with no Ginkgo test code, making this test quality check not applicable.
Microshift Test Compatibility	✅ Passed	PR contains only documentation updates with no e2e tests added, making the MicroShift Test Compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR contains only documentation updates to high-volume-log-loss.adoc with no new Ginkgo e2e test code.
Topology-Aware Scheduling Compatibility	✅ Passed	Pull request contains only documentation updates to an AsciiDoc article explaining log loss in OpenShift clusters with no code changes, YAML manifests, operator modifications, or scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	This PR contains only documentation changes to an AsciiDoc file with no executable code, so the OTE Binary Stdout Contract check does not apply.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR is a documentation-only change that updates a single AsciiDoc file with clarifications and guidance about log loss management. The custom check for IPv6 and disconnected network compatibility is designed for new Ginkgo e2e tests. Since this PR adds no test files and contains only documentation updates without any code changes, the check is not applicable to this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-05-04T16:34:05Z

Review Summary by Qodo

Comprehensive updates to high-volume log loss documentation

📝 Documentation

Walkthroughs

Description

• Clarifies log loss concepts and adds missing technical details
• Improves metrics documentation with better explanations and new queries
• Expands recommendations section with practical guidance on CPU/memory tuning
• Reorganizes and corrects information about other log types (journald, audit)
• Adds warnings about disk I/O impact and per-node variation in capacity planning
• Restructures "bad alternatives" section with clearer explanations of buffer limitations

Diagram

flowchart LR
  A["Log Loss Article"] -->|Clarifies concepts| B["Overview & Rotation"]
  A -->|Expands metrics| C["Metrics Documentation"]
  A -->|Adds guidance| D["Recommendations"]
  A -->|Improves explanations| E["Other Log Types"]
  A -->|Restructures| F["Bad Alternatives"]
  D -->|New section| G["Check Forwarder CPU/Memory"]
  F -->|Better clarity| H["Buffer Limitations"]

File Changes

1. docs/administration/high-volume-log-loss.adoc 📝 Documentation +146/-70

Comprehensive technical clarifications and expanded guidance

• Clarifies log loss mechanisms including short-lived pod scenarios and CRI-O compression effects
• Updates metrics section with improved log_logged_bytes_total explanation and adds
 MaxNodeWriteRateBytes query
• Expands "Check forwarder CPU and Memory" section with detailed troubleshooting guidance
• Reorganizes "Other types of logs" section with specific configuration details for journald, audit,
 and Kubernetes API logs
• Adds multiple NOTE and WARNING blocks highlighting per-node variation, disk I/O concerns, and
 buffer design mismatches
• Restructures "Alternative (non)-solutions" to "Bad alternatives" with clearer explanations of why
 buffers are insufficient
• Corrects terminology (e.g., "logs per second" to "bytes per second", "journal" to "journald")
• Improves summary with actionable steps and updated console navigation path

docs/administration/high-volume-log-loss.adoc

qodo-code-review · 2026-05-04T16:34:07Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

1. Conflicting disk sizing guidance 🐞 Bug ≡ Correctness

Description

The new WARNING advises sizing rotation parameters based on the busiest nodes, but the later
disk-sizing section still states sizing is based on cluster-wide average write rates. This
contradiction can cause under-provisioned per-node /var/log space and continued log loss if readers
follow the cluster-average guidance.

Code

docs/administration/high-volume-log-loss.adoc[R229-236]

+[WARNING]
+====
+Cluster-wide averages can hide per-node variation.
+In practice, a small number of nodes often produce most of the log volume.
+Always size rotation parameters based on the _busiest nodes_, not cluster averages.
+
+Use `MaxNodeWriteRateBytes` (see <<Using metrics to measure log activity>>) to identify the worst-case node.
+====

Evidence
The document now explicitly instructs readers to size for busiest nodes, but later claims disk
sizing is based on cluster-wide averages; the worked example also uses the busiest-node write rate.
These statements cannot all be true at the same time without clarifying whether the calculations are
per-node vs cluster-total, and what each variable represents.
docs/administration/high-volume-log-loss.adoc[229-236]
docs/administration/high-volume-log-loss.adoc[269-275]
docs/administration/high-volume-log-loss.adoc[331-342]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The doc simultaneously (a) recommends sizing based on the busiest nodes and (b) states disk sizing is based on cluster-wide average write rates, while the example uses busiest-node rates. This creates conflicting instructions that can lead to incorrect capacity planning.

### Issue Context
This article is used to size node log rotation/disk usage to prevent log loss. The sizing section should be internally consistent about whether variables and formulas are **per-node** (node /var/log partition sizing) or **cluster-total** (sum across all nodes).

### Fix Focus Areas
- docs/administration/high-volume-log-loss.adoc[229-236]
- docs/administration/high-volume-log-loss.adoc[267-296]
- docs/administration/high-volume-log-loss.adoc[331-342]

### Suggested direction
- Either update the “Estimate total disk requirements” narrative/variables to be explicitly **per-node** (e.g., use busiest-node `MaxNodeWriteRateBytes` / “NodeWriteRateBytes”) and remove/adjust the “cluster-wide average” statement; **or**
- Keep cluster-wide calculations but add a clearly labeled, separate **per-node** sizing subsection (recommended, since node partitions are what actually fill), and make the example match the subsection it’s demonstrating.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

openshift-ci · 2026-05-04T16:34:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [alanconway]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This update is to address the unresolved comments from Pull Request openshift#3166.

openshift-ci · 2026-05-04T20:52:32Z

@alanconway: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jcantrill

couple questions but overal lgtm

jcantrill · 2026-05-06T14:26:34Z

+[NOTE]
+====
+CRI-O may compress rotated log files (`.gz`).
+The collector cannot read compressed files — they are excluded from collection.


Do we know this is a true statement other than we exclude them now. I'm not certain we have ever tested it

jcantrill · 2026-05-06T14:29:30Z


-Linux audit node logs:: The write-rate is total of all auditable actions on the node.
-Rotation is controlled by `auditd`, which is configured by `/etc/auditd/auditd.conf`.
+Linux audit node logs:: Rotation is controlled by `auditd`, configured in `/etc/audit/auditd.conf`.


is there a cluster config that controls the config settings in this file? Can it be changed without rebuilding the image? If so, can we reference that here?

jcantrill · 2026-05-06T14:29:44Z


-journald node logs:: The write-rate in is the total volume of logs from _local_ processes on the node.
-Rotation is controlled by local `journald.conf` configuration files.
+journald node logs:: Rotation is controlled by `journald.conf` configuration files.


Same question here as below

jcantrill · 2026-05-06T14:33:06Z

+[CAUTION]
+====
+Large rotation settings mean more data accumulates on disk during outages.
+When the collector catches up after an outage, reading a large backlog causes heavy disk I/O


Do we need to mention anything here about receivers with ingetst rate limits (i.e. loki) where the burst may result in dropped logs and therefore they need to consider the config of the receiver

openshift-ci Bot assigned jcantrill May 4, 2026

openshift-ci Bot requested review from Clee2691 and r2d2rnd May 4, 2026 16:33

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026

qodo-code-review Bot reviewed May 4, 2026

View reviewed changes

Comment thread docs/administration/high-volume-log-loss.adoc

fix: update log loss article to address comments.

d969ade

This update is to address the unresolved comments from Pull Request openshift#3166.

alanconway force-pushed the log-loss-2 branch from 6e4006f to d969ade Compare May 4, 2026 20:35

jcantrill added the release/6.5 label May 6, 2026

jcantrill reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to log loss article.#3267

Updates to log loss article.#3267
alanconway wants to merge 1 commit intoopenshift:masterfrom
alanconway:log-loss-2

alanconway commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 4, 2026

Uh oh!

qodo-code-review Bot commented May 4, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

jcantrill left a comment

Uh oh!

jcantrill May 6, 2026

Uh oh!

jcantrill May 6, 2026

Uh oh!

jcantrill May 6, 2026

Uh oh!

jcantrill May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanconway commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 4, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

jcantrill left a comment

Choose a reason for hiding this comment

Uh oh!

jcantrill May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jcantrill May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alanconway commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

qodo-code-review Bot commented May 4, 2026 •

edited

Loading