Skip to content

Updates to log loss article.#3267

Open
alanconway wants to merge 1 commit intoopenshift:masterfrom
alanconway:log-loss-2
Open

Updates to log loss article.#3267
alanconway wants to merge 1 commit intoopenshift:masterfrom
alanconway:log-loss-2

Conversation

@alanconway
Copy link
Copy Markdown
Contributor

@alanconway alanconway commented May 4, 2026

Description

Update the log loss article to address unresolved comments on the original PR:
#3166

/assign jcantrill
/cc Clee2691
/cc r2d2rnd

Links

fix: update log loss article to address comments.

This update is to address the unresolved comments from Pull Request #3166.

Summary by CodeRabbit

  • Documentation
    • Updated guidance on high-volume log loss, clarifying deletion from short-lived pods vs rotation
    • Forwarder guidance clarified (CPU and memory) and note that drops/collector throttling can reduce effective write rate
    • Delivery rate terminology switched to bytes/sec; PromQL guidance adds worst-case sizing metric
    • Metrics section restructured with Limitations (container-only metric coverage) and recovery/backlog considerations
    • Expanded rotation/audit log notes, added CAUTION on large rotation settings and a “Bad alternatives” rationale

@openshift-ci openshift-ci Bot requested review from Clee2691 and r2d2rnd May 4, 2026 16:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 00b3fa42-2826-4f03-abb9-af083c553ada

📥 Commits

Reviewing files that changed from the base of the PR and between 6e4006f and d969ade.

📒 Files selected for processing (1)
  • docs/administration/high-volume-log-loss.adoc
✅ Files skipped from review due to trivial changes (1)
  • docs/administration/high-volume-log-loss.adoc

Walkthrough

Refines the high-volume log-loss documentation: clarifies unread-container-log loss (including short-lived pods), switches operational metrics from events/sec to bytes/sec, expands rotation/other-log controls and limitations, restructures metrics guidance, warns about rotation backlog I/O, and replaces “alternatives” with a “Bad alternatives” section plus updated checklist.

Changes

High-volume log-loss guide

Layer / File(s) Summary
Core Concepts / Reader-facing clarifications
docs/administration/high-volume-log-loss.adoc
Clarifies forwarder wording to use “CPU and memory”; adds that logs can be lost when short-lived pods/jobs terminate before the collector reads their files; adds NOTE distinguishing container logs from other log types.
Rotation details & assumptions
docs/administration/high-volume-log-loss.adoc
Adds NOTE that CRI-O may compress rotated logs (.gz) which the collector cannot read; updates disk-sizing assumptions to treat rotated logs as uncompressed for sizing.
Operational units & forwarder effects
docs/administration/high-volume-log-loss.adoc
Changes operational rate units from logs/sec to bytes/sec; notes that ClusterLogForwarder drop/filter rules and collector CPU/memory throttling reduce effective write/read rates independently of remote store capacity.
Metrics restructuring & limitations
docs/administration/high-volume-log-loss.adoc
Replaces prior CAUTION with NOTE/TIP blocks; clarifies forwarder-added metadata affects byte vs event size; adds "Limitations" explaining log_logged_bytes_total covers only container logs (/var/log/pods) and excludes node/journald, Linux audit, and Kubernetes API audit logs.
PromQL & node sizing guidance
docs/administration/high-volume-log-loss.adoc
Adds MaxNodeWriteRateBytes PromQL for busiest-node worst-case sizing; notes node journal/audit logs can affect write/send comparisons depending on forwarder config.
Other log types and rotation controls
docs/administration/high-volume-log-loss.adoc
Expands “Other types of logs” with rotation/control sources for journald, Linux audit, and Kubernetes API audit; adds NOTE on extreme verbosity of unfiltered Kubernetes API audit logs and mentions ClusterLogForwarder audit filter type.
Rotation caution & capacity planning
docs/administration/high-volume-log-loss.adoc
Adds CAUTION that large rotation settings increase disk backlog and can cause heavy catch-up I/O affecting latency-sensitive workloads; updates capacity-planning to size by busiest-node write rates and includes recovery/backlog considerations and typical OpenShift node partition layout.
Bad alternatives & summary checklist
docs/administration/high-volume-log-loss.adoc
Replaces “Alternative (non)-solutions” with “Bad alternatives” detailing why large forwarder buffers and persistent-volume buffers are problematic; updates final checklist to include rotation-size/storage planning steps.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive phrasing that doesn't convey meaningful information about the specific changes made. Replace with a more specific title that highlights the main changes, such as 'Clarify log loss mechanisms and update operational guidance' or 'Expand log loss article with loss mechanisms and mitigation guidance'.
✅ Passed checks (11 passed)
Check name Status Explanation
Description check ✅ Passed The description includes key sections (description, links, reviewers/assignees) but lacks sufficient detail about the intent and rationale for the changes made.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The custom check for stable and deterministic Ginkgo test names is not applicable to this pull request because it exclusively modifies the documentation file docs/administration/high-volume-log-loss.adoc with no test files added or modified.
Test Structure And Quality ✅ Passed The PR contains only documentation updates to a single .adoc file with no Ginkgo test code, making this test quality check not applicable.
Microshift Test Compatibility ✅ Passed PR contains only documentation updates with no e2e tests added, making the MicroShift Test Compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR contains only documentation updates to high-volume-log-loss.adoc with no new Ginkgo e2e test code.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request contains only documentation updates to an AsciiDoc article explaining log loss in OpenShift clusters with no code changes, YAML manifests, operator modifications, or scheduling constraints.
Ote Binary Stdout Contract ✅ Passed This PR contains only documentation changes to an AsciiDoc file with no executable code, so the OTE Binary Stdout Contract check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR is a documentation-only change that updates a single AsciiDoc file with clarifications and guidance about log loss management. The custom check for IPv6 and disconnected network compatibility is designed for new Ginkgo e2e tests. Since this PR adds no test files and contains only documentation updates without any code changes, the check is not applicable to this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Comprehensive updates to high-volume log loss documentation

📝 Documentation

Grey Divider

Walkthroughs

Description
• Clarifies log loss concepts and adds missing technical details
• Improves metrics documentation with better explanations and new queries
• Expands recommendations section with practical guidance on CPU/memory tuning
• Reorganizes and corrects information about other log types (journald, audit)
• Adds warnings about disk I/O impact and per-node variation in capacity planning
• Restructures "bad alternatives" section with clearer explanations of buffer limitations
Diagram
flowchart LR
  A["Log Loss Article"] -->|Clarifies concepts| B["Overview & Rotation"]
  A -->|Expands metrics| C["Metrics Documentation"]
  A -->|Adds guidance| D["Recommendations"]
  A -->|Improves explanations| E["Other Log Types"]
  A -->|Restructures| F["Bad Alternatives"]
  D -->|New section| G["Check Forwarder CPU/Memory"]
  F -->|Better clarity| H["Buffer Limitations"]
Loading

Grey Divider

File Changes

1. docs/administration/high-volume-log-loss.adoc 📝 Documentation +146/-70

Comprehensive technical clarifications and expanded guidance

• Clarifies log loss mechanisms including short-lived pod scenarios and CRI-O compression effects
• Updates metrics section with improved log_logged_bytes_total explanation and adds
 MaxNodeWriteRateBytes query
• Expands "Check forwarder CPU and Memory" section with detailed troubleshooting guidance
• Reorganizes "Other types of logs" section with specific configuration details for journald, audit,
 and Kubernetes API logs
• Adds multiple NOTE and WARNING blocks highlighting per-node variation, disk I/O concerns, and
 buffer design mismatches
• Restructures "Alternative (non)-solutions" to "Bad alternatives" with clearer explanations of why
 buffers are insufficient
• Corrects terminology (e.g., "logs per second" to "bytes per second", "journal" to "journald")
• Improves summary with actionable steps and updated console navigation path

docs/administration/high-volume-log-loss.adoc


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 4, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Action required

1. Conflicting disk sizing guidance 🐞 Bug ≡ Correctness
Description
The new WARNING advises sizing rotation parameters based on the busiest nodes, but the later
disk-sizing section still states sizing is based on cluster-wide average write rates. This
contradiction can cause under-provisioned per-node /var/log space and continued log loss if readers
follow the cluster-average guidance.
Code

docs/administration/high-volume-log-loss.adoc[R229-236]

+[WARNING]
+====
+Cluster-wide averages can hide per-node variation.
+In practice, a small number of nodes often produce most of the log volume.
+Always size rotation parameters based on the _busiest nodes_, not cluster averages.
+
+Use `MaxNodeWriteRateBytes` (see <<Using metrics to measure log activity>>) to identify the worst-case node.
+====
Evidence
The document now explicitly instructs readers to size for busiest nodes, but later claims disk
sizing is based on cluster-wide averages; the worked example also uses the busiest-node write rate.
These statements cannot all be true at the same time without clarifying whether the calculations are
per-node vs cluster-total, and what each variable represents.

docs/administration/high-volume-log-loss.adoc[229-236]
docs/administration/high-volume-log-loss.adoc[269-275]
docs/administration/high-volume-log-loss.adoc[331-342]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The doc simultaneously (a) recommends sizing based on the busiest nodes and (b) states disk sizing is based on cluster-wide average write rates, while the example uses busiest-node rates. This creates conflicting instructions that can lead to incorrect capacity planning.

### Issue Context
This article is used to size node log rotation/disk usage to prevent log loss. The sizing section should be internally consistent about whether variables and formulas are **per-node** (node /var/log partition sizing) or **cluster-total** (sum across all nodes).

### Fix Focus Areas
- docs/administration/high-volume-log-loss.adoc[229-236]
- docs/administration/high-volume-log-loss.adoc[267-296]
- docs/administration/high-volume-log-loss.adoc[331-342]

### Suggested direction
- Either update the “Estimate total disk requirements” narrative/variables to be explicitly **per-node** (e.g., use busiest-node `MaxNodeWriteRateBytes` / “NodeWriteRateBytes”) and remove/adjust the “cluster-wide average” statement; **or**
- Keep cluster-wide calculations but add a clearly labeled, separate **per-node** sizing subsection (recommended, since node partitions are what actually fill), and make the example match the subsection it’s demonstrating.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026
Comment thread docs/administration/high-volume-log-loss.adoc
This update is to address the unresolved comments from Pull Request openshift#3166.
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@alanconway: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions but overal lgtm

[NOTE]
====
CRI-O may compress rotated log files (`.gz`).
The collector cannot read compressed files — they are excluded from collection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know this is a true statement other than we exclude them now. I'm not certain we have ever tested it


Linux audit node logs:: The write-rate is total of all auditable actions on the node.
Rotation is controlled by `auditd`, which is configured by `/etc/auditd/auditd.conf`.
Linux audit node logs:: Rotation is controlled by `auditd`, configured in `/etc/audit/auditd.conf`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a cluster config that controls the config settings in this file? Can it be changed without rebuilding the image? If so, can we reference that here?


journald node logs:: The write-rate in is the total volume of logs from _local_ processes on the node.
Rotation is controlled by local `journald.conf` configuration files.
journald node logs:: Rotation is controlled by `journald.conf` configuration files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here as below

[CAUTION]
====
Large rotation settings mean more data accumulates on disk during outages.
When the collector catches up after an outage, reading a large backlog causes heavy disk I/O
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mention anything here about receivers with ingetst rate limits (i.e. loki) where the burst may result in dropped logs and therefore they need to consider the config of the receiver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. release/6.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants