Skip to content

OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands#3052

Open
BhargaviGudi wants to merge 1 commit intokubernetes-sigs:mainfrom
BhargaviGudi:OCPBUGS-62269-fix
Open

OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands#3052
BhargaviGudi wants to merge 1 commit intokubernetes-sigs:mainfrom
BhargaviGudi:OCPBUGS-62269-fix

Conversation

@BhargaviGudi
Copy link
Copy Markdown
Contributor

@BhargaviGudi BhargaviGudi commented Nov 27, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a race condition in the JSON enricher that causes audit log entries to be missing or incomplete when commands are executed in rapid succession (e.g., touch test1 && touch test2 && touch test3).

Problem:
When short-lived processes execute quickly, audit events arrive at the JSON enricher before the corresponding BPF events have been processed from the ring buffer. This results in:

  • Empty cmdLine fields
  • Missing requestUID
  • Null resource information (pod/namespace/container)

Solution:
Implements a retry mechanism in dispatchSeccompLine() that attempts to fetch missing information multiple times with progressive delays (0ms, 10ms, 50ms):

  1. BPF Cache Retry: Retries cmdLine and requestUID lookup from BPF process cache, giving the BPF ring buffer poller time to process events
  2. Container Info Retry: Retries container information lookup if initially null

Since LogBuckets remain in cache for 60 seconds before being written to the audit log, there's ample time for retries to succeed while BPF events are processed asynchronously.

Testing Results:

  • Before fix: ~33% success rate for rapid consecutive commands
  • After fix: 100% success rate - all commands captured with complete information

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-62269

Does this PR have test?

Yes - The fix was validated through:

  • Manual testing with rapid consecutive commands (oc exec and oc rsh)
  • Existing unit tests continue to pass
  • Build verification completed successfully

No new automated tests added as this is a timing-dependent race condition that's difficult to reliably reproduce in unit tests.

Special notes for your reviewer:

  • This fix is scoped exclusively to jsonenricher.go - no modifications to BPF components or ring buffer configuration
  • The retry mechanism only activates when information is missing, avoiding unnecessary delays for normal cases
  • Progressive delays (0ms → 10ms → 50ms) balance responsiveness with reliability
  • Early break on success minimizes latency impact

Does this PR introduce a user-facing change?

Fixed race condition in JSON enricher causing audit log entries to be missing or incomplete when commands are executed in rapid succession. Audit logs now reliably capture cmdLine, requestUID, and resource
information for all executed commands, even during bursts of activity.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 27, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi
Once this PR has been reviewed and has the lgtm label, please assign ccojocar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @BhargaviGudi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 27, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 27, 2025

Codecov Report

❌ Patch coverage is 28.88889% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.17%. Comparing base (11d77f4) to head (b75d4d8).
⚠️ Report is 1072 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3052       +/-   ##
===========================================
- Coverage   45.50%   24.17%   -21.33%     
===========================================
  Files          79      125       +46     
  Lines        7782    17822    +10040     
===========================================
+ Hits         3541     4309      +768     
- Misses       4099    13229     +9130     
- Partials      142      284      +142     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@saschagrunert
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2025
@BhargaviGudi BhargaviGudi force-pushed the OCPBUGS-62269-fix branch 2 times, most recently from 96a2136 to 69ba88b Compare November 28, 2025 07:29
@BhargaviGudi BhargaviGudi marked this pull request as ready for review December 1, 2025 06:07
@BhargaviGudi BhargaviGudi changed the title WIP OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands OCPBUGS-62269 - Fix race condition causing missing audit log entries for rapid commands Dec 1, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025
@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/retest-required

@saschagrunert
Copy link
Copy Markdown
Member

@BhargaviGudi do you mind a rebase?

e.retryBpfRequestUID(logBucket)

// Retry container info lookup if it's still missing
e.retryContainerInfo(ctx, logBucket, processID, nodeName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BhargaviGudi Thanks for the PR. Can you try to test if e.retryContainerInfo(ctx, logBucket, processID, nodeName) alone can fix the issue without the retryBpfCmdLine and retryBpfRequestUID? I feel missing container info is the cause of the issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with only retryContainerInfo as suggested, but it doesn't fix the issue for short-lived processes.

  • The short-lived processes exit so fast that by the time the LogBucket expires (30 seconds later), their /proc/ is gone and are missing from audit log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngopalak-redhat Could you please help me review the PR? Thanks

@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/test pull-security-profiles-operator-verify

@saschagrunert
Copy link
Copy Markdown
Member

@BhargaviGudi thank you for the PR! Please rebase to fix the CI.

…for rapid commands

OCPBUGS-62269 - Fix GitHub Actions linting issues

OCPBUGS-62269 - Fix GitHub Actions linting issues
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 13, 2026
@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

@ngopalak-redhat Could you please help me review the PR? Thanks

@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/retitle OCPBUGS-62269: Fix race condition causing missing audit log entries for rapid commands

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@BhargaviGudi: Re-titling can only be requested by trusted users, like repository collaborators.

Details

In response to this:

/retitle OCPBUGS-62269: Fix race condition causing missing audit log entries for rapid commands

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-baremetalds-ipi-ovn-lvms-f14-security-profiles

Copy link
Copy Markdown
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnosis is correct: short-lived processes with a single audit event miss BPF data because the ring buffer hasn't been polled yet. Retrying at eviction time is a reasonable place since 60 seconds is more than enough for BPF events to arrive. A few concerns and suggestions below.


for attempt := 0; attempt < maxRetries && logBucket.ProcessInfo.CmdLine == ""; attempt++ {
if attempt > 0 {
time.Sleep(retryDelays[attempt])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatchSeccompLine is called from the OnEviction callback, and DeleteExpired() (line 274) runs every 30s from the Run() loop. Evictions are processed synchronously, so sleeping here (up to 60ms per retry function, 120ms total per entry) blocks all subsequent evictions and the main audit line processing loop. For N expired entries, that's N * 120ms of stall.

Consider dispatching into a goroutine so the callback returns immediately:

e.logLinesCache.OnEviction(
    func(ctx context.Context, reason ttlcache.EvictionReason, logItem *ttlcache.Item[int, *types.LogBucket]) {
        auditLogBucket := logItem.Value()
        go e.dispatchSeccompLine(ctx, auditLogBucket, logItem.Key(), nodeName)
    })

Comment on lines +434 to +435

break
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition already checks logBucket.ProcessInfo.CmdLine == "". After setting CmdLine to a non-empty value, the loop terminates on its own. The break is redundant.

Same applies to retryBpfRequestUID at line 462.

Suggested change
break

Comment on lines +446 to +447
maxRetries := 3
retryDelays := []time.Duration{0, 10 * time.Millisecond, 50 * time.Millisecond}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are identical to lines 421-422. Extract to a package-level variable to avoid the duplication:

var bpfRetryDelays = []time.Duration{0, 10 * time.Millisecond, 50 * time.Millisecond}

}
}

// retryContainerInfo retries fetching container info if it's missing.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the BPF retry functions, this makes a single call with no delay or loop. The name retryContainerInfo is misleading. Either rename to something like refetchContainerInfo, or add actual retry logic consistent with the other two functions.


func (e *JsonEnricher) dispatchSeccompLine(
logBucket *types.LogBucket, nodeName string,
ctx context.Context, logBucket *types.LogBucket, processID int, nodeName string,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processID is redundant since logBucket.ProcessInfo.Pid holds the same value (both sourced from auditLine.ProcessID). The BPF retry functions already use logBucket.ProcessInfo.Pid directly. Only retryContainerInfo uses this parameter, creating an inconsistency.

Comment on lines +520 to +525
// Retry BPF cache lookups with progressive delays to allow ring buffer polling to catch up
e.retryBpfCmdLine(logBucket)
e.retryBpfRequestUID(logBucket)

// Retry container info lookup if it's still missing
e.retryContainerInfo(ctx, logBucket, processID, nodeName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function names already communicate intent. These comments restate what the code does.

Suggested change
// Retry BPF cache lookups with progressive delays to allow ring buffer polling to catch up
e.retryBpfCmdLine(logBucket)
e.retryBpfRequestUID(logBucket)
// Retry container info lookup if it's still missing
e.retryContainerInfo(ctx, logBucket, processID, nodeName)
e.retryBpfCmdLine(logBucket)
e.retryBpfRequestUID(logBucket)
e.retryContainerInfo(ctx, logBucket, processID, nodeName)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants