Skip to content

OPCT-428: fix leaktk-gcs-filter archive redaction by improving leak scanner#223

Open
bshaw7 wants to merge 10 commits into
mainfrom
fix/OPCT-428-leaktk-redaction
Open

OPCT-428: fix leaktk-gcs-filter archive redaction by improving leak scanner#223
bshaw7 wants to merge 10 commits into
mainfrom
fix/OPCT-428-leaktk-redaction

Conversation

@bshaw7

@bshaw7 bshaw7 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

leaktk-gcs-filter was replacing OPCT CI archives in GCS because PR #219's scanner missed sensitive data. Ran leaktk scan with Red Hat internal patterns against the CI archive from S3: 3157 findings before, 0 after this fix.

Fixes: OPCT-428

Changes

1. Add .tar.xz nested archive support (internal/cleaner/cleaner.go)

Must-gather archives use .tar.xz compression. The scanner only recursed into .tar.gz files, so .tar.xz archives (43 MB) were copied without scanning. Added ScanPatchTarXzReaderFor() using the xz library already in go.mod.

2. Increase scan size limit (internal/cleaner/leakscanner.go)

Changed maxLeakScanSize from 10MB to 100MB. The 60MB HTML/JSON test result files inside nested archives were skipping the scan entirely.

3. Add 4 new leak patterns (internal/cleaner/leakpatterns.go)

  • Generic Secret YAML (rnWF160pWNg): catches token:, secret:, clientSecret: in YAML
  • Authorization Header (QqS4RvI6Zmg): catches Authorization: Bearer/Basic headers
  • Broad JWT (RS256): catches OIDC/audience tokens beyond service account JWTs
  • Unescaped registry auth: catches pull secrets in unescaped JSON format

4. Remove files with encoded secrets (internal/cleaner/cleaner.go)

  • Remove machineconfigs.json (percent+base64 encoded pull secrets, leaktk decodes but OPCT does not)
  • Remove all packagemanifests.json (base64-encoded PEM keys, previously kept 1 copy)

Test Results

BEFORE: 3157 findings (leaktk + Red Hat patterns)
AFTER:     0 findings

Tested with leaktk v0.3.3 using Red Hat pattern server (patterns.security.redhat.com) against S3 archive s3://opct-archive/uploads/5.0.0-0.nightly-2026-06-10-003138-20260611-HighlyAvailable-aws-External.tar.gz

Test plan

  • go build succeeds
  • go test ./internal/cleaner/ — all tests pass
  • opct adm cleaner runs on CI archive, produces cleaned output
  • leaktk scan on cleaned output: 0 findings
  • CI periodic job produces archive not redacted by leaktk-gcs-filter

…canner

leaktk-gcs-filter was replacing OPCT CI archives in GCS with a 76-byte
placeholder because PR #219 scanner missed sensitive data. Scanned
with leaktk + Red Hat internal patterns: 3157 findings before, 0 after.

Fixes:
- Add .tar.xz nested archive support (must-gather uses xz compression)
- Increase maxLeakScanSize from 10MB to 100MB (60MB test result files
  inside nested archives were skipping scan)
- Add 4 new leak patterns: Generic Secret YAML, Authorization Header,
  broad JWT (RS256), unescaped container registry auth
- Remove machineconfigs.json (contains percent+base64 encoded pull
  secrets that require multi-layer decoding to detect)
- Remove all packagemanifests.json copies (contain base64-encoded PEM
  keys, previously kept 1 copy)

Tested by running leaktk scan with Red Hat pattern server
(patterns.security.redhat.com) against the CI archive from S3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mtulio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@bshaw7, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 1 minute

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a19dbbca-b6b8-4447-98d9-cd3d64624829

📥 Commits

Reviewing files that changed from the base of the PR and between 91108f4 and bcaeb5f.

📒 Files selected for processing (2)
  • internal/cleaner/cleaner.go
  • internal/cleaner/leakscanner.go
📝 Walkthrough

Walkthrough

This PR updates RemoveFilePatternRules with machineconfig-related entries, changes large-file bypass handling and logging in processTarHeader, adds four leakPatterns entries, and marks golang.org/x/sync as indirect in go.mod.

Changes

Cleaner package updates

Layer / File(s) Summary
New leak detection patterns
internal/cleaner/leakpatterns.go
Adds broader RS256 JWT, unescaped JSON registry auth, YAML generic secret, and HTTP authorization header leak patterns.
RemoveFilePatternRules and large-file log adjustments
internal/cleaner/cleaner.go
Rewrites the packagemanifests rule, adds machineconfig-related removal rules, and changes large-file bypass logic and logging in processTarHeader.

Estimated code review effort: 3 (Moderate) | ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main theme: fixing archive redaction by improving the leak scanner.
Description check ✅ Passed The description is clearly about the same OPCT-428 leak-scanner redaction fix and related archive scanning changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/OPCT-428-leaktk-redaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@bshaw7 bshaw7 marked this pull request as ready for review June 24, 2026 11:34
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from mtulio and vr4manta June 24, 2026 11:34

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cleaner/cleaner.go`:
- Around line 150-152: The nested archive rebuild in
ScanPatchTarXzReaderFor/processTarHeader is writing gzip data back into members
that are still named as .tar.xz, which breaks the archive contract. Update the
archive reconstruction path to preserve the original compression format for
.tar.xz entries instead of using gzip.NewWriter, and make sure the bytes emitted
by processTarHeader match the member type that downstream readers will expect.
Also review the related tar handling in the referenced helper flow so all
.tar.xz members are rebuilt consistently.

In `@internal/cleaner/leakpatterns.go`:
- Around line 146-150: The Authorization Header pattern in the leak pattern list
is matching only the exact capitalized header name, so lowercase and mixed-case
variants are missed. Update the regexp in the pattern entry with ID QqS4RvI6Zmg
inside the leak patterns definition to make the header name match
case-insensitively while keeping the rest of the capture behavior unchanged.
- Around line 69-74: The registry auth leak pattern in leakpatterns.go is too
narrow because the Regex for opct-registry-auth-unescaped only matches dotted
hostnames, so valid auths entries using host:port or URL-style keys can evade
redaction. Update the regex in the registry auth pattern block to match broader
registry key formats while still targeting unescaped JSON auth entries, and keep
the existing Description/Keywords intact so the pattern remains identifiable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7b39a22c-099f-45b3-a1a8-9d9744b1cc21

📥 Commits

Reviewing files that changed from the base of the PR and between 246b7d3 and 7e72e6b.

📒 Files selected for processing (3)
  • internal/cleaner/cleaner.go
  • internal/cleaner/leakpatterns.go
  • internal/cleaner/leakscanner.go

Comment thread internal/cleaner/cleaner.go Outdated
Comment thread internal/cleaner/leakpatterns.go
Comment thread internal/cleaner/leakpatterns.go Outdated
@bshaw7

bshaw7 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/kind bug

@openshift-ci openshift-ci Bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 24, 2026
bshaw7 and others added 2 commits June 24, 2026 17:30
… patterns, remove machineconfig YAMLs

Review feedback from CodeRabbit and reviewdog:
- Fix tar.xz output: write back as xz (not gzip) to preserve archive format
- Make Authorization header pattern case-insensitive
- Broaden registry auth pattern to match host:port entries
- Fix gofmt: use short struct literal syntax in RemoveFilePatternRules
- Remove machineconfig YAML files from must-gather (contain encoded
  pull secrets inside nested tar.xz)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bshaw7

bshaw7 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/jira refresh

Comment thread internal/cleaner/cleaner.go

@mtulio mtulio left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks! I just have one comment related to the new file skips, as well if you don't mind posting the total time of opct retrieve with/ and without this change in a real archive, if you have one. So that we can measure the expectations of and eventually adjust CI timeouts if it is increasing higher than tresholds.

Comment thread internal/cleaner/cleaner.go Outdated
Comment thread internal/cleaner/cleaner.go
Comment thread internal/cleaner/cleaner.go Outdated
Comment thread internal/cleaner/cleaner.go Outdated
@mtulio

mtulio commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hold until validating retrieve manually (unfortunately we don't have presubmits for it) and code review.
/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2026
…ge files

Address Marco review feedback:
- Update comment from >10MB to >100MB to match new maxLeakScanSize
- Add log.Debugf when files are skipped due to size limit, showing
  file name and size for easier debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/cleaner/cleaner.go (1)

277-285: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Avoid the extra full-buffer copy for nested .tar.xz payloads.

ScanPatchTarXzReaderFor already returns the cleaned archive and its size. Copying it into archiveBuf before writing doubles peak memory for large nested archives on this new path.

Proposed change
 		header.Size = int64(size)
-		archiveBuf := new(bytes.Buffer)
-		if _, err = io.Copy(archiveBuf, resp); err != nil {
-			return nil, err
-		}
 		if err := tarWriter.WriteHeader(header); err != nil {
 			return nil, fmt.Errorf("unable to write file header to new archive: %w", err)
 		}
-		if _, err := tarWriter.Write(archiveBuf.Bytes()); err != nil {
+		if _, err := io.Copy(tarWriter, resp); err != nil {
 			return nil, fmt.Errorf("unable to write file data to new archive: %w", err)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cleaner/cleaner.go` around lines 277 - 285, The nested .tar.xz
handling in ScanPatchTarXzReaderFor is making an unnecessary full-buffer copy by
reading the cleaned archive into archiveBuf before writing it out. Update the
cleaner flow around the archive write path to stream the returned cleaned
archive directly into tarWriter instead of buffering it first, using the
existing size/result from ScanPatchTarXzReaderFor and the surrounding cleaner
logic in cleaner.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/cleaner/cleaner.go`:
- Around line 277-285: The nested .tar.xz handling in ScanPatchTarXzReaderFor is
making an unnecessary full-buffer copy by reading the cleaned archive into
archiveBuf before writing it out. Update the cleaner flow around the archive
write path to stream the returned cleaned archive directly into tarWriter
instead of buffering it first, using the existing size/result from
ScanPatchTarXzReaderFor and the surrounding cleaner logic in cleaner.go.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: eb58ffc5-daad-4cff-be9a-62f4f7eca137

📥 Commits

Reviewing files that changed from the base of the PR and between 7e72e6b and a926af0.

📒 Files selected for processing (2)
  • internal/cleaner/cleaner.go
  • internal/cleaner/leakpatterns.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/cleaner/leakpatterns.go

Remove nested archive scanning from opct retrieve cleaner since sensitive
data is now cleaned at source in the artifacts-collector plugin (PR #88
in provider-certification-plugins).

Removed:
- ScanPatchTarXzReaderFor() function and xz import
- .tar.xz handling in processTarHeader()
- maxLeakScanSize increase (revert 100MB to 10MB)

Kept (non-nested improvements):
- 4 new leak patterns (Generic Secret YAML, Authorization Header, JWT RS256,
  Registry Auth unescaped)
- machineconfigs file removal rules
- packagemanifests KeepCount change (1 to 0)
- Debug log for skipped large files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bshaw7

bshaw7 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Local tests shows leaktk findings are 0 while scanning reports

[root@localhost ~]# oc get pods -n opct -l sonobuoy-plugin=99-openshift-artifacts-collector -o jsonpath='{.items[*].spec.containers[*].image}'; echo
quay.io/rhn-support-bshaw/plugin-artifacts-collector:opct-428 quay.io/opct/sonobuoy:v0.57.3
[root@localhost ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.22.3    True        False         98m     Cluster version is 4.22.3
[root@localhost ~]# 
[root@localhost test]# time /root/opct/build/opct-linux-amd64 retrieve
INFO[2026-07-02T06:26:34Z] Collecting results...                        
INFO[2026-07-02T06:26:35Z] Downloading archive from aggregator server... 
INFO[2026-07-02T06:27:02Z] Downloaded 224.3 MB in 27s                   
INFO[2026-07-02T06:27:02Z] Scanning archive for sensitive data...       
INFO[2026-07-02T06:27:42Z] Results saved to /root/test/opct_202607012041_ea0d0485-6949-485f-b138-01fd9836f4ba.tar.gz 
INFO[2026-07-02T06:27:42Z] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results. 

real	1m10.076s
user	0m40.508s
sys	0m2.850s
[root@localhost test]# 


[root@localhost test]# leaktk scan -c ~/.config/leaktk/config.toml --kind Files /Users/bshaw/splat/OPCT/OPCT-428/opct_202607012041_ea0d0485-6949-485f-b138-01fd9836f4ba.tar.gz  |python3 -c "import sys,json; data=json.load(sys.stdin); print(f'Findings:{len(data.get(\"results\",[]))}')"
[INFO] queueing scan: id="D0YUvjO7uSU" queue_size=1
[INFO] starting scan: id="D0YUvjO7uSU"
[INFO] queueing response: id="D0YUvjO7uSU" queue_size=1
Findings:0
[root@localhost test]# 

Now findings shows 0 in local testing

Comment thread internal/cleaner/cleaner.go
bshaw7 and others added 2 commits July 3, 2026 13:23
Change KeepCount from 0 to 1 for machineconfigs removal rules so one
cleaned copy remains in the archive for auditing purposes, instead of
removing all instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep one cleaned sample for audit instead of removing all.
The scanner obfuscates sensitive data in the kept copy.
bshaw7 added 3 commits July 3, 2026 23:50
…iles

- machineconfigs KeepCount back to 0 (remove all, scanner cannot fully
  redact registry auth in these files)
- packagemanifests stays KeepCount=1 (original behavior, per reviewer)
- Files kept by removal rules now bypass the 10MB size limit and are
  always scanned, ensuring kept copies are properly redacted
The size gating already happens in processTarHeader. The duplicate
check in ScanAndRedactLeaks and ScanContentForLeaks was preventing
kept files (bypassed size check in processTarHeader) from being
scanned, leaving private keys unredacted in packagemanifests.json.
Scanner cannot reliably redact multi-line private keys in the
11.4MB packagemanifests file. Remove it entirely as agreed
with reviewer — it's operator catalog metadata not needed
for OPCT validation.
@bshaw7

bshaw7 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

There is 0 Findings post leaktk scan

[root@localhost test]#  time /root/opct/build/opct-linux-amd64 retrieve
INFO[2026-07-03T19:21:31Z] Collecting results...                        
INFO[2026-07-03T19:21:32Z] Downloading archive from aggregator server... 
INFO[2026-07-03T19:22:04Z] Downloaded 177.7 MB in 32s                   
INFO[2026-07-03T19:22:04Z] Scanning archive for sensitive data...       
INFO[2026-07-03T19:22:41Z] Results saved to /root/test/opct_202607030938_d621a366-53a8-4677-9531-24422731d4e0.tar.gz 
INFO[2026-07-03T19:22:41Z] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results. 

real	1m11.651s
user	0m37.236s
sys	0m2.184s
[root@localhost test]# 
bshaw-mac:~ bshaw$  leaktk scan -c ~/.config/leaktk/config.toml --kind Files opct_202607030938_d621a366-53a8-4677-9531-24422731d4e0.tar.gz | python3 -c "import sys,json; data=json.load(sys.stdin); print(f'Findings:{len(data.get(\"results\",[]))}')"
[INFO] queueing scan: id="1T5yJCR2-24" queue_size=1
[INFO] starting scan: id="1T5yJCR2-24"
[INFO] queueing response: id="1T5yJCR2-24" queue_size=1
Findings:0
bshaw-mac:~ bshaw$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants