OPCT-428: fix leaktk-gcs-filter archive redaction by improving leak scanner by bshaw7 · Pull Request #223 · redhat-openshift-ecosystem/opct

bshaw7 · 2026-06-24T11:27:53Z

Summary

leaktk-gcs-filter was replacing OPCT CI archives in GCS because PR #219's scanner missed sensitive data. Ran leaktk scan with Red Hat internal patterns against the CI archive from S3: 3157 findings before, 0 after this fix.

Fixes: OPCT-428

Changes

1. Add .tar.xz nested archive support (internal/cleaner/cleaner.go)

Must-gather archives use .tar.xz compression. The scanner only recursed into .tar.gz files, so .tar.xz archives (43 MB) were copied without scanning. Added ScanPatchTarXzReaderFor() using the xz library already in go.mod.

2. Increase scan size limit (internal/cleaner/leakscanner.go)

Changed maxLeakScanSize from 10MB to 100MB. The 60MB HTML/JSON test result files inside nested archives were skipping the scan entirely.

3. Add 4 new leak patterns (internal/cleaner/leakpatterns.go)

Generic Secret YAML (rnWF160pWNg): catches token:, secret:, clientSecret: in YAML
Authorization Header (QqS4RvI6Zmg): catches Authorization: Bearer/Basic headers
Broad JWT (RS256): catches OIDC/audience tokens beyond service account JWTs
Unescaped registry auth: catches pull secrets in unescaped JSON format

4. Remove files with encoded secrets (internal/cleaner/cleaner.go)

Remove machineconfigs.json (percent+base64 encoded pull secrets, leaktk decodes but OPCT does not)
Remove all packagemanifests.json (base64-encoded PEM keys, previously kept 1 copy)

Test Results

BEFORE: 3157 findings (leaktk + Red Hat patterns)
AFTER:     0 findings

Tested with leaktk v0.3.3 using Red Hat pattern server (patterns.security.redhat.com) against S3 archive s3://opct-archive/uploads/5.0.0-0.nightly-2026-06-10-003138-20260611-HighlyAvailable-aws-External.tar.gz

Test plan

go build succeeds
go test ./internal/cleaner/ — all tests pass
opct adm cleaner runs on CI archive, produces cleaned output
leaktk scan on cleaned output: 0 findings
CI periodic job produces archive not redacted by leaktk-gcs-filter

…canner leaktk-gcs-filter was replacing OPCT CI archives in GCS with a 76-byte placeholder because PR #219 scanner missed sensitive data. Scanned with leaktk + Red Hat internal patterns: 3157 findings before, 0 after. Fixes: - Add .tar.xz nested archive support (must-gather uses xz compression) - Increase maxLeakScanSize from 10MB to 100MB (60MB test result files inside nested archives were skipping scan) - Add 4 new leak patterns: Generic Secret YAML, Authorization Header, broad JWT (RS256), unescaped container registry auth - Remove machineconfigs.json (contains percent+base64 encoded pull secrets that require multi-layer decoding to detect) - Remove all packagemanifests.json copies (contain base64-encoded PEM keys, previously kept 1 copy) Tested by running leaktk scan with Red Hat pattern server (patterns.security.redhat.com) against the CI archive from S3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-24T11:27:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mtulio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-24T11:28:04Z

Warning

Review limit reached

@bshaw7, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 1 minute

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a19dbbca-b6b8-4447-98d9-cd3d64624829

📥 Commits

Reviewing files that changed from the base of the PR and between 91108f4 and bcaeb5f.

📒 Files selected for processing (2)

internal/cleaner/cleaner.go
internal/cleaner/leakscanner.go

📝 Walkthrough

Walkthrough

This PR updates RemoveFilePatternRules with machineconfig-related entries, changes large-file bypass handling and logging in processTarHeader, adds four leakPatterns entries, and marks golang.org/x/sync as indirect in go.mod.

Changes

Cleaner package updates

Layer / File(s)	Summary
New leak detection patterns `internal/cleaner/leakpatterns.go`	Adds broader RS256 JWT, unescaped JSON registry auth, YAML generic secret, and HTTP authorization header leak patterns.
RemoveFilePatternRules and large-file log adjustments `internal/cleaner/cleaner.go`	Rewrites the packagemanifests rule, adds machineconfig-related removal rules, and changes large-file bypass logic and logging in `processTarHeader`.

Estimated code review effort: 3 (Moderate) | ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the main theme: fixing archive redaction by improving the leak scanner.
Description check	✅ Passed	The description is clearly about the same OPCT-428 leak-scanner redaction fix and related archive scanning changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/OPCT-428-leaktk-redaction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/cleaner/cleaner.go`:
- Around line 150-152: The nested archive rebuild in
ScanPatchTarXzReaderFor/processTarHeader is writing gzip data back into members
that are still named as .tar.xz, which breaks the archive contract. Update the
archive reconstruction path to preserve the original compression format for
.tar.xz entries instead of using gzip.NewWriter, and make sure the bytes emitted
by processTarHeader match the member type that downstream readers will expect.
Also review the related tar handling in the referenced helper flow so all
.tar.xz members are rebuilt consistently.

In `@internal/cleaner/leakpatterns.go`:
- Around line 146-150: The Authorization Header pattern in the leak pattern list
is matching only the exact capitalized header name, so lowercase and mixed-case
variants are missed. Update the regexp in the pattern entry with ID QqS4RvI6Zmg
inside the leak patterns definition to make the header name match
case-insensitively while keeping the rest of the capture behavior unchanged.
- Around line 69-74: The registry auth leak pattern in leakpatterns.go is too
narrow because the Regex for opct-registry-auth-unescaped only matches dotted
hostnames, so valid auths entries using host:port or URL-style keys can evade
redaction. Update the regex in the registry auth pattern block to match broader
registry key formats while still targeting unescaped JSON auth entries, and keep
the existing Description/Keywords intact so the pattern remains identifiable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7b39a22c-099f-45b3-a1a8-9d9744b1cc21

📥 Commits

Reviewing files that changed from the base of the PR and between 246b7d3 and 7e72e6b.

📒 Files selected for processing (3)

internal/cleaner/cleaner.go
internal/cleaner/leakpatterns.go
internal/cleaner/leakscanner.go

bshaw7 · 2026-06-24T11:42:14Z

/kind bug

… patterns, remove machineconfig YAMLs Review feedback from CodeRabbit and reviewdog: - Fix tar.xz output: write back as xz (not gzip) to preserve archive format - Make Authorization header pattern case-insensitive - Broaden registry auth pattern to match host:port entries - Fix gofmt: use short struct literal syntax in RemoveFilePatternRules - Remove machineconfig YAML files from must-gather (contain encoded pull secrets inside nested tar.xz) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bshaw7 · 2026-06-24T16:44:33Z

/jira refresh

mtulio

Overall looks good to me, thanks! I just have one comment related to the new file skips, as well if you don't mind posting the total time of opct retrieve with/ and without this change in a real archive, if you have one. So that we can measure the expectations of and eventually adjust CI timeouts if it is increasing higher than tresholds.

mtulio · 2026-06-24T18:23:40Z

Hold until validating retrieve manually (unfortunately we don't have presubmits for it) and code review.
/hold

…ge files Address Marco review feedback: - Update comment from >10MB to >100MB to match new maxLeakScanSize - Add log.Debugf when files are skipped due to size limit, showing file name and size for easier debugging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

internal/cleaner/cleaner.go (1)

277-285: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Avoid the extra full-buffer copy for nested .tar.xz payloads.

ScanPatchTarXzReaderFor already returns the cleaned archive and its size. Copying it into archiveBuf before writing doubles peak memory for large nested archives on this new path.

Proposed change

 		header.Size = int64(size)
-		archiveBuf := new(bytes.Buffer)
-		if _, err = io.Copy(archiveBuf, resp); err != nil {
-			return nil, err
-		}
 		if err := tarWriter.WriteHeader(header); err != nil {
 			return nil, fmt.Errorf("unable to write file header to new archive: %w", err)
 		}
-		if _, err := tarWriter.Write(archiveBuf.Bytes()); err != nil {
+		if _, err := io.Copy(tarWriter, resp); err != nil {
 			return nil, fmt.Errorf("unable to write file data to new archive: %w", err)
 		}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cleaner/cleaner.go` around lines 277 - 285, The nested .tar.xz
handling in ScanPatchTarXzReaderFor is making an unnecessary full-buffer copy by
reading the cleaned archive into archiveBuf before writing it out. Update the
cleaner flow around the archive write path to stream the returned cleaned
archive directly into tarWriter instead of buffering it first, using the
existing size/result from ScanPatchTarXzReaderFor and the surrounding cleaner
logic in cleaner.go.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/cleaner/cleaner.go`:
- Around line 277-285: The nested .tar.xz handling in ScanPatchTarXzReaderFor is
making an unnecessary full-buffer copy by reading the cleaned archive into
archiveBuf before writing it out. Update the cleaner flow around the archive
write path to stream the returned cleaned archive directly into tarWriter
instead of buffering it first, using the existing size/result from
ScanPatchTarXzReaderFor and the surrounding cleaner logic in cleaner.go.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: eb58ffc5-daad-4cff-be9a-62f4f7eca137

📥 Commits

Reviewing files that changed from the base of the PR and between 7e72e6b and a926af0.

📒 Files selected for processing (2)

internal/cleaner/cleaner.go
internal/cleaner/leakpatterns.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/cleaner/leakpatterns.go

Remove nested archive scanning from opct retrieve cleaner since sensitive data is now cleaned at source in the artifacts-collector plugin (PR #88 in provider-certification-plugins). Removed: - ScanPatchTarXzReaderFor() function and xz import - .tar.xz handling in processTarHeader() - maxLeakScanSize increase (revert 100MB to 10MB) Kept (non-nested improvements): - 4 new leak patterns (Generic Secret YAML, Authorization Header, JWT RS256, Registry Auth unescaped) - machineconfigs file removal rules - packagemanifests KeepCount change (1 to 0) - Debug log for skipped large files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bshaw7 · 2026-07-02T06:33:41Z

Local tests shows leaktk findings are 0 while scanning reports

[root@localhost ~]# oc get pods -n opct -l sonobuoy-plugin=99-openshift-artifacts-collector -o jsonpath='{.items[*].spec.containers[*].image}'; echo
quay.io/rhn-support-bshaw/plugin-artifacts-collector:opct-428 quay.io/opct/sonobuoy:v0.57.3
[root@localhost ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.22.3    True        False         98m     Cluster version is 4.22.3
[root@localhost ~]#

[root@localhost test]# time /root/opct/build/opct-linux-amd64 retrieve
INFO[2026-07-02T06:26:34Z] Collecting results...                        
INFO[2026-07-02T06:26:35Z] Downloading archive from aggregator server... 
INFO[2026-07-02T06:27:02Z] Downloaded 224.3 MB in 27s                   
INFO[2026-07-02T06:27:02Z] Scanning archive for sensitive data...       
INFO[2026-07-02T06:27:42Z] Results saved to /root/test/opct_202607012041_ea0d0485-6949-485f-b138-01fd9836f4ba.tar.gz 
INFO[2026-07-02T06:27:42Z] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results. 

real	1m10.076s
user	0m40.508s
sys	0m2.850s
[root@localhost test]# 


[root@localhost test]# leaktk scan -c ~/.config/leaktk/config.toml --kind Files /Users/bshaw/splat/OPCT/OPCT-428/opct_202607012041_ea0d0485-6949-485f-b138-01fd9836f4ba.tar.gz  |python3 -c "import sys,json; data=json.load(sys.stdin); print(f'Findings:{len(data.get(\"results\",[]))}')"
[INFO] queueing scan: id="D0YUvjO7uSU" queue_size=1
[INFO] starting scan: id="D0YUvjO7uSU"
[INFO] queueing response: id="D0YUvjO7uSU" queue_size=1
Findings:0
[root@localhost test]#

Now findings shows 0 in local testing

Change KeepCount from 0 to 1 for machineconfigs removal rules so one cleaned copy remains in the archive for auditing purposes, instead of removing all instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep one cleaned sample for audit instead of removing all. The scanner obfuscates sensitive data in the kept copy.

…iles - machineconfigs KeepCount back to 0 (remove all, scanner cannot fully redact registry auth in these files) - packagemanifests stays KeepCount=1 (original behavior, per reviewer) - Files kept by removal rules now bypass the 10MB size limit and are always scanned, ensuring kept copies are properly redacted

The size gating already happens in processTarHeader. The duplicate check in ScanAndRedactLeaks and ScanContentForLeaks was preventing kept files (bypassed size check in processTarHeader) from being scanned, leaving private keys unredacted in packagemanifests.json.

Scanner cannot reliably redact multi-line private keys in the 11.4MB packagemanifests file. Remove it entirely as agreed with reviewer — it's operator catalog metadata not needed for OPCT validation.

bshaw7 · 2026-07-03T19:25:43Z

There is 0 Findings post leaktk scan

[root@localhost test]#  time /root/opct/build/opct-linux-amd64 retrieve
INFO[2026-07-03T19:21:31Z] Collecting results...                        
INFO[2026-07-03T19:21:32Z] Downloading archive from aggregator server... 
INFO[2026-07-03T19:22:04Z] Downloaded 177.7 MB in 32s                   
INFO[2026-07-03T19:22:04Z] Scanning archive for sensitive data...       
INFO[2026-07-03T19:22:41Z] Results saved to /root/test/opct_202607030938_d621a366-53a8-4677-9531-24422731d4e0.tar.gz 
INFO[2026-07-03T19:22:41Z] Run 'opct report -s ./report <archive>.tar.gz' to review the validation results. 

real	1m11.651s
user	0m37.236s
sys	0m2.184s
[root@localhost test]#

bshaw-mac:~ bshaw$  leaktk scan -c ~/.config/leaktk/config.toml --kind Files opct_202607030938_d621a366-53a8-4677-9531-24422731d4e0.tar.gz | python3 -c "import sys,json; data=json.load(sys.stdin); print(f'Findings:{len(data.get(\"results\",[]))}')"
[INFO] queueing scan: id="1T5yJCR2-24" queue_size=1
[INFO] starting scan: id="1T5yJCR2-24"
[INFO] queueing response: id="1T5yJCR2-24" queue_size=1
Findings:0
bshaw-mac:~ bshaw$

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026

bshaw7 marked this pull request as ready for review June 24, 2026 11:34

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026

openshift-ci Bot requested review from mtulio and vr4manta June 24, 2026 11:34

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread internal/cleaner/cleaner.go Outdated

Comment thread internal/cleaner/leakpatterns.go

Comment thread internal/cleaner/leakpatterns.go Outdated

openshift-ci Bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 24, 2026

bshaw7 and others added 2 commits June 24, 2026 17:30

OPCT-428: fix staticcheck lint — use raw string for regex

ed2fb1b