OPCT-428: fix leaktk-gcs-filter archive redaction by improving leak scanner#223
OPCT-428: fix leaktk-gcs-filter archive redaction by improving leak scanner#223bshaw7 wants to merge 10 commits into
Conversation
…canner leaktk-gcs-filter was replacing OPCT CI archives in GCS with a 76-byte placeholder because PR #219 scanner missed sensitive data. Scanned with leaktk + Red Hat internal patterns: 3157 findings before, 0 after. Fixes: - Add .tar.xz nested archive support (must-gather uses xz compression) - Increase maxLeakScanSize from 10MB to 100MB (60MB test result files inside nested archives were skipping scan) - Add 4 new leak patterns: Generic Secret YAML, Authorization Header, broad JWT (RS256), unescaped container registry auth - Remove machineconfigs.json (contains percent+base64 encoded pull secrets that require multi-layer decoding to detect) - Remove all packagemanifests.json copies (contain base64-encoded PEM keys, previously kept 1 copy) Tested by running leaktk scan with Red Hat pattern server (patterns.security.redhat.com) against the CI archive from S3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Warning Review limit reached
Next review available in: 1 minute Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR updates ChangesCleaner package updates
Estimated code review effort: 3 (Moderate) | ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/cleaner/cleaner.go`:
- Around line 150-152: The nested archive rebuild in
ScanPatchTarXzReaderFor/processTarHeader is writing gzip data back into members
that are still named as .tar.xz, which breaks the archive contract. Update the
archive reconstruction path to preserve the original compression format for
.tar.xz entries instead of using gzip.NewWriter, and make sure the bytes emitted
by processTarHeader match the member type that downstream readers will expect.
Also review the related tar handling in the referenced helper flow so all
.tar.xz members are rebuilt consistently.
In `@internal/cleaner/leakpatterns.go`:
- Around line 146-150: The Authorization Header pattern in the leak pattern list
is matching only the exact capitalized header name, so lowercase and mixed-case
variants are missed. Update the regexp in the pattern entry with ID QqS4RvI6Zmg
inside the leak patterns definition to make the header name match
case-insensitively while keeping the rest of the capture behavior unchanged.
- Around line 69-74: The registry auth leak pattern in leakpatterns.go is too
narrow because the Regex for opct-registry-auth-unescaped only matches dotted
hostnames, so valid auths entries using host:port or URL-style keys can evade
redaction. Update the regex in the registry auth pattern block to match broader
registry key formats while still targeting unescaped JSON auth entries, and keep
the existing Description/Keywords intact so the pattern remains identifiable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 7b39a22c-099f-45b3-a1a8-9d9744b1cc21
📒 Files selected for processing (3)
internal/cleaner/cleaner.gointernal/cleaner/leakpatterns.gointernal/cleaner/leakscanner.go
|
/kind bug |
… patterns, remove machineconfig YAMLs Review feedback from CodeRabbit and reviewdog: - Fix tar.xz output: write back as xz (not gzip) to preserve archive format - Make Authorization header pattern case-insensitive - Broaden registry auth pattern to match host:port entries - Fix gofmt: use short struct literal syntax in RemoveFilePatternRules - Remove machineconfig YAML files from must-gather (contain encoded pull secrets inside nested tar.xz) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/jira refresh |
There was a problem hiding this comment.
Overall looks good to me, thanks! I just have one comment related to the new file skips, as well if you don't mind posting the total time of opct retrieve with/ and without this change in a real archive, if you have one. So that we can measure the expectations of and eventually adjust CI timeouts if it is increasing higher than tresholds.
|
Hold until validating retrieve manually (unfortunately we don't have presubmits for it) and code review. |
…ge files Address Marco review feedback: - Update comment from >10MB to >100MB to match new maxLeakScanSize - Add log.Debugf when files are skipped due to size limit, showing file name and size for easier debugging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/cleaner/cleaner.go (1)
277-285: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winAvoid the extra full-buffer copy for nested
.tar.xzpayloads.
ScanPatchTarXzReaderForalready returns the cleaned archive and its size. Copying it intoarchiveBufbefore writing doubles peak memory for large nested archives on this new path.Proposed change
header.Size = int64(size) - archiveBuf := new(bytes.Buffer) - if _, err = io.Copy(archiveBuf, resp); err != nil { - return nil, err - } if err := tarWriter.WriteHeader(header); err != nil { return nil, fmt.Errorf("unable to write file header to new archive: %w", err) } - if _, err := tarWriter.Write(archiveBuf.Bytes()); err != nil { + if _, err := io.Copy(tarWriter, resp); err != nil { return nil, fmt.Errorf("unable to write file data to new archive: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cleaner/cleaner.go` around lines 277 - 285, The nested .tar.xz handling in ScanPatchTarXzReaderFor is making an unnecessary full-buffer copy by reading the cleaned archive into archiveBuf before writing it out. Update the cleaner flow around the archive write path to stream the returned cleaned archive directly into tarWriter instead of buffering it first, using the existing size/result from ScanPatchTarXzReaderFor and the surrounding cleaner logic in cleaner.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/cleaner/cleaner.go`:
- Around line 277-285: The nested .tar.xz handling in ScanPatchTarXzReaderFor is
making an unnecessary full-buffer copy by reading the cleaned archive into
archiveBuf before writing it out. Update the cleaner flow around the archive
write path to stream the returned cleaned archive directly into tarWriter
instead of buffering it first, using the existing size/result from
ScanPatchTarXzReaderFor and the surrounding cleaner logic in cleaner.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: eb58ffc5-daad-4cff-be9a-62f4f7eca137
📒 Files selected for processing (2)
internal/cleaner/cleaner.gointernal/cleaner/leakpatterns.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/cleaner/leakpatterns.go
Remove nested archive scanning from opct retrieve cleaner since sensitive data is now cleaned at source in the artifacts-collector plugin (PR #88 in provider-certification-plugins). Removed: - ScanPatchTarXzReaderFor() function and xz import - .tar.xz handling in processTarHeader() - maxLeakScanSize increase (revert 100MB to 10MB) Kept (non-nested improvements): - 4 new leak patterns (Generic Secret YAML, Authorization Header, JWT RS256, Registry Auth unescaped) - machineconfigs file removal rules - packagemanifests KeepCount change (1 to 0) - Debug log for skipped large files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Local tests shows leaktk findings are 0 while scanning reports Now findings shows 0 in local testing |
Change KeepCount from 0 to 1 for machineconfigs removal rules so one cleaned copy remains in the archive for auditing purposes, instead of removing all instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep one cleaned sample for audit instead of removing all. The scanner obfuscates sensitive data in the kept copy.
…iles - machineconfigs KeepCount back to 0 (remove all, scanner cannot fully redact registry auth in these files) - packagemanifests stays KeepCount=1 (original behavior, per reviewer) - Files kept by removal rules now bypass the 10MB size limit and are always scanned, ensuring kept copies are properly redacted
The size gating already happens in processTarHeader. The duplicate check in ScanAndRedactLeaks and ScanContentForLeaks was preventing kept files (bypassed size check in processTarHeader) from being scanned, leaving private keys unredacted in packagemanifests.json.
Scanner cannot reliably redact multi-line private keys in the 11.4MB packagemanifests file. Remove it entirely as agreed with reviewer — it's operator catalog metadata not needed for OPCT validation.
|
There is 0 Findings post leaktk scan |
Summary
leaktk-gcs-filter was replacing OPCT CI archives in GCS because PR #219's scanner missed sensitive data. Ran leaktk scan with Red Hat internal patterns against the CI archive from S3: 3157 findings before, 0 after this fix.
Fixes: OPCT-428
Changes
1. Add .tar.xz nested archive support (internal/cleaner/cleaner.go)
Must-gather archives use .tar.xz compression. The scanner only recursed into .tar.gz files, so .tar.xz archives (43 MB) were copied without scanning. Added ScanPatchTarXzReaderFor() using the xz library already in go.mod.
2. Increase scan size limit (internal/cleaner/leakscanner.go)
Changed maxLeakScanSize from 10MB to 100MB. The 60MB HTML/JSON test result files inside nested archives were skipping the scan entirely.
3. Add 4 new leak patterns (internal/cleaner/leakpatterns.go)
4. Remove files with encoded secrets (internal/cleaner/cleaner.go)
Test Results
Tested with leaktk v0.3.3 using Red Hat pattern server (patterns.security.redhat.com) against S3 archive s3://opct-archive/uploads/5.0.0-0.nightly-2026-06-10-003138-20260611-HighlyAvailable-aws-External.tar.gz
Test plan