Skip to content

Perf/secrets scanner performance#354

Open
shreelshah12 wants to merge 2 commits into
release/0.9.0from
perf/secrets_scanner_performance
Open

Perf/secrets scanner performance#354
shreelshah12 wants to merge 2 commits into
release/0.9.0from
perf/secrets_scanner_performance

Conversation

@shreelshah12

Copy link
Copy Markdown
Contributor

Description

The scanner now batches TruffleHog over a chunk of notebooks (one pass instead of two subprocesses per file), parallelizes notebook/cluster I/O and scans workspaces concurrently, drops the fixed 10s inter-page/per-cluster sleeps for 429-only backoff, and skips the per-notebook get-status in the 403 fallback. Also raises the inner timeout to 4h and adds an 8h job timeout.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (non-code changes like README or docs)

…xed sleeps)

The SAT Secrets Scanner could run ~14h and time out on large or
locked-down accounts. Address the bottlenecks:

- Batch TruffleHog: materialize a chunk of notebooks and scan the whole
  chunk in one invocation instead of two subprocesses per notebook.
- Parallelize I/O: thread-pool the 403-fallback workspace traversal,
  notebook export/FUSE copy, and per-cluster config fetch + scan.
- Skip the per-leaf get-status in the workspace/list fallback when the
  list response already carries modified_at; otherwise fetch in parallel.
- Remove the unconditional 10s inter-page and per-cluster sleeps; rate
  limiting is now reactive (429-only exponential backoff with retry).
- Scan workspaces concurrently and run notebook+cluster scans in
  parallel; run_ids are pre-allocated sequentially to avoid racing on
  SELECT max(runID). Configurable via secrets_max_parallel_workspaces.
- Raise inner notebook.run timeout 1h -> 4h and add an 8h job-level
  timeout (DABS + Terraform) so runs finish or fail fast.

Co-authored-by: Isaac
…llisions

When multiple workspaces are scanned concurrently, every child notebook
shares the driver-local filesystem. A fixed shared SCAN_BATCH_DIR let one
workspace's rmtree/writes clobber another's in-flight batch and cross-wire
finding attribution. Use tempfile.mkdtemp() per chunk and clean it up in a
finally block so concurrent and sequential scans can never collide.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant