Skip to content

feat: parallelize container monitor dependency requests#6757

Merged
bgardiner merged 1 commit into
mainfrom
feat/parallelize-container-monitor
May 14, 2026
Merged

feat: parallelize container monitor dependency requests#6757
bgardiner merged 1 commit into
mainfrom
feat/parallelize-container-monitor

Conversation

@bgardiner
Copy link
Copy Markdown
Contributor

@bgardiner bgardiner commented Apr 29, 2026

What does this PR do?

Rewrites monitorDependencies in src/lib/ecosystems/monitor.ts to fan out per-ScanResult PUT /monitor-dependencies calls in parallel via pMap, bounded by the getRequestConcurrency() helper introduced in #6756 (default 5, user-overridable via SNYK_REQUEST_CONCURRENCY).

Effective change for users: container monitor goes from sequential (1 in-flight request) to 5 in-flight by default, with the same env-var override surface as snyk container test.

  • Per-ScanResult work is extracted into a small monitorOneScanResult helper for testability.
  • Outer loop over paths still runs sequentially (one path per monitor invocation is the common case anyway).
  • Inner loop is now pMap(scanResults, ..., { concurrency }).

This PR is based on feat/snyk-request-concurrency (#6756) so its diff shows only the monitor-flow changes.

Why?

snyk container monitor previously ran fully sequentially — one full RTT per ScanResult — because of a nested for...of await loop. For large images with many ScanResults (e.g. fat-JAR-heavy images that produce one ScanResult per directory of JARs in snyk-docker-plugin's groupJarFingerprintsByPath), this is the dominant wall-clock cost.

Benchmark — quay.io/wildfly/wildfly:34.0.1.Final-jdk21 (512 ScanResults)

Sequential baseline measured against the released CLI (snyk v1.1303.2, prior code path). PR configurations measured against the locally-built PR binary, varying only SNYK_REQUEST_CONCURRENCY. Re-monitoring updates existing project snapshots rather than creating duplicates.

Configuration Wall-clock vs sequential
sequential (snyk v1.1303.2, prior code) 481 s (8m 1s) 1.00×
default (c=5, this PR) 104 s 4.6× faster
override (c=10) 54 s 8.9× faster
override (c=20) 29 s 16.6× faster

Each PR run successfully updated all 512 snapshots. The sequential run produced 510 snapshots; possibly minor version drift between v1.1303.2 and the PR binary's bundled snyk-docker-plugin. No errors logged; doesn't affect the timing comparison.

The default-of-5 row is the new behavior users will see without any opt-in.

Error semantics preserved

  • 401 → still throws AuthFailedError (terminates the run).
  • Other 4xx → still throws MonitorError (terminates the run via pMap's default fail-fast on rejection).
  • 5xx and other non-4xx errors are accumulated per-ScanResult into the errors array, matching prior continue-on-error behavior.

Result ordering preserved

pMap returns results in input order regardless of completion order, so the formatted output remains deterministic.

Where should the reviewer start?

  • src/lib/ecosystems/monitor.tsmonitorDependencies rewrite + new monitorOneScanResult.
  • test/jest/unit/ecosystems-monitor-docker.spec.ts — 5 new tests in a parallelization of monitor-dependencies requests describe:
    • Concurrency cap honored at default 5.
    • SNYK_INTERNAL_REQUEST_CONCURRENCY=3 override respected (the internal env var the TS code reads; the user-facing SNYK_REQUEST_CONCURRENCY is plumbed through Go in feat: introduce SNYK_REQUEST_CONCURRENCY for dependency request parallelism #6756).
    • Result ordering preserved when responses arrive out of order.
    • 4xx fail-fast (throws MonitorError).
    • 5xx accumulation (continues, returns errors).

How should this be manually tested?

  1. Run unit tests:
    npx jest --selectProjects coreCli --testPathPattern 'test/jest/unit/ecosystems-monitor-docker'
  2. Optional smoke test against a throwaway org with a multi-project image — use the Go-wrapped binary so SNYK_REQUEST_CONCURRENCY is plumbed through:
    ./binary-releases/snyk-... container monitor <image> --org=<throwaway-org>
    SNYK_REQUEST_CONCURRENCY=20 ./binary-releases/snyk-... container monitor <image> --org=<throwaway-org>

Risk assessment

Low. The change is scoped to a single function. Error semantics are preserved (verified by tests). Concurrency is bounded and configurable. pMap is already used by runTest so it's a familiar dependency in this codebase.

Background

Depends on / based on #6756, which:

  • introduces the getRequestConcurrency helper and clamps,
  • pipes the user-facing SNYK_REQUEST_CONCURRENCY env var through the Go-side GAF configuration so the legacy CLI receives the resolved value via the internal SNYK_INTERNAL_REQUEST_CONCURRENCY env var.

Supersedes #6748:

  • That PR used unbounded Promise.all, which on a 500-ScanResult image would fire 500+ concurrent PUTs — a real risk of 429s, socket exhaustion, or thundering-herd backend load.
  • That PR added a "base scan first, then Promise.all the rest" hybrid pattern; once results are kept in input order via pMap, the special case isn't load-bearing.

@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 29, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@bgardiner
Copy link
Copy Markdown
Contributor Author

bgardiner commented Apr 30, 2026

Benchmark — quay.io/wildfly/wildfly:34.0.1.Final-jdk21 (512 ScanResults)

Sequential baseline measured against the released CLI in PATH (snyk v1.1303.2, prior code path). PR-B configurations measured against the locally-built PR binary, varying only SNYK_REQUEST_CONCURRENCY. Re-monitoring updates existing project snapshots rather than creating duplicates.

Configuration Wall-clock vs sequential
sequential (snyk v1.1303.2, prior code) 481 s (8m 1s) 1.00×
SNYK_REQUEST_CONCURRENCY=5 (PR B) 104 s 4.6× faster
default c=10 (PR B) 54 s 8.9× faster
SNYK_REQUEST_CONCURRENCY=20 (PR B) 29 s 16.6× faster

Each PR-B run successfully updated all 512 snapshots. The sequential run produced 510 snapshots; possibly minor version drift between v1.1303.2 and the PR binary's bundled snyk-docker-plugin. No errors logged; doesn't affect the timing comparison.

const MAX_REQUEST_CONCURRENCY = 50;

/**
* Returns the maximum number of in-flight Snyk dependency-test or
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if these APIs rate limit? Could we run into this if a customer (like Walmart) sets this to 50?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I am worried about that. I'm not sure how the rate limiting occurs (e.g., dest endpoint, source IP, etc.). I need to look into this. I'm hoping going from 5->10 as a default is relatively safe, but I don't think we can do more without some testing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate-limit investigation summary

I dug into the api-gateway config (github.com/snyk/api-gateway) to confirm the bumped concurrency won't push CLI users over rate limits. Here's what I found.

Rate-limit policy for our endpoints

POST /v1/test-dependencies, POST /v1/test-dep-graph, and PUT /v1/monitor-dependencies all match the default /v1/test* and /v1/monitor* prefix routes in helm/templates/route-tables/api-v1.yaml. None of them have a custom envoyMetadata.snyk.ratelimiter.keys.bucket override, so they fall into the default high bucket of the api-v1 rate-limit config.

  • Keyed by principal_id — per-authenticated-user (per-token), per docs/rate-limiting.md. Not per-IP, not per-cluster.
  • Default bucket: high. Limits per token:
Window Limit
Per second 200 req/s
Per minute 2,000 req/min (≈ 33/s sustained)
Per hour 60,000 req/hour (≈ 16.7/s sustained)

All three apply simultaneously (burst, 1-min, 1-hour ceilings).

The only adjacent exception is lowerDepGraphRateLimit.enabled (polaris-prod-pc-pc01-1 only), which drops /v1/monitor/dep-graph (with slash, not our endpoint) to extra_low. Doesn't affect us.

Headroom analysis for this PR

The default change is 5 → 10 in-flight requests per CLI invocation. The proposed env override (SNYK_REQUEST_CONCURRENCY) caps at 50.

Scenario Observed peak rate Limit Headroom
Wildfly container test, c=10 ~13 req/s sustained 200 req/s burst 15×
Wildfly container test, c=20 ~20 req/s sustained 200 req/s burst 10×
Wildfly container test, c=50 (max override) ~50 req/s burst 200 req/s burst

For a single invocation, no setting in the supported range gets close to the per-second ceiling.

Other context from registry-side telemetry

For completeness, I checked Datadog for actual 429s on the registry pod:

Endpoint 30-day requests 30-day 429s 429 rate
POST /test-dep-graph 9.0 M 24 0.00027%
POST /test-dependencies 1.65 M 22 0.0013%
PUT /monitor-dependencies 1.04 M 0 0%

Caveat: the Envoy gateway returns 429s to clients before the request reaches the registry pod, so registry-side 429 counts under-represent total rejections. At the cluster level, ~1% of total traffic to static-default-registry-web-30100_api-gateway (~807M requests over 30 days) is rejected — but that's all paths combined, and per-endpoint attribution isn't available at the metric layer.

Verdict

The bump from 5 → 10 (and the 50 cap on the env override) is well within the per-token rate limit policy for these endpoints. Even at the highest supported override, a single CLI burst stays at ≤25% of the per-second ceiling and doesn't change per-minute/per-hour quota consumption.

Recommended follow-up (optional, separate PR)

The current retry logic in run-test.ts:265-296 retries on 500/502/503 but not 429. If a heavy user (CI fleet sharing a token) ever hits the per-hour bucket — which is plausible regardless of this PR — they get a hard failure today.

import {
RETRY_ATTEMPTS,
RETRY_DELAY,
getRequestConcurrency,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why concurrency was not enabled for monitor-deps ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume just an oversight. I don't know

Copy link
Copy Markdown
Contributor

@bdemeo12 bdemeo12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few questions!

@bgardiner bgardiner marked this pull request as ready for review May 1, 2026 16:30
@bgardiner bgardiner requested review from a team as code owners May 1, 2026 16:30
@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@bgardiner bgardiner force-pushed the feat/parallelize-container-monitor branch from 9b8b7f1 to 0523d40 Compare May 13, 2026 16:19
@bgardiner bgardiner changed the base branch from main to feat/snyk-request-concurrency May 13, 2026 16:19
@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@bgardiner bgardiner force-pushed the feat/parallelize-container-monitor branch from 86726c6 to 2450535 Compare May 13, 2026 18:07
@snyk-pr-review-bot

This comment has been minimized.

@bgardiner bgardiner force-pushed the feat/parallelize-container-monitor branch from 2450535 to 5ae8410 Compare May 14, 2026 00:32
@snyk-pr-review-bot

This comment has been minimized.

Base automatically changed from feat/snyk-request-concurrency to main May 14, 2026 08:12
@bgardiner bgardiner force-pushed the feat/parallelize-container-monitor branch from 5ae8410 to fcd0bad Compare May 14, 2026 14:07
@snyk-pr-review-bot

This comment has been minimized.

@bgardiner bgardiner enabled auto-merge May 14, 2026 14:33
@snyk-pr-review-bot

This comment has been minimized.

@bgardiner bgardiner disabled auto-merge May 14, 2026 16:58
@snyk-pr-review-bot

This comment has been minimized.

Rewrite monitorDependencies in src/lib/ecosystems/monitor.ts to fan out
per-ScanResult /monitor-dependencies PUTs in parallel via pMap, bounded
by the SNYK_REQUEST_CONCURRENCY limit (default 5). Per-ScanResult work
is extracted into monitorOneScanResult for testability and clarity.

Container images that produce many ScanResults (e.g. one per directory
of JARs in fat-JAR-heavy images) previously incurred one full RTT per
scan result, since the prior implementation used a nested for-loop with
await. With bounded parallelism this collapses to ~ceil(N / concurrency)
sequential batches, materially reducing wall-clock for large images.

Error semantics are preserved:
- 401 still throws AuthFailedError (terminates the run).
- Other 4xx still throws MonitorError (terminates via pMap fail-fast).
- 5xx and other non-4xx errors are accumulated per-ScanResult into the
  errors array, matching the prior continue-on-error behavior.

Result order is preserved by pMap based on input order, so output
remains deterministic regardless of completion order.

Tests cover concurrency cap (default 5), env override via the internal
SNYK_INTERNAL_REQUEST_CONCURRENCY contract that the wrapping Go CLI
forwards, ordering preservation, 4xx fail-fast, and 5xx accumulation.
@bgardiner bgardiner force-pushed the feat/parallelize-container-monitor branch from 7026ae2 to 186c5fb Compare May 14, 2026 17:29
@bgardiner bgardiner enabled auto-merge May 14, 2026 17:30
@snyk-pr-review-bot
Copy link
Copy Markdown

PR Reviewer Guide 🔍

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Import Resolution Risk 🟡 [minor]

The PR uses import * as pMap from 'p-map'. For p-map version 4.0.0 (as specified in package.json), the package is often exported as a default export in CommonJS environments or requires specific flag handling in ESM. While other parts of the monorepo (like snyk-fix) use this syntax, if the build target/environment differs, this can result in pMap being an object containing the function rather than the function itself, leading to 'pMap is not a function' at runtime.

import * as pMap from 'p-map';
📚 Repository Context Analyzed

This review considered 8 relevant code sections from 7 files (average relevance: 0.84)

@bgardiner bgardiner merged commit afb29b2 into main May 14, 2026
9 checks passed
@bgardiner bgardiner deleted the feat/parallelize-container-monitor branch May 14, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants