[https://nvbugs/5911304][fix] Add URL validation and request hardening for media input loading#12748
Conversation
9ecc0e4 to
8da414a
Compare
📝 WalkthroughWalkthroughAdded security validation for remote URL fetching in Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Calling Code
participant Validator as URL Validator
participant Resolver as IP Resolver
participant HTTPClient as HTTP Client
participant RemoteServer as Remote Server
Client->>Validator: Request with URL
Validator->>Validator: Check scheme (http/https only)
alt Invalid scheme
Validator-->>Client: Raise ValueError
end
Validator->>Resolver: Resolve URL to IP address
alt Resolution fails
Resolver-->>Validator: Error
Validator-->>Client: Raise error
end
Resolver-->>Validator: Return IP address
Validator->>Validator: Validate IP (not private/loopback/reserved)
alt Invalid IP range
Validator-->>Client: Raise error
end
Validator->>HTTPClient: Make safe request
HTTPClient->>RemoteServer: HTTP GET request
RemoteServer-->>HTTPClient: Response
HTTPClient->>HTTPClient: Check response size ≤ 200 MB
alt Oversized response
HTTPClient-->>Client: Raise error
end
HTTPClient->>HTTPClient: Validate redirect target (≤5 redirects)
alt Redirect limit exceeded
HTTPClient-->>Client: Raise error
end
HTTPClient-->>Validator: Valid response
Validator-->>Client: Return content
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tensorrt_llm/inputs/utils.py (1)
107-131: SSRF validation is solid, with a known DNS rebinding caveat.The implementation correctly validates all resolved IPs against private/loopback/link-local/reserved ranges. However, there's an inherent TOCTOU (time-of-check to time-of-use) window: DNS could resolve to a different IP between validation and the actual HTTP request due to TTL expiration or DNS rebinding attacks.
This is a known limitation of application-layer SSRF defenses. For defense-in-depth, consider documenting this limitation or adding a comment noting that network-level controls (firewall rules blocking egress to private ranges) provide the strongest protection.
📝 Suggested documentation comment
def _validate_url(url: str) -> None: """Validate that *url* points to a public, non-internal HTTP(S) resource. Raises ``ValueError`` for URLs that target private, loopback, or link-local addresses, or that use a scheme other than http / https. + + Note: This provides application-layer SSRF protection but cannot fully + prevent DNS rebinding attacks. Network-level egress controls blocking + traffic to internal IP ranges provide additional defense-in-depth. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/inputs/utils.py` around lines 107 - 131, The SSRF validation in _validate_url correctly checks resolved IPs but has a TOCTOU/DNS-rebinding gap between DNS resolution and actual HTTP use; update the code by adding a clear comment above the _validate_url function explaining this limitation (DNS can change after validation), and mention recommended defense-in-depth measures (e.g., enforce network-level egress controls/firewall rules to block private ranges, or perform request-time validation/connection-level checks) so future maintainers know this is not a complete protection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/inputs/utils.py`:
- Around line 134-162: The redirect loop in _safe_request_get can leak
connections because intermediate resp objects aren’t closed when stream=True;
before reassigning resp inside the for loop (just prior to calling
requests.get(redirect_url,...)), call resp.close() for the previous response
when stream is True (ensure resp is not None), validate redirect_url with
_validate_url as already done, then proceed to request the redirect; leave the
final resp open to return so callers can consume/close it.
- Around line 165-179: The function _safe_aiohttp_get currently lets aiohttp
follow redirects automatically which allows intermediate redirect targets to be
requested before validation; change it to perform manual redirect handling: call
session.get(..., allow_redirects=False), validate the initial URL with
_validate_url, then if response is a 3xx redirect, read the Location header,
resolve it to an absolute URL (preserving response.url base), call _validate_url
on that new URL, and repeat up to _MAX_REDIRECTS; only when you receive a
non-redirect (2xx) response proceed to read response.content (enforcing
_MAX_RESPONSE_BYTES and response.raise_for_status()). Ensure you reference and
update _safe_aiohttp_get, use _validate_url, and respect _MAX_REDIRECTS and
_MAX_RESPONSE_BYTES while handling relative Location headers correctly.
---
Nitpick comments:
In `@tensorrt_llm/inputs/utils.py`:
- Around line 107-131: The SSRF validation in _validate_url correctly checks
resolved IPs but has a TOCTOU/DNS-rebinding gap between DNS resolution and
actual HTTP use; update the code by adding a clear comment above the
_validate_url function explaining this limitation (DNS can change after
validation), and mention recommended defense-in-depth measures (e.g., enforce
network-level egress controls/firewall rules to block private ranges, or perform
request-time validation/connection-level checks) so future maintainers know this
is not a complete protection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4ab187e6-c8fc-471e-ae15-41ce2936cea9
📒 Files selected for processing (1)
tensorrt_llm/inputs/utils.py
|
/bot run --disable-fail-fast |
|
PR_Github #41789 [ run ] triggered by Bot. Commit: |
|
PR_Github #41789 [ run ] completed with state |
|
@2ez4bz @yechank-nvidia @Wanli-Jiang I saw your recent work in this file and figured you might be good reviewer of this PR, could you please take a look? Thank you! |
2ez4bz
left a comment
There was a problem hiding this comment.
Can we add tests for the new code? 🙏
1a5a0ea to
f534614
Compare
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #42765 [ run ] triggered by Bot. Commit: |
|
PR_Github #42765 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42775 [ run ] triggered by Bot. Commit: |
|
PR_Github #42775 [ run ] completed with state |
|
@2ez4bz could you review again, all comments are responded and pipeline test passed |
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
… loading Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
…uest_get Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
64e11b7 to
7c1c17f
Compare
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
|
/bot run |
|
PR_Github #44579 [ run ] triggered by Bot. Commit: |
|
PR_Github #44579 [ run ] completed with state
|
|
|
||
| async def _safe_aiohttp_get(url: str, timeout_sec: int = 30) -> bytes: | ||
| """Aiohttp GET wrapper that validates every redirect hop before following.""" | ||
| _validate_url(url) |
There was a problem hiding this comment.
_validate_url internally calls socket.getaddrinfo(), which is synchronous and can block the event loop for seconds on slow DNS. Since this function is async and intended for use in concurrent request paths, consider wrapping it with await asyncio.to_thread(_validate_url, url).
| _validate_url(url) | ||
| timeout = aiohttp.ClientTimeout(total=timeout_sec) | ||
| current_url = url | ||
| async with aiohttp.ClientSession(timeout=timeout) as session: |
There was a problem hiding this comment.
Previously the codebase used _get_aiohttp_session(), please consider accepting an existing session as a parameter.
| image = await asyncio.to_thread(_load_and_convert_image, | ||
| BytesIO(content)) | ||
| content = await _safe_aiohttp_get(image) | ||
| image = _load_and_convert_image(BytesIO(content)) |
There was a problem hiding this comment.
Seems we are not using async, can you go back to original code where we used asyncio.to_thread?
| image = _load_and_convert_image(BytesIO(content)) | ||
| elif parsed_url.scheme == "data": | ||
| image = await asyncio.to_thread(load_base64_image, parsed_url) | ||
| image = load_base64_image(parsed_url) |
| image = await asyncio.to_thread(load_base64_image, parsed_url) | ||
| image = load_base64_image(parsed_url) | ||
| elif parsed_url.scheme in ("", "file"): | ||
| image = _load_and_convert_image(Path(parsed_url.path)) |
| content = await response.content.read() | ||
| return await asyncio.to_thread(_load_from_bytes, content) | ||
| video_data = await _safe_aiohttp_get(video) | ||
| return _load_from_bytes(video_data) |
There was a problem hiding this comment.
Same as image case.
| elif parsed_url.scheme == "data": | ||
| decoded_video = load_base64_video(video) | ||
| return await asyncio.to_thread(_load_from_bytes, decoded_video) | ||
| return _load_from_bytes(decoded_video) |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.