Skip to content

fix(security): validate untrusted input in sandbox pip install and langchain separator parsing#588

Closed
charliegillet wants to merge 3 commits into
rocketride-org:developfrom
charliegillet:fix/security-input-validation-split
Closed

fix(security): validate untrusted input in sandbox pip install and langchain separator parsing#588
charliegillet wants to merge 3 commits into
rocketride-org:developfrom
charliegillet:fix/security-input-validation-split

Conversation

@charliegillet
Copy link
Copy Markdown
Contributor

@charliegillet charliegillet commented Apr 1, 2026

Split from #581 per @asclearuc's review.

Summary

  • sandbox.py: Add regex validation for pip package names to reject path traversal, URLs, and shell metacharacters before passing to subprocess. Add threading.Lock around pip install to prevent concurrent conflicts.
  • langchain.py: Add input length validation (_MAX_SEPARATOR_INPUT_LEN = 1024) before ast.literal_eval in the langchain preprocessor to prevent resource exhaustion from oversized separator input.

Why this bug happens in this codebase

sandbox.py passes user-provided package names directly to pip install via subprocess without validation. A malicious or malformed package name containing path traversal sequences (../), URLs, or shell metacharacters could be passed through to pip, which would attempt to install from arbitrary sources. The regex allowlist ensures only valid PyPI package name characters are accepted.

langchain.py calls ast.literal_eval on a user-provided separator string. While ast.literal_eval is safe against code execution, it still needs to parse the input — an extremely long string could cause unnecessary resource consumption. A 1024-character limit is generous for separator values (which are typically 1-10 characters) while preventing abuse.

Validation

  • Both changes are minimal and focused — 2 files, +24/-8 lines total
  • ruff check and ruff format pass clean
  • The package name regex (^[a-zA-Z0-9]([a-zA-Z0-9._-]*[a-zA-Z0-9])?$) matches the PyPI naming convention

Type

  • Bug fix / Security hardening

#Hack-with-bay-2

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Enforced input size limits to reject oversized or malformed inputs before processing, preventing errors.
    • Rejected invalid package names early (type, emptiness, length and format checks) to avoid failed installs.
    • Serialized package installation and cache updates to improve reliability under concurrent operations.

…ngchain separator parsing

Split from rocketride-org#581 per @asclearuc's review.

- Add regex validation for pip package names to reject path traversal,
  URLs, and shell metacharacters before passing to subprocess
- Add threading lock around pip install to prevent concurrent conflicts
- Add input length validation for ast.literal_eval in langchain preprocessor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added module:nodes Python pipeline nodes module:ai AI/ML modules labels Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

Warning

Rate limit exceeded

@charliegillet has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 16 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 4 minutes and 16 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 46645901-73ae-4152-9875-172ae71319ac

📥 Commits

Reviewing files that changed from the base of the PR and between bddcd8f and 7663810.

📒 Files selected for processing (1)
  • nodes/src/nodes/preprocessor_langchain/langchain.py
📝 Walkthrough

Walkthrough

Added input validation and safety controls: the langchain preprocessor enforces a 1024-character maximum for separator parsing, and the sandbox pip helper validates package names, restricts install subprocess stderr, and serializes pip installs with a module-level lock.

Changes

Cohort / File(s) Summary
Langchain Input Validation
nodes/src/nodes/preprocessor_langchain/langchain.py
Introduce _MAX_SEPARATOR_INPUT_LEN = 1024; _parseSeparators(user_input) now raises ValueError if len(user_input) > 1024; existing parsing and error wrapping retained for valid lengths.
Pip Installation Safety
packages/ai/src/ai/common/sandbox.py
Add _pip_lock to serialize installs; _pip_install(package) now validates package (type, non-empty, ≤128 chars, regex for allowed chars/format), raises ValueError on invalid names; pip subprocess call discards stderr and importlib.invalidate_caches() is performed under the lock.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐇 I nibble code and tidy threads,
Count to 1024 before I tread,
I lock the pip, I check the name,
Small safeguards, a cautious game.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main security-focused changes across both modified files (input validation in sandbox pip install and langchain separator parsing).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/preprocessor_langchain/langchain.py`:
- Around line 77-79: The constant _MAX_SEPARATOR_INPUT_LEN is defined inside
_parseSeparators, causing it to be recreated on each call and inconsistent with
other class-level constants; move _MAX_SEPARATOR_INPUT_LEN = 1024 up to the
class scope alongside the other constants (use the same private naming
convention) and update _parseSeparators to reference the class-level constant
(via self._MAX_SEPARATOR_INPUT_LEN or ClassName._MAX_SEPARATOR_INPUT_LEN) so the
value is reused and kept with the class configuration.

In `@packages/ai/src/ai/common/sandbox.py`:
- Around line 316-322: The pip install call inside the _pip_lock block uses
subprocess.check_call with stderr=subprocess.PIPE which can block if not read;
change it to redirect stderr to subprocess.DEVNULL (matching stdout) or use
subprocess.run(..., capture_output=True, check=True) if you want to capture
errors. Locate the call that invokes subprocess.check_call([sys.executable,
'-m', 'pip', 'install', '--quiet', package], stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE, timeout=60) and replace stderr=subprocess.PIPE with
stderr=subprocess.DEVNULL (or refactor to subprocess.run with
capture_output=True and check=True and handle the result) to avoid pipe buffer
deadlocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ec7be14e-d215-487d-907d-76bd927c0222

📥 Commits

Reviewing files that changed from the base of the PR and between 6f1c58c and e2b34ff.

📒 Files selected for processing (2)
  • nodes/src/nodes/preprocessor_langchain/langchain.py
  • packages/ai/src/ai/common/sandbox.py

Comment thread nodes/src/nodes/preprocessor_langchain/langchain.py Outdated
Comment thread packages/ai/src/ai/common/sandbox.py
…ketride-org#588

- Move _MAX_SEPARATOR_INPUT_LEN to class-level constant for consistency
- Use subprocess.DEVNULL for stderr in pip install to prevent pipe buffer deadlocks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@charliegillet
Copy link
Copy Markdown
Contributor Author

Addressed CodeRabbit feedback:

  • Moved _MAX_SEPARATOR_INPUT_LEN to class-level constant for consistency
  • Changed stderr=subprocess.PIPE to subprocess.DEVNULL to prevent pipe buffer deadlocks

Thanks for the review!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nodes/src/nodes/preprocessor_langchain/langchain.py`:
- Around line 78-79: The length check on user_input can raise TypeError for
non-string truthy values; add an explicit type guard before the len(...) call:
validate that user_input is an instance of str (or coerce to str if intended)
and if not raise a ValueError with a clear message, so the code path using
self._MAX_SEPARATOR_INPUT_LEN always fails deterministically for invalid
separator input; update the check around user_input and
self._MAX_SEPARATOR_INPUT_LEN accordingly (e.g., verify isinstance(user_input,
str) then perform the len check and raise ValueError on failure).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6855d57b-10ab-4226-9722-6a7fedb1dbf0

📥 Commits

Reviewing files that changed from the base of the PR and between e2b34ff and bddcd8f.

📒 Files selected for processing (2)
  • nodes/src/nodes/preprocessor_langchain/langchain.py
  • packages/ai/src/ai/common/sandbox.py

Comment thread nodes/src/nodes/preprocessor_langchain/langchain.py
@asclearuc
Copy link
Copy Markdown
Collaborator

Closing this in favour of keeping the status quo.

_pip_lock is not needed — pip has used its own per-environment filelock since v10, so concurrent pip install subprocesses are already serialised at the pip level. Adding a process-level threading lock on top serialises all installs globally, meaning a slow torch install would block an unrelated boto3 install in another thread — a net regression for concurrent sandbox use.

stderr=subprocess.DEVNULL is a step backwards from the original stderr=subprocess.PIPE. Swallowing pip's stderr means install failures (network errors, version conflicts, missing build deps) produce no actionable output, making sandbox errors much harder to diagnose.

langchain.py input validation is the one genuinely useful change here, but it's small enough to land as part of a broader input-validation pass rather than its own PR.

Thanks for the work, @charliegillet — the intent is right, the execution just introduces more problems than it solves.

@asclearuc asclearuc closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ai AI/ML modules module:nodes Python pipeline nodes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants