Skip to content

[GSoC 2026] Sandboxed Execution Environments with Devcontainers: Implement @devcontainer decorator for local sandboxing#3069

Closed
tsvlgd wants to merge 9 commits into
Netflix:masterfrom
tsvlgd:rfc-gsoc-proposal
Closed

[GSoC 2026] Sandboxed Execution Environments with Devcontainers: Implement @devcontainer decorator for local sandboxing#3069
tsvlgd wants to merge 9 commits into
Netflix:masterfrom
tsvlgd:rfc-gsoc-proposal

Conversation

@tsvlgd
Copy link
Copy Markdown

@tsvlgd tsvlgd commented Mar 30, 2026

PR Type

  • Bug fix
  • New feature
  • Core Runtime change
  • Docs / tooling
  • Refactoring

Summary

Introduces the @devcontainer decorator, enabling Metaflow steps to run in isolated, reproducible Docker sandboxes locally using the [DevContainer specification] This bridges the gap between uncontained local execution and cloud-based container orchestration.

Issue

Fixes # (Not applicable - This is a GSoC 2026 RFC/Prototype)

Reproduction

This PR includes a functional prototype and validation flows.

Runtime: local (Docker)

Commands to run:

# 1. Verify spec discovery and parsing
python test_flows/test_phase0_discovery.py run

# 2. Verify CLI Hijack (Host Python -> Docker Container)
python test_flows/test_phase1_hijack.py run

# 3. Verify Zero Host Environment Leakage (Isolation Test)
python test_flows/test_phase2_clean_env.py run

Where evidence shows up: Parent console (stdout)

After (Evidence of Zero Leakage & UID Mapping)
=== Environment Leakage Check ===
  CLEAN:  VIRTUAL_ENV not present
  CLEAN:  SHELL not present
  ...
SUCCESS: Zero host environment leakage detected.
Running

tsvlgd added 2 commits March 30, 2026 16:03
Phase 1: Registration — Metaflow recognizes @devcontainer via plugins/__init__.py.
Phase 2: CLI Hijack + Spec Parsing — runs steps inside Docker using
         image and env vars from .devcontainer/devcontainer.json.

Uses Docker SDK (docker-py) backend to fix host environment leakage.
The shell approach passed the host's full env into the container;
the SDK passes only explicitly specified vars (clean boot).
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

This PR introduces a @devcontainer step decorator that transparently redirects Metaflow step execution into a Docker container defined by a .devcontainer/devcontainer.json spec, bridging local uncontained execution and cloud container orchestration.

The implementation has matured considerably from the initial prototype: prior feedback around shell injection, broad home-directory mounts, Windows UID availability, stale Metaflow version installation, container accumulation, and spec-path fragility has all been addressed. One new critical issue was introduced during those fixes:

  • remove=True + container.wait() exit-code bug (_docker_launcher.py lines 56–77): The previous iteration used remove=False and a manual container.remove() in a try/except. The fix switched to remove=True for automatic cleanup, but combined with detach=True this creates a race: Docker removes the container as soon as it exits, so by the time the log stream finishes and container.wait() is called, the container is gone. Docker-py raises NotFound, which the except Exception handler catches and converts to sys.exit(1) — meaning every successful step (exit code 0) reports failure.
  • Missing __init__.py: metaflow/plugins/devcontainer/ has no __init__.py, inconsistent with all other Metaflow plugin packages and potentially problematic in some toolchain configurations.

Confidence Score: 3/5

Not safe to merge — the exit-code propagation bug makes every successful decorated step appear to have failed

The PR resolves all previously flagged concerns (injection, broad mounts, UID portability, version pinning, container leaks, path discovery), but the fix for container cleanup introduced a new P1 regression: using remove=True with detach=True makes container.wait() raise NotFound on every successful run, so exit code 0 is never propagated and all steps appear to fail. This single bug prevents the decorator from being functional.

metaflow/plugins/devcontainer/_docker_launcher.py — the remove=True/container.wait() interaction must be corrected before this PR can be merged

Important Files Changed

Filename Overview
metaflow/plugins/devcontainer/_docker_launcher.py Docker SDK launcher — critical exit-code bug: remove=True+detach=True+container.wait() always raises NotFound after log stream ends, causing every successful step to report exit code 1
metaflow/plugins/devcontainer/devcontainer_decorator.py Step decorator with robust spec discovery, targeted volume mounting, Windows UID guard, and version-pinned install — missing __init__.py for the package directory
metaflow/plugins/init.py Registers devcontainer in STEP_DECORATORS_DESC — clean one-line addition
setup.py Adds devcontainer extras group with docker>=7.0.0 — appropriate soft dependency
.devcontainer/devcontainer.json.example Neutral example devcontainer spec replacing the previously committed personal config
.gitignore Correctly ignores personal devcontainer.json and adds venv/ exclusion

Comments Outside Diff (1)

  1. metaflow/plugins/devcontainer/_docker_launcher.py, line 56-77 (link)

    remove=True + container.wait() — exit code never propagated

    When remove=True is combined with detach=True, Docker auto-removes the container the moment it exits. container.logs(stream=True, follow=True) blocks until the container has finished — and Docker has already removed it — so the subsequent container.wait() call sends a request for a container that no longer exists. Docker-py converts the resulting 404 into docker.errors.NotFound, which is caught by the outer except Exception block and unconditionally calls sys.exit(1).

    The practical consequence: a Metaflow step that exits cleanly (exit code 0) will always be reported as failed (exit code 1) by the host process. Non-zero exits accidentally produce the right result, but successful steps are always mislabelled as failures, breaking the decorator's core contract.

    The fix is to drop remove=True and manage cleanup manually after wait():

    container = client.containers.run(
        image=image,
        command=["bash", "-c", full_cmd],
        environment=env_vars,
        volumes=volumes,
        working_dir=working_dir,
        user=config.get("user", ""),
        remove=False,   # manage removal manually so wait() can read the exit code
        detach=True,
        stdout=True,
        stderr=True,
    )
    
    for chunk in container.logs(stream=True, follow=True):
        sys.stdout.buffer.write(chunk)
        sys.stdout.buffer.flush()
    
    result = container.wait()   # safe: container still exists at this point
    try:
        container.remove(force=True)
    except Exception:
        pass
    sys.exit(result.get("StatusCode", 1))

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Reviews (7): Last reviewed commit: "Merge branch 'master' into rfc-gsoc-prop..." | Re-trigger Greptile

Comment thread metaflow/plugins/devcontainer/_docker_launcher.py Outdated
Comment thread metaflow/plugins/devcontainer/devcontainer_decorator.py
Comment thread metaflow/plugins/devcontainer/devcontainer_decorator.py Outdated
Comment thread metaflow/plugins/devcontainer/_docker_launcher.py Outdated
Comment thread metaflow/plugins/devcontainer/_docker_launcher.py Outdated
Comment thread .devcontainer/devcontainer.json Outdated
Comment thread metaflow/plugins/devcontainer/devcontainer_decorator.py Outdated
@tsvlgd tsvlgd marked this pull request as draft March 30, 2026 23:24
@tsvlgd
Copy link
Copy Markdown
Author

tsvlgd commented Mar 30, 2026

Hi @savingoyal @saikonen @romain-intel @valayDave

I’m currently working on a GSoC proposal for the project “Sandboxed Execution Environments with Devcontainers” in Metaflow.

I’ve pushed a draft version here:
https://github.com/tsvlgd/metaflow/blob/b6efe14fd2888ff61c097d23eb31278fe23f4033/draft-proposal/proposal.md

This draft is based on a working prototype I’ve been developing around the core execution flow (@devcontainer, Docker backend, datastore handling, etc.), and I’ve tried to reflect those design decisions in the proposal.

It’s still not final—I’m actively refining it—but I wanted to share it early to get feedback on the approach, scope, and alignment before I submit the final version tonight.

Any guidance or pointers would be really helpful.

Thanks!

@tsvlgd tsvlgd marked this pull request as ready for review March 31, 2026 17:04
@talsperre
Copy link
Copy Markdown
Collaborator

Please submit your proposal on the GSOC website directly. We won't be able to consider just the PRs or email submissions.

tsvlgd and others added 5 commits April 2, 2026 18:50
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
… discovery

Summary of changes:
- Resolved shell injection risk using shlex-based command sanitization.
- Implemented recursive parent-directory search for .devcontainer.json discovery.
- Applied Principle of Least Privilege to volume mounts (restricting to ~/.metaflow).
- Added auto_remove=True to launcher to ensure container cleanup.
- Synchronized host/container Metaflow versions for environment parity.
- Integrated docker-py into extras_require in setup.py.
Hardened the sandbox architecture to address P1 security and reliability concerns:

- Security: Implemented shlex-based command sanitization to prevent shell injection.
- Security: Applied Principle of Least Privilege by restricting volume mounts to ~/.metaflow.
- Robustness: Replaced os.getcwd() with a recursive parent-search for .devcontainer discovery.
- Reliability: Integrated auto_remove=True in Docker SDK to prevent orphan container leaks.
- Compatibility: Pinned host/container Metaflow versions and added Windows/WSL2 platform guards.
- Project Hygiene: Moved Docker SDK to extras_require and sanitized configuration templates.

This update aligns the prototype with production standards while the official proposal is under review.
@tsvlgd
Copy link
Copy Markdown
Author

tsvlgd commented Apr 2, 2026

Please submit your proposal on the GSOC website directly. We won't be able to consider just the PRs or email submissions.

Thank you, @talsperre. I have officially submitted my proposal on the GSoC website as required.

I’ve just updated this RFC branch to reflect the final technical approach in my proposal. This update "hardens" the sandbox architecture by resolving the initial feedback regarding shell injection (via shlex), implementing the Principle of Least Privilege for volume mounts, and adding robust recursive path discovery for .devcontainer specs.

I'm keeping this prototype active here to demonstrate the production-readiness of the proposed implementation. Thanks for the guidance!

@npow npow closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants