Skip to content

feat(infra): split Docker image into sglang and vllm variants#985

Merged
rchardx merged 2 commits into
mainfrom
fw/docker
Mar 8, 2026
Merged

feat(infra): split Docker image into sglang and vllm variants#985
rchardx merged 2 commits into
mainfrom
fw/docker

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade commented Mar 5, 2026

Summary

Split the monolithic Docker image build into sglang and vllm variants using a single parameterized Dockerfile with build arguments. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size.

User-facing documentation (installation guides, example configs, READMEs) is intentionally not updated in this PR — those changes will land in a follow-up when the split images are officially released.

Docker

  • Unified Dockerfile with ARG BASE_IMAGE / ARG VARIANT — builds either variant from one file
  • sglang base: lmsysorg/sglang:v0.5.7-cu129-amd64-runtime
  • vllm base: vllm/vllm-openai:v0.14.0 (with ENTRYPOINT [] reset)
  • Image naming convention: ghcr.io/inclusionai/areal-runtime:{tag}-{variant} (e.g. :dev-sglang, :v1.0.2-vllm)
  • latest tag points to the sglang variant

Dependencies (pyproject.toml)

  • Declared sglang and vllm as conflicting extras via [tool.uv.conflicts]
  • Renamed cuda extra → cuda-train (training packages only, no inference backend)
  • Removed torchao from direct dependencies (transitive dep of sglang only)
  • Removed coexistence overrides that were only needed when both backends were installed
  • Kept litellm-vs-sglang overrides (openai, soundfile)
  • Registered sglang and vllm pytest markers in [tool.pytest.ini_options]

CI Workflows

  • build-docker-image.yml: Builds both images → tests each via test-areal.yml → promotes to :dev-{variant}
  • test-areal.yml: Matrix strategy with variant parameter; per-variant GCP runners; excludes opposite backend's tests via -m "not {backend}"
  • tag-release-image.yml: Matrix build for release tags (v1.0.2-sglang, v1.0.2-vllm)
  • install-test.yml: Docker test matrix for both variants
  • .pre-commit-config.yaml: Excluded .github/workflows/ from check-yaml (GHA expressions use non-standard YAML)

Tests

  • Tagged backend-specific tests with @pytest.mark.sglang / @pytest.mark.vllm markers
  • CI excludes tests for the non-installed backend via -m "not {backend}"
  • test_inference_engines.py: Conditional sglang fixture (skip when sglang not installed)
  • test_tool_call_parser.py: pytest.importorskip("sglang") guard
  • test_grpo.py: Parametrized over (training_backend, inference_backend) with per-case marks — runs sglang configs on sglang image, vllm configs on vllm image
  • Added vllm GRPO config YAMLs for fsdp, megatron, and archon training backends

Validation Tools

  • validate_docker_installation.py: Auto-detect installed inference variant instead of requiring both
  • validation_base.py: Removed sglang from base critical packages (variant-specific)

What is NOT included (deferred to release)

  • Installation guide updates (Docker image names, uv sync commands)
  • Example README / SkyPilot config updates
  • AGENTS.md / CLAUDE.md install command updates

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Breaking change — no user-facing breaking changes in this PR; docs stay on current release

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • My branch is up to date with main
  • CI passes for both sglang and vllm variants

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's Docker infrastructure and dependency management to better support distinct SGLang and vLLM inference backends. By splitting the Dockerfile and introducing explicit uv extras, it improves modularity, reduces dependency conflicts, and streamlines the build and testing processes for different backend configurations. This change also impacts user installation instructions and Docker image references, which have been updated across the documentation and examples.

Highlights

  • Docker Infrastructure Refactoring: The single Dockerfile was split into two variant-specific Dockerfiles (Dockerfile.sglang and Dockerfile.vllm) with dedicated base images, enhancing modularity for SGLang and vLLM backends.
  • CI Workflow Updates: All Continuous Integration (CI) workflows were updated to concurrently build, test, and promote both the SGLang and vLLM Docker images.
  • Dependency Management Enhancement: The cuda extra was renamed to cuda-train to specifically denote CUDA training packages, and sglang and vllm were declared as mutually exclusive extras using uv's conflicts feature, simplifying dependency resolution.
  • Dependency Override Reduction: Several unnecessary dependency overrides were removed from pyproject.toml due to the improved isolation provided by the new backend-specific extras.
  • Documentation and Example Updates: All relevant documentation files and example configurations were updated to reflect the new Docker image names and the revised uv sync installation commands.
  • Pytest Marker Integration: New sglang and vllm pytest markers were introduced and applied to backend-specific tests, enabling more granular and efficient testing in CI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .pre-commit-config.yaml
    • Excluded GitHub workflow YAML files from linting checks.
  • AGENTS.md
    • Updated installation instructions to use new uv sync extras.
  • CLAUDE.md
    • Updated installation instructions to use new uv sync extras.
  • Dockerfile.sglang
    • Renamed and modified the Dockerfile to specifically build the SGLang runtime image.
  • Dockerfile.vllm
    • Added a new Dockerfile to build the vLLM runtime image.
  • README.md
    • Updated installation commands and Docker image references.
  • docs/en/tutorial/installation.md
    • Updated Docker image names and uv sync commands in English documentation.
  • docs/zh/tutorial/installation.md
    • Updated Docker image names and uv sync commands in Chinese documentation.
  • examples/openclaw/README.md
    • Updated uv sync command for OpenClaw example.
  • examples/skypilot/README.md
    • Updated Docker image reference for SkyPilot examples.
  • examples/skypilot/ray_cluster.sky.yaml
    • Updated Docker image reference.
  • examples/skypilot/single_node.sky.yaml
    • Updated Docker image reference.
  • examples/tau2/README.md
    • Updated Docker image reference for Tau2 example.
  • pyproject.toml
    • Removed torchao from dependencies.
    • Renamed cuda extra to cuda-train.
    • Added conflicts section for sglang and vllm.
    • Removed several override-dependencies.
    • Introduced new sglang and vllm pytest markers.
  • tests/experimental/openai/test_client.py
    • Added sglang pytest marker.
  • tests/experimental/openai/test_client_with_tool.py
    • Added sglang pytest marker.
  • tests/experimental/openai/test_proxy_integration.py
    • Added sglang pytest marker.
  • tests/grpo/test_grpo.py
    • Added sglang pytest marker.
  • tests/test_examples.py
    • Applied sglang and vllm pytest markers to various example tests.
  • tests/test_fsdp_engine_nccl.py
    • Added sglang pytest marker.
  • tests/test_inference_engines.py
    • Updated inference engine fixture to use sglang and vllm pytest markers.
  • tests/test_rollout_controller.py
    • Added sglang pytest marker.
Ignored Files
  • Ignored by pattern: .github/workflows/** (4)
    • .github/workflows/build-docker-image.yml
    • .github/workflows/install-test.yml
    • .github/workflows/tag-release-image.yml
    • .github/workflows/test-areal.yml
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of splitting the Docker build into sglang and vllm variants, which simplifies dependency management. The use of uv's conflicts feature in pyproject.toml to make the backends mutually exclusive is an excellent implementation choice. The updates to documentation and CI workflows are thorough and consistent with these changes.

My main feedback is a medium-severity maintainability concern regarding the significant code duplication between Dockerfile.sglang and the new Dockerfile.vllm. I've left a specific comment with a suggestion to refactor this using a single, multi-stage Dockerfile to reduce future maintenance overhead.

Comment thread Dockerfile.vllm Outdated
@garrett4wade garrett4wade force-pushed the fw/docker branch 11 times, most recently from c28cc00 to 03cd54d Compare March 6, 2026 11:39
garrett4wade added a commit that referenced this pull request Mar 6, 2026
Parametrize GRPO tests over both sglang and vllm inference backends
so the vllm CI image can run GRPO tests (previously exit code 5
because all tests were marked sglang-only).

Key changes:
- Add vllm config YAMLs for fsdp, megatron, and archon backends
- Replace global pytestmark=sglang with per-case sglang/vllm marks
- Conditionally set config.vllm.model vs config.sglang.model_path
- README table reformatted by mdformat hook

Refs: #985
garrett4wade added a commit that referenced this pull request Mar 6, 2026
Parametrize GRPO tests over both sglang and vllm inference backends
so the vllm CI image can run GRPO tests (previously exit code 5
because all tests were marked sglang-only).

Key changes:
- Add vllm config YAMLs for fsdp, megatron, and archon backends
- Replace global pytestmark=sglang with per-case sglang/vllm marks
- Conditionally set config.vllm.model vs config.sglang.model_path
- README table reformatted by mdformat hook

Refs: #985
garrett4wade added a commit that referenced this pull request Mar 6, 2026
Parametrize GRPO tests over both sglang and vllm inference backends
so the vllm CI image can run GRPO tests (previously exit code 5
because all tests were marked sglang-only).

Key changes:
- Add vllm config YAMLs for fsdp, megatron, and archon backends
- Replace global pytestmark=sglang with per-case sglang/vllm marks
- Conditionally set config.vllm.model vs config.sglang.model_path
- README table reformatted by mdformat hook

Refs: #985
garrett4wade added a commit that referenced this pull request Mar 6, 2026
Restore test_ppo_stats.py and reward_curve.png which were
accidentally included in the Docker split branch. These
deletions are unrelated to the sglang/vllm image separation.

Refs: #985
garrett4wade added a commit that referenced this pull request Mar 6, 2026
…ase state

Revert user-facing docs, example configs, and install commands to
match the current released image naming and extras. These will be
updated in a follow-up when the split images are officially released.

Key changes:
- Restore README, AGENTS.md, CLAUDE.md install commands
- Restore EN/ZH installation guides (Docker image names, uv sync)
- Restore SkyPilot configs and example READMEs

Refs: #985
garrett4wade added a commit that referenced this pull request Mar 6, 2026
Build separate sglang and vllm Docker images from a single
parameterized Dockerfile. Each variant ships only one inference
backend, eliminating dependency conflicts and reducing image size.

Key changes:
- Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds
- CI workflows: build-docker-image, test-areal, tag-release, install-test
- pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train
- pytest markers and conditional fixtures for backend-specific tests
- GRPO tests parametrized over both sglang and vllm inference backends
- Docker validation tools auto-detect installed inference variant

Refs: #985
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 6, 2026
@garrett4wade garrett4wade force-pushed the fw/docker branch 7 times, most recently from 373a0d3 to 8da6b47 Compare March 7, 2026 10:06
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

The workflow succeeds at https://github.com/inclusionAI/AReaL/actions/runs/22798596790

Pending review.

@garrett4wade garrett4wade requested review from nuzant and rchardx March 7, 2026 14:53
Comment thread pyproject.toml
Comment thread areal/tools/validate_docker_installation.py
Comment thread tests/grpo/test_grpo.py Outdated
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 8, 2026
…onflicts

Runner 2.317.0 does not support node24 required by actions/checkout@v6,
causing 'Set up job' failures on dynamically provisioned GCP instances.

Key changes:
- Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml
- Remove duplicate imports in test_rollout_controller.py from rebase with PR #996
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 8, 2026
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting for CI.

@rchardx rchardx merged commit 13ea771 into main Mar 8, 2026
13 checks passed
@rchardx rchardx deleted the fw/docker branch March 8, 2026 03:35
dingzhiqiang pushed a commit that referenced this pull request Mar 16, 2026
* feat(infra): split Docker image into sglang and vllm variants

Build separate sglang and vllm Docker images from a single
parameterized Dockerfile. Each variant ships only one inference
backend, eliminating dependency conflicts and reducing image size.

Key changes:
- Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds
- CI workflows: build-docker-image, test-areal, tag-release, install-test
- pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train
- pytest markers and conditional fixtures for backend-specific tests
- GRPO tests parametrized over both sglang and vllm inference backends
- Docker validation tools auto-detect installed inference variant

Refs: #985

* fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts

Runner 2.317.0 does not support node24 required by actions/checkout@v6,
causing 'Set up job' failures on dynamically provisioned GCP instances.

Key changes:
- Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml
- Remove duplicate imports in test_rollout_controller.py from rebase with PR #996
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…project#985)

* feat(infra): split Docker image into sglang and vllm variants

Build separate sglang and vllm Docker images from a single
parameterized Dockerfile. Each variant ships only one inference
backend, eliminating dependency conflicts and reducing image size.

Key changes:
- Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds
- CI workflows: build-docker-image, test-areal, tag-release, install-test
- pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train
- pytest markers and conditional fixtures for backend-specific tests
- GRPO tests parametrized over both sglang and vllm inference backends
- Docker validation tools auto-detect installed inference variant

Refs: areal-project#985

* fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts

Runner 2.317.0 does not support node24 required by actions/checkout@v6,
causing 'Set up job' failures on dynamically provisioned GCP instances.

Key changes:
- Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml
- Remove duplicate imports in test_rollout_controller.py from rebase with PR areal-project#996
SathyaGnanakumar pushed a commit to danielkiely/AReaL that referenced this pull request Apr 29, 2026
…project#985)

* feat(infra): split Docker image into sglang and vllm variants

Build separate sglang and vllm Docker images from a single
parameterized Dockerfile. Each variant ships only one inference
backend, eliminating dependency conflicts and reducing image size.

Key changes:
- Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds
- CI workflows: build-docker-image, test-areal, tag-release, install-test
- pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train
- pytest markers and conditional fixtures for backend-specific tests
- GRPO tests parametrized over both sglang and vllm inference backends
- Docker validation tools auto-detect installed inference variant

Refs: areal-project#985

* fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts

Runner 2.317.0 does not support node24 required by actions/checkout@v6,
causing 'Set up job' failures on dynamically provisioned GCP instances.

Key changes:
- Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml
- Remove duplicate imports in test_rollout_controller.py from rebase with PR areal-project#996
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants