feat(infra): split Docker image into sglang and vllm variants by garrett4wade · Pull Request #985 · areal-project/AReaL

garrett4wade · 2026-03-05T13:54:55Z

Summary

Split the monolithic Docker image build into sglang and vllm variants using a single parameterized Dockerfile with build arguments. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size.

User-facing documentation (installation guides, example configs, READMEs) is intentionally not updated in this PR — those changes will land in a follow-up when the split images are officially released.

Docker

Unified Dockerfile with ARG BASE_IMAGE / ARG VARIANT — builds either variant from one file
sglang base: lmsysorg/sglang:v0.5.7-cu129-amd64-runtime
vllm base: vllm/vllm-openai:v0.14.0 (with ENTRYPOINT [] reset)
Image naming convention: ghcr.io/inclusionai/areal-runtime:{tag}-{variant} (e.g. :dev-sglang, :v1.0.2-vllm)
latest tag points to the sglang variant

Dependencies (pyproject.toml)

Declared sglang and vllm as conflicting extras via [tool.uv.conflicts]
Renamed cuda extra → cuda-train (training packages only, no inference backend)
Removed torchao from direct dependencies (transitive dep of sglang only)
Removed coexistence overrides that were only needed when both backends were installed
Kept litellm-vs-sglang overrides (openai, soundfile)
Registered sglang and vllm pytest markers in [tool.pytest.ini_options]

CI Workflows

build-docker-image.yml: Builds both images → tests each via test-areal.yml → promotes to :dev-{variant}
test-areal.yml: Matrix strategy with variant parameter; per-variant GCP runners; excludes opposite backend's tests via -m "not {backend}"
tag-release-image.yml: Matrix build for release tags (v1.0.2-sglang, v1.0.2-vllm)
install-test.yml: Docker test matrix for both variants
.pre-commit-config.yaml: Excluded .github/workflows/ from check-yaml (GHA expressions use non-standard YAML)

Tests

Tagged backend-specific tests with @pytest.mark.sglang / @pytest.mark.vllm markers
CI excludes tests for the non-installed backend via -m "not {backend}"
test_inference_engines.py: Conditional sglang fixture (skip when sglang not installed)
test_tool_call_parser.py: pytest.importorskip("sglang") guard
test_grpo.py: Parametrized over (training_backend, inference_backend) with per-case marks — runs sglang configs on sglang image, vllm configs on vllm image
Added vllm GRPO config YAMLs for fsdp, megatron, and archon training backends

Validation Tools

validate_docker_installation.py: Auto-detect installed inference variant instead of requiring both
validation_base.py: Removed sglang from base critical packages (variant-specific)

What is NOT included (deferred to release)

Installation guide updates (Docker image names, uv sync commands)
Example README / SkyPilot config updates
AGENTS.md / CLAUDE.md install command updates

Type of Change

New feature (non-breaking change that adds functionality)
Breaking change — no user-facing breaking changes in this PR; docs stay on current release

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
My branch is up to date with main
CI passes for both sglang and vllm variants

gemini-code-assist · 2026-03-05T13:55:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's Docker infrastructure and dependency management to better support distinct SGLang and vLLM inference backends. By splitting the Dockerfile and introducing explicit uv extras, it improves modularity, reduces dependency conflicts, and streamlines the build and testing processes for different backend configurations. This change also impacts user installation instructions and Docker image references, which have been updated across the documentation and examples.

Highlights

Docker Infrastructure Refactoring: The single Dockerfile was split into two variant-specific Dockerfiles (Dockerfile.sglang and Dockerfile.vllm) with dedicated base images, enhancing modularity for SGLang and vLLM backends.
CI Workflow Updates: All Continuous Integration (CI) workflows were updated to concurrently build, test, and promote both the SGLang and vLLM Docker images.
Dependency Management Enhancement: The cuda extra was renamed to cuda-train to specifically denote CUDA training packages, and sglang and vllm were declared as mutually exclusive extras using uv's conflicts feature, simplifying dependency resolution.
Dependency Override Reduction: Several unnecessary dependency overrides were removed from pyproject.toml due to the improved isolation provided by the new backend-specific extras.
Documentation and Example Updates: All relevant documentation files and example configurations were updated to reflect the new Docker image names and the revised uv sync installation commands.
Pytest Marker Integration: New sglang and vllm pytest markers were introduced and applied to backend-specific tests, enabling more granular and efficient testing in CI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.pre-commit-config.yaml
- Excluded GitHub workflow YAML files from linting checks.
AGENTS.md
- Updated installation instructions to use new uv sync extras.
CLAUDE.md
- Updated installation instructions to use new uv sync extras.
Dockerfile.sglang
- Renamed and modified the Dockerfile to specifically build the SGLang runtime image.
Dockerfile.vllm
- Added a new Dockerfile to build the vLLM runtime image.
README.md
- Updated installation commands and Docker image references.
docs/en/tutorial/installation.md
- Updated Docker image names and uv sync commands in English documentation.
docs/zh/tutorial/installation.md
- Updated Docker image names and uv sync commands in Chinese documentation.
examples/openclaw/README.md
- Updated uv sync command for OpenClaw example.
examples/skypilot/README.md
- Updated Docker image reference for SkyPilot examples.
examples/skypilot/ray_cluster.sky.yaml
- Updated Docker image reference.
examples/skypilot/single_node.sky.yaml
- Updated Docker image reference.
examples/tau2/README.md
- Updated Docker image reference for Tau2 example.
pyproject.toml
- Removed torchao from dependencies.
- Renamed cuda extra to cuda-train.
- Added conflicts section for sglang and vllm.
- Removed several override-dependencies.
- Introduced new sglang and vllm pytest markers.
tests/experimental/openai/test_client.py
- Added sglang pytest marker.
tests/experimental/openai/test_client_with_tool.py
- Added sglang pytest marker.
tests/experimental/openai/test_proxy_integration.py
- Added sglang pytest marker.
tests/grpo/test_grpo.py
- Added sglang pytest marker.
tests/test_examples.py
- Applied sglang and vllm pytest markers to various example tests.
tests/test_fsdp_engine_nccl.py
- Added sglang pytest marker.
tests/test_inference_engines.py
- Updated inference engine fixture to use sglang and vllm pytest markers.
tests/test_rollout_controller.py
- Added sglang pytest marker.

Ignored Files

Ignored by pattern: .github/workflows/** (4)
- .github/workflows/build-docker-image.yml
- .github/workflows/install-test.yml
- .github/workflows/tag-release-image.yml
- .github/workflows/test-areal.yml

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request does a great job of splitting the Docker build into sglang and vllm variants, which simplifies dependency management. The use of uv's conflicts feature in pyproject.toml to make the backends mutually exclusive is an excellent implementation choice. The updates to documentation and CI workflows are thorough and consistent with these changes.

My main feedback is a medium-severity maintainability concern regarding the significant code duplication between Dockerfile.sglang and the new Dockerfile.vllm. I've left a specific comment with a suggestion to refactor this using a single, multi-stage Dockerfile to reduce future maintenance overhead.

Parametrize GRPO tests over both sglang and vllm inference backends so the vllm CI image can run GRPO tests (previously exit code 5 because all tests were marked sglang-only). Key changes: - Add vllm config YAMLs for fsdp, megatron, and archon backends - Replace global pytestmark=sglang with per-case sglang/vllm marks - Conditionally set config.vllm.model vs config.sglang.model_path - README table reformatted by mdformat hook Refs: #985

Restore test_ppo_stats.py and reward_curve.png which were accidentally included in the Docker split branch. These deletions are unrelated to the sglang/vllm image separation. Refs: #985

…ase state Revert user-facing docs, example configs, and install commands to match the current released image naming and extras. These will be updated in a follow-up when the split images are officially released. Key changes: - Restore README, AGENTS.md, CLAUDE.md install commands - Restore EN/ZH installation guides (Docker image names, uv sync) - Restore SkyPilot configs and example READMEs Refs: #985

Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: #985

garrett4wade · 2026-03-07T14:53:09Z

The workflow succeeds at https://github.com/inclusionAI/AReaL/actions/runs/22798596790

Pending review.

…onflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR #996

rchardx

LGTM. Waiting for CI.

* feat(infra): split Docker image into sglang and vllm variants Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: #985 * fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR #996

…project#985) * feat(infra): split Docker image into sglang and vllm variants Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: areal-project#985 * fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR areal-project#996

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread Dockerfile.vllm Outdated

garrett4wade force-pushed the fw/docker branch 11 times, most recently from c28cc00 to 03cd54d Compare March 6, 2026 11:39

garrett4wade temporarily deployed to AReaL-unittests March 6, 2026 12:00 — with GitHub Actions Inactive

garrett4wade had a problem deploying to AReaL-unittests March 6, 2026 12:26 — with GitHub Actions Failure

garrett4wade had a problem deploying to AReaL-unittests March 6, 2026 13:40 — with GitHub Actions Failure

garrett4wade had a problem deploying to AReaL-unittests March 6, 2026 13:43 — with GitHub Actions Error

garrett4wade had a problem deploying to AReaL-unittests March 6, 2026 14:37 — with GitHub Actions Error

garrett4wade temporarily deployed to AReaL-unittests March 6, 2026 14:38 — with GitHub Actions Inactive

garrett4wade force-pushed the fw/docker branch from 1680456 to 8ecc683 Compare March 6, 2026 15:49

garrett4wade force-pushed the fw/docker branch from 8ecc683 to 0692853 Compare March 6, 2026 15:53

garrett4wade force-pushed the fw/docker branch from b3c125e to bdf1dec Compare March 6, 2026 16:12

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 6, 2026

garrett4wade force-pushed the fw/docker branch 7 times, most recently from 373a0d3 to 8da6b47 Compare March 7, 2026 10:06

garrett4wade temporarily deployed to AReaL-unittests March 7, 2026 12:03 — with GitHub Actions Inactive

garrett4wade temporarily deployed to AReaL-unittests March 7, 2026 12:12 — with GitHub Actions Inactive

garrett4wade requested review from nuzant and rchardx March 7, 2026 14:53

rchardx reviewed Mar 7, 2026

View reviewed changes

Comment thread pyproject.toml

rchardx reviewed Mar 7, 2026

View reviewed changes

Comment thread areal/tools/validate_docker_installation.py

rchardx reviewed Mar 7, 2026

View reviewed changes

Comment thread tests/grpo/test_grpo.py Outdated

garrett4wade force-pushed the fw/docker branch from 8da6b47 to cba1df4 Compare March 8, 2026 01:34

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 8, 2026

garrett4wade force-pushed the fw/docker branch from cba1df4 to c5e56d9 Compare March 8, 2026 01:42

garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Mar 8, 2026

garrett4wade temporarily deployed to AReaL-unittests March 8, 2026 01:51 — with GitHub Actions Inactive

rchardx approved these changes Mar 8, 2026

View reviewed changes

rchardx merged commit 13ea771 into main Mar 8, 2026
13 checks passed

rchardx deleted the fw/docker branch March 8, 2026 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(infra): split Docker image into sglang and vllm variants#985

feat(infra): split Docker image into sglang and vllm variants#985
rchardx merged 2 commits into
mainfrom
fw/docker

garrett4wade commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

garrett4wade commented Mar 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrett4wade commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Docker

Dependencies (pyproject.toml)

CI Workflows

Tests

Validation Tools

What is NOT included (deferred to release)

Type of Change

Checklist

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

garrett4wade commented Mar 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrett4wade commented Mar 5, 2026 •

edited

Loading