docs: restructure site, add changelogs, improve content by junpuf · Pull Request #6098 · aws/deep-learning-containers

junpuf · 2026-05-15T07:22:36Z

Summary

Major documentation site restructure for clarity, navigation, and content quality.

Site Structure

Nav bar: Home → User Guide → Blog Posts → Resources
User Guide: vLLM, vLLM-Omni, Ray, PyTorch, Base — each multi-page with sidebar nav (Overview, Supported Models, Deployment, Configuration, Changelog)
Resources: Reference (Image Access, Available Images, Region Availability, Support Policy, Release Notifications) + Security

Home Page

New landing page: DLC intro, quick start example, use-case cards (3 + 2 grid), footer links
Replaces previous README.md copy
New "Build Your Own Image" card pointing to Base guide

Framework Pages

vLLM, vLLM-Omni, Ray: split monolithic pages into focused sub-pages with changelogs
PyTorch: new guide for the AL2023 PyTorch training image (overview + EC2/SageMaker deployment + changelog)
Base: new guide for the lightweight CUDA + Python base images (overview + changelog, covers both v1 / CUDA 12.9 and v2 / CUDA 13.0)
Updated SageMaker deployment docs with standard-supervisor features (PR feat: wire vLLM SageMaker entrypoint to standard-supervisor #6044)
Added "What's Included" sections, default port columns, model coverage labels (Smoke / Benchmark / Smoke + Benchmark)
Documented SageMaker model resolution order, S3 model loading via runai-streamer, OpenAI-compatible API endpoints

Release-Notes Pipeline Removal

The auto-generated docs/releasenotes/ pages were not wired into the site nav and had no inbound links — effectively dead surface. The manually-maintained User Guide changelogs now own image history.

Deleted docs/releasenotes/ tree
Removed generate_release_notes() and helpers from docs/src/generate.py
Removed release-notes templates, table config, and tests
Stripped announcements: and packages: from 91 data YAMLs (no longer consumed)
Trimmed scripts/autocurrency/docs-pr.sh to drop docker-image introspection and the dead announcements:/packages: emit
Removed docs_packages: from .github/config/autocurrency-tracker.yml
Updated test_generate.py to drop the release-notes mock

Reference

Separated Region Availability into its own page
Trimmed Image Access to essentials
Simplified Release Notifications

Fact-Check Findings Fixed

RAYSERVE_NUM_GPUS was a phantom variable in Ray deployment docs (no script reads it) — removed
CodeArtifact (CA_REPOSITORY_ARN) was incorrectly implied to work on Ray EC2 image — corrected to SageMaker-only

CI Hardening

mkdocs build → mkdocs build --strict in docs-test.yml. PRs that introduce broken internal links, missing nav targets, broken anchors, or page conflicts will now fail CI (existing test_links.py covered .md link targets but not anchors).
Added /README.md, /tutorials/README.md, /DEVELOPMENT.md to mkdocs.yaml exclude_docs so contributor docs don't conflict with the published site.
Sidebar nav section titles darkened for readability.

Test plan

mkdocs build --strict passes locally
All 74 docs tests pass (pytest test/docs/)
pre-commit run --all-files passes (except actionlint which needs network in the local env)
All pages return 200 (verified locally)
Autocurrency unit tests: same pre-existing baseline (no new regressions)

Redesign the documentation site structure and content for clarity: Site structure: - Top nav: Home, User Guide, Blog Posts, Resources - User Guide: vLLM, vLLM-Omni, Ray (each multi-page with sidebar nav) - Resources: Reference (Image Access, Available Images, Region Availability, Support Policy, Release Notifications) + Security Home page: - New landing page with DLC intro, quick start example, use-case cards - Replaces the previous README copy Framework pages (vLLM, vLLM-Omni, Ray): - Split monolithic pages into Overview, Supported Models, Deployment (EC2/EKS/SageMaker), Configuration, and Changelog sub-pages - Add changelogs with real release content from PRs - Remove auto-generated vllm-server release notes (covered by changelog) - Update SageMaker docs with standard-supervisor features Reference: - Separate Region Availability into its own page - Trim Image Access to essentials - Simplify Release Notifications Tests: - Update test_generate_available_images.py for removed Region Availability section from available_images template

User Guide additions: - Add Base image guide (overview + changelog) under docs/base/ - Add PyTorch image guide (overview + EC2/SageMaker deployment + changelog) under docs/pytorch/ - Wire both into top-level User Guide nav - Expand vLLM/vLLM-Omni/Ray content (What's Included, API endpoints, port columns, model coverage labels, fact-check fixes) Release-notes pipeline removal: - Delete docs/releasenotes/ tree (output had no nav entry, dead surface) - Remove generate_release_notes() and helpers from docs/src/generate.py - Remove release-notes templates, table config, tests - Strip announcements/packages from 91 data YAMLs - Trim scripts/autocurrency/docs-pr.sh to drop docker introspection and announcement/packages emission; update autocurrency-tracker.yml Net: User Guide changelogs are the single source of truth; ~2100 lines removed.

- CI: switch docs-test.yml to `mkdocs build --strict`. Catches broken internal links, missing nav targets, broken anchors, and orphan-page warnings that the existing test_links.py tests don't cover (anchors in particular). - Fix pre-existing strict-mode warnings: - Add /README.md, /tutorials/README.md, /DEVELOPMENT.md to mkdocs.yaml exclude_docs (these conflicted with index.md or were not in nav). Anchored with leading / so per-tutorial README files still build. - Remove broken `available_images.md#tensorflow-training` anchor link from the home page (TensorFlow Training section was previously removed from the available_images table). - Sidebar: darken section titles ("User Guide", "vLLM", etc.) to pure black/white in light/dark mode for readability. - Home page: switch use-case grid to 3 columns (3 + 2) to accommodate the new "Build Your Own Image" Base card.

- vLLM Inference → LLM Serving using vLLM DLC - vLLM-Omni Inference → Multimodal Serving using vLLM-Omni DLC - Ray Serve Inference → ML Serving using Ray DLC - PyTorch Training → ML Training using PyTorch DLC - Base Inference → Build Custom Images using Base DLC Sidebar nav labels (vLLM / vLLM-Omni / Ray / PyTorch / Base) are unchanged — only the H1 title on each guide's overview page is updated.

… rendering - Each guide overview (vLLM, vLLM-Omni, Ray, PyTorch, Base) now points to its respective ECR Public Gallery page next to the existing Image Access reference. - Ray Example Deployments table: drop inline-code wrapping on path links so they render as plain links (the code-block background made the text hard to read against the link color).

…-and-content # Conflicts: # scripts/autocurrency/docs-pr.sh

Eren-Jeager123

autocurrency/docspr part lgtm

Per review feedback: while the DLC image doesn't read RAYSERVE_NUM_GPUS, the mnist-direct-app example's deployment.py uses it to parameterize ray_actor_options.num_gpus. Restore the env var in the example, with a comment clarifying it's a user-side convention rather than a DLC contract. Also extend docs/ray/deployment/ec2.md Direct App Import section to call out this pattern: env vars consumed by the user's deployment.py are valid; they're just not defined by the DLC.

Per #6098 review feedback: RAYSERVE_BACKEND_URL is an internal default in the SageMaker adapter (always 127.0.0.1:8000 on the DLC) added in #5704 explicitly marked internal in code. Customers have no supported reason to override it, so it shouldn't appear in the user-facing env-var table.

sirutBuasai

A small thing is that release notes used to have the main package version such as python version, cuda version, pytorch version, etc. Now the new changelogs doesn't display which version or commit it comes from but rather only the main package (eg: vllm) source commit.

Another small nit is that we should try to use the variables such as EC2/ECS/EKS/SageMaker variables from global.yml as much as possible to keep everything in the docs standard and any future changes are easy to make

Base images use nvidia/cuda:*-{base,runtime,devel}-amzn2023, not the -cudnn flavors. cuDNN is not installed in v1 or v2.

- Use global.yml variables ({{ ec2_short }}, {{ eks_short }}, {{ sm_short }}, {{ sagemaker }}) for AWS service names in guide pages and docs/index.md, user_guide/index.md instead of hardcoded "EC2", "EKS", "SageMaker", "Amazon SageMaker AI" so future renames update everywhere - Add "Bundled versions" line per release in docs/vllm/changelog/index.md (CUDA, Python, FlashInfer, DeepEP) so the changelog conveys per-release framework state, matching the existing PyTorch/Ray changelog format

junpuf · 2026-05-20T01:03:00Z

@sirutBuasai thanks for the review. Both points addressed in 6fcc8a0e:

1. Per-release version info on changelogs. PyTorch and Ray changelogs already include per-release framework versions; only the vLLM changelog was light. Added a **Bundled versions:** CUDA · Python · FlashInfer · DeepEP line to each vLLM entry (v1.0, v1.1, v1.2, v1.3), plus the wheel-version tag (e.g., 0.20.0.dev361+amzn2023.3f5bd482) inline with the source-commit link. Sourced from docker/vllm/versions.env at each release commit.

2. Use global.yml variables for service names. Converted hardcoded EC2, EKS, SageMaker, Amazon SageMaker AI literals to {{ ec2_short }} / {{ eks_short }} / {{ sm_short }} / {{ sagemaker }} across:

docs/{vllm,vllm-omni,ray,pytorch}/index.md
docs/index.md (landing page cards + walkthrough line)
docs/user_guide/index.md

Image tag URLs (*-sagemaker-cuda), file paths, and code blocks left literal since they're identifiers, not service names.

One follow-up thought worth raising: it's worth weighing whether this level of templating is worth the effort across the doc tree. Writing EC2 is meaningfully easier to read and write than {{ ec2_short }}, well-known acronyms like EC2 / EKS / SageMaker are unlikely to be mistyped by humans or coding agents. Variables make a clear difference for complex / evolving strings (full product names, version-pinned identifiers, paths) where a future rename should propagate automatically. For two-letter service abbreviations the cost-benefit is closer — the consistency win is real but the maintenance / readability cost is non-trivial.

aws-deep-learning-containers-ci Bot added the authorized label May 15, 2026

junpuf added 8 commits May 15, 2026 07:26

fix: apply pre-commit linter fixes (ruff, mdformat)

da81c30

docs: add PyTorch and Base entries to User Guide index

a12560c

docs: rename home card 'Serve LLMs' to 'Serve Large Language Models'

8b36355

Merge remote-tracking branch 'origin/main' into docs/restructure-site…

0203b69

…-and-content # Conflicts: # scripts/autocurrency/docs-pr.sh

Eren-Jeager123 reviewed May 18, 2026

View reviewed changes

jinyan-li1 reviewed May 18, 2026

View reviewed changes

Comment thread examples/ray/sagemaker/deploy_direct_app.py Outdated

jinyan-li1 reviewed May 18, 2026

View reviewed changes

Comment thread docs/ray/deployment/sagemaker.md Outdated