|
| 1 | +# Single host inference benchmark of Wan2.2 with Sglang on G4 |
| 2 | + |
| 3 | +This recipe shows how to serve and benchmark the Wan-AI/Wan2.2-T2V-A14B & Wan-AI/Wan2.2-I2V-A14B model using [SGLang](https://github.com/sgl-project/sglang/tree/main) on a single GCP G4 VM with RTX PRO 6000 GPUs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). |
| 4 | + |
| 5 | +## Before you begin |
| 6 | + |
| 7 | +### 1. Create a GCP VM with G4 GPUs |
| 8 | + |
| 9 | +First, we will create a Google Cloud Platform (GCP) Virtual Machine (VM) that has the necessary GPU resources. |
| 10 | + |
| 11 | +Make sure you have the following prerequisites: |
| 12 | +* [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) is initialized. |
| 13 | +* You have a project with a GPU quota. See [Request a quota increase](https://cloud.google.com/docs/quota/view-request#requesting_higher_quota). |
| 14 | +* [Enable required APIs](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com). |
| 15 | + |
| 16 | +The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-384` for a multi-GPU VM (8 GPUs). The boot disk is set to 200GB to accommodate the models and dependencies. |
| 17 | + |
| 18 | +```bash |
| 19 | +export VM_NAME="${USER}-g4-sglang-wan2.2" |
| 20 | +export PROJECT_ID="your-project-id" |
| 21 | +export ZONE="your-zone" |
| 22 | +export MACHINE_TYPE="g4-standard-384" |
| 23 | +export IMAGE_PROJECT="ubuntu-os-accelerator-images" |
| 24 | +export IMAGE_FAMILY="ubuntu-accelerator-2404-amd64-with-nvidia-570" |
| 25 | + |
| 26 | +gcloud compute instances create ${VM_NAME} \ |
| 27 | + --machine-type=${MACHINE_TYPE} \ |
| 28 | + --project=${PROJECT_ID} \ |
| 29 | + --zone=${ZONE} \ |
| 30 | + --image-project=${IMAGE_PROJECT} \ |
| 31 | + --image-family=${IMAGE_FAMILY} \ |
| 32 | + --maintenance-policy=TERMINATE \ |
| 33 | + --boot-disk-size=200GB |
| 34 | +``` |
| 35 | + |
| 36 | +### 2. Connect to the VM |
| 37 | + |
| 38 | +Use `gcloud compute ssh` to connect to the newly created instance. |
| 39 | + |
| 40 | +```bash |
| 41 | +gcloud compute ssh ${VM_NAME?} --project=${PROJECT_ID?} --zone=${ZONE?} |
| 42 | +``` |
| 43 | + |
| 44 | +```bash |
| 45 | +# Run NVIDIA smi to verify the driver installation and see the available GPUs. |
| 46 | +nvidia-smi |
| 47 | +``` |
| 48 | + |
| 49 | +## Serve a model |
| 50 | + |
| 51 | +### 1. Install Docker |
| 52 | + |
| 53 | +Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu: |
| 54 | +[Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/) |
| 55 | + |
| 56 | +After installing Docker, make sure the Docker daemon is running. |
| 57 | + |
| 58 | +### 2. Install NVIDIA Container Toolkit |
| 59 | + |
| 60 | +To enable Docker containers to access the GPU, you need to install the NVIDIA Container Toolkit. |
| 61 | + |
| 62 | +You can follow the official NVIDIA documentation to install the container toolkit: |
| 63 | +[NVIDIA Container Toolkit Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) |
| 64 | + |
| 65 | +### 3. Setup Sglang |
| 66 | + |
| 67 | +This prepares the host for model storage and starts the SGLang Docker container. We mount the /scratch directory to ensure model weights persist on the host disk and enable the --gpus all flag so the container can utilize the G4 hardware. |
| 68 | + |
| 69 | +```bash |
| 70 | +# Create a local directory to store model weights and cache |
| 71 | +mkdir -p /scratch/cache |
| 72 | + |
| 73 | +# Define the SGLang development image |
| 74 | +export IMAGE_URL="lmsysorg/sglang:latest" |
| 75 | + |
| 76 | +# Start the container with GPU support and persistent volume mounts |
| 77 | +docker run -it \ |
| 78 | + --gpus all \ |
| 79 | + -v /scratch:/scratch \ |
| 80 | + -v /scratch/cache:/root/.cache \ |
| 81 | + --ipc=host \ |
| 82 | + $IMAGE_URL \ |
| 83 | + /bin/bash |
| 84 | +``` |
| 85 | + |
| 86 | +### 4. Download the Model Weights |
| 87 | + |
| 88 | +Inside the container, we use the Hugging Face CLI to download the Wan2.2 model files. These are saved to the /scratch mount to prevent data loss when the container is deleted. |
| 89 | + |
| 90 | +```bash |
| 91 | +# Download the base model from Hugging Face |
| 92 | +apt-get update && apt-get install -y huggingface-cli |
| 93 | + |
| 94 | +huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir /scratch/models/Wan2.2 |
| 95 | +huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir /scratch/models/Wan2.2 |
| 96 | + |
| 97 | +``` |
| 98 | + |
| 99 | +## Run Benchmarks |
| 100 | + |
| 101 | +Use the following commands to test video generation. These examples show how to run the model on a single GPU or across multiple GPUs using Tensor Parallelism (--tp-size). Download Image from internet to run the benchmark to test Image to Video generation. |
| 102 | + |
| 103 | +*Benchmark: Text-to-Video on 1 GPU* |
| 104 | +```bash |
| 105 | +sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --dit-layerwise-offload false --text-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory --dit-cpu-offload false --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." --save-output --num-gpus 1 --num-frames 81 |
| 106 | +``` |
| 107 | +*Benchmark: Text-to-Video on 4 GPU* |
| 108 | +```bash |
| 109 | +sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --dit-layerwise-offload false --text-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory --dit-cpu-offload false --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." --save-output --num-gpus 4 --tp-size 4 --num-frames 93 |
| 110 | +``` |
| 111 | +*Benchmark: Image-to-Video on 1 GPU* |
| 112 | +```bash |
| 113 | +sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers --image-path assets/logo.png --dit-layerwise-offload false --text-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory --dit-cpu-offload false --prompt "A curious raccoon" --save-output --num-gpus 1 --num-frames 81 |
| 114 | +``` |
| 115 | +*Benchmark: Image-to-Video on 4 GPU* |
| 116 | +```bash |
| 117 | +sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers --image-path assets/logo.png --dit-layerwise-offload false --text-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory --dit-cpu-offload false --prompt "A curious raccoon" --save-output --num-gpus 4 --tp-size 4 --num-frames 93 |
| 118 | +``` |
| 119 | + |
| 120 | +## Clean up |
| 121 | + |
| 122 | +### 1. Exit the container |
| 123 | + |
| 124 | +```bash |
| 125 | +exit |
| 126 | +``` |
| 127 | + |
| 128 | +### 2. Delete the VM |
| 129 | + |
| 130 | +This command will delete the GCE instance and all its disks. |
| 131 | + |
| 132 | +```bash |
| 133 | +gcloud compute instances delete ${VM_NAME?} --zone=${ZONE?} --project=${PROJECT_ID} --quiet --delete-disks=all |
| 134 | +``` |
0 commit comments