diff --git a/README.md b/README.md index 1b463f20..2b4c5359 100644 --- a/README.md +++ b/README.md @@ -92,17 +92,17 @@ Models | GPU Machine Type | Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe | | ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ | -| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | vLLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/vllm/README.md) -| **Wan2.2 T2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md) -| **Wan2.2 I2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md) -| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)

[Link for Using Google Cloud Storage (GCS) as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md))

[Link for Using Lustre as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md)) -| **Llama 3.1 405B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Llama 3.1 70B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Llama 3.1 8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Qwen 2.5 VL 7B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Qwen 3 235B A22B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Qwen 3 32B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) -| **Qwen 3 4B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | vLLM (v0.14.0rc1) | Inference | GKE | [Link](./inference/a4x/single-host-serving/vllm/README.md) +| **Wan2.2 T2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang (latest) | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md) +| **Wan2.2 I2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang (latest) | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md) +| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)

[Link for Using Google Cloud Storage (GCS) as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md))

[Link for Using Lustre as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md)) +| **Llama 3.1 405B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Llama 3.1 70B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Llama 3.1 8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Qwen 2.5 VL 7B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Qwen 3 235B A22B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Qwen 3 32B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) +| **Qwen 3 4B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) ### Inference benchmarks G4