Skip to content

Latest commit

 

History

History
270 lines (199 loc) · 9.74 KB

File metadata and controls

270 lines (199 loc) · 9.74 KB

SLA Planner Quick Start Guide

Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.

Important

Prerequisites: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the Dynamo Platform installation.

Overview

The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.

The deployment process consists of two mandatory phases:

  1. Pre-Deployment Profiling (2-4 hours) - Generates performance data
  2. SLA Planner Deployment (5-10 minutes) - Enables autoscaling

Tip

Fast Profiling with AI Configurator: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See AI Configurator section in the Profiling Guide.

flowchart TD
    A[Start Setup] --> B{Profiling Done?}
    B -->|No| C[Run Profiling<br/>2-4 hours]
    C --> D[Verify Results]
    D --> E[Deploy Planner<br/>5-10 minutes]
    B -->|Yes| E
    E --> F[Test System]
    F --> G[Ready!]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style E fill:#e8f5e8
    style G fill:#f3e5f5
    style B fill:#fff8e1
Loading

Phase 1: Pre-Deployment Profiling (REQUIRED)

Warning

MANDATORY: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.

Step 1.1: Set Up Profiling Environment

Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.

export NAMESPACE=your-namespace

Prerequisites: Ensure all dependencies are installed:

pip install -r deploy/utils/requirements.txt

Step 1.2: Inject Your Configuration

Use the injector utility to place your DGD manifest into the PVC:

# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml

# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml

Note: All paths must start with /data/ for security reasons.

Step 1.3: Configure SLA Targets

For dense models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml:

spec:
  template:
    spec:
      containers:
        - name: profile-sla
          args:
            - --isl
            - "3000" # average ISL is 3000 tokens
            - --osl
            - "150" # average OSL is 150 tokens
            - --ttft
            - "200" # target TTFT is 200ms
            - --itl
            - "20" # target ITL is 20ms
            - --backend
            - <vllm/sglang>

For MoE models, edit $DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml instead.

Step 1.4: Run Profiling

Set the container image and config path:

export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
export DGD_CONFIG_FILE=/data/configs/disagg.yaml

Run profiling:

# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -

# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -

# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -

Step 1.5: Monitor Profiling Progress

kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE

Note

Time Investment: This profiling process is comprehensive and typically takes 2-4 hours to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.

Step 1.6: Download Profiling Results (Optional)

If you want to view the profiling results and performance plots:

# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results

For detailed information about the output structure, performance plots, and how to analyze the results, see the Viewing Profiling Results section in the Profiling Guide.

Verify Success: Look for terminal output like:

Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)

Phase 2: Deploy SLA Planner

Step 2.1: Verify Prerequisites

Before deploying the SLA planner, ensure:

  • Pre-deployment profiling completed successfully (from Phase 1)
  • Profiling results saved to dynamo-pvc PVC
  • kube-prometheus-stack installed and running. By default, the prometheus server is not deployed in the monitoring namespace. If it is deployed to a different namespace, set dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090".
  • Dynamo platform installed (see Installation Guide)
  • Prefill and decode workers use the best parallelization mapping from profiling

Step 2.2: Deploy the System

We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM.

# Apply the disaggregated planner deployment
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm

kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang

kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm

# Check deployment status
kubectl get pods -n $NAMESPACE

Expected pods (all should be 1/1 Running):

vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running

Step 2.3: Test the System

# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000

# Send a request
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":true,
    "max_tokens": 30
  }'

Step 2.4: Monitor Scaling

# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10

Expected successful output (after streaming requests):

New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
Observed ttft: X.XXXs itl: X.XXXs
Number of prefill workers: 1, number of decode workers: 1

Phase 3: Production Readiness

Monitoring Metrics

  • Basic metrics (request count): Available with any request type
  • Latency metrics (TTFT/ITL): Available for both streaming and non-streaming requests
  • Scaling decisions: Require sufficient request volume

Troubleshooting

Connection Issues:

# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"

Missing Metrics:

# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service

Worker Issues:

  • Large models can take 10+ minutes to initialize
  • Check worker logs: kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend
  • Ensure GPU resources are available for workers

Unknown Field subComponentType:

If you encounter the following error when applying the deployment:

Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"

This is because the subComponentType field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions here.

Next Steps

Quick Reference

Phase Duration Purpose Status Check
Profiling 2-4 hours Generate performance data kubectl logs job/profile-sla
Deployment 5-10 minutes Enable autoscaling kubectl get pods
Testing 5 minutes Verify functionality kubectl logs deployment/planner

Tip

Need Help? If you encounter issues, check the troubleshooting section or refer to the detailed guides linked in Next Steps.