| title | Disaggregated Serving |
|---|---|
| subtitle | Find optimal prefill/decode configuration for disaggregated serving deployments |
AIConfigurator is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
When deploying LLMs with Dynamo, you need to make several critical decisions:
- Aggregated vs Disaggregated: Which architecture gives better performance for your workload?
- Worker Configuration: How many prefill and decode workers to deploy?
- Parallelism Settings: What tensor/pipeline parallel configuration to use?
- SLA Compliance: How to meet your TTFT and TPOT targets?
AIConfigurator answers these questions in seconds, providing:
- Recommended configurations that meet your SLA requirements
- Ready-to-deploy Dynamo configuration files (including Kubernetes manifests)
- Performance comparisons between different deployment strategies
- Up to 1.7x better throughput compared to manual configuration
AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:
# Install
pip3 install aiconfigurator
# Find optimal configuration for vLLM backend
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save-dir ./results_vllm
# Deploy on Kubernetes
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yamlThis section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--total-gpus 8 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 25 \
--backend vllm \
--backend-version 0.12.0 \
--generator-dynamo-version 1.0.0 \
--generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
--save-dir ./results_vllmParameters explained:
--model: HuggingFace model ID or local path (e.g.,Qwen/Qwen3-32B-FP8)--system: GPU system type (h200_sxm,h100_sxm,a100_sxm)--total-gpus: Number of GPUs available for deployment--isl/--osl: Input/Output sequence lengths in tokens--ttft/--tpot: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)--backend: Inference backend (vllm,trtllm, orsglang)--backend-version: Backend version (e.g.,0.12.0for vLLM)--save-dir: Directory to save generated deployment configs
AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
Total GPUs: 8
Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 3,574.80 tokens/s
- Per-GPU Throughput: 446.85 tokens/s/gpu
- Per-User Throughput: 53.58 tokens/s/user
- TTFT: 453.18ms
- TPOT: 18.66ms
- Request Latency: 9766.51ms
----------------------------------------------------------------------------
Pareto Frontier:
Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg │
│ ff disagg │
│ xx disagg best │
│ │
708.3┤ │
│ f │
│ f │
│ fff │
566.7┤ f │
│ f │
│ f │
│ •• fffffffffffffffffx │
425.0┤ •••• ff │
│ ••• f │
│ ••••• f │
│ •••••••••• f │
283.3┤ ••• f │
│ •• f │
│ •• f │
│ ••••f │
141.7┤ •f• │
│ f••••• │
│ f ••••••• │
│ fffff •••• │
0.0┤ •••• │
└┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
0 30 60 90 120
tokens/s/gpu_cluster tokens/s/user
----------------------------------------------------------------------------
Deployment Details:
(p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
Some math: total gpus used = replicas * gpus/replica
gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| 1 | vllm | 322.69 | 41.78 | 546.92 | 12490.03 | 64 (=32x2) | 8 (8=2x4) | 2 | 4 | 4 (=4x1x1) | tp4pp1 | 32 |
| 2 | vllm | 293.94 | 44.43 | 593.10 | 11823.67 | 56 (=14x4) | 8 (8=4x2) | 4 | 2 | 2 (=2x1x1) | tp2pp1 | 14 |
| 3 | vllm | 208.87 | 42.90 | 460.58 | 12093.52 | 40 (=40x1) | 8 (8=1x8) | 1 | 8 | 8 (=8x1x1) | tp8pp1 | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| 1 | vllm | 446.85 | 53.58 | 453.18 | 9766.51 | 76 (=76x1) | 8 (8=1x8) | 1 | 8 (=2x2+1x4) | 2 | 2 (=2x1) | tp2pp1 | 1 | 1 | 4 (=4x1) | tp4pp1 | 76 |
| 2 | vllm | 446.85 | 41.14 | 453.18 | 12581.87 | 144 (=72x2) | 8 (8=2x4) | 2 | 4 (=1x2+1x2) | 1 | 2 (=2x1) | tp2pp1 | 1 | 1 | 2 (=2x1) | tp2pp1 | 72 |
| 3 | vllm | 333.73 | 40.22 | 453.18 | 12860.32 | 72 (=36x2) | 8 (8=2x4) | 2 | 4 (=1x2+2x1) | 1 | 2 (=2x1) | tp2pp1 | 1 | 2 | 1 (=1x1) | tp1pp1 | 18 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
Reading the output:
- tokens/s/gpu: Overall throughput efficiency — higher is better
- tokens/s/user: Per-request generation speed (inverse of TPOT)
- TTFT: Predicted time to first token
- concurrency: Total concurrent requests across all replicas (e.g.,
56 (=14x4)means batch size 14 × 4 replicas) - agg Rank 1 recommends TP4 with 2 replicas — simpler to deploy
- disagg Rank 1 recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA
The --save-dir generates ready-to-use Kubernetes manifests:
├── agg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── agg_config.yaml
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ └── run_0.sh
│ ...
├── disagg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── decode_config.yaml
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ ├── prefill_config.yaml
│ │ ├── run_0.sh
│ │ └── run_1.sh (for multi-node setups)
│ ...
└── pareto_frontier.png
Before deploying, ensure you have:
-
HuggingFace Token Secret (for gated models):
kubectl create secret generic hf-token-secret \ -n your-namespace \ --from-literal=HF_TOKEN="your-huggingface-token" -
Model Cache PVC (recommended for faster restarts):
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: your-namespace spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi
The generated k8s_deploy.yaml provides a starting point. You'll typically need to customize it for your environment:
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yamlComplete deployment example with model cache and production settings:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-agg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false # Use existing PVC
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
imagePullPolicy: IfNotPresent
VLLMWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 4
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi # Required for vLLM
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--no-enable-prefix-caching"
- "--tensor-parallel-size"
- "2"
- "--pipeline-parallel-size"
- "1"
- "--data-parallel-size"
- "1"
- "--kv-cache-dtype"
- "fp8"
- "--max-model-len"
- "6000"
- "--max-num-seqs"
- "1024"Key deployment settings:
| Setting | Purpose | Notes |
|---|---|---|
backendFramework: vllm |
Tells Dynamo which runtime to use | Required at spec level |
pvcs + volumeMounts |
Caches model weights across restarts | Mount at /opt/models (not /root/) |
HF_HOME env var |
Points HuggingFace to cache location | Must match mountPoint |
sharedMemory.size: 16Gi |
IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM |
envFromSecret |
Injects HF_TOKEN | Required for gated models |
After deployment, validate the predictions against actual performance using AIPerf.
ℹ️ Run AIPerf inside the cluster to avoid network latency affecting measurements.
AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when --save-dir ... is specified). For Kubernetes deployments, you can run benchmarks using k8s_bench.yaml; while for bare-metal systems, use the bench_run.sh script. These scripts execute AIPerf across a concurrency list: the default set (1 2 8 16 32 64 128) along with BenchConfig.estimated_concurrency and its values within ±5%. You can also customize this concurrency list as needed.
By default, AIPerf results will be saved in /tmp/bench_artifacts of the containers. If PVC name is specified in --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC, result artifacts will be saved in the PVC volume mount instead.
| AIC Output | AIPerf Parameter | Notes |
|---|---|---|
concurrency: 56 (=14x4) |
--concurrency 56 |
Use total concurrency when benchmarking via the frontend |
| ISL/OSL targets | --isl 4000 --osl 500 |
Match your AIC inputs |
| - | --num-requests 800 |
Use concurrency × 40 minimum for statistical stability |
| - | --extra-inputs "ignore_eos:true" |
Ensures exact OSL tokens generated |
Note on concurrency: AIC reports concurrency as
total (=bs × replicas). When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replicabsvalue instead.
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: your-namespace
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.10
command:
- /bin/bash
- -c
- |
pip install aiperf
aiperf profile \
-m Qwen/Qwen3-32B-FP8 \
--endpoint-type chat \
-u http://dynamo-agg-frontend:8000 \
--isl 4000 --isl-stddev 0 \
--osl 500 --osl-stddev 0 \
--num-requests 800 \
--concurrency 56 \
--streaming \
--extra-inputs "ignore_eos:true" \
--num-warmup-requests 40 \
--ui-type simplekubectl apply -f k8s_bench.yaml
kubectl logs -f -l job-name=aiperf-benchmarkValidated results (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):
| Metric | AIC Prediction | Actual (avg) | Status |
|---|---|---|---|
| TTFT (ms) | 509 | 209 | Better than target |
| ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% |
| Throughput (req/s) | ~6.3 | 6.9 | Within 10% |
| Total Output TPS | ~3,178 | 3,462 | Within 10% |
AIConfigurator provides a strong starting point. Here's how to iterate for production:
If your real workload differs from the benchmark parameters:
# For longer outputs (chat/code generation):
# increase OSL, relax TTFT target
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 2000 \
--osl 2000 \
--ttft 1000 \
--tpot 10 \
--save-dir ./results_long_outputUse exp mode to compare custom configurations:
# custom_exp.yaml
exps:
- exp_tp2
- exp_tp4
exp_tp2:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [2]
exp_tp4:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [4]aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_customCritical: Disaggregated deployments require RDMA for KV cache transfer. Without RDMA, performance degrades by 40x (TTFT increases from 355ms to 10+ seconds). See the Disaggregated Deployment section below.
Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA, this transfer becomes a severe bottleneck, causing 40x performance degradation.
- RDMA-capable network (InfiniBand or RoCE)
- RDMA device plugin installed on the cluster (provides
rdma/ibresources) - ETCD and NATS deployed (for coordination)
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-disagg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
imagePullPolicy: IfNotPresent
VLLMPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 2
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" # Enable RDMA transports
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory registration
resources:
limits:
rdma/ib: "2" # Request RDMA resources
requests:
rdma/ib: "2"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "2"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1" # Prefill workers use batch size 1
- --disaggregation-mode
- prefill
VLLMDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "4"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
rdma/ib: "4"
requests:
rdma/ib: "4"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "4"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1024" # Decode workers handle high concurrency
- --disaggregation-mode
- decodeCritical RDMA settings:
| Setting | Purpose |
|---|---|
rdma/ib: "N" |
Request N RDMA resources (match TP size) |
IPC_LOCK capability |
Required for RDMA memory registration |
UCX_TLS env var |
Enables RDMA transports (rc_x, dc_x) |
UCX_RNDV_SCHEME=get_zcopy |
Zero-copy RDMA transfers |
After deployment, check the worker logs for UCX initialization:
kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"You should see:
NIXL INFO Backend UCX was instantiated
If you see only TCP transports, RDMA is not active - check your RDMA device plugin and resource requests.
Override vLLM engine parameters with --generator-set:
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 4000 --osl 500 \
--ttft 600 --tpot 16.67 \
--save-dir ./results_tuned \
--generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
--generator-set Workers.agg.max_num_seqs=2048Run aiconfigurator cli default --generator-help to see all available parameters.
For workloads with repeated prefixes (e.g., system prompts):
- Enable prefix caching when you have high prefix hit rates
- Disable prefix caching (
--no-enable-prefix-caching) for diverse prompts
AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.
For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the support matrix. The raw data is available as per-system CSV files, which are automatically generated and tested to ensure accuracy across all supported configurations.
You can also check if a system / framework version is supported via the aiconfigurator cli support command. For example:
aiconfigurator cli support --model Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5# Strict latency SLAs (real-time chat)
aiconfigurator cli default \
--model meta-llama/Llama-3.1-70B \
--total-gpus 16 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--ttft 200 --tpot 8
# High throughput (batch processing)
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 32 \
--system h200_sxm \
--backend trtllm \
--ttft 2000 --tpot 50
# Request latency constraint (end-to-end SLA)
aiconfigurator cli default \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 16 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--request-latency 12000 \
--isl 4000 --osl 500# Web interface for interactive exploration
pip3 install aiconfigurator[webapp]
aiconfigurator webapp # Visit http://127.0.0.1:7860
# Quick config generation (no parameter sweep)
aiconfigurator cli generate \
--model Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm
# Check model/system support
aiconfigurator cli support \
--model Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--backend vllmModel not found: Use the full HuggingFace path (e.g., Qwen/Qwen3-32B-FP8 not QWEN3_32B)
Backend version mismatch: Check supported versions with aiconfigurator cli support --model <model> --system <system> --backend <backend>
Pods crash with "Permission denied" on cache directory:
- Mount the PVC at
/opt/modelsinstead of/root/.cache/huggingface - Set
HF_HOME=/opt/modelsenvironment variable - Ensure the PVC has
ReadWriteManyaccess mode
Workers stuck in CrashLoopBackOff:
- Check logs:
kubectl logs <pod-name> --previous - Verify
sharedMemory.sizeis set (16Gi for vLLM, 80Gi for TRT-LLM) - Ensure HuggingFace token secret exists and is named correctly
Model download slow on every restart:
- Add PVC for model caching (see deployment example above)
- Verify
volumeMountsandHF_HOMEare configured on workers
"Context stopped or killed" errors (disaggregated only):
- Deploy ETCD and NATS infrastructure (required for KV cache transfer)
- See Dynamo Kubernetes Guide for platform setup
OOM errors: Reduce --max-num-seqs or increase tensor parallelism
Performance below predictions:
- Verify warmup requests are sufficient (40+ recommended)
- Check for competing workloads on the cluster
- Ensure KV cache memory fraction is optimized
- Run benchmarks from inside the cluster to eliminate network latency
Disaggregated TTFT extremely high (10+ seconds): This is almost always caused by missing RDMA configuration. Without RDMA, KV cache transfer falls back to TCP and becomes a severe bottleneck.
To diagnose:
# Check if RDMA resources are allocated
kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"
# Check UCX transport in logs
kubectl logs <worker-pod> | grep -i "UCX\|transport"To fix:
- Ensure your cluster has RDMA device plugin installed
- Add
rdma/ibresource requests to worker pods - Add
IPC_LOCKcapability to security context - Add UCX environment variables (see Disaggregated Deployment section)
Disaggregated working but throughput lower than aggregated: For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:
- Very long inputs (ISL > 8000) with short outputs
- Workloads needing independent prefill/decode scaling