dynamo/docs/kubernetes/chrek/dynamo.md at main · drivenets/dynamo

title	Integration with Dynamo

⚠️ Experimental Feature: ChReK is currently in beta/preview. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.

Startup Type	Time	What Happens
Cold Start	~1 min	Download model, load to GPU, initialize engine
Warm Start (checkpoint)	< 10 sec	Restore from checkpoint tar

Prerequisites

Dynamo Platform installed (v0.4.0+) on k8s cluster with GPU nodes
ChReK Helm chart installed (separate from platform)
RWX PVC storage (PVC is currently the only supported backend)

Quick Start

1. Install ChReK Infrastructure

First, install the ChReK Helm chart in each namespace where you need checkpointing:

# Install ChReK infrastructure
helm install chrek nvidia/chrek \
  --namespace my-team \
  --create-namespace \
  --set storage.pvc.size=100Gi

This creates:

A PVC for checkpoint storage (chrek-pvc)
A DaemonSet for CRIU operations (chrek-agent)

2. Configure Operator Values

Update your Helm values to point to the ChReK infrastructure:

# values.yaml
dynamo-operator:
  checkpoint:
    enabled: true
    storage:
      type: pvc  # Only PVC is currently supported (S3/OCI planned)
      pvc:
        pvcName: "chrek-pvc"  # Must match ChReK chart
        basePath: "/checkpoints"
      signalHostPath: "/var/lib/chrek/signals"  # Must match ChReK chart

2. Configure Your DGD

Add checkpoint configuration to your worker service. Both vLLM and SGLang are supported — use the appropriate backendFramework, command, and CLI flags.

vLLM Example

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    worker:
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
          command: ["python3"]
          args:
            - "-m"
            - "dynamo.vllm"
            - "--model"
            - "meta-llama/Llama-3-8B"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.90"
          env:
            # Required for cross-node checkpoint/restore
            - name: GLOO_SOCKET_IFNAME
              value: "lo"
            - name: NCCL_SOCKET_IFNAME
              value: "lo"
      resources:
        limits:
          nvidia.com/gpu: "1"
      checkpoint:
        enabled: true
        mode: auto
        identity:
          model: "meta-llama/Llama-3-8B"
          backendFramework: "vllm"
          tensorParallelSize: 1
          dtype: "bfloat16"
          maxModelLen: 4096

SGLang Example

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-sglang-llm
spec:
  services:
    worker:
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/dynamo-sglang-placeholder:latest
          command: ["python3"]
          args:
            - "-m"
            - "dynamo.sglang"
            - "--model"
            - "meta-llama/Llama-3-8B"
            - "--mem-fraction-static"
            - "0.90"
          env:
            # Required for cross-node checkpoint/restore
            - name: GLOO_SOCKET_IFNAME
              value: "lo"
            - name: NCCL_SOCKET_IFNAME
              value: "lo"
      resources:
        limits:
          nvidia.com/gpu: "1"
      checkpoint:
        enabled: true
        mode: auto
        identity:
          model: "meta-llama/Llama-3-8B"
          backendFramework: "sglang"
          tensorParallelSize: 1
          dtype: "bfloat16"
          maxModelLen: 4096

Key differences between backends:

Setting	vLLM	SGLang
Module	`dynamo.vllm`	`dynamo.sglang`
Max context (optional)	`--max-model-len`	`--context-length`
GPU memory	`--gpu-memory-utilization`	`--mem-fraction-static`
Placeholder image	`dynamo-vllm-placeholder`	`dynamo-sglang-placeholder`
Identity `backendFramework`	`"vllm"`	`"sglang"`

Note: Do not set DYN_READY_FOR_CHECKPOINT_FILE or DYN_CHECKPOINT_READY_FILE in the DGD worker env vars. These are injected automatically by the operator's checkpoint controller into checkpoint job pods only. Setting them on worker pods causes all workers to enter checkpoint mode instead of cold-starting normally.

3. Deploy

kubectl apply -f my-llm.yaml -n dynamo-system

On first deployment:

A checkpoint job runs to create the checkpoint
Worker pods start with cold start (checkpoint not ready yet)
Once checkpoint is ready, new pods (scale-up, restarts) restore from checkpoint

Checkpoint Modes

Auto Mode (Recommended)

The operator automatically creates a DynamoCheckpoint CR if one doesn't exist:

checkpoint:
  enabled: true
  mode: auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"  # or "sglang"
    tensorParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 4096

Reference Mode

Reference an existing DynamoCheckpoint CR by its 16-character hash using checkpointRef:

checkpoint:
  enabled: true
  checkpointRef: "e5962d34ba272638"  # 16-char hash of DynamoCheckpoint CR

This is useful when:

You want to pre-warm checkpoints before creating DGDs
You want to explicit control over which checkpoint to use

Flow:

Create a DynamoCheckpoint CR (see DynamoCheckpoint CRD section)
Wait for it to become Ready
Reference it in your DGD using checkpointRef with the hash

# Check checkpoint status (using 16-char hash name)
kubectl get dynamocheckpoint e5962d34ba272638 -n dynamo-system
NAME                MODEL                   BACKEND  PHASE  HASH              AGE
e5962d34ba272638    meta-llama/Llama-3-8B  vllm     Ready  e5962d34ba272638  5m

# Now create DGD referencing it
kubectl apply -f my-dgd.yaml

Checkpoint Identity

Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:

Field	Required	Affects Hash	Example
`model`	✓	✓	`meta-llama/Llama-3-8B`
`framework`	✓	✓	`sglang`, `trtllm`, `vllm`
`dynamoVersion`		✓	`0.9.0`, `1.0.0`
`tensorParallelSize`		✓	`1`, `2`, `4`, `8` (default: 1)
`pipelineParallelSize`		✓	`1`, `2` (default: 1)
`dtype`		✓	`float16`, `bfloat16`, `fp8`
`maxModelLen`		✓	`4096`, `8192`
`extraParameters`		✓	Custom key-value pairs

Not included in hash (don't invalidate checkpoint):

replicas
nodeSelector, affinity, tolerations
resources (requests/limits)
Logging/observability config

Example with all fields:

checkpoint:
  enabled: true
  mode: auto
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"
    dynamoVersion: "0.9.0"
    tensorParallelSize: 1
    pipelineParallelSize: 1
    dtype: "bfloat16"
    maxModelLen: 8192
    extraParameters:
      enableChunkedPrefill: "true"
      quantization: "awq"

Checkpoint Naming: The DynamoCheckpoint CR is automatically named using the 16-character identity hash (e.g., e5962d34ba272638).

Checkpoint Sharing: Multiple DGDs with the same identity automatically share the same checkpoint.

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

When to create a DynamoCheckpoint directly:

Pre-warming: Create checkpoints before deploying DGDs for instant startup
Explicit control: Manage checkpoint lifecycle independently from DGDs

Note: With the new hash-based naming, checkpoint names are automatically generated (16-character hash). The operator handles checkpoint discovery and reuse automatically in auto mode.

Create a checkpoint:

apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: e5962d34ba272638  # Use the computed 16-char hash
spec:
  identity:
    model: "meta-llama/Llama-3-8B"
    backendFramework: "vllm"
    tensorParallelSize: 1
    dtype: "bfloat16"

  job:
    activeDeadlineSeconds: 3600
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm:latest
            command: ["python3", "-m", "dynamo.vllm"]
            args: ["--model", "meta-llama/Llama-3-8B"]
            resources:
              limits:
                nvidia.com/gpu: "1"
            env:
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-token-secret
                    key: HF_TOKEN

Note: You can compute the hash yourself, or use auto mode to let the operator create it.

Check status:

# List all checkpoints
kubectl get dynamocheckpoint -n dynamo-system
# Or use shortname
kubectl get dckpt -n dynamo-system

NAME                MODEL                          BACKEND  PHASE    HASH              AGE
e5962d34ba272638    meta-llama/Llama-3-8B         vllm     Ready    e5962d34ba272638  5m
a7b4f89c12de3456    meta-llama/Llama-3-70B        vllm     Creating a7b4f89c12de3456  2m

Phases:

Phase	Description
`Pending`	CR created, waiting for job to start
`Creating`	Checkpoint job is running
`Ready`	Checkpoint available for use
`Failed`	Checkpoint creation failed

Detailed status:

kubectl describe dckpt e5962d34ba272638 -n dynamo-system

Status:
  Phase: Ready
  IdentityHash: e5962d34ba272638
  Location: /checkpoints/e5962d34ba272638
  StorageType: pvc
  CreatedAt: 2026-01-29T10:05:00Z

Reference from DGD:

Once the checkpoint is Ready, you can reference it by hash:

spec:
  services:
    VllmWorker:
      checkpoint:
        enabled: true
        checkpointRef: "e5962d34ba272638"  # 16-char hash

Or use auto mode and the operator will find/create it automatically.

Limitations

vLLM and SGLang backends only: TensorRT-LLM support is planned.
LLM workers only: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
Single-GPU only: Multi-GPU configurations are not yet supported (planned)
Network state: Active TCP connections are closed during restore (handled with tcp-close CRIU option)
Storage: Only PVC backend currently implemented (S3/OCI planned)
Security: ChReK runs as a privileged DaemonSet which is required to run CRIU

Troubleshooting

Checkpoint Not Creating

Check the checkpoint job:

kubectl get jobs -l nvidia.com/chrek-is-checkpoint-source=true -n dynamo-system
kubectl logs job/checkpoint-<name> -n dynamo-system

Check the DaemonSet:

kubectl logs daemonset/chrek-agent -n dynamo-system

Verify storage access:

kubectl exec -it <checkpoint-agent-pod> -- ls -la /checkpoints

Restore Failing

Check pod logs:

kubectl logs <worker-pod> -n dynamo-system

Verify checkpoint file exists:

# For PVC
kubectl exec -it <any-pod-with-pvc> -- ls -la /checkpoints/

Check environment variables:

kubectl exec <worker-pod> -- env | grep DYN_CHECKPOINT

Cold Start Despite Checkpoint

Pods fall back to cold start if:

Checkpoint file doesn't exist yet (still being created)
Checkpoint file is corrupted
CRIU restore fails

Check logs for "Falling back to cold start" message.

Environment Variables

Variable	Description
`DYN_CHECKPOINT_STORAGE_TYPE`	Backend: `pvc`, `s3`, `oci` (`s3` and `oci` are currently no-ops)
`DYN_CHECKPOINT_LOCATION`	Full checkpoint location (checkpoint jobs)
`DYN_CHECKPOINT_PATH`	Base checkpoint directory (restore pods, PVC)
`DYN_CHECKPOINT_HASH`	Identity hash
`DYN_READY_FOR_CHECKPOINT_FILE`	Ready-for-checkpoint file path (checkpoint jobs)

Complete Example

Create a checkpoint and use it in a DGD:

# 1. Create the DynamoCheckpoint CR
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
  name: e5962d34ba272638  # 16-char hash (computed from identity)
  namespace: dynamo-system
spec:
  identity:
    model: "meta-llama/Meta-Llama-3-8B-Instruct"
    backendFramework: "vllm"
    tensorParallelSize: 1
    dtype: "bfloat16"
  job:
    activeDeadlineSeconds: 3600
    backoffLimit: 3
    podTemplateSpec:
      spec:
        containers:
          - name: main
            image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
            command: ["python3"]
            args:
              - "-m"
              - "dynamo.vllm"
              - "--model"
              - "meta-llama/Meta-Llama-3-8B-Instruct"
              - "--max-model-len"
              - "4096"
              - "--gpu-memory-utilization"
              - "0.90"
            env:
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-token-secret
                    key: HF_TOKEN
              - name: GLOO_SOCKET_IFNAME
                value: "lo"
              - name: NCCL_SOCKET_IFNAME
                value: "lo"
            resources:
              limits:
                nvidia.com/gpu: "1"
        restartPolicy: Never
---
# 2. Wait for Ready: kubectl get dckpt e5962d34ba272638 -n dynamo-system -w
---
# 3. Reference the checkpoint in your DGD
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
  namespace: dynamo-system
spec:
  services:
    worker:
      replicas: 2
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/dynamo-vllm-placeholder:latest
          command: ["python3"]
          args:
            - "-m"
            - "dynamo.vllm"
            - "--model"
            - "meta-llama/Meta-Llama-3-8B-Instruct"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.90"
          env:
            - name: GLOO_SOCKET_IFNAME
              value: "lo"
            - name: NCCL_SOCKET_IFNAME
              value: "lo"
      resources:
        limits:
          nvidia.com/gpu: "1"
      checkpoint:
        enabled: true
        checkpointRef: "e5962d34ba272638"  # Reference by hash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prerequisites

Quick Start

1. Install ChReK Infrastructure

2. Configure Operator Values

2. Configure Your DGD

vLLM Example

SGLang Example

3. Deploy

Checkpoint Modes

Auto Mode (Recommended)

Reference Mode

Checkpoint Identity

DynamoCheckpoint CRD

Limitations

Troubleshooting

Checkpoint Not Creating

Restore Failing

Cold Start Despite Checkpoint

Environment Variables

Complete Example

Related Documentation

FilesExpand file tree

dynamo.md

Latest commit

History

dynamo.md

File metadata and controls

Prerequisites

Quick Start

1. Install ChReK Infrastructure

2. Configure Operator Values

2. Configure Your DGD

vLLM Example

SGLang Example

3. Deploy

Checkpoint Modes

Auto Mode (Recommended)

Reference Mode

Checkpoint Identity

DynamoCheckpoint CRD

Limitations

Troubleshooting

Checkpoint Not Creating

Restore Failing

Cold Start Despite Checkpoint

Environment Variables

Complete Example

Related Documentation