Skip to content

Latest commit

 

History

History
490 lines (363 loc) · 11.5 KB

File metadata and controls

490 lines (363 loc) · 11.5 KB
title Testing

This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

Overview

Dynamo's fault tolerance test suite is located in tests/fault_tolerance/ and includes:

Test Category Location Purpose
Cancellation cancellation/ Request cancellation during in-flight operations
Migration migration/ Request migration when workers fail
etcd HA etcd_ha/ etcd failover and recovery
Hardware hardware/ GPU and network fault injection
Deployment deploy/ End-to-end deployment testing

Test Directory Structure

tests/fault_tolerance/
├── cancellation/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── migration/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── etcd_ha/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── hardware/
│   └── fault_injection_service/
│       ├── api_service/
│       └── agents/
├── deploy/
│   ├── test_deployment.py
│   ├── scenarios.py
│   ├── base_checker.py
│   └── ...
└── client.py

Request Cancellation Tests

Test that in-flight requests can be properly canceled.

Running Cancellation Tests

# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v

# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v

Cancellation Test Utilities

The cancellation/utils.py module provides:

CancellableRequest

Thread-safe request cancellation via TCP socket manipulation:

from tests.fault_tolerance.cancellation.utils import CancellableRequest

request = CancellableRequest()

# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()

# Cancel after some time
time.sleep(1)
request.cancel()  # Closes underlying socket

send_completion_request / send_chat_completion_request

Send cancellable completion requests:

from tests.fault_tolerance.cancellation.utils import (
    send_completion_request,
    send_chat_completion_request
)

# Non-streaming
response = send_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    prompt="Hello, world!",
    max_tokens=100
)

# Streaming with cancellation
responses = send_chat_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
    cancellable_request=request
)

poll_for_pattern

Wait for specific patterns in logs:

from tests.fault_tolerance.cancellation.utils import poll_for_pattern

# Wait for cancellation confirmation
found = poll_for_pattern(
    log_file="/var/log/dynamo/worker.log",
    pattern="Request cancelled",
    timeout=30,
    interval=0.5
)

Migration Tests

Test that requests migrate to healthy workers when failures occur.

Running Migration Tests

# Run all migration tests
pytest tests/fault_tolerance/migration/ -v

# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v

Migration Test Utilities

The migration/utils.py module provides:

  • Frontend wrapper with configurable request planes
  • Long-running request spawning for migration scenarios
  • Health check disabling for controlled testing

Example Migration Test

def test_migration_on_worker_failure():
    # Start deployment with 2 workers
    deployment = start_deployment(workers=2)

    # Send long-running request
    request_thread = spawn_long_request(max_tokens=1000)

    # Kill one worker mid-generation
    kill_worker(deployment.workers[0])

    # Verify request completes on remaining worker
    response = request_thread.join()
    assert response.status_code == 200
    assert len(response.tokens) > 0

etcd HA Tests

Test system behavior during etcd failures and recovery.

Running etcd HA Tests

pytest tests/fault_tolerance/etcd_ha/ -v

Test Scenarios

  • Leader failover: etcd leader node fails, cluster elects new leader
  • Network partition: etcd node becomes unreachable
  • Recovery: System recovers after etcd becomes available

Hardware Fault Injection

The fault injection service enables testing under simulated hardware failures.

Fault Injection Service

Located at tests/fault_tolerance/hardware/fault_injection_service/, this FastAPI service orchestrates fault injection:

# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main

Supported Fault Types

GPU Faults

Fault Type Description
XID_ERROR Simulate GPU XID error (various codes)
THROTTLE GPU thermal throttling
MEMORY_PRESSURE GPU memory exhaustion
OVERHEAT GPU overheating condition
COMPUTE_OVERLOAD GPU compute saturation

Network Faults

Fault Type Description
FRONTEND_WORKER Partition between frontend and workers
WORKER_NATS Partition between workers and NATS
WORKER_WORKER Partition between workers
CUSTOM Custom network partition

Fault Injection API

Inject GPU Fault

curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
  -H "Content-Type: application/json" \
  -d '{
    "target_pod": "vllm-worker-0",
    "fault_type": "XID_ERROR",
    "severity": "HIGH"
  }'

Inject Specific XID Error

# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
  -H "Content-Type: application/json" \
  -d '{"target_pod": "vllm-worker-0"}'

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

Inject Network Partition

curl -X POST http://localhost:8080/api/v1/faults/network/inject \
  -H "Content-Type: application/json" \
  -d '{
    "partition_type": "FRONTEND_WORKER",
    "duration_seconds": 30
  }'

Recover from Fault

curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover

List Active Faults

curl http://localhost:8080/api/v1/faults

GPU Fault Injector Agent

The GPU fault injector runs as a DaemonSet on worker nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-fault-injector
spec:
  selector:
    matchLabels:
      app: gpu-fault-injector
  template:
    spec:
      containers:
      - name: agent
        image: dynamo/gpu-fault-injector:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: dev
          mountPath: /dev

The agent injects fake XID messages via /dev/kmsg to trigger NVSentinel detection.

Deployment Testing Framework

The deploy/ directory contains an end-to-end testing framework.

Test Phases

Tests run through three phases:

Phase Description
STANDARD Baseline performance under normal conditions
OVERFLOW System behavior during fault/overload
RECOVERY System recovery after fault resolution

Scenario Configuration

Define test scenarios in scenarios.py:

from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure

scenario = Scenario(
    name="worker_failure_migration",
    backend="vllm",
    load=Load(
        clients=10,
        requests_per_client=100,
        max_tokens=256
    ),
    failure=Failure(
        type="pod_kill",
        target="vllm-worker-0",
        trigger_after_requests=50
    )
)

Running Deployment Tests

# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v

# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v

Validation Checkers

The framework includes pluggable validators:

from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext

class MigrationChecker(BaseChecker):
    def check(self, context: ValidationContext) -> bool:
        # Verify migrations occurred
        migrations = context.metrics.get("migrations_total", 0)
        return migrations > 0

Results Parsing

Parse test results for analysis:

from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test

results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")

Client Utilities

The client.py module provides shared client functionality:

Multi-Threaded Load Generation

from tests.fault_tolerance.client import client

# Generate load with multiple clients
results = client(
    base_url="http://localhost:8000",
    num_clients=10,
    requests_per_client=100,
    model="Qwen/Qwen3-0.6B",
    max_tokens=256,
    log_dir="/tmp/test_logs"
)

Request Options

Parameter Description
base_url Frontend URL
num_clients Number of concurrent clients
requests_per_client Requests per client
model Model name
max_tokens Max tokens per request
log_dir Directory for client logs
endpoint completions or chat/completions

Running the Full Test Suite

Prerequisites

  1. Kubernetes cluster with GPU nodes
  2. Dynamo deployment
  3. etcd cluster (for HA tests)
  4. Fault injection service (for hardware tests)

Environment Setup

export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000

Run All Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short

# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"

Test Markers

Marker Description
slow Long-running tests (> 5 minutes)
gpu Requires GPU resources
k8s Requires Kubernetes cluster
etcd_ha Requires multi-node etcd

Best Practices

1. Isolate Test Environments

Run fault tolerance tests in dedicated namespaces:

kubectl create namespace dynamo-fault-test

2. Clean Up After Tests

Ensure fault injection is recovered:

# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover

3. Collect Logs

Preserve logs for debugging:

pytest tests/fault_tolerance/ -v \
  --log-dir=/tmp/fault_test_logs \
  --capture=no

4. Monitor During Tests

Watch system state during tests:

# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test

# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'

Related Documentation