Add StatefulSet workload support to CRUD benchmarking framework by diamondpowell · Pull Request #1132 · Azure/telescope

diamondpowell · 2026-04-14T18:45:28Z

Summary

Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (deployment, statefulset, jobs). Measures K8s StatefulSet create/verify latency on AKS node pools.

Branch cleanup note: Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for gpu_profile driver setting that caused node pool creation failure on non-GPU pools.

Changes

modules/python/crud/workload_templates/statefulset.yml

New K8s manifest template with headless Service (clusterIP: None) for stable pod DNS
Uses STATEFULSET_REPLICAS placeholder, configurable node affinity via label_selector

modules/python/crud/azure/node_pool_crud.py

Add create_statefulset() — same loop pattern as create_deployment
Uses ready condition (not available) since StatefulSets don't support the available condition type

modules/python/crud/main.py

Add statefulset subparser with --node-pool-name, --number-of-statefulsets, --replicas, --manifest-dir
Add elif command == "statefulset" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

Add _is_statefulset_ready and _check_statefulset_condition for readiness polling
Extend wait_for_condition to support StatefulSet resource type

steps/engine/crud/k8s/execute.yml

Add statefulset script block calling python3 main.py statefulset
Add number_of_statefulsets parameter

steps/topology/k8s-crud-gpu/execute-crud.yml

Wire number_of_statefulsets through to engine step

modules/python/clients/aks_client.py

Fix: set gpu_profile driver to "None" for non-GPU node pools (was incorrectly set to "Install", causing creation failures)

Tests

test_azure_node_pool_crud.py:

test_create_statefulset_success — happy path
test_create_statefulset_failure — all fail to become ready
test_create_statefulset_no_client — returns early when k8s client unavailable
test_create_statefulset_partial_success — continues on failures, returns False

test_kubernetes_client.py:

test_wait_for_condition_statefulset_success
test_wait_for_condition_statefulset_timeout
test_wait_for_condition_statefulset_not_found

Dependencies

Based on test-refactor (PR #879) — must merge first.

Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command

Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete

begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.

Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False

Add create_statefulset() to NodePoolCRUD that deploys K8s StatefulSets onto node pools after provisioning. Follows the same pattern as create_deployment — multi-doc YAML manifest parsing, configurable replica count, and per-statefulset readiness validation via wait_for_condition. - Add 'statefulset' subcommand to handle_workload_operations() in main.py with --number-of-statefulsets and --replicas args - Add statefulset.yml workload template with configurable replicas and node affinity via label_selector - Add _is_statefulset_ready and _check_statefulset_condition to kubernetes_client.py for readiness polling

Add statefulset execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_statefulsets, replicas) flow from pipeline matrix → topology → engine step → main.py. - Add statefulset script block to steps/engine/crud/k8s/execute.yml - Pass number_of_statefulsets through topology execute-crud.yml

Fixes node pool creation failure when gpu_node_pool=False by explicitly setting gpu_profile driver to 'None' string for non-GPU pools, matching the Azure API expectation.

Add test coverage for create_statefulset and statefulset wait_for_condition: - test_create_statefulset_success: single statefulset with readiness check - test_create_statefulset_failure: statefulset fails to become ready - test_create_statefulset_partial_success: continues on individual failures - test_create_statefulset_no_client: returns early when k8s client unavailable - test_statefulset_wait_for_condition: validates _is_statefulset_ready and _check_statefulset_condition polling logic

diamondpowell force-pushed the dipowell/crud-statefulset branch 3 times, most recently from 8a14575 to 1207d1d Compare April 21, 2026 01:58

diamondpowell requested a review from lokesh-keyan April 21, 2026 15:50

diamondpowell added 8 commits May 4, 2026 16:27

fix: set gpu_profile driver to None for non-GPU node pools

2f4315e

Fixes node pool creation failure when gpu_node_pool=False by explicitly setting gpu_profile driver to 'None' string for non-GPU pools, matching the Azure API expectation.

diamondpowell force-pushed the dipowell/crud-statefulset branch from 695cd4e to 61a2300 Compare May 4, 2026 20:58

diamondpowell mentioned this pull request May 4, 2026

Add Job workload support to CRUD benchmarking framework #1133

Draft

diamondpowell force-pushed the dipowell/crud-statefulset branch 2 times, most recently from 540ea80 to 7ea865e Compare May 5, 2026 18:12

test: configure pipeline for statefulset workload validation

dea3f3f

diamondpowell force-pushed the dipowell/crud-statefulset branch from 7ea865e to dea3f3f Compare May 5, 2026 19:21

refactor: chain .result() directly to match create/delete pattern

77ed08a

diamondpowell mentioned this pull request May 7, 2026

Improve deployment workload support in CRUD benchmarking framework #879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StatefulSet workload support to CRUD benchmarking framework#1132

Add StatefulSet workload support to CRUD benchmarking framework#1132
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-statefulset

diamondpowell commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diamondpowell commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diamondpowell commented Apr 14, 2026 •

edited

Loading