Skip to content

Add StatefulSet workload support to CRUD benchmarking framework#1132

Draft
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-statefulset
Draft

Add StatefulSet workload support to CRUD benchmarking framework#1132
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-statefulset

Conversation

@diamondpowell
Copy link
Copy Markdown

@diamondpowell diamondpowell commented Apr 14, 2026

Summary

Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (deployment, statefulset, jobs). Measures K8s StatefulSet create/verify latency on AKS node pools.

Branch cleanup note: Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for gpu_profile driver setting that caused node pool creation failure on non-GPU pools.

Changes

modules/python/crud/workload_templates/statefulset.yml

  • New K8s manifest template with headless Service (clusterIP: None) for stable pod DNS
  • Uses STATEFULSET_REPLICAS placeholder, configurable node affinity via label_selector

modules/python/crud/azure/node_pool_crud.py

  • Add create_statefulset() — same loop pattern as create_deployment
  • Uses ready condition (not available) since StatefulSets don't support the available condition type

modules/python/crud/main.py

  • Add statefulset subparser with --node-pool-name, --number-of-statefulsets, --replicas, --manifest-dir
  • Add elif command == "statefulset" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

  • Add _is_statefulset_ready and _check_statefulset_condition for readiness polling
  • Extend wait_for_condition to support StatefulSet resource type

steps/engine/crud/k8s/execute.yml

  • Add statefulset script block calling python3 main.py statefulset
  • Add number_of_statefulsets parameter

steps/topology/k8s-crud-gpu/execute-crud.yml

  • Wire number_of_statefulsets through to engine step

modules/python/clients/aks_client.py

  • Fix: set gpu_profile driver to "None" for non-GPU node pools (was incorrectly set to "Install", causing creation failures)

Tests

test_azure_node_pool_crud.py:

  • test_create_statefulset_success — happy path
  • test_create_statefulset_failure — all fail to become ready
  • test_create_statefulset_no_client — returns early when k8s client unavailable
  • test_create_statefulset_partial_success — continues on failures, returns False

test_kubernetes_client.py:

  • test_wait_for_condition_statefulset_success
  • test_wait_for_condition_statefulset_timeout
  • test_wait_for_condition_statefulset_not_found

Dependencies

Based on test-refactor (PR #879) — must merge first.

@diamondpowell diamondpowell force-pushed the dipowell/crud-statefulset branch 3 times, most recently from 8a14575 to 1207d1d Compare April 21, 2026 01:58
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto
node pools after provisioning. Implements multi-doc YAML manifest parsing,
configurable replica count, and per-deployment readiness validation via
wait_for_condition.

- Add handle_workload_operations() dispatcher in main.py with 'deployment'
  subcommand supporting --number-of-deployments, --replicas, --manifest-dir
- Add deployment.yml workload template with configurable replicas and
  node affinity via label_selector
- Derive label_selector from node pool name parameter (removes hardcoding)
- Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the
k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas,
manifest_dir) flow from pipeline matrix → topology → engine step → main.py.

- Add deployment script block to steps/engine/crud/k8s/execute.yml
- Pass deployment parameters through topology execute-crud.yml
- Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded
in scale_node_pool and _progressive_scale, allowing execution to continue
while Azure still had an operation in-progress. Subsequent scale/delete
calls were rejected with OperationNotAllowed.

Call poller.result() to block until Azure fully completes each operation
before proceeding. Aligns scale behavior with create_node_pool and
delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and
handle_workload_operations:

- test_create_deployment_success: single deployment with readiness check
- test_create_deployment_partial_success: some deployments fail, operation
  continues and reports partial success
- test_create_deployment_failure: all deployments fail
- test_multiple_deployments: verifies N deployments created sequentially
- test_handle_workload_operations: deployment command routing + kwargs
- test_handle_workload_operations_unknown_command: returns error
- test_progressive_scaling_failure: scale continues after step failure
- test_scale_down_fails_continues: delete still runs after scale-down error
- test_returns_false_early_exit: unknown operation returns False
Add create_statefulset() to NodePoolCRUD that deploys K8s StatefulSets
onto node pools after provisioning. Follows the same pattern as
create_deployment — multi-doc YAML manifest parsing, configurable replica
count, and per-statefulset readiness validation via wait_for_condition.

- Add 'statefulset' subcommand to handle_workload_operations() in main.py
  with --number-of-statefulsets and --replicas args
- Add statefulset.yml workload template with configurable replicas and
  node affinity via label_selector
- Add _is_statefulset_ready and _check_statefulset_condition to
  kubernetes_client.py for readiness polling
Add statefulset execution step to the k8s CRUD engine pipeline between
deployment and scale-down. Parameters (number_of_statefulsets, replicas)
flow from pipeline matrix → topology → engine step → main.py.

- Add statefulset script block to steps/engine/crud/k8s/execute.yml
- Pass number_of_statefulsets through topology execute-crud.yml
Fixes node pool creation failure when gpu_node_pool=False by explicitly
setting gpu_profile driver to 'None' string for non-GPU pools, matching
the Azure API expectation.
Add test coverage for create_statefulset and statefulset wait_for_condition:

- test_create_statefulset_success: single statefulset with readiness check
- test_create_statefulset_failure: statefulset fails to become ready
- test_create_statefulset_partial_success: continues on individual failures
- test_create_statefulset_no_client: returns early when k8s client unavailable
- test_statefulset_wait_for_condition: validates _is_statefulset_ready and
  _check_statefulset_condition polling logic
@diamondpowell diamondpowell force-pushed the dipowell/crud-statefulset branch from 695cd4e to 61a2300 Compare May 4, 2026 20:58
@diamondpowell diamondpowell force-pushed the dipowell/crud-statefulset branch 2 times, most recently from 540ea80 to 7ea865e Compare May 5, 2026 18:12
@diamondpowell diamondpowell force-pushed the dipowell/crud-statefulset branch from 7ea865e to dea3f3f Compare May 5, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant