Add StatefulSet workload support to CRUD benchmarking framework#1132
Draft
diamondpowell wants to merge 10 commits into
Draft
Add StatefulSet workload support to CRUD benchmarking framework#1132diamondpowell wants to merge 10 commits into
diamondpowell wants to merge 10 commits into
Conversation
8a14575 to
1207d1d
Compare
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False
Add create_statefulset() to NodePoolCRUD that deploys K8s StatefulSets onto node pools after provisioning. Follows the same pattern as create_deployment — multi-doc YAML manifest parsing, configurable replica count, and per-statefulset readiness validation via wait_for_condition. - Add 'statefulset' subcommand to handle_workload_operations() in main.py with --number-of-statefulsets and --replicas args - Add statefulset.yml workload template with configurable replicas and node affinity via label_selector - Add _is_statefulset_ready and _check_statefulset_condition to kubernetes_client.py for readiness polling
Add statefulset execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_statefulsets, replicas) flow from pipeline matrix → topology → engine step → main.py. - Add statefulset script block to steps/engine/crud/k8s/execute.yml - Pass number_of_statefulsets through topology execute-crud.yml
Fixes node pool creation failure when gpu_node_pool=False by explicitly setting gpu_profile driver to 'None' string for non-GPU pools, matching the Azure API expectation.
Add test coverage for create_statefulset and statefulset wait_for_condition: - test_create_statefulset_success: single statefulset with readiness check - test_create_statefulset_failure: statefulset fails to become ready - test_create_statefulset_partial_success: continues on individual failures - test_create_statefulset_no_client: returns early when k8s client unavailable - test_statefulset_wait_for_condition: validates _is_statefulset_ready and _check_statefulset_condition polling logic
695cd4e to
61a2300
Compare
540ea80 to
7ea865e
Compare
7ea865e to
dea3f3f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (
deployment,statefulset,jobs). Measures K8s StatefulSet create/verify latency on AKS node pools.Branch cleanup note: Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for
gpu_profiledriver setting that caused node pool creation failure on non-GPU pools.Changes
modules/python/crud/workload_templates/statefulset.ymlclusterIP: None) for stable pod DNSSTATEFULSET_REPLICASplaceholder, configurable node affinity via label_selectormodules/python/crud/azure/node_pool_crud.pycreate_statefulset()— same loop pattern ascreate_deploymentreadycondition (notavailable) since StatefulSets don't support theavailablecondition typemodules/python/crud/main.pystatefulsetsubparser with--node-pool-name,--number-of-statefulsets,--replicas,--manifest-direlif command == "statefulset"routing inhandle_workload_operationsmodules/python/clients/kubernetes_client.py_is_statefulset_readyand_check_statefulset_conditionfor readiness pollingwait_for_conditionto support StatefulSet resource typesteps/engine/crud/k8s/execute.ymlstatefulsetscript block callingpython3 main.py statefulsetnumber_of_statefulsetsparametersteps/topology/k8s-crud-gpu/execute-crud.ymlnumber_of_statefulsetsthrough to engine stepmodules/python/clients/aks_client.pygpu_profiledriver to"None"for non-GPU node pools (was incorrectly set to"Install", causing creation failures)Tests
test_azure_node_pool_crud.py:test_create_statefulset_success— happy pathtest_create_statefulset_failure— all fail to become readytest_create_statefulset_no_client— returns early when k8s client unavailabletest_create_statefulset_partial_success— continues on failures, returns Falsetest_kubernetes_client.py:test_wait_for_condition_statefulset_successtest_wait_for_condition_statefulset_timeouttest_wait_for_condition_statefulset_not_foundDependencies
Based on
test-refactor(PR #879) — must merge first.