Add Job workload support to CRUD benchmarking framework#1133
Draft
diamondpowell wants to merge 10 commits into
Draft
Add Job workload support to CRUD benchmarking framework#1133diamondpowell wants to merge 10 commits into
diamondpowell wants to merge 10 commits into
Conversation
84f388c to
1da0a77
Compare
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False
Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads — success means the pod terminated cleanly (succeeded > 0), failure raises immediately (no self-healing). - Add 'jobs' subcommand to handle_workload_operations() in main.py with --number-of-jobs and --completions args - Add job.yml workload template with configurable completions and node affinity via label_selector - Add _check_job_condition and _is_job_condition_met to kubernetes_client.py — checks completion_time + succeeded count - Add wait_for_job_completed with 5-min timeout and 30s polling - Job kind support in apply/update/delete manifest methods
Add job execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_jobs, completions) flow from pipeline matrix → topology → engine step → main.py. - Add jobs script block to steps/engine/crud/k8s/execute.yml - Pass number_of_jobs and completions through topology execute-crud.yml - Jobs run after deployment, before scale-down + delete
Fixes node pool creation failure when gpu_node_pool=False by setting gpu_profile driver to 'None' for non-GPU pools. Original logic only set 'None' for a specific GPU VM size, leaving non-GPU pools incorrectly set to 'Install'.
Add comprehensive test coverage for create_job and job wait_for_condition: - test_create_job_success: single job completes successfully - test_create_job_failure: job fails to complete - test_create_job_partial_success: continues on individual failures - test_job_wait_for_condition: validates _check_job_condition and _is_job_condition_met for 'complete' and 'failed' states - test_wait_for_job_completed: timeout and polling behavior - Tests cover Job-specific semantics (succeeded count, completion_time, failed + no active pods)
e88a51f to
787e4d6
Compare
787e4d6 to
5a0978d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (
succeeded > 0), failure raises immediately.Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).
Changes
modules/python/crud/workload_templates/job.ymlbatch/v1API withrestartPolicy: NeverJOB_COMPLETIONSplaceholder, no parallelism (defaults to match completions)modules/python/crud/azure/node_pool_crud.pycreate_job()— same loop pattern as other workloadscompletecondition instead ofavailable/readysince Jobs terminate after completionwait_for_pods_ready— pods exit after job finishesmodules/python/crud/main.pyjobssubparser with--node-pool-name,--number-of-jobs,--completions,--manifest-direlif command == "jobs"routing inhandle_workload_operationsmodules/python/clients/kubernetes_client.py_check_job_conditionand_is_job_condition_met— checkscompletion_time+succeededcountwait_for_job_completedwith 5-min timeout and 30s pollingapply_manifest,update_manifest,delete_manifeststeps/engine/crud/k8s/execute.ymljobsscript block callingpython3 main.py jobsnumber_of_jobsandcompletionsparameterssteps/topology/k8s-crud-gpu/execute-crud.ymlnumber_of_jobsandcompletionsthrough to engine stepmodules/python/clients/aks_client.pygpu_profiledriver to"None"for non-GPU node poolsTests
test_azure_node_pool_crud.py:test_create_job_successtest_create_job_failuretest_create_job_no_clienttest_create_job_partial_successtest_kubernetes_client.py:test_wait_for_condition_job_success— Job completes successfullytest_wait_for_condition_job_timeout— fails within timeouttest_wait_for_condition_job_not_found— not found, returns failureDependencies
Based on
test-refactor(PR #879) — must merge first. Independent of StatefulSet PR (#1132).