Skip to content

Add Job workload support to CRUD benchmarking framework#1133

Draft
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs
Draft

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs

Conversation

@diamondpowell
Copy link
Copy Markdown

@diamondpowell diamondpowell commented Apr 14, 2026

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

  • New K8s manifest template using batch/v1 API with restartPolicy: Never
  • Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

  • Add create_job() — same loop pattern as other workloads
  • Uses complete condition instead of available/ready since Jobs terminate after completion
  • No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

  • Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
  • Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

  • Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
  • Add wait_for_job_completed with 5-min timeout and 30s polling
  • Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

  • Add jobs script block calling python3 main.py jobs
  • Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

  • Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

  • Fix: set gpu_profile driver to "None" for non-GPU node pools

Tests

test_azure_node_pool_crud.py:

  • test_create_job_success
  • test_create_job_failure
  • test_create_job_no_client
  • test_create_job_partial_success

test_kubernetes_client.py:

  • test_wait_for_condition_job_success — Job completes successfully
  • test_wait_for_condition_job_timeout — fails within timeout
  • test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto
node pools after provisioning. Implements multi-doc YAML manifest parsing,
configurable replica count, and per-deployment readiness validation via
wait_for_condition.

- Add handle_workload_operations() dispatcher in main.py with 'deployment'
  subcommand supporting --number-of-deployments, --replicas, --manifest-dir
- Add deployment.yml workload template with configurable replicas and
  node affinity via label_selector
- Derive label_selector from node pool name parameter (removes hardcoding)
- Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the
k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas,
manifest_dir) flow from pipeline matrix → topology → engine step → main.py.

- Add deployment script block to steps/engine/crud/k8s/execute.yml
- Pass deployment parameters through topology execute-crud.yml
- Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded
in scale_node_pool and _progressive_scale, allowing execution to continue
while Azure still had an operation in-progress. Subsequent scale/delete
calls were rejected with OperationNotAllowed.

Call poller.result() to block until Azure fully completes each operation
before proceeding. Aligns scale behavior with create_node_pool and
delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and
handle_workload_operations:

- test_create_deployment_success: single deployment with readiness check
- test_create_deployment_partial_success: some deployments fail, operation
  continues and reports partial success
- test_create_deployment_failure: all deployments fail
- test_multiple_deployments: verifies N deployments created sequentially
- test_handle_workload_operations: deployment command routing + kwargs
- test_handle_workload_operations_unknown_command: returns error
- test_progressive_scaling_failure: scale continues after step failure
- test_scale_down_fails_continues: delete still runs after scale-down error
- test_returns_false_early_exit: unknown operation returns False
Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools.
Unlike deployments/statefulsets which run indefinitely, Jobs are
run-to-completion workloads — success means the pod terminated cleanly
(succeeded > 0), failure raises immediately (no self-healing).

- Add 'jobs' subcommand to handle_workload_operations() in main.py
  with --number-of-jobs and --completions args
- Add job.yml workload template with configurable completions and
  node affinity via label_selector
- Add _check_job_condition and _is_job_condition_met to
  kubernetes_client.py — checks completion_time + succeeded count
- Add wait_for_job_completed with 5-min timeout and 30s polling
- Job kind support in apply/update/delete manifest methods
Add job execution step to the k8s CRUD engine pipeline between
deployment and scale-down. Parameters (number_of_jobs, completions)
flow from pipeline matrix → topology → engine step → main.py.

- Add jobs script block to steps/engine/crud/k8s/execute.yml
- Pass number_of_jobs and completions through topology execute-crud.yml
- Jobs run after deployment, before scale-down + delete
Fixes node pool creation failure when gpu_node_pool=False by setting
gpu_profile driver to 'None' for non-GPU pools. Original logic only
set 'None' for a specific GPU VM size, leaving non-GPU pools incorrectly
set to 'Install'.
Add comprehensive test coverage for create_job and job wait_for_condition:

- test_create_job_success: single job completes successfully
- test_create_job_failure: job fails to complete
- test_create_job_partial_success: continues on individual failures
- test_job_wait_for_condition: validates _check_job_condition and
  _is_job_condition_met for 'complete' and 'failed' states
- test_wait_for_job_completed: timeout and polling behavior
- Tests cover Job-specific semantics (succeeded count, completion_time,
  failed + no active pods)
@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from e88a51f to 787e4d6 Compare May 5, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant