Add Job workload support to CRUD benchmarking framework by diamondpowell · Pull Request #1133 · Azure/telescope

diamondpowell · 2026-04-14T19:27:57Z

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

New K8s manifest template using batch/v1 API with restartPolicy: Never
Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

Add create_job() — same loop pattern as other workloads
Uses complete condition instead of available/ready since Jobs terminate after completion
No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
Add wait_for_job_completed with 5-min timeout and 30s polling
Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

Add jobs script block calling python3 main.py jobs
Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

Fix: set gpu_profile driver to "None" for non-GPU node pools

Tests

test_azure_node_pool_crud.py:

test_create_job_success
test_create_job_failure
test_create_job_no_client
test_create_job_partial_success

test_kubernetes_client.py:

test_wait_for_condition_job_success — Job completes successfully
test_wait_for_condition_job_timeout — fails within timeout
test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command

Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete

begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.

Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False

Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads — success means the pod terminated cleanly (succeeded > 0), failure raises immediately (no self-healing). - Add 'jobs' subcommand to handle_workload_operations() in main.py with --number-of-jobs and --completions args - Add job.yml workload template with configurable completions and node affinity via label_selector - Add _check_job_condition and _is_job_condition_met to kubernetes_client.py — checks completion_time + succeeded count - Add wait_for_job_completed with 5-min timeout and 30s polling - Job kind support in apply/update/delete manifest methods

Add job execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_jobs, completions) flow from pipeline matrix → topology → engine step → main.py. - Add jobs script block to steps/engine/crud/k8s/execute.yml - Pass number_of_jobs and completions through topology execute-crud.yml - Jobs run after deployment, before scale-down + delete

Fixes node pool creation failure when gpu_node_pool=False by setting gpu_profile driver to 'None' for non-GPU pools. Original logic only set 'None' for a specific GPU VM size, leaving non-GPU pools incorrectly set to 'Install'.

Add comprehensive test coverage for create_job and job wait_for_condition: - test_create_job_success: single job completes successfully - test_create_job_failure: job fails to complete - test_create_job_partial_success: continues on individual failures - test_job_wait_for_condition: validates _check_job_condition and _is_job_condition_met for 'complete' and 'failed' states - test_wait_for_job_completed: timeout and polling behavior - Tests cover Job-specific semantics (succeeded count, completion_time, failed + no active pods)

diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08

diamondpowell added 8 commits May 4, 2026 16:27

fix: set gpu_profile driver to None for non-GPU node pools

70939c5

Fixes node pool creation failure when gpu_node_pool=False by setting gpu_profile driver to 'None' for non-GPU pools. Original logic only set 'None' for a specific GPU VM size, leaving non-GPU pools incorrectly set to 'Install'.

diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from e88a51f to 787e4d6 Compare May 5, 2026 17:27

fix: resolve trailing whitespace and yamllint issues

5a0978d

diamondpowell force-pushed the dipowell/crud-jobs branch from 787e4d6 to 5a0978d Compare May 5, 2026 19:31

refactor: chain .result() directly to match create/delete pattern

8aed904

diamondpowell mentioned this pull request May 7, 2026

Improve deployment workload support in CRUD benchmarking framework #879

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Job workload support to CRUD benchmarking framework#1133

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 10 commits into
mainfrom
dipowell/crud-jobs

diamondpowell commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diamondpowell commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diamondpowell commented Apr 14, 2026 •

edited

Loading