Improve deployment workload support in CRUD benchmarking framework by diamondpowell · Pull Request #879 · Azure/telescope

diamondpowell · 2025-09-26T21:18:02Z

Summary

Adds deployment workload support to the CRUD benchmarking framework, enabling Telescope to measure K8s Deployment create/verify latency on AKS node pools. Also fixes a race condition in Azure LRO handling for scale operations.

Branch cleanup note: Rebased and squashed from 141 commits → 4 logical commits for reviewability. All prior review feedback has been addressed.

Changes

'modules/python/crud/azure/node_pool_crud.py'

Add create_deployment() — creates N deployments from a YAML template, validates readiness via wait_for_condition, and handles partial failures gracefully
Derive [label_selector] from parameters (no hardcoding)
Use [self.step_timeout] instead of hardcoded timeout
Use %-style logging for consistency with existing methods
Make namespace configurable with "default" as default value

'modules/python/crud/main.py'

Add handle_workload_operations() dispatcher with deployment subcommand
CLI args: --number-of-deployments, --replicas, --manifest-dir, [--node-pool-name]
Remove unnecessary [--deployment-name] argument
Add else clause returning error on unknown workload commands

'modules/python/crud/workload_templates/deployment.yml'

Parameterized deployment manifest with configurable replicas, label_selector, and namespace

[steps/engine/crud/k8s/execute.yml]

Add deployment script block between scale-up and scale-down operations
Wire NUMBER_OF_DEPLOYMENTS, REPLICAS, MANIFEST_DIR environment variables

[steps/topology/k8s-crud-gpu/execute-crud.yml]

Pass deployment parameters from pipeline matrix through topology layer

'modules/python/clients/aks_client.py'

Fix: chain .result() on begin_create_or_update() in scale_node_pool and _progressive_scale — blocks until ARM completes before proceeding. Matches existing pattern in create_node_pool (line 331) and delete_node_pool (line 573) which already call .result(). Prevents race condition where wait_for_nodes_ready runs before ARM has processed the scale request. Consistent with the internal repo which also calls .result() on all operations.

Tests

Unit tests in test_azure_node_pool_crud.py and test_main.py:

test_create_deployment_success — all deployments ready
test_create_deployment_failure — all fail to become ready
test_create_deployment_partial_success — continues on individual failures, returns False
test_multiple_deployments — verifies N deployments created sequentially
test_handle_workload_operations — deployment command routing
test_handle_workload_operations_unknown_command — returns error
test_progressive_scaling_failure — scale continues after step failure
test_scale_down_fails_continues — delete still runs after scale-down error
test_returns_false_early_exit — unknown operation returns False

diamondpowell · 2025-11-13T13:40:16Z

@microsoft-github-policy-service agree company="Microsoft"

Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command

Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete

begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.

Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False

Copilot

Pull request overview

This PR extends Telescope’s Python CRUD benchmarking framework to support applying and verifying Kubernetes Deployment workloads (intended for AKS node pool benchmarking) and updates AKS scaling to properly block on Azure ARM long-running operations to avoid sequential scale race conditions.

Changes:

Added a new deployment workload command/dispatcher and wired it into the CRUD pipeline execution flow.
Introduced a templated multi-document Deployment/Service manifest and an Azure create_deployment() implementation that applies and waits for readiness.
Updated AKS node pool scale operations to call poller.result() to ensure ARM operations complete before continuing.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
steps/topology/k8s-crud-gpu/execute-crud.yml	Plumbs workload-related parameters from topology to engine execution.
steps/engine/crud/k8s/execute.yml	Adds a workload-deploy step between scale-up and scale-down and wires env/params.
modules/python/crud/main.py	Adds `deployment` subcommand and `handle_workload_operations()` dispatcher.
modules/python/crud/azure/node_pool_crud.py	Implements `create_deployment()` to apply templated manifests and wait for readiness.
modules/python/crud/workload_templates/deployment.yml	Adds templated Deployment/Service YAML used by the workload step.
modules/python/clients/aks_client.py	Ensures scale operations block on ARM LRO completion via `.result()`.
modules/python/tests/crud/test_main.py	Adds unit tests for workload dispatcher routing and failure modes.
modules/python/tests/crud/test_azure_node_pool_crud.py	Adds unit tests for `create_deployment()` and improved all()-sequence behavior.

diamondpowell · 2026-05-05T21:59:43Z

@microsoft-github-policy-service agree company="Microsoft"

- Use per-deployment unique labels to avoid selector collision in wait_for_pods_ready - Resolve template path via __file__ to work regardless of workingDirectory - Pass namespace to apply_manifest_from_file for consistency with wait calls - Remove empty spec.template.metadata.name from deployment template - Fix docstring to match actual method signature - Remove unused deployment_name parameter from pipeline topology - Make --manifest-dir conditional (only pass when non-empty) - Rename test to match actual behavior (deployment, not pod)

…pipeline - Add conditional on deployment step: only runs when cloud=='azure' - Add hasattr check in handle_workload_operations for unsupported providers - Switch test pipeline to Standard_D4s_v3 (quota available)

diamondpowell · 2026-05-06T17:43:37Z

Addressed Copilot review feedback:

--manifest-dir made optional — defaults to None, uses built-in template at workload_templates/deployment.yml when not provided
Deployment step gated on Azure (${{ if eq(parameters.cloud, 'azure') }}), manifest-dir passed conditionally via ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}
Per-deployment unique labels (f"{label_selector.split('=', 1)[-1]}-{deployment_index}")
Template path resolved via os.path.dirname(__file__) (no hardcoded paths)
Namespace passthrough to deployment operations
Removed empty name: field from deployment template spec
hasattr(node_pool_crud, 'create_deployment') guard for backward compatibility
Renamed test to match actual operation (test_handle_workload_operations_create_deployment_success)

Files changed:

modules/python/crud/azure/node_pool_crud.py
modules/python/crud/main.py
modules/python/crud/workload_templates/deployment.yml
modules/python/tests/crud/test_main.py
steps/engine/crud/k8s/execute.yml

…test

… methods

diamondpowell requested a review from lokesh-keyan March 18, 2026 23:21