Improve deployment workload support in CRUD benchmarking framework#879
Open
diamondpowell wants to merge 17 commits into
Open
Improve deployment workload support in CRUD benchmarking framework#879diamondpowell wants to merge 17 commits into
diamondpowell wants to merge 17 commits into
Conversation
Author
|
@microsoft-github-policy-service agree company="Microsoft" |
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto node pools after provisioning. Implements multi-doc YAML manifest parsing, configurable replica count, and per-deployment readiness validation via wait_for_condition. - Add handle_workload_operations() dispatcher in main.py with 'deployment' subcommand supporting --number-of-deployments, --replicas, --manifest-dir - Add deployment.yml workload template with configurable replicas and node affinity via label_selector - Derive label_selector from node pool name parameter (removes hardcoding) - Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas, manifest_dir) flow from pipeline matrix → topology → engine step → main.py. - Add deployment script block to steps/engine/crud/k8s/execute.yml - Pass deployment parameters through topology execute-crud.yml - Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded in scale_node_pool and _progressive_scale, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were rejected with OperationNotAllowed. Call poller.result() to block until Azure fully completes each operation before proceeding. Aligns scale behavior with create_node_pool and delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and handle_workload_operations: - test_create_deployment_success: single deployment with readiness check - test_create_deployment_partial_success: some deployments fail, operation continues and reports partial success - test_create_deployment_failure: all deployments fail - test_multiple_deployments: verifies N deployments created sequentially - test_handle_workload_operations: deployment command routing + kwargs - test_handle_workload_operations_unknown_command: returns error - test_progressive_scaling_failure: scale continues after step failure - test_scale_down_fails_continues: delete still runs after scale-down error - test_returns_false_early_exit: unknown operation returns False
80dc153 to
ac30cca
Compare
This was referenced May 4, 2026
017c519 to
182fb73
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends Telescope’s Python CRUD benchmarking framework to support applying and verifying Kubernetes Deployment workloads (intended for AKS node pool benchmarking) and updates AKS scaling to properly block on Azure ARM long-running operations to avoid sequential scale race conditions.
Changes:
- Added a new
deploymentworkload command/dispatcher and wired it into the CRUD pipeline execution flow. - Introduced a templated multi-document Deployment/Service manifest and an Azure
create_deployment()implementation that applies and waits for readiness. - Updated AKS node pool scale operations to call
poller.result()to ensure ARM operations complete before continuing.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/topology/k8s-crud-gpu/execute-crud.yml | Plumbs workload-related parameters from topology to engine execution. |
| steps/engine/crud/k8s/execute.yml | Adds a workload-deploy step between scale-up and scale-down and wires env/params. |
| modules/python/crud/main.py | Adds deployment subcommand and handle_workload_operations() dispatcher. |
| modules/python/crud/azure/node_pool_crud.py | Implements create_deployment() to apply templated manifests and wait for readiness. |
| modules/python/crud/workload_templates/deployment.yml | Adds templated Deployment/Service YAML used by the workload step. |
| modules/python/clients/aks_client.py | Ensures scale operations block on ARM LRO completion via .result(). |
| modules/python/tests/crud/test_main.py | Adds unit tests for workload dispatcher routing and failure modes. |
| modules/python/tests/crud/test_azure_node_pool_crud.py | Adds unit tests for create_deployment() and improved all()-sequence behavior. |
Author
|
@microsoft-github-policy-service agree company="Microsoft" |
- Use per-deployment unique labels to avoid selector collision in wait_for_pods_ready - Resolve template path via __file__ to work regardless of workingDirectory - Pass namespace to apply_manifest_from_file for consistency with wait calls - Remove empty spec.template.metadata.name from deployment template - Fix docstring to match actual method signature - Remove unused deployment_name parameter from pipeline topology - Make --manifest-dir conditional (only pass when non-empty) - Rename test to match actual behavior (deployment, not pod)
…pipeline - Add conditional on deployment step: only runs when cloud=='azure' - Add hasattr check in handle_workload_operations for unsupported providers - Switch test pipeline to Standard_D4s_v3 (quota available)
df0c61b to
0fe7641
Compare
Author
Addressed Copilot review feedback:
Files changed:
|
xinWeiWei24
reviewed
May 6, 2026
xinWeiWei24
reviewed
May 7, 2026
xinWeiWei24
reviewed
May 7, 2026
xinWeiWei24
reviewed
May 7, 2026
xinWeiWei24
reviewed
May 7, 2026
xinWeiWei24
reviewed
May 8, 2026
xinWeiWei24
reviewed
May 8, 2026
xinWeiWei24
approved these changes
May 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds deployment workload support to the CRUD benchmarking framework, enabling Telescope to measure K8s Deployment create/verify latency on AKS node pools. Also fixes a race condition in Azure LRO handling for scale operations.
Branch cleanup note: Rebased and squashed from 141 commits → 4 logical commits for reviewability. All prior review feedback has been addressed.
Changes
'modules/python/crud/azure/node_pool_crud.py'
create_deployment()— creates N deployments from a YAML template, validates readiness viawait_for_condition, and handles partial failures gracefully[label_selector]from parameters (no hardcoding)[self.step_timeout]instead of hardcoded timeout%-stylelogging for consistency with existing methodsnamespaceconfigurable with"default"as default value'modules/python/crud/main.py'
handle_workload_operations()dispatcher withdeploymentsubcommand--number-of-deployments,--replicas,--manifest-dir,[--node-pool-name][--deployment-name]argumentelseclause returning error on unknown workload commands'modules/python/crud/workload_templates/deployment.yml'
[steps/engine/crud/k8s/execute.yml]NUMBER_OF_DEPLOYMENTS,REPLICAS,MANIFEST_DIRenvironment variables[steps/topology/k8s-crud-gpu/execute-crud.yml]'modules/python/clients/aks_client.py'
.result()onbegin_create_or_update()inscale_node_pooland_progressive_scale— blocks until ARM completes before proceeding. Matches existing pattern increate_node_pool(line 331) anddelete_node_pool(line 573) which already call.result(). Prevents race condition wherewait_for_nodes_readyruns before ARM has processed the scale request. Consistent with the internal repo which also calls.result()on all operations.Tests
Unit tests in
test_azure_node_pool_crud.pyandtest_main.py:test_create_deployment_success— all deployments readytest_create_deployment_failure— all fail to become readytest_create_deployment_partial_success— continues on individual failures, returns Falsetest_multiple_deployments — verifies N deployments created sequentiallytest_handle_workload_operations— deployment command routingtest_handle_workload_operations_unknown_command— returns errortest_progressive_scaling_failure— scale continues after step failuretest_scale_down_fails_continues— delete still runs after scale-down errortest_returns_false_early_exit— unknown operation returns False