Skip to content

Improve deployment workload support in CRUD benchmarking framework#879

Open
diamondpowell wants to merge 17 commits into
mainfrom
test-refactor
Open

Improve deployment workload support in CRUD benchmarking framework#879
diamondpowell wants to merge 17 commits into
mainfrom
test-refactor

Conversation

@diamondpowell
Copy link
Copy Markdown

@diamondpowell diamondpowell commented Sep 26, 2025

Summary

Adds deployment workload support to the CRUD benchmarking framework, enabling Telescope to measure K8s Deployment create/verify latency on AKS node pools. Also fixes a race condition in Azure LRO handling for scale operations.

Branch cleanup note: Rebased and squashed from 141 commits → 4 logical commits for reviewability. All prior review feedback has been addressed.

Changes

'modules/python/crud/azure/node_pool_crud.py'

  • Add create_deployment() — creates N deployments from a YAML template, validates readiness via wait_for_condition, and handles partial failures gracefully
  • Derive [label_selector] from parameters (no hardcoding)
  • Use [self.step_timeout] instead of hardcoded timeout
  • Use %-style logging for consistency with existing methods
  • Make namespace configurable with "default" as default value

'modules/python/crud/main.py'

  • Add handle_workload_operations() dispatcher with deployment subcommand
  • CLI args: --number-of-deployments, --replicas, --manifest-dir, [--node-pool-name]
  • Remove unnecessary [--deployment-name] argument
  • Add else clause returning error on unknown workload commands

'modules/python/crud/workload_templates/deployment.yml'

  • Parameterized deployment manifest with configurable replicas, label_selector, and namespace

[steps/engine/crud/k8s/execute.yml]

  • Add deployment script block between scale-up and scale-down operations
  • Wire NUMBER_OF_DEPLOYMENTS, REPLICAS, MANIFEST_DIR environment variables

[steps/topology/k8s-crud-gpu/execute-crud.yml]

  • Pass deployment parameters from pipeline matrix through topology layer

'modules/python/clients/aks_client.py'

  • Fix: chain .result() on begin_create_or_update() in scale_node_pool and _progressive_scale — blocks until ARM completes before proceeding. Matches existing pattern in create_node_pool (line 331) and delete_node_pool (line 573) which already call .result(). Prevents race condition where wait_for_nodes_ready runs before ARM has processed the scale request. Consistent with the internal repo which also calls .result() on all operations.

Tests

Unit tests in test_azure_node_pool_crud.py and test_main.py:

test_create_deployment_success — all deployments ready
test_create_deployment_failure — all fail to become ready
test_create_deployment_partial_success — continues on individual failures, returns False
test_multiple_deployments — verifies N deployments created sequentially
test_handle_workload_operations — deployment command routing
test_handle_workload_operations_unknown_command — returns error
test_progressive_scaling_failure — scale continues after step failure
test_scale_down_fails_continues — delete still runs after scale-down error
test_returns_false_early_exit — unknown operation returns False

@diamondpowell
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

Comment thread modules/python/clients/aks_client.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/main.py Outdated
Comment thread modules/python/crud/main.py Outdated
Comment thread modules/python/crud/main.py
Comment thread modules/python/crud/aws/node_pool_crud.py Outdated
Comment thread steps/engine/crud/k8s/execute.yml Outdated
@diamondpowell diamondpowell changed the title initial update Improve deployment workload support in CRUD benchmarking framework Apr 14, 2026
Add create_deployment() to NodePoolCRUD that deploys K8s workloads onto
node pools after provisioning. Implements multi-doc YAML manifest parsing,
configurable replica count, and per-deployment readiness validation via
wait_for_condition.

- Add handle_workload_operations() dispatcher in main.py with 'deployment'
  subcommand supporting --number-of-deployments, --replicas, --manifest-dir
- Add deployment.yml workload template with configurable replicas and
  node affinity via label_selector
- Derive label_selector from node pool name parameter (removes hardcoding)
- Return error on unknown workload command
Add deployment execution step between scale-up and scale-down in the
k8s CRUD engine pipeline. Parameters (number_of_deployments, replicas,
manifest_dir) flow from pipeline matrix → topology → engine step → main.py.

- Add deployment script block to steps/engine/crud/k8s/execute.yml
- Pass deployment parameters through topology execute-crud.yml
- Deployment runs after scale-up, before scale-down + delete
begin_create_or_update() returns an LROPoller that was being discarded
in scale_node_pool and _progressive_scale, allowing execution to continue
while Azure still had an operation in-progress. Subsequent scale/delete
calls were rejected with OperationNotAllowed.

Call poller.result() to block until Azure fully completes each operation
before proceeding. Aligns scale behavior with create_node_pool and
delete_node_pool which already awaited the poller.
Add comprehensive test coverage for create_deployment and
handle_workload_operations:

- test_create_deployment_success: single deployment with readiness check
- test_create_deployment_partial_success: some deployments fail, operation
  continues and reports partial success
- test_create_deployment_failure: all deployments fail
- test_multiple_deployments: verifies N deployments created sequentially
- test_handle_workload_operations: deployment command routing + kwargs
- test_handle_workload_operations_unknown_command: returns error
- test_progressive_scaling_failure: scale continues after step failure
- test_scale_down_fails_continues: delete still runs after scale-down error
- test_returns_false_early_exit: unknown operation returns False
@diamondpowell diamondpowell force-pushed the test-refactor branch 4 times, most recently from 017c519 to 182fb73 Compare May 5, 2026 19:21
@diamondpowell diamondpowell marked this pull request as ready for review May 5, 2026 21:50
Copilot AI review requested due to automatic review settings May 5, 2026 21:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Telescope’s Python CRUD benchmarking framework to support applying and verifying Kubernetes Deployment workloads (intended for AKS node pool benchmarking) and updates AKS scaling to properly block on Azure ARM long-running operations to avoid sequential scale race conditions.

Changes:

  • Added a new deployment workload command/dispatcher and wired it into the CRUD pipeline execution flow.
  • Introduced a templated multi-document Deployment/Service manifest and an Azure create_deployment() implementation that applies and waits for readiness.
  • Updated AKS node pool scale operations to call poller.result() to ensure ARM operations complete before continuing.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
steps/topology/k8s-crud-gpu/execute-crud.yml Plumbs workload-related parameters from topology to engine execution.
steps/engine/crud/k8s/execute.yml Adds a workload-deploy step between scale-up and scale-down and wires env/params.
modules/python/crud/main.py Adds deployment subcommand and handle_workload_operations() dispatcher.
modules/python/crud/azure/node_pool_crud.py Implements create_deployment() to apply templated manifests and wait for readiness.
modules/python/crud/workload_templates/deployment.yml Adds templated Deployment/Service YAML used by the workload step.
modules/python/clients/aks_client.py Ensures scale operations block on ARM LRO completion via .result().
modules/python/tests/crud/test_main.py Adds unit tests for workload dispatcher routing and failure modes.
modules/python/tests/crud/test_azure_node_pool_crud.py Adds unit tests for create_deployment() and improved all()-sequence behavior.

Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/workload_templates/deployment.yml Outdated
Comment thread steps/engine/crud/k8s/execute.yml Outdated
Comment thread steps/topology/k8s-crud-gpu/execute-crud.yml Outdated
Comment thread modules/python/tests/crud/test_main.py Outdated
Comment thread steps/engine/crud/k8s/execute.yml Outdated
Comment thread modules/python/crud/main.py
@diamondpowell
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

diamondpowell and others added 4 commits May 6, 2026 13:20
- Use per-deployment unique labels to avoid selector collision in wait_for_pods_ready
- Resolve template path via __file__ to work regardless of workingDirectory
- Pass namespace to apply_manifest_from_file for consistency with wait calls
- Remove empty spec.template.metadata.name from deployment template
- Fix docstring to match actual method signature
- Remove unused deployment_name parameter from pipeline topology
- Make --manifest-dir conditional (only pass when non-empty)
- Rename test to match actual behavior (deployment, not pod)
…pipeline

- Add conditional on deployment step: only runs when cloud=='azure'
- Add hasattr check in handle_workload_operations for unsupported providers
- Switch test pipeline to Standard_D4s_v3 (quota available)
@diamondpowell
Copy link
Copy Markdown
Author

diamondpowell commented May 6, 2026

Addressed Copilot review feedback:

  • --manifest-dir made optional — defaults to None, uses built-in template at workload_templates/deployment.yml when not provided
  • Deployment step gated on Azure (${{ if eq(parameters.cloud, 'azure') }}), manifest-dir passed conditionally via ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}
  • Per-deployment unique labels (f"{label_selector.split('=', 1)[-1]}-{deployment_index}")
  • Template path resolved via os.path.dirname(__file__) (no hardcoded paths)
  • Namespace passthrough to deployment operations
  • Removed empty name: field from deployment template spec
  • hasattr(node_pool_crud, 'create_deployment') guard for backward compatibility
  • Renamed test to match actual operation (test_handle_workload_operations_create_deployment_success)

Files changed:

  • modules/python/crud/azure/node_pool_crud.py
  • modules/python/crud/main.py
  • modules/python/crud/workload_templates/deployment.yml
  • modules/python/tests/crud/test_main.py
  • steps/engine/crud/k8s/execute.yml

Comment thread modules/python/crud/main.py Outdated
Comment thread steps/topology/k8s-crud-gpu/execute-crud.yml Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py Outdated
Comment thread modules/python/crud/azure/node_pool_crud.py
Comment thread modules/python/crud/azure/node_pool_crud.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants