Commit 63ac789
authored
test: add nova tests (aws#5933)
* test(serve): harden model customization deployment integ tests
Add post-deploy invoke verification and make the Bedrock import-job
lifecycle robust in test_model_customization_deployment.py.
- Verify deployed endpoints by invoking them and validating the
response structure (LORA uses the adapter IC name, otherwise the
default base IC).
- Replace unconditional stop-all cleanup with age-based (>24h) and
status-aware cleanup: stop only InProgress/Pending jobs and delete
completed imported models, with logging on failures.
- Add a class-scoped autouse cleanup_import_jobs fixture to replace the
zzz-prefixed ordering hack.
- Bound the import-job wait loop with a 60-minute timeout and fail fast
on Failed status; fix importedModelName -> importedModelArn.
- Delete the imported model after tests via a yielding deployed_model_arn
fixture.
- Configure bedrock-runtime with standard retries (10 attempts) and add a
slow-marked, retrying test_bedrock_model_invoke to tolerate
"model not ready" exceptions.
X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes
X-AI-Tool: kiro-cli
* test(serve): add Nova model customization deployment integ tests (SageMaker)
Add a Nova counterpart to test_model_customization_deployment.py covering
ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints,
running against the Nova test account in us-east-1 (784379639078).
- TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova
messages format), and fetch_endpoint_names_for_base_model.
- TestModelCustomizationFromModelPackage: build and deploy from a
registered model package.
- TestInstanceTypeAutoDetection: instance type auto-detection from recipe.
- TestModelCustomizationDetection: customization detection and model
package ARN fetch.
- TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with
RLVR since Nova has no DPO recipe in SageMakerPublicHub).
- Model package is resolved dynamically from the sdk-test-finetuned-models
group (latest Completed), mirroring test_benchmark_evaluation_nova_model;
dependent tests skip when none exists.
- All tests marked us_east_1 so they run in the PR check
integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do
not run in the scheduled GPU workflow).
- Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini.
The Bedrock deployment suite is kept commented out for now; the Nova for
Bedrock integ tests will be added in a follow-up.
X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration
X-AI-Tool: kiro-cli
* test(serve): add Nova for Bedrock model customization deployment integ tests
Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova
model to Amazon Bedrock via BedrockModelBuilder, complementing the existing
Nova-for-SageMaker tests in the same file.
- Deploy a Nova model package through BedrockModelBuilder.deploy(), which
routes Nova models to create_custom_model + create_custom_model_deployment
and polls each resource to Active (vs the create_model_import_job path used
for open-weight models).
- test_nova_bedrock_deployment_active asserts the deployment reaches Active.
- test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via
bedrock-runtime, with standard retries to tolerate the cold-start window.
- Model package is resolved dynamically from sdk-test-finetuned-models
(latest Completed); deployment fixture cleans up the deployment and custom
model afterwards. Role is resolved via get_execution_role().
- Marked us_east_1 (Nova test account, us-east-1) to run in the PR check
integ-tests-us-east-1 job; not gpu_intensive.
- Replace the previously commented-out OSS-style Bedrock suite (it used the
import-job API, which does not apply to Nova) and update the module
docstring to describe both SageMaker and Bedrock deployment targets.
X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests
X-AI-Tool: kiro-cli
* test: fix Nova deployment and Lake Formation integ tests
- Nova deploy/Bedrock tests: build from the TrainingJob instead of a
ModelPackage, since Nova escrow artifacts are only resolvable from the
training job's manifest (deploying from a ModelPackage is unsupported).
- Lake Formation tests: register the S3 location with an explicit role
(use_service_linked_role=False) to avoid the WithFederation+SLR
combination that Lake Formation rejects.
* test(serve): discover Nova SFT training job dynamically
The training_job_name fixture hardcoded a reusable job whose output model
package (sdk-test-nova-finetuned-models/1) was deleted, so every test that
resolves the job's output model package failed with "ModelPackage ... does not
exist".
Discover the latest completed sft-nova-integ-* job at runtime (produced every
few hours by the scheduled Nova SFT workflow) and verify its output model
package still exists before using it; skip if none is found. This avoids
depending on a hardcoded job name that goes stale once resource cleanup deletes
its model package.
X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists
X-AI-Tool: kiro-cli
* fix(serve): resolve Nova Bedrock manifest from output_data_config
BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via
self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by
SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate
model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova
TrainingJob to Bedrock failed with
"AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'".
Build the manifest path from output_data_config.s3_output_path and the training
job name instead. This aligns with the two other implementations that locate the
Nova manifest the same way:
- ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and
- the official Nova Studio notebook
(v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which
derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts).
Verified the derived key is identical to the previous logic when model_artifacts
is present, and matches the real manifest location
({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test
account.
Also update the TestGetCheckpointUri unit tests to mock output_data_config, and
keep the Nova Bedrock integ tests driving BedrockModelBuilder from the
TrainingJob.
X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests
X-AI-Tool: kiro-cli
* fix(serve): support BaseTrainer in Nova escrow resolution; skip deploy on capacity shortage
- _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a
Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses)
failed with "Nova escrow URI resolution requires a TrainingJob or
ModelTrainer". Resolve the underlying job via _latest_training_job for
BaseTrainer, matching _is_model_customization and _fetch_model_package_arn.
- Nova deploy integ tests could fail with InsufficientInstanceCapacity, a
transient region-wide ml.g6.48xlarge availability issue. Add a
_deploy_or_skip_on_capacity helper that skips (instead of failing) in that
case, used by the training-job and model-package deploy tests.
X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity
X-AI-Tool: kiro-cli
* Fix flaky feature store integ tests: LF negative-role assertion and async FG deletion
test_enable_lake_formation_fails_with_nonexistent_role asserted the registration
error contains EntityNotFoundException, but under a least-privilege iam:PassRole
policy the failure surfaces as an AccessDeniedException on iam:PassRole before
Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException,
or iam:PassRole as valid "role not usable" outcomes for this negative test.
test_delete_feature_group used a fixed 2s sleep then a single get(), but
FeatureGroup deletion is asynchronous and the group stays describable while in
Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it
raises (group fully gone) or a 120s timeout.
X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion
X-AI-Tool: kiro-cli
* test(serve): use Nova messages-v1 schema for Bedrock invoke
test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...},
which Bedrock rejected with "Malformed input request: #/messages/0/content/0:
extraneous key [type] is not permitted".
Use the Nova messages-v1 InvokeModel schema instead (content items are
{"text": ...} with no type key, plus schemaVersion and inferenceConfig),
matching the official Nova Studio notebook, and assert on the Nova response
shape output.message.content[0].text.
X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure
X-AI-Tool: kiro-cli
* chore(serve): trim verbose comments
* test(serve): pick latest Nova SFT job without requiring its model package
The training_job_name fixture required the job's output model package to still
exist, but the resource cleaner keeps only the oldest and newest package in the
group, so every job's package was deleted and all dependent tests skipped.
Build/deploy resolve artifacts from the job manifest (not the model package),
so just pick the latest completed sft-nova-integ job.
X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping
X-AI-Tool: kiro-cli
* test(serve): resolve Nova training job from an existing model package
ModelBuilder.build fetches the training job's output model package, so the
package must exist. Resource cleanup keeps only the oldest and newest package
in the group, so picking the latest job left it pointing at a deleted package
and every build/deploy test failed.
Instead, start from a model package that currently exists and resolve the
training job that produced it (parsed from the package's escrow S3 URI),
preferring an SFT job. The cleaner always retains the oldest package, so this
reliably yields a job whose output package is present.
X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages
X-AI-Tool: kiro-cli1 parent e736e8b commit 63ac789
8 files changed
Lines changed: 760 additions & 65 deletions
File tree
- sagemaker-mlops/tests/integ
- sagemaker-serve
- src/sagemaker/serve
- tests
- integ
- unit
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
155 | 155 | | |
156 | 156 | | |
157 | 157 | | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
162 | 178 | | |
163 | 179 | | |
164 | 180 | | |
| |||
Lines changed: 31 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
163 | | - | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
164 | 169 | | |
165 | 170 | | |
166 | 171 | | |
| |||
198 | 203 | | |
199 | 204 | | |
200 | 205 | | |
| 206 | + | |
| 207 | + | |
201 | 208 | | |
202 | 209 | | |
203 | 210 | | |
| |||
467 | 474 | | |
468 | 475 | | |
469 | 476 | | |
470 | | - | |
471 | | - | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
472 | 488 | | |
473 | 489 | | |
474 | 490 | | |
| |||
503 | 519 | | |
504 | 520 | | |
505 | 521 | | |
506 | | - | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
507 | 528 | | |
508 | 529 | | |
509 | 530 | | |
| |||
546 | 567 | | |
547 | 568 | | |
548 | 569 | | |
549 | | - | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
550 | 576 | | |
551 | 577 | | |
552 | 578 | | |
| |||
Lines changed: 11 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
645 | 645 | | |
646 | 646 | | |
647 | 647 | | |
648 | | - | |
649 | | - | |
650 | | - | |
651 | | - | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
652 | 651 | | |
653 | 652 | | |
654 | 653 | | |
| |||
660 | 659 | | |
661 | 660 | | |
662 | 661 | | |
663 | | - | |
664 | | - | |
665 | | - | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
666 | 668 | | |
667 | | - | |
668 | | - | |
669 | | - | |
670 | | - | |
671 | | - | |
672 | | - | |
| 669 | + | |
| 670 | + | |
673 | 671 | | |
674 | 672 | | |
675 | 673 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4699 | 4699 | | |
4700 | 4700 | | |
4701 | 4701 | | |
| 4702 | + | |
| 4703 | + | |
| 4704 | + | |
4702 | 4705 | | |
4703 | 4706 | | |
4704 | 4707 | | |
| |||
0 commit comments