Skip to content

Commit 63ac789

Browse files
authored
test: add nova tests (aws#5933)
* test(serve): harden model customization deployment integ tests Add post-deploy invoke verification and make the Bedrock import-job lifecycle robust in test_model_customization_deployment.py. - Verify deployed endpoints by invoking them and validating the response structure (LORA uses the adapter IC name, otherwise the default base IC). - Replace unconditional stop-all cleanup with age-based (>24h) and status-aware cleanup: stop only InProgress/Pending jobs and delete completed imported models, with logging on failures. - Add a class-scoped autouse cleanup_import_jobs fixture to replace the zzz-prefixed ordering hack. - Bound the import-job wait loop with a 60-minute timeout and fail fast on Failed status; fix importedModelName -> importedModelArn. - Delete the imported model after tests via a yielding deployed_model_arn fixture. - Configure bedrock-runtime with standard retries (10 attempts) and add a slow-marked, retrying test_bedrock_model_invoke to tolerate "model not ready" exceptions. X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes X-AI-Tool: kiro-cli * test(serve): add Nova model customization deployment integ tests (SageMaker) Add a Nova counterpart to test_model_customization_deployment.py covering ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints, running against the Nova test account in us-east-1 (784379639078). - TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova messages format), and fetch_endpoint_names_for_base_model. - TestModelCustomizationFromModelPackage: build and deploy from a registered model package. - TestInstanceTypeAutoDetection: instance type auto-detection from recipe. - TestModelCustomizationDetection: customization detection and model package ARN fetch. - TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with RLVR since Nova has no DPO recipe in SageMakerPublicHub). - Model package is resolved dynamically from the sdk-test-finetuned-models group (latest Completed), mirroring test_benchmark_evaluation_nova_model; dependent tests skip when none exists. - All tests marked us_east_1 so they run in the PR check integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do not run in the scheduled GPU workflow). - Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini. The Bedrock deployment suite is kept commented out for now; the Nova for Bedrock integ tests will be added in a follow-up. X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration X-AI-Tool: kiro-cli * test(serve): add Nova for Bedrock model customization deployment integ tests Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova model to Amazon Bedrock via BedrockModelBuilder, complementing the existing Nova-for-SageMaker tests in the same file. - Deploy a Nova model package through BedrockModelBuilder.deploy(), which routes Nova models to create_custom_model + create_custom_model_deployment and polls each resource to Active (vs the create_model_import_job path used for open-weight models). - test_nova_bedrock_deployment_active asserts the deployment reaches Active. - test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via bedrock-runtime, with standard retries to tolerate the cold-start window. - Model package is resolved dynamically from sdk-test-finetuned-models (latest Completed); deployment fixture cleans up the deployment and custom model afterwards. Role is resolved via get_execution_role(). - Marked us_east_1 (Nova test account, us-east-1) to run in the PR check integ-tests-us-east-1 job; not gpu_intensive. - Replace the previously commented-out OSS-style Bedrock suite (it used the import-job API, which does not apply to Nova) and update the module docstring to describe both SageMaker and Bedrock deployment targets. X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests X-AI-Tool: kiro-cli * test: fix Nova deployment and Lake Formation integ tests - Nova deploy/Bedrock tests: build from the TrainingJob instead of a ModelPackage, since Nova escrow artifacts are only resolvable from the training job's manifest (deploying from a ModelPackage is unsupported). - Lake Formation tests: register the S3 location with an explicit role (use_service_linked_role=False) to avoid the WithFederation+SLR combination that Lake Formation rejects. * test(serve): discover Nova SFT training job dynamically The training_job_name fixture hardcoded a reusable job whose output model package (sdk-test-nova-finetuned-models/1) was deleted, so every test that resolves the job's output model package failed with "ModelPackage ... does not exist". Discover the latest completed sft-nova-integ-* job at runtime (produced every few hours by the scheduled Nova SFT workflow) and verify its output model package still exists before using it; skip if none is found. This avoids depending on a hardcoded job name that goes stale once resource cleanup deletes its model package. X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists X-AI-Tool: kiro-cli * fix(serve): resolve Nova Bedrock manifest from output_data_config BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova TrainingJob to Bedrock failed with "AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'". Build the manifest path from output_data_config.s3_output_path and the training job name instead. This aligns with the two other implementations that locate the Nova manifest the same way: - ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and - the official Nova Studio notebook (v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts). Verified the derived key is identical to the previous logic when model_artifacts is present, and matches the real manifest location ({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test account. Also update the TestGetCheckpointUri unit tests to mock output_data_config, and keep the Nova Bedrock integ tests driving BedrockModelBuilder from the TrainingJob. X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests X-AI-Tool: kiro-cli * fix(serve): support BaseTrainer in Nova escrow resolution; skip deploy on capacity shortage - _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses) failed with "Nova escrow URI resolution requires a TrainingJob or ModelTrainer". Resolve the underlying job via _latest_training_job for BaseTrainer, matching _is_model_customization and _fetch_model_package_arn. - Nova deploy integ tests could fail with InsufficientInstanceCapacity, a transient region-wide ml.g6.48xlarge availability issue. Add a _deploy_or_skip_on_capacity helper that skips (instead of failing) in that case, used by the training-job and model-package deploy tests. X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity X-AI-Tool: kiro-cli * Fix flaky feature store integ tests: LF negative-role assertion and async FG deletion test_enable_lake_formation_fails_with_nonexistent_role asserted the registration error contains EntityNotFoundException, but under a least-privilege iam:PassRole policy the failure surfaces as an AccessDeniedException on iam:PassRole before Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException, or iam:PassRole as valid "role not usable" outcomes for this negative test. test_delete_feature_group used a fixed 2s sleep then a single get(), but FeatureGroup deletion is asynchronous and the group stays describable while in Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it raises (group fully gone) or a 120s timeout. X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion X-AI-Tool: kiro-cli * test(serve): use Nova messages-v1 schema for Bedrock invoke test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...}, which Bedrock rejected with "Malformed input request: #/messages/0/content/0: extraneous key [type] is not permitted". Use the Nova messages-v1 InvokeModel schema instead (content items are {"text": ...} with no type key, plus schemaVersion and inferenceConfig), matching the official Nova Studio notebook, and assert on the Nova response shape output.message.content[0].text. X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure X-AI-Tool: kiro-cli * chore(serve): trim verbose comments * test(serve): pick latest Nova SFT job without requiring its model package The training_job_name fixture required the job's output model package to still exist, but the resource cleaner keeps only the oldest and newest package in the group, so every job's package was deleted and all dependent tests skipped. Build/deploy resolve artifacts from the job manifest (not the model package), so just pick the latest completed sft-nova-integ job. X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping X-AI-Tool: kiro-cli * test(serve): resolve Nova training job from an existing model package ModelBuilder.build fetches the training job's output model package, so the package must exist. Resource cleanup keeps only the oldest and newest package in the group, so picking the latest job left it pointing at a deleted package and every build/deploy test failed. Instead, start from a model package that currently exists and resolve the training job that produced it (parsed from the package's escrow S3 URI), preferring an SFT job. The cleaner always retains the oldest package, so this reliably yields a job whose output package is present. X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages X-AI-Tool: kiro-cli
1 parent e736e8b commit 63ac789

8 files changed

Lines changed: 760 additions & 65 deletions

File tree

sagemaker-mlops/tests/integ/test_feature_store.py

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -155,10 +155,26 @@ def test_delete_feature_group(feature_group_name, sample_dataframe, bucket, role
155155
fg.wait_for_status("Created")
156156

157157
fg.delete()
158-
time.sleep(2)
159-
160-
with pytest.raises(Exception):
161-
FeatureGroup.get(feature_group_name=feature_group_name)
158+
159+
# FeatureGroup deletion is asynchronous: after delete() returns the group
160+
# stays in "Deleting" status and is still describable for a while, so a
161+
# fixed short sleep + single get() is racy. Poll until get() raises (the
162+
# group is fully gone) or we hit the timeout.
163+
deadline = time.time() + 120
164+
last_exc = None
165+
while time.time() < deadline:
166+
try:
167+
FeatureGroup.get(feature_group_name=feature_group_name)
168+
except Exception as e: # noqa: BLE001 - any error means it's no longer retrievable
169+
last_exc = e
170+
break
171+
time.sleep(5)
172+
else:
173+
pytest.fail(
174+
f"FeatureGroup {feature_group_name} was still retrievable 120s after delete()"
175+
)
176+
177+
assert last_exc is not None
162178

163179

164180
# Test 7: Ingest to both OnlineStore and OfflineStore

sagemaker-mlops/tests/integ/test_feature_store_lakeformation.py

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,12 @@ def test_create_feature_group_and_enable_lake_formation(s3_uri, role, region):
160160
assert fg.feature_group_status == "Created"
161161

162162
# Enable Lake Formation governance
163-
result = fg.enable_lake_formation(hybrid_access_mode_enabled=False, acknowledge_risk=True)
163+
result = fg.enable_lake_formation(
164+
hybrid_access_mode_enabled=False,
165+
acknowledge_risk=True,
166+
use_service_linked_role=False,
167+
registration_role_arn=role,
168+
)
164169

165170
# Verify all phases completed successfully
166171
assert result["s3_location_registered"] is True
@@ -198,6 +203,8 @@ def test_create_feature_group_with_lake_formation_enabled(s3_uri, role, region):
198203
enabled=True,
199204
hybrid_access_mode_enabled = False,
200205
acknowledge_risk=True,
206+
use_service_linked_role=False,
207+
registration_role_arn=role,
201208
)
202209

203210
fg = FeatureGroupManager.create(
@@ -467,8 +474,17 @@ def test_enable_lake_formation_fails_with_nonexistent_role(
467474
# Verify we got an appropriate error
468475
error_msg = str(exc_info.value)
469476
print(exc_info)
470-
# Should mention role-related issues (not found, invalid, access denied, etc.)
471-
assert "EntityNotFoundException" in error_msg
477+
# The registration must fail because the role is not usable. Depending on
478+
# how the build/execution role's iam:PassRole policy is scoped, this surfaces
479+
# either as Lake Formation rejecting the unknown role (EntityNotFoundException)
480+
# or as IAM denying PassRole before the call reaches Lake Formation
481+
# (AccessDeniedException on iam:PassRole). Both are valid "nonexistent / not
482+
# usable role" outcomes for this negative test.
483+
assert (
484+
"EntityNotFoundException" in error_msg
485+
or "AccessDeniedException" in error_msg
486+
or "iam:PassRole" in error_msg
487+
), f"Unexpected error for nonexistent role registration: {error_msg}"
472488

473489

474490
# ============================================================================
@@ -503,7 +519,12 @@ def test_enable_lake_formation_full_flow_with_policy_output(s3_uri, role, region
503519

504520
# Enable Lake Formation governance
505521
with caplog.at_level(logging.WARNING, logger="sagemaker.mlops.feature_store.feature_group_manager"):
506-
result = fg.enable_lake_formation(hybrid_access_mode_enabled=False, acknowledge_risk=True)
522+
result = fg.enable_lake_formation(
523+
hybrid_access_mode_enabled=False,
524+
acknowledge_risk=True,
525+
use_service_linked_role=False,
526+
registration_role_arn=role,
527+
)
507528

508529
# Verify all phases completed successfully
509530
assert result["s3_location_registered"] is True
@@ -546,7 +567,12 @@ def test_enable_lake_formation_default_logs_recommended_policy(s3_uri, role, reg
546567

547568
# Enable Lake Formation governance with hybrid_access_mode_enabled=False
548569
with caplog.at_level(logging.WARNING, logger="sagemaker.mlops.feature_store.feature_group_manager"):
549-
result = fg.enable_lake_formation(hybrid_access_mode_enabled=False, acknowledge_risk=True)
570+
result = fg.enable_lake_formation(
571+
hybrid_access_mode_enabled=False,
572+
acknowledge_risk=True,
573+
use_service_linked_role=False,
574+
registration_role_arn=role,
575+
)
550576

551577
# Verify phases completed successfully
552578
assert result["s3_location_registered"] is True

sagemaker-serve/src/sagemaker/serve/bedrock_model_builder.py

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -645,10 +645,9 @@ def _get_checkpoint_uri_from_manifest(self) -> Optional[str]:
645645
"""Get checkpoint URI from manifest.json for Nova models.
646646
647647
Steps:
648-
1. Fetch S3 model artifacts from training job
649-
2. Construct path to manifest.json in the output directory
650-
3. Read and parse manifest.json
651-
4. Return checkpoint_s3_bucket value
648+
1. Build the manifest.json path from the training job output_data_config
649+
2. Read and parse manifest.json
650+
3. Return checkpoint_s3_bucket value
652651
653652
Returns:
654653
Checkpoint URI from manifest.json.
@@ -660,16 +659,15 @@ def _get_checkpoint_uri_from_manifest(self) -> Optional[str]:
660659
if not isinstance(self.model, TrainingJob):
661660
raise ValueError("Model must be a TrainingJob instance for Nova models")
662661

663-
s3_artifacts = self.model.model_artifacts.s3_model_artifacts
664-
if not s3_artifacts:
665-
raise ValueError("No S3 model artifacts found in training job")
662+
# Nova serverless training jobs have no model_artifacts; the manifest
663+
# lives under the job's output_data_config path.
664+
output_data_config = getattr(self.model, "output_data_config", None)
665+
s3_output_path = getattr(output_data_config, "s3_output_path", None)
666+
if not s3_output_path:
667+
raise ValueError("No S3 output path found in training job output_data_config")
666668

667-
logger.info("S3 artifacts path: %s", s3_artifacts)
668-
669-
# Construct manifest path
670-
# s3://bucket/path/output/model.tar.gz -> s3://bucket/path/output/output/manifest.json
671-
parts = s3_artifacts.rstrip("/").rsplit("/", 1)
672-
manifest_path = parts[0] + "/output/manifest.json"
669+
output_path = s3_output_path.rstrip("/")
670+
manifest_path = f"{output_path}/{self.model.training_job_name}/output/output/manifest.json"
673671

674672
logger.info("Manifest path: %s", manifest_path)
675673

sagemaker-serve/src/sagemaker/serve/model_builder.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4699,6 +4699,9 @@ def _resolve_nova_escrow_uri(self) -> str:
46994699
training_job = self.model
47004700
elif isinstance(self.model, ModelTrainer):
47014701
training_job = self.model._latest_training_job
4702+
elif isinstance(self.model, BaseTrainer) and hasattr(self.model, "_latest_training_job"):
4703+
# SFTTrainer / RLVRTrainer / DPOTrainer expose the job via _latest_training_job.
4704+
training_job = self.model._latest_training_job
47024705
else:
47034706
raise ValueError("Nova escrow URI resolution requires a TrainingJob or ModelTrainer")
47044707

0 commit comments

Comments
 (0)