fix: Model builder unable to (5667)#5754
Conversation
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes a crash in ModelBuilder when a ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is reasonable—resolving missing fields early before they're accessed. However, there are several issues: broad exception catching, potential duplicate HubContent.get calls, line length violations, and the resolution guard pattern using a dynamic attribute instead of proper initialization.
| """_is_nova_model should return False without raising when fields are Unassigned.""" | ||
| mb = _make_model_builder() | ||
| base_model = _make_base_model( | ||
| hub_content_name=None, # Unassigned |
There was a problem hiding this comment.
Good test for the error case! However, you should also add a test for the case where _resolve_base_model_fields itself fails (the except Exception path at line 737 in model_builder.py) to verify that the method gracefully handles HubContent.get failures and that downstream code still raises the appropriate ValueError.
🤖 Iteration #1 — Review Comments AddressedDescriptionModelBuilder crashes when a ModelPackage's BaseModel has Root Cause
Fix
Changes
Related IssueRelated issue: 5667 Merge Checklist
Comments reviewed: 12
|
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes a crash when ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is reasonable — resolve missing fields early before they're accessed. However, there are several issues: the _resolve_base_model_fields method is too long and deeply nested, there's a missing type annotation on the return type, the _base_model_fields_resolved attribute is set in __post_init__ without a type annotation in the class body, and the error handling swallows exceptions in some cases where it probably shouldn't.
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes a crash when ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is sound — resolving missing fields early before they're accessed. However, there are several issues: the _resolve_base_model_fields method is too long and deeply nested, there's a missing docstring on __post_init__ attribute, exception handling swallows errors too broadly, and the test file has some gaps.
| if self.log_level is not None: | ||
| logger.setLevel(self.log_level) | ||
|
|
||
| self._base_model_fields_resolved: bool = False |
There was a problem hiding this comment.
This instance attribute is set in __post_init__ but lacks a type annotation on the class body. Per SDK conventions (PEP 484, Pydantic BaseModel), all class attributes should have type annotations. Consider declaring this as a class-level field:
_base_model_fields_resolved: bool = Falsein the class body (or as a PrivateAttr if this is a Pydantic model), rather than setting it in __post_init__.
| @@ -680,18 +681,182 @@ def _infer_instance_type_from_jumpstart(self) -> str: | |||
|
|
|||
| raise ValueError(error_msg) | |||
|
|
|||
There was a problem hiding this comment.
Missing type annotation on the return type. Per SDK conventions, all public and non-trivial private methods should have type annotations:
def _resolve_base_model_fields(self) -> None:| return | ||
|
|
||
| model_package = self._fetch_model_package() | ||
| if not model_package: |
There was a problem hiding this comment.
The early-return pattern with self._base_model_fields_resolved = True is repeated 4 times before the main logic. Consider restructuring to reduce duplication — e.g., extract the chain of getattr checks into a helper that returns the base_model or None, then have a single early return:
base_model = self._get_base_model_from_package()
if not base_model:
self._base_model_fields_resolved = True
returnThis would also make the method significantly shorter and more readable.
| logger.info( | ||
| "Resolved hub_content_version to '%s' " | ||
| "for hub content '%s'.", | ||
| cached_hub_content.hub_content_version, |
There was a problem hiding this comment.
Catching ValueError here is quite broad. What ValueError would HubContent.get() raise? If this is for input validation errors from sagemaker-core, that's fine, but consider documenting why ValueError is caught alongside ClientError. Also, if the ClientError is something other than ResourceNotFoundException (e.g., AccessDeniedException), silently swallowing it and returning early could mask real permission issues. Consider only catching specific error codes:
except ClientError as e:
if e.response['Error']['Code'] == 'ResourceNotFoundException':
logger.warning(...)
else:
raise| and cached_hub_content.hub_content_version | ||
| == base_model.hub_content_version | ||
| ): | ||
| hub_content = cached_hub_content |
There was a problem hiding this comment.
The version comparison cached_hub_content.hub_content_version == base_model.hub_content_version — at this point base_model.hub_content_version was just set to cached_hub_content.hub_content_version a few lines above (line 741), so this condition will always be True when cached_hub_content is not None. The else branch (re-fetching with a specific version) is dead code in the current flow. Consider simplifying by always reusing the cached hub content when available, or adding a comment explaining when the else branch would be hit.
| ), | ||
| ): | ||
| from sagemaker.serve.model_builder import ModelBuilder | ||
| defaults = dict( |
There was a problem hiding this comment.
The test helper creates a ModelBuilder by importing it inside the function and patching Session and get_execution_role. This is fragile — if ModelBuilder.__post_init__ changes to call other services, this will break. Consider also patching _fetch_model_package at the helper level, or using a more targeted approach to construct the object for unit testing.
| defaults = dict( | ||
| model="dummy-model", | ||
| role_arn=( | ||
| "arn:aws:iam::123456789012:role/SageMakerRole" |
There was a problem hiding this comment.
The role ARN arn:aws:iam::123456789012:role/SageMakerRole has a double colon (iam::123456789012) — it's missing the partition. While this is just a test mock value, using a properly formatted ARN (arn:aws:iam:us-west-2:123456789012:role/SageMakerRole) would be more realistic and avoid any future validation issues.
| mock_hub_content_cls.get.reset_mock() | ||
|
|
||
| mb._resolve_base_model_fields() | ||
| mock_hub_content_cls.get.assert_not_called() |
There was a problem hiding this comment.
The test file covers the main scenarios well, but is missing a test for the case where _fetch_model_package() returns None. The test_no_base_model_is_noop test covers base_model = None, but there should be an explicit test for _fetch_model_package returning None to cover line 701 in the source.
| mock_hub_content_cls.get.reset_mock() | ||
|
|
||
| mb._resolve_base_model_fields() | ||
| mock_hub_content_cls.get.assert_not_called() |
There was a problem hiding this comment.
Missing test: the case where hub_content_document contains a RecipeCollection with an entry that has no Name key (covers the logger.warning("RecipeCollection found but first recipe has no Name...") branch at ~line 800 in the source).
| Nova training recipes don't have hosting configs in the JumpStart hub document. | ||
| This provides the hardcoded fallback, matching Rhinestone's getNovaHostingConfigs(). | ||
| """ | ||
| self._resolve_base_model_fields() |
There was a problem hiding this comment.
The _resolve_base_model_fields() call was added here in _get_nova_hosting_config, but _is_nova_model() (which gates the call to this method) does NOT call _resolve_base_model_fields(). This means _is_nova_model() could return False for a Nova model if hub_content_name is Unassigned, and _get_nova_hosting_config would never be reached. Should _is_nova_model() also call _resolve_base_model_fields() first?
🤖 Iteration #2 — Review Comments AddressedDescriptionModelBuilder crashes when a ModelPackage's BaseModel has Root Cause
Fix
Changes
Merge Checklist
Comments reviewed: 15
|
Description
ModelBuilder crashes when a ModelPackage's BaseModel has HubContentName set but HubContentVersion and/or RecipeName are Unassigned (the sagemaker-core sentinel value for missing fields). The root cause is in
_fetch_hub_document_for_custom_model()which passesbase_model.hub_content_versiondirectly toHubContent.get()without checking for Unassigned, and_fetch_and_cache_recipe_config()which readsbase_model.recipe_namewithout validation. The fix is to: (1) Add a new method_resolve_base_model_fields()that auto-resolves missing hub_content_version by callingHubContent.get()with just the hub_content_name (no version), and auto-resolves missing recipe_name from the hub document's RecipeCollection. (2) Call this method early in the model-package build flow, before any code accesses these fields. (3) Update_fetch_hub_document_for_custom_model()to handle the case where hub_content_version might still be Unassigned (call without version param).Related Issue
Related issue: 5667
Changes Made
sagemaker-serve/src/sagemaker/serve/model_builder.pysagemaker-serve/tests/unit/test_resolve_base_model_fields.pyAI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat