fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#29
fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#29
Conversation
mufaddal-rohawala
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR adds a source convenience parameter to ProcessingInput for local file paths, but the actual ProcessingInput class modification in shapes.py is missing from the diff. The PR only shows an unrelated change to InferenceComponentComputeResourceRequirements in shapes.py. The _normalize_inputs changes and tests look reasonable, but without the core ProcessingInput model change, the tests would fail. Additionally, there are several code quality and architectural concerns.
| if file_input.dataset_definition: | ||
| normalized_inputs.append(file_input) | ||
| continue | ||
| # Handle case where source was set but s3_input was not created |
There was a problem hiding this comment.
Concern: getattr usage suggests source is not a proper model field. Using getattr(file_input, "source", None) implies uncertainty about whether the attribute exists, which is a code smell for a Pydantic model. If ProcessingInput properly defines source as a field (which the missing diff should show), you should use file_input.source directly. If the source field is properly defined on the model, getattr is unnecessary.
Also, this fallback logic duplicates the validation that should happen in ProcessingInput.__init__ (or model_validator). If the convenience __init__ logic always creates s3_input when source is provided, this fallback should never trigger. Consider whether this defensive code is actually needed, or if it masks bugs.
| # (e.g., if ProcessingInput was constructed without using the | ||
| # convenience __init__ logic) | ||
| if file_input.s3_input is None and getattr(file_input, "source", None): | ||
| file_input.s3_input = ProcessingS3Input( |
There was a problem hiding this comment.
Hardcoded magic strings. "S3Prefix" and "File" are repeated here and presumably in the ProcessingInput model validator. Extract these into module-level constants (e.g., DEFAULT_S3_DATA_TYPE = "S3Prefix", DEFAULT_S3_INPUT_MODE = "File") to avoid duplication and improve maintainability.
| normalized = processor._normalize_inputs(inputs) | ||
|
|
||
| assert len(normalized) == 1 | ||
| assert normalized[0].s3_input.s3_uri == "s3://my-bucket/job-name/input/my-input/data.csv" |
There was a problem hiding this comment.
Line exceeds 100 characters. This assertion line is quite long. Consider breaking it into a variable:
expected_uri = "s3://my-bucket/job-name/input/my-input/data.csv"
assert normalized[0].s3_input.s3_uri == expected_uri| # ANY KIND, either express or implied. See the License for the specific | ||
| # language governing permissions and limitations under the License. | ||
| """Tests for ProcessingInput source parameter and local file upload behavior.""" | ||
| from __future__ import absolute_import |
There was a problem hiding this comment.
Use from __future__ import annotations instead of from __future__ import absolute_import. Per SDK coding standards, new modules should use from __future__ import annotations to enable PEP 604 union syntax. absolute_import is a Python 2 artifact and unnecessary in Python 3.
mufaddal-rohawala
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR adds a source convenience parameter to ProcessingInput to simplify specifying local file paths for processing jobs. While the intent is good, there are several issues: the sagemaker-core shapes file is auto-generated and should not be manually modified, the Pydantic validation approach has issues, there's an unrelated change to InferenceComponentComputeResourceRequirements, and the _normalize_inputs logic has a gap where s3_input could be Unassigned rather than None.
Description
Add
sourceconvenience parameter toProcessingInputfor local file pathsProblem
In V3, specifying a local file or directory as input to a processing job requires setting
s3_input=ProcessingS3Input(s3_uri="/local/path/to/data", ...), which is confusing because:s3_uri, suggesting only S3 paths are validThe underlying
_normalize_inputs()method already handles local paths correctly (detecting non-S3 URIs viaurlparseand uploading them viaS3Uploader.upload()), but the API surface doesn't make this clear.Changes
Added
sourceconvenience parameter toProcessingInput: Users can now write:instead of:
When
sourceis provided ands3_inputis not, aProcessingS3Inputis automatically created. Both local paths and S3 URIs are accepted. Specifying bothsourceands3_inputraises aValueError.Updated
ProcessingS3Input.s3_uridocumentation: The docstring now explicitly states that local file/directory paths are accepted and will be automatically uploaded to S3.Added logging for local file uploads:
_normalize_inputs()now logs anINFOmessage when a local path is detected and being uploaded to S3, improving visibility into what's happening.Added fallback handling in
_normalize_inputs(): If aProcessingInputhas asourceattribute set buts3_inputisNone(edge case), the normalization logic creates theProcessingS3Inputautomatically.Backward Compatibility
This change is fully backward compatible:
sourceparameter is optional and defaults toNones3_inputdirectly continues to work unchangedsourcefield is excluded from serialization (not sent to the API)Testing
Added comprehensive unit tests covering:
sourceparameter creatingProcessingS3Inputautomaticallysourcewith S3 URI passthroughsourceands3_inputare provided_normalize_inputsRelated Issue
Related issue: 5672
Changes Made
The V3 sagemaker.core.processing.Processor._normalize_inputs() method already supports local file paths as input. When a ProcessingInput is provided with a non-S3 URI in its s3_input.s3_uri field, the normalization logic detects this (via urlparse checking scheme != 's3'), uploads the local file to S3 using s3.S3Uploader.upload(), and replaces the s3_uri with the resulting S3 URI. This is the same auto-upload behavior that existed in V2. However, the user experience is confusing because: (1) the field is named 's3_uri' in ProcessingS3Input, suggesting only S3 paths are valid, and (2) there's no documentation or parameter alias making this clear. The code change needed is to add a convenience parameter 'source' (or similar) to ProcessingInput that accepts local file paths explicitly, and/or update the ProcessingS3Input class to document that local paths are accepted in the s3_uri field. The most minimal and non-breaking approach is to add a 'local_path' or 'source' convenience parameter on ProcessingInput that, when provided, populates s3_input.s3_uri with the local path (which _normalize_inputs will then upload). This makes the API more intuitive for V2 users migrating to V3.
AI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat