fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#30
fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#30
Conversation
mufaddal-rohawala
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR adds two convenience factory functions for creating ProcessingInput from local paths, which is a reasonable UX improvement. However, there are several design concerns: the ad-hoc source attribute monkey-patching pattern in _normalize_inputs() is fragile and goes against SDK conventions, the docstrings use Google-style Args: instead of reST format, and the processing_input_from_local function is missing validation for the destination parameter. The factory functions themselves are well-tested but could benefit from type annotation improvements.
mufaddal-rohawala
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR adds convenience factory functions for creating ProcessingInput from local paths, improving V3 UX. While the factories are well-tested, there are several design and code quality concerns: the ad-hoc source attribute monkey-patching pattern is fragile and anti-pattern for a Pydantic-based SDK, the docstrings use Google-style Args/Returns instead of reST format, and there are missing type annotations and validation gaps.
| # a subclass), populate s3_input.s3_uri from it so the existing | ||
| # upload logic handles it. | ||
| _source = getattr(file_input, "source", None) | ||
| if _source is not None: |
There was a problem hiding this comment.
Critical design concern: getattr(file_input, "source", None) monkey-patching pattern is fragile and anti-pattern.
This relies on users monkey-patching an ad-hoc source attribute onto a Pydantic BaseModel (ProcessingInput is auto-generated from sagemaker-core shapes). Pydantic models by default reject extra attributes, so file_input.source = '/tmp/data' would raise a ValidationError unless the model is configured with model_config = ConfigDict(extra='allow'). This code path is essentially dead unless users bypass Pydantic validation.
Moreover, encouraging monkey-patching on auto-generated shapes is a maintenance hazard — if the shape ever adds a real source field, this code will silently conflict.
Recommendation: Remove this entire _source = getattr(...) block. The two factory functions (processing_input_from_local and create_processing_input) already solve the UX problem cleanly by populating s3_input.s3_uri correctly. There's no need for this fallback path.
| This will be uploaded to S3 automatically before the | ||
| processing job starts. | ||
| destination: The container path where the input data will be | ||
| made available (e.g. ``/opt/ml/processing/input/data``). |
There was a problem hiding this comment.
Missing validation for destination parameter. processing_input_from_local validates input_name and local_path but not destination, while create_processing_input validates all three. This inconsistency could confuse users. Add:
if not destination:
raise ValueError(f"destination must be a non-empty string, got: {destination!r}")| s3_uri=_source, | ||
| local_path="/opt/ml/processing/input", | ||
| s3_data_type="S3Prefix", | ||
| s3_input_mode="File", |
There was a problem hiding this comment.
Hardcoded default /opt/ml/processing/input — this magic string should be extracted to a module-level constant, e.g.:
DEFAULT_PROCESSING_INPUT_PATH = "/opt/ml/processing/input"This is also used in the factory functions' docstrings and examples, so a constant would ensure consistency.
Description
Add convenience factories for creating ProcessingInput from local paths
Problem
In the V3
Processor._normalize_inputs(), local file paths are already supported — whenProcessingS3Input.s3_uricontains a non-S3 scheme (e.g., a local path), the file is automatically uploaded to S3. However, the user experience is poor because users must put a local file path into thes3_urifield ofProcessingS3Input, which is semantically confusing. The V2ProcessingInputhad asourceparameter that clearly accepted local paths.Solution
This PR adds two public convenience factory functions and enhances
_normalize_inputs()to improve the developer experience:processing_input_from_local()— Creates aProcessingInputfrom a local file/directory path with clear parameter names (local_path,destination).create_processing_input()— A V2-like factory that accepts asourceparameter (local path or S3 URI) anddestination, providing familiar ergonomics._normalize_inputs()enhancement — Now checks for an ad-hocsourceattribute onProcessingInputobjects and populatess3_input.s3_urifrom it, providing a fallback for users who set the attribute directly.Usage Examples
Testing
processing_input_from_local()(valid inputs, custom params, validation errors, pipeline config integration)create_processing_input()(local paths, S3 URIs, defaults, validation errors)_normalize_inputs()local path upload behaviorRelated Issue
Related issue: 5672
Changes Made
The V3
Processor._normalize_inputs()insagemaker-core/src/sagemaker/core/processing.pyalready handles local file paths: whenfile_input.s3_input.s3_urihas a non-s3 scheme (e.g., a local path), it uploads the file to S3 automatically. However, the user experience is poor because users must put a local file path into thes3_urifield ofProcessingS3Input, which is semantically confusing. The V2ProcessingInputhad asourceparameter that clearly accepted local paths. The fix should add an optionalsourceconvenience parameter toProcessingInputconstruction in the V3 processing module, or provide a helper/factory that wraps local paths into the expectedProcessingS3Inputshape. SinceProcessingInputis auto-generated insagemaker.core.shapes, the best approach is to add a convenience factory function or extend theProcessor._normalize_inputs()to also accept asourcefield, plus provide clear documentation.AI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat