Skip to content

fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#29

Closed
aviruthen wants to merge 2 commits intomasterfrom
fix/support-local-source-for-sagemaker-core-shapes-5672
Closed

fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#29
aviruthen wants to merge 2 commits intomasterfrom
fix/support-local-source-for-sagemaker-core-shapes-5672

Conversation

@aviruthen
Copy link
Copy Markdown
Owner

Description

Add source convenience parameter to ProcessingInput for local file paths

Problem

In V3, specifying a local file or directory as input to a processing job requires setting s3_input=ProcessingS3Input(s3_uri="/local/path/to/data", ...), which is confusing because:

  1. The field is named s3_uri, suggesting only S3 paths are valid
  2. There's no documentation indicating that local paths are accepted and will be auto-uploaded to S3
  3. V2 users migrating to V3 may not discover this capability

The underlying _normalize_inputs() method already handles local paths correctly (detecting non-S3 URIs via urlparse and uploading them via S3Uploader.upload()), but the API surface doesn't make this clear.

Changes

  1. Added source convenience parameter to ProcessingInput: Users can now write:

    ProcessingInput(input_name="my-input", source="/local/path/to/data")

    instead of:

    ProcessingInput(
        input_name="my-input",
        s3_input=ProcessingS3Input(s3_uri="/local/path/to/data", s3_data_type="S3Prefix", s3_input_mode="File")
    )

    When source is provided and s3_input is not, a ProcessingS3Input is automatically created. Both local paths and S3 URIs are accepted. Specifying both source and s3_input raises a ValueError.

  2. Updated ProcessingS3Input.s3_uri documentation: The docstring now explicitly states that local file/directory paths are accepted and will be automatically uploaded to S3.

  3. Added logging for local file uploads: _normalize_inputs() now logs an INFO message when a local path is detected and being uploaded to S3, improving visibility into what's happening.

  4. Added fallback handling in _normalize_inputs(): If a ProcessingInput has a source attribute set but s3_input is None (edge case), the normalization logic creates the ProcessingS3Input automatically.

Backward Compatibility

This change is fully backward compatible:

  • The source parameter is optional and defaults to None
  • Existing code using s3_input directly continues to work unchanged
  • The source field is excluded from serialization (not sent to the API)

Testing

Added comprehensive unit tests covering:

  • source parameter creating ProcessingS3Input automatically
  • source with S3 URI passthrough
  • Error when both source and s3_input are provided
  • Local file path upload behavior in _normalize_inputs
  • S3 URI not triggering upload
  • Local directory upload
  • Pipeline config path generation

Related Issue

Related issue: 5672

Changes Made

The V3 sagemaker.core.processing.Processor._normalize_inputs() method already supports local file paths as input. When a ProcessingInput is provided with a non-S3 URI in its s3_input.s3_uri field, the normalization logic detects this (via urlparse checking scheme != 's3'), uploads the local file to S3 using s3.S3Uploader.upload(), and replaces the s3_uri with the resulting S3 URI. This is the same auto-upload behavior that existed in V2. However, the user experience is confusing because: (1) the field is named 's3_uri' in ProcessingS3Input, suggesting only S3 paths are valid, and (2) there's no documentation or parameter alias making this clear. The code change needed is to add a convenience parameter 'source' (or similar) to ProcessingInput that accepts local file paths explicitly, and/or update the ProcessingS3Input class to document that local paths are accepted in the s3_uri field. The most minimal and non-breaking approach is to add a 'local_path' or 'source' convenience parameter on ProcessingInput that, when provided, populates s3_input.s3_uri with the local path (which _normalize_inputs will then upload). This makes the API more intuitive for V2 users migrating to V3.

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 65%
  • Classification: feature_request
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR adds a source convenience parameter to ProcessingInput for local file paths, but the actual ProcessingInput class modification in shapes.py is missing from the diff. The PR only shows an unrelated change to InferenceComponentComputeResourceRequirements in shapes.py. The _normalize_inputs changes and tests look reasonable, but without the core ProcessingInput model change, the tests would fail. Additionally, there are several code quality and architectural concerns.

Comment thread sagemaker-core/src/sagemaker/core/shapes/shapes.py
if file_input.dataset_definition:
normalized_inputs.append(file_input)
continue
# Handle case where source was set but s3_input was not created
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: getattr usage suggests source is not a proper model field. Using getattr(file_input, "source", None) implies uncertainty about whether the attribute exists, which is a code smell for a Pydantic model. If ProcessingInput properly defines source as a field (which the missing diff should show), you should use file_input.source directly. If the source field is properly defined on the model, getattr is unnecessary.

Also, this fallback logic duplicates the validation that should happen in ProcessingInput.__init__ (or model_validator). If the convenience __init__ logic always creates s3_input when source is provided, this fallback should never trigger. Consider whether this defensive code is actually needed, or if it masks bugs.

# (e.g., if ProcessingInput was constructed without using the
# convenience __init__ logic)
if file_input.s3_input is None and getattr(file_input, "source", None):
file_input.s3_input = ProcessingS3Input(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded magic strings. "S3Prefix" and "File" are repeated here and presumably in the ProcessingInput model validator. Extract these into module-level constants (e.g., DEFAULT_S3_DATA_TYPE = "S3Prefix", DEFAULT_S3_INPUT_MODE = "File") to avoid duplication and improve maintainability.

Comment thread sagemaker-core/tests/unit/test_processing_local_input.py
Comment thread sagemaker-core/tests/unit/test_processing_local_input.py
normalized = processor._normalize_inputs(inputs)

assert len(normalized) == 1
assert normalized[0].s3_input.s3_uri == "s3://my-bucket/job-name/input/my-input/data.csv"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line exceeds 100 characters. This assertion line is quite long. Consider breaking it into a variable:

expected_uri = "s3://my-bucket/job-name/input/my-input/data.csv"
assert normalized[0].s3_input.s3_uri == expected_uri

# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.
"""Tests for ProcessingInput source parameter and local file upload behavior."""
from __future__ import absolute_import
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use from __future__ import annotations instead of from __future__ import absolute_import. Per SDK coding standards, new modules should use from __future__ import annotations to enable PEP 604 union syntax. absolute_import is a Python 2 artifact and unnecessary in Python 3.

Comment thread sagemaker-core/tests/unit/test_processing_local_input.py
Comment thread sagemaker-core/tests/unit/test_processing_local_input.py
Copy link
Copy Markdown

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR adds a source convenience parameter to ProcessingInput to simplify specifying local file paths for processing jobs. While the intent is good, there are several issues: the sagemaker-core shapes file is auto-generated and should not be manually modified, the Pydantic validation approach has issues, there's an unrelated change to InferenceComponentComputeResourceRequirements, and the _normalize_inputs logic has a gap where s3_input could be Unassigned rather than None.

⚠️ Inline comments could not be posted (line mapping issue).

@aviruthen aviruthen closed this Mar 27, 2026
@aviruthen aviruthen deleted the fix/support-local-source-for-sagemaker-core-shapes-5672 branch March 27, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants