Skip to content

fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#30

Closed
aviruthen wants to merge 2 commits intomasterfrom
fix/support-local-source-for-sagemaker-core-shapes-5672
Closed

fix: Support local source for sagemaker.core.shapes.ProcessingInput (5672)#30
aviruthen wants to merge 2 commits intomasterfrom
fix/support-local-source-for-sagemaker-core-shapes-5672

Conversation

@aviruthen
Copy link
Copy Markdown
Owner

Description

Add convenience factories for creating ProcessingInput from local paths

Problem

In the V3 Processor._normalize_inputs(), local file paths are already supported — when ProcessingS3Input.s3_uri contains a non-S3 scheme (e.g., a local path), the file is automatically uploaded to S3. However, the user experience is poor because users must put a local file path into the s3_uri field of ProcessingS3Input, which is semantically confusing. The V2 ProcessingInput had a source parameter that clearly accepted local paths.

Solution

This PR adds two public convenience factory functions and enhances _normalize_inputs() to improve the developer experience:

  1. processing_input_from_local() — Creates a ProcessingInput from a local file/directory path with clear parameter names (local_path, destination).

  2. create_processing_input() — A V2-like factory that accepts a source parameter (local path or S3 URI) and destination, providing familiar ergonomics.

  3. _normalize_inputs() enhancement — Now checks for an ad-hoc source attribute on ProcessingInput objects and populates s3_input.s3_uri from it, providing a fallback for users who set the attribute directly.

Usage Examples

from sagemaker.core.processing import (
    Processor,
    processing_input_from_local,
    create_processing_input,
)

# Using the explicit local path factory
inp = processing_input_from_local(
    input_name="my-data",
    local_path="/home/user/data/",
    destination="/opt/ml/processing/input/data",
)

# Using the V2-like source/destination factory
inp = create_processing_input(
    source="/home/user/data/",  # or "s3://bucket/data/"
    destination="/opt/ml/processing/input/data",
    input_name="my-data",
)

processor.run(inputs=[inp])

Testing

  • Added tests for processing_input_from_local() (valid inputs, custom params, validation errors, pipeline config integration)
  • Added tests for create_processing_input() (local paths, S3 URIs, defaults, validation errors)
  • Added tests for _normalize_inputs() local path upload behavior
  • All existing tests continue to pass (no regressions)

Related Issue

Related issue: 5672

Changes Made

The V3 Processor._normalize_inputs() in sagemaker-core/src/sagemaker/core/processing.py already handles local file paths: when file_input.s3_input.s3_uri has a non-s3 scheme (e.g., a local path), it uploads the file to S3 automatically. However, the user experience is poor because users must put a local file path into the s3_uri field of ProcessingS3Input, which is semantically confusing. The V2 ProcessingInput had a source parameter that clearly accepted local paths. The fix should add an optional source convenience parameter to ProcessingInput construction in the V3 processing module, or provide a helper/factory that wraps local paths into the expected ProcessingS3Input shape. Since ProcessingInput is auto-generated in sagemaker.core.shapes, the best approach is to add a convenience factory function or extend the Processor._normalize_inputs() to also accept a source field, plus provide clear documentation.

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 72%
  • Classification: feature_request
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR adds two convenience factory functions for creating ProcessingInput from local paths, which is a reasonable UX improvement. However, there are several design concerns: the ad-hoc source attribute monkey-patching pattern in _normalize_inputs() is fragile and goes against SDK conventions, the docstrings use Google-style Args: instead of reST format, and the processing_input_from_local function is missing validation for the destination parameter. The factory functions themselves are well-tested but could benefit from type annotation improvements.

⚠️ Inline comments could not be posted (line mapping issue).

Copy link
Copy Markdown

@mufaddal-rohawala mufaddal-rohawala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR adds convenience factory functions for creating ProcessingInput from local paths, improving V3 UX. While the factories are well-tested, there are several design and code quality concerns: the ad-hoc source attribute monkey-patching pattern is fragile and anti-pattern for a Pydantic-based SDK, the docstrings use Google-style Args/Returns instead of reST format, and there are missing type annotations and validation gaps.

# a subclass), populate s3_input.s3_uri from it so the existing
# upload logic handles it.
_source = getattr(file_input, "source", None)
if _source is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical design concern: getattr(file_input, "source", None) monkey-patching pattern is fragile and anti-pattern.

This relies on users monkey-patching an ad-hoc source attribute onto a Pydantic BaseModel (ProcessingInput is auto-generated from sagemaker-core shapes). Pydantic models by default reject extra attributes, so file_input.source = '/tmp/data' would raise a ValidationError unless the model is configured with model_config = ConfigDict(extra='allow'). This code path is essentially dead unless users bypass Pydantic validation.

Moreover, encouraging monkey-patching on auto-generated shapes is a maintenance hazard — if the shape ever adds a real source field, this code will silently conflict.

Recommendation: Remove this entire _source = getattr(...) block. The two factory functions (processing_input_from_local and create_processing_input) already solve the UX problem cleanly by populating s3_input.s3_uri correctly. There's no need for this fallback path.

Comment thread sagemaker-core/src/sagemaker/core/processing.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
This will be uploaded to S3 automatically before the
processing job starts.
destination: The container path where the input data will be
made available (e.g. ``/opt/ml/processing/input/data``).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for destination parameter. processing_input_from_local validates input_name and local_path but not destination, while create_processing_input validates all three. This inconsistency could confuse users. Add:

if not destination:
    raise ValueError(f"destination must be a non-empty string, got: {destination!r}")

Comment thread sagemaker-core/src/sagemaker/core/processing.py
s3_uri=_source,
local_path="/opt/ml/processing/input",
s3_data_type="S3Prefix",
s3_input_mode="File",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded default /opt/ml/processing/input — this magic string should be extracted to a module-level constant, e.g.:

DEFAULT_PROCESSING_INPUT_PATH = "/opt/ml/processing/input"

This is also used in the factory functions' docstrings and examples, so a constant would ensure consistency.

Comment thread sagemaker-core/tests/unit/test_processing.py
Comment thread sagemaker-core/tests/unit/test_processing.py
Comment thread sagemaker-core/tests/unit/test_processing.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
@aviruthen aviruthen closed this Mar 27, 2026
@aviruthen aviruthen deleted the fix/support-local-source-for-sagemaker-core-shapes-5672 branch March 27, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants