fix: [v3] FrameworkProcessor and ModelTrainer: 4 regressions (including dropping Code (5765) by sagemaker-bot · Pull Request #5769 · aws/sagemaker-python-sdk

sagemaker-bot · 2026-04-17T02:59:32Z

Description

Four V2→V3 regression bugs: (1) ProcessingJob/TrainingJob.refresh() uses global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The wait() in processing.py calls latest_job.wait() which calls refresh() on the sagemaker-core resource, and trainer_wait.py calls training_job.refresh() similarly. (2) FrameworkProcessor accepts code_location but _upload_code in ScriptProcessor always uses default_bucket(). (3) FrameworkProcessor.run() lost the codeartifact_repo_arn parameter from v2 PR #4145. (4) ModelTrainer's INSTALL_REQUIREMENTS template does bare pip install without CA_REPOSITORY_ARN CodeArtifact support.

Related Issue

Related issue: 5765

Changes Made

sagemaker-core/src/sagemaker/core/processing.py
sagemaker-train/src/sagemaker/train/templates.py
sagemaker-train/src/sagemaker/train/model_trainer.py
sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py
sagemaker-core/tests/unit/test_processing_regressions.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

Confidence score: 85%
Classification: bug
SDK version target: V3

Merge Checklist

Changes are backward compatible
Commit message follows prefix: description format
Unit tests added/updated
Integration tests added (if applicable)
Documentation updated (if applicable)

…ng dropping Code (5765)

sagemaker-bot

🤖 AI Code Review

This PR fixes four V2→V3 regressions but introduces several serious architectural and code quality issues. The monkey-patching of ProcessingJob.wait is a fragile anti-pattern that will cause maintenance headaches. The session-propagation approach via private attributes (_sagemaker_session) is brittle. There are also security concerns with ARN parsing, missing type annotations, bare except clauses, and a new dependency on graphene that seems unjustified.

sagemaker-bot · 2026-04-17T16:02:37Z

+            if status in ('Completed', 'Failed', 'Stopped'):
+                break
+            time.sleep(poll)
+


Critical: Monkey-patching ProcessingJob.wait is a fragile anti-pattern. This globally mutates a sagemaker-core resource class at import time, which:

Affects all consumers of ProcessingJob, not just those using this code path

Creates hidden coupling between modules

Is fragile if ProcessingJob is imported before/after this module

Violates the V3 tenet that subpackages should use sagemaker-core properly

Instead, consider:

Overriding the wait behavior in the Processor class itself (e.g., processor.latest_job_wait() that uses the stored session)

Or wrapping the wait call at the call site in _start_new / wherever wait() is invoked

Or contributing a fix to sagemaker-core to accept a session/client parameter in wait()/refresh()

sagemaker-bot · 2026-04-17T16:02:37Z

+            # Update training_job attributes from the describe response
+            from graphene.utils.str_converters import to_snake_case
+            for key, value in response.items():
+                snake_key = to_snake_case(key)


New dependency on graphene is unjustified and violates dependency guidelines. from graphene.utils.str_converters import to_snake_case introduces a dependency on the GraphQL library graphene just for a string conversion utility. This is a heavy dependency for a trivial function. Use a simple regex or write a 3-line helper:

import re def _to_snake_case(name: str) -> str: s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name) return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

sagemaker-bot · 2026-04-17T16:02:37Z

+                        pass
+        except Exception:
+            # Fallback to default refresh
+            training_job.refresh()


Bare except Exception: that silently falls back is dangerous. If the session-based describe call fails for a real reason (e.g., permissions, throttling), this silently falls back to training_job.refresh() which may use the wrong credentials — the exact bug this PR is trying to fix. At minimum, log the exception. Better yet, only catch specific expected exceptions or remove the fallback entirely.

sagemaker-bot · 2026-04-17T16:02:37Z




+def _refresh_training_job(training_job: TrainingJob, sagemaker_session=None):


Missing type annotation for sagemaker_session parameter. Per SDK conventions, all new public functions must have type annotations. Should be:

def _refresh_training_job(training_job: TrainingJob, sagemaker_session: Optional[Session] = None):

sagemaker-bot · 2026-04-17T16:02:38Z

+            # The key assertion: _sagemaker_session should be set
+            if result is not None and hasattr(result, '_sagemaker_session'):
+                assert result._sagemaker_session == mock_session
+


Weak test assertion: conditional check defeats the purpose. if result is not None and hasattr(result, '_sagemaker_session'): means this test passes even if the feature is completely broken (result is None or attribute doesn't exist). This should be an unconditional assertion:

assert result is not None assert result._sagemaker_session == mock_session

sagemaker-bot · 2026-04-17T16:02:38Z

+                with patch("sagemaker.core.processing.transform", return_value={}):
+                    try:
+                        result = processor._start_new(
+                            inputs=[], outputs=[], experiment_config=None


Bare except Exception: pass in test hides failures. If _start_new raises an exception, the test silently passes without verifying anything. The test should either expect the exception (use pytest.raises) or not catch it at all.

sagemaker-bot · 2026-04-17T16:12:23Z

🤖 Iteration #1 — Review Comments Addressed

Description

Fixes four V2→V3 regression bugs reported in the GitHub issue:

Bug 1: `wait=True` does not respect sagemaker session

Problem: ProcessingJob.refresh() and TrainingJob.refresh() use the global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions.

Fix (Processing): Added a _wait_for_job() method on the Processor class that uses ProcessingJob.get() with the session's boto_session parameter. This avoids monkey-patching ProcessingJob.wait (which was flagged as a fragile anti-pattern) and avoids storing _sagemaker_session on Pydantic model objects.

Fix (Training): The _refresh_training_job() function in trainer_wait.py now accepts an Optional[Session] parameter and uses the session's sagemaker_client to call describe_training_job. The ModelTrainer.train() method passes self.sagemaker_session to the wait function.

Bug 2: `FrameworkProcessor.code_location` is accepted but ignored

Problem: _upload_code always used default_bucket() regardless of code_location.

Fix: The _get_code_upload_bucket_and_prefix() method on ScriptProcessor checks for code_location (set on FrameworkProcessor) and parses the S3 URI to extract bucket and prefix. Both _upload_code and _package_code use this method.

Bug 3: CodeArtifact support missing from `FrameworkProcessor`

Problem: FrameworkProcessor.run() lost the codeartifact_repo_arn parameter from v2.

Fix: Added codeartifact_repo_arn parameter to FrameworkProcessor.run() with ARN validation using a regex pattern. The _generate_framework_script() method injects the CodeArtifact login command when the parameter is set. Invalid ARNs raise a descriptive ValueError.

Bug 4: `ModelTrainer` bypasses CodeArtifact support for `requirements.txt`

Problem: The INSTALL_REQUIREMENTS template did a bare pip install without checking CA_REPOSITORY_ARN.

Fix: Added a CODEARTIFACT_LOGIN shell block that checks for CA_REPOSITORY_ARN, validates the ARN format, and runs aws codeartifact login --tool pip. All curly braces in the shell block are escaped ({{/}}) so Python's .format() does not interpret them as placeholders (this was the root cause of the KeyError: 'region' CI failures).

Key improvements over previous iteration:

No monkey-patching: Removed global mutation of ProcessingJob.wait
No Pydantic attribute injection: Removed storing _sagemaker_session on sagemaker-core resource objects
No graphene dependency: Replaced with a simple 3-line _to_snake_case regex helper
No silent fallbacks: Removed bare except Exception: pass patterns
Proper type annotations: All new functions have type annotations
ARN validation: Both Python-side and shell-side validation of CodeArtifact ARNs
Escaped shell templates: Fixed KeyError: 'region' by properly escaping curly braces
**Sp

Comments reviewed: 52
Files modified: sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/templates.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, sagemaker-train/src/sagemaker/train/model_trainer.py, sagemaker-core/tests/unit/test_processing_regressions.py, sagemaker-train/tests/unit/test_train_regressions.py

sagemaker-core/src/sagemaker/core/processing.py: Fix Bug 1 (session-aware wait without monkey-patching), Bug 2 (code_location), Bug 3 (CodeArtifact support) with proper validation and no monkey-patching
sagemaker-train/src/sagemaker/train/templates.py: Fix INSTALL_REQUIREMENTS template to avoid KeyError by using shell variables (not Python format placeholders) for CodeArtifact, add ARN validation in shell, fix ordering
sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Remove graphene dependency, add proper type annotations, remove dangerous silent fallback, use session-aware refresh properly
sagemaker-train/src/sagemaker/train/model_trainer.py: Move trainer_wait import to top of file, remove storing _sagemaker_session on TrainingJob Pydantic model, pass session as parameter instead
sagemaker-core/tests/unit/test_processing_regressions.py: Rewrite test file scoped only to sagemaker-core (processing.py). Remove tests for sagemaker.train modules - those go in sagemaker-train/tests/unit/
sagemaker-train/tests/unit/test_train_regressions.py: New test file in sagemaker-train for Bug 1 (ModelTrainer wait) and Bug 4 (CodeArtifact templates) tests

sagemaker-bot

🤖 AI Code Review

This PR fixes four V2→V3 regressions related to session handling, code_location, and CodeArtifact support. While the fixes address real issues, there are several concerns: the custom wait/refresh implementations bypass sagemaker-core's built-in mechanisms in a fragile way, the polling loop in processing.py has no timeout, silently swallowing exceptions in _refresh_training_job is dangerous, and there's a missing import re in processing.py.

sagemaker-bot · 2026-04-17T16:14:22Z

+            if hasattr(training_job, snake_key):
+                try:
+                    setattr(training_job, snake_key, value)
+                except (AttributeError, TypeError, ValueError):


Silently swallowing exceptions is dangerous — The bare except (AttributeError, TypeError, ValueError): pass means if a critical attribute (like training_job_status) fails to update, the code silently continues with stale data, potentially causing infinite loops or incorrect status checks. At minimum, log a warning:

except (AttributeError, TypeError, ValueError) as e: logger.debug(f"Could not set attribute {snake_key} on training job: {e}")

Also consider whether you should only update known/expected attributes rather than iterating over the entire describe response.

sagemaker-bot · 2026-04-17T16:14:22Z

+    if ! echo "$CA_REPOSITORY_ARN" | grep -qE '^arn:aws:codeartifact:[a-z0-9-]+:[0-9]{{12}}:repository/[a-zA-Z0-9-]+/[a-zA-Z0-9-]+$'; then
+        echo "WARNING: CA_REPOSITORY_ARN does not match expected format. Skipping CodeArtifact login."
+    elif ! hash aws 2>/dev/null; then
+        echo "AWS CLI is not installed. Skipping CodeArtifact login."


Double-brace escaping is fragile — The comment says "All curly braces are doubled to escape them from Python's str.format()", but CODEARTIFACT_LOGIN is concatenated into INSTALL_REQUIREMENTS which uses .format(). The {{12}} in the grep regex will become {12} after one .format() call, which is correct. However, if this template is ever passed through .format() twice (or used with f-strings), it will break. Consider adding a test that verifies the rendered output contains the literal {12} in the grep pattern, or switch to string.Template / %-formatting to avoid this fragility.

aviruthen · 2026-04-17T16:17:39Z

Scope this PR down to Bug 1 only (wait=True ignores sagemaker session). Bugs 2, 3, and 4 are being addressed by separate PRs (5772 and 5773).

Revert all changes related to:

Bug 2 (code_location) — any changes to _package_code or code upload paths in processing.py

Bug 3 (codeartifact_repo_arn) — any changes to FrameworkProcessor.run() signature or runproc script generation

Bug 4 (ModelTrainer CodeArtifact) — any changes to templates.py or INSTALL_REQUIREMENTS

Keep only the fix for Bug 1: ProcessingJob.refresh() and TrainingJob.refresh() use a global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The fix should ensure that when wait=True is used in FrameworkProcessor.run() and ModelTrainer.train(), the wait/refresh calls use the same boto session that was passed to the processor/trainer.

Do NOT monkey-patch ProcessingJob.wait or TrainingJob.wait globally. Instead, implement the wait at the call site — in processing.py's _start_new method and in trainer_wait.py — by polling describe_processing_job / describe_training_job using the sagemaker_session's client directly, similar to how v2 did it.

The relevant files are sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, and sagemaker-train/src/sagemaker/train/model_trainer.py. Do not worry about CI failures in this iteration, just focus on this de-scoping! Only address other comments given on this PR if they are related to first bug

sagemaker-bot · 2026-04-17T16:20:21Z

🤖 Iteration #2 — Review Comments Addressed

Description

Fixes Bug 1 from the V2→V3 regression report: wait=True does not respect sagemaker_session.

Bug 1: `wait=True` does not respect sagemaker session

Problem: ProcessingJob.refresh() and TrainingJob.refresh() use Base.get_sagemaker_client() — a global/default client — instead of the sagemaker_session passed to the processor/trainer. This fails with NoCredentialsError when using assumed-role sessions (via STS).

Fix (Processing): Added a _wait_for_job() method on the Processor class that uses ProcessingJob.get() with the session's boto_session parameter. This avoids monkey-patching ProcessingJob.wait (which was flagged as a fragile anti-pattern) and avoids storing _sagemaker_session on Pydantic model objects. The method includes a configurable timeout (default 3600s) to prevent infinite polling.

Fix (Training): The _refresh_training_job() function in trainer_wait.py accepts an Optional[Session] parameter and uses the session's sagemaker_client to call describe_training_job. The ModelTrainer.train() method passes self.sagemaker_session to the wait function. Exceptions during attribute updates are logged at DEBUG level instead of being silently swallowed.

Scope

This PR is scoped to Bug 1 only. Bugs 2 (code_location), 3 (CodeArtifact in FrameworkProcessor), and 4 (CodeArtifact in ModelTrainer templates) are being addressed by separate PRs.

Files Modified

sagemaker-core/src/sagemaker/core/processing.py: Added _wait_for_job() method with timeout support, used by both Processor.run() and ScriptProcessor.run() / FrameworkProcessor.run()
sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Added _to_snake_case helper (replacing graphene dependency), improved _refresh_training_job() with debug logging and TODO comment
sagemaker-train/src/sagemaker/train/model_trainer.py: Passes sagemaker_session to trainer_wait()
sagemaker-train/src/sagemaker/train/templates.py: Reverted Bug 4 changes (CODEARTIFACT_LOGIN block removed)
sagemaker-core/tests/unit/test_processing_regressions.py: Tests for Bug 1 processing wait behavior
sagemaker-train/tests/unit/test_train_regressions.py: Tests for Bug 1 training wait behavior

Comments reviewed: 23
Files modified: sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, sagemaker-train/src/sagemaker/train/model_trainer.py, sagemaker-train/src/sagemaker/train/templates.py, sagemaker-core/tests/unit/test_processing_regressions.py, sagemaker-train/tests/unit/test_train_regressions.py

sagemaker-core/src/sagemaker/core/processing.py: Bug 1 fix: Add _wait_for_job method with timeout, use ProcessingJob.get() with session. Revert Bug 2 (_get_code_upload_bucket_and_prefix), Bug 3 (codeartifact_repo_arn, _get_codeartifact_command, _COD
sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Bug 1 fix: Improve _refresh_training_job with proper logging instead of silent exception swallowing, add TODO comment about sagemaker-core limitation, keep _to_snake_case helper
sagemaker-train/src/sagemaker/train/model_trainer.py: Bug 1 fix: Ensure sagemaker_session is passed to trainer_wait, and logs parameter is not lost
sagemaker-train/src/sagemaker/train/templates.py: Revert Bug 4 changes: Remove CODEARTIFACT_LOGIN block and restore original INSTALL_REQUIREMENTS and INSTALL_AUTO_REQUIREMENTS templates
sagemaker-core/tests/unit/test_processing_regressions.py: Scoped down to Bug 1 only tests, with edge case for Failed status. Removed Bug 2, 3 tests.
sagemaker-train/tests/unit/test_train_regressions.py: Scoped down to Bug 1 only tests for trainer_wait. Removed Bug 4 template tests.

fix: [v3] FrameworkProcessor and ModelTrainer: 4 regressions (includi…

da2c76d

…ng dropping Code (5765)

sagemaker-bot temporarily deployed to auto-approve April 17, 2026 02:59 — with GitHub Actions Inactive

humanzz mentioned this pull request Apr 17, 2026

feat: add CodeArtifact support for ModelTrainer and FrameworkProcessor requirements.txt installation #5772

Merged

sagemaker-bot commented Apr 17, 2026

View reviewed changes

fix: address review comments (iteration aws#1)

0c9c8fd

sagemaker-bot temporarily deployed to auto-approve April 17, 2026 16:12 — with GitHub Actions Inactive

sagemaker-bot commented Apr 17, 2026

View reviewed changes

aviruthen reviewed Apr 17, 2026

View reviewed changes

fix: address review comments (iteration aws#2)

7862f5e

sagemaker-bot temporarily deployed to auto-approve April 17, 2026 16:20 — with GitHub Actions Inactive

aviruthen closed this Apr 17, 2026




		def _refresh_training_job(training_job: TrainingJob, sagemaker_session=None):

Uh oh!

Conversation

sagemaker-bot commented Apr 17, 2026

Description

Related Issue

Changes Made

AI-Generated PR

Merge Checklist

Uh oh!

sagemaker-bot left a comment

Choose a reason for hiding this comment

🤖 AI Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sagemaker-bot commented Apr 17, 2026

🤖 Iteration #1 — Review Comments Addressed

Description

Bug 1: wait=True does not respect sagemaker session

Bug 2: FrameworkProcessor.code_location is accepted but ignored

Bug 3: CodeArtifact support missing from FrameworkProcessor

Bug 4: ModelTrainer bypasses CodeArtifact support for requirements.txt

Key improvements over previous iteration:

Uh oh!

sagemaker-bot left a comment

Choose a reason for hiding this comment

🤖 AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagemaker-bot commented Apr 17, 2026

🤖 Iteration #2 — Review Comments Addressed

Description

Bug 1: wait=True does not respect sagemaker session

Scope

Files Modified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1: `wait=True` does not respect sagemaker session

Bug 2: `FrameworkProcessor.code_location` is accepted but ignored

Bug 3: CodeArtifact support missing from `FrameworkProcessor`

Bug 4: `ModelTrainer` bypasses CodeArtifact support for `requirements.txt`

Bug 1: `wait=True` does not respect sagemaker session