Skip to content

fix: [v3] FrameworkProcessor and ModelTrainer: 4 regressions (including dropping Code (5765)#5769

Closed
sagemaker-bot wants to merge 3 commits intoaws:masterfrom
sagemaker-bot:fix/v3-frameworkprocessor-and-modeltrainer-4-5765
Closed

fix: [v3] FrameworkProcessor and ModelTrainer: 4 regressions (including dropping Code (5765)#5769
sagemaker-bot wants to merge 3 commits intoaws:masterfrom
sagemaker-bot:fix/v3-frameworkprocessor-and-modeltrainer-4-5765

Conversation

@sagemaker-bot
Copy link
Copy Markdown
Collaborator

Description

Four V2→V3 regression bugs: (1) ProcessingJob/TrainingJob.refresh() uses global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The wait() in processing.py calls latest_job.wait() which calls refresh() on the sagemaker-core resource, and trainer_wait.py calls training_job.refresh() similarly. (2) FrameworkProcessor accepts code_location but _upload_code in ScriptProcessor always uses default_bucket(). (3) FrameworkProcessor.run() lost the codeartifact_repo_arn parameter from v2 PR #4145. (4) ModelTrainer's INSTALL_REQUIREMENTS template does bare pip install without CA_REPOSITORY_ARN CodeArtifact support.

Related Issue

Related issue: 5765

Changes Made

  • sagemaker-core/src/sagemaker/core/processing.py
  • sagemaker-train/src/sagemaker/train/templates.py
  • sagemaker-train/src/sagemaker/train/model_trainer.py
  • sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py
  • sagemaker-core/tests/unit/test_processing_regressions.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes four V2→V3 regressions but introduces several serious architectural and code quality issues. The monkey-patching of ProcessingJob.wait is a fragile anti-pattern that will cause maintenance headaches. The session-propagation approach via private attributes (_sagemaker_session) is brittle. There are also security concerns with ARN parsing, missing type annotations, bare except clauses, and a new dependency on graphene that seems unjustified.

if status in ('Completed', 'Failed', 'Stopped'):
break
time.sleep(poll)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Monkey-patching ProcessingJob.wait is a fragile anti-pattern. This globally mutates a sagemaker-core resource class at import time, which:

  1. Affects all consumers of ProcessingJob, not just those using this code path
  2. Creates hidden coupling between modules
  3. Is fragile if ProcessingJob is imported before/after this module
  4. Violates the V3 tenet that subpackages should use sagemaker-core properly

Instead, consider:

  • Overriding the wait behavior in the Processor class itself (e.g., processor.latest_job_wait() that uses the stored session)
  • Or wrapping the wait call at the call site in _start_new / wherever wait() is invoked
  • Or contributing a fix to sagemaker-core to accept a session/client parameter in wait()/refresh()

Comment thread sagemaker-core/src/sagemaker/core/processing.py
# Update training_job attributes from the describe response
from graphene.utils.str_converters import to_snake_case
for key, value in response.items():
snake_key = to_snake_case(key)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New dependency on graphene is unjustified and violates dependency guidelines. from graphene.utils.str_converters import to_snake_case introduces a dependency on the GraphQL library graphene just for a string conversion utility. This is a heavy dependency for a trivial function. Use a simple regex or write a 3-line helper:

import re
def _to_snake_case(name: str) -> str:
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

pass
except Exception:
# Fallback to default refresh
training_job.refresh()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except Exception: that silently falls back is dangerous. If the session-based describe call fails for a real reason (e.g., permissions, throttling), this silently falls back to training_job.refresh() which may use the wrong credentials — the exact bug this PR is trying to fix. At minimum, log the exception. Better yet, only catch specific expected exceptions or remove the fallback entirely.




def _refresh_training_job(training_job: TrainingJob, sagemaker_session=None):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type annotation for sagemaker_session parameter. Per SDK conventions, all new public functions must have type annotations. Should be:

def _refresh_training_job(training_job: TrainingJob, sagemaker_session: Optional[Session] = None):

# The key assertion: _sagemaker_session should be set
if result is not None and hasattr(result, '_sagemaker_session'):
assert result._sagemaker_session == mock_session

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weak test assertion: conditional check defeats the purpose. if result is not None and hasattr(result, '_sagemaker_session'): means this test passes even if the feature is completely broken (result is None or attribute doesn't exist). This should be an unconditional assertion:

assert result is not None
assert result._sagemaker_session == mock_session

with patch("sagemaker.core.processing.transform", return_value={}):
try:
result = processor._start_new(
inputs=[], outputs=[], experiment_config=None
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except Exception: pass in test hides failures. If _start_new raises an exception, the test silently passes without verifying anything. The test should either expect the exception (use pytest.raises) or not catch it at all.

Comment thread sagemaker-core/tests/unit/test_processing_regressions.py
Comment thread sagemaker-core/tests/unit/test_processing_regressions.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py Outdated
@sagemaker-bot
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

Fixes four V2→V3 regression bugs reported in the GitHub issue:

Bug 1: wait=True does not respect sagemaker session

Problem: ProcessingJob.refresh() and TrainingJob.refresh() use the global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions.

Fix (Processing): Added a _wait_for_job() method on the Processor class that uses ProcessingJob.get() with the session's boto_session parameter. This avoids monkey-patching ProcessingJob.wait (which was flagged as a fragile anti-pattern) and avoids storing _sagemaker_session on Pydantic model objects.

Fix (Training): The _refresh_training_job() function in trainer_wait.py now accepts an Optional[Session] parameter and uses the session's sagemaker_client to call describe_training_job. The ModelTrainer.train() method passes self.sagemaker_session to the wait function.

Bug 2: FrameworkProcessor.code_location is accepted but ignored

Problem: _upload_code always used default_bucket() regardless of code_location.

Fix: The _get_code_upload_bucket_and_prefix() method on ScriptProcessor checks for code_location (set on FrameworkProcessor) and parses the S3 URI to extract bucket and prefix. Both _upload_code and _package_code use this method.

Bug 3: CodeArtifact support missing from FrameworkProcessor

Problem: FrameworkProcessor.run() lost the codeartifact_repo_arn parameter from v2.

Fix: Added codeartifact_repo_arn parameter to FrameworkProcessor.run() with ARN validation using a regex pattern. The _generate_framework_script() method injects the CodeArtifact login command when the parameter is set. Invalid ARNs raise a descriptive ValueError.

Bug 4: ModelTrainer bypasses CodeArtifact support for requirements.txt

Problem: The INSTALL_REQUIREMENTS template did a bare pip install without checking CA_REPOSITORY_ARN.

Fix: Added a CODEARTIFACT_LOGIN shell block that checks for CA_REPOSITORY_ARN, validates the ARN format, and runs aws codeartifact login --tool pip. All curly braces in the shell block are escaped ({{/}}) so Python's .format() does not interpret them as placeholders (this was the root cause of the KeyError: 'region' CI failures).

Key improvements over previous iteration:

  • No monkey-patching: Removed global mutation of ProcessingJob.wait
  • No Pydantic attribute injection: Removed storing _sagemaker_session on sagemaker-core resource objects
  • No graphene dependency: Replaced with a simple 3-line _to_snake_case regex helper
  • No silent fallbacks: Removed bare except Exception: pass patterns
  • Proper type annotations: All new functions have type annotations
  • ARN validation: Both Python-side and shell-side validation of CodeArtifact ARNs
  • Escaped shell templates: Fixed KeyError: 'region' by properly escaping curly braces
  • **Sp

Comments reviewed: 52
Files modified: sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/templates.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, sagemaker-train/src/sagemaker/train/model_trainer.py, sagemaker-core/tests/unit/test_processing_regressions.py, sagemaker-train/tests/unit/test_train_regressions.py

  • sagemaker-core/src/sagemaker/core/processing.py: Fix Bug 1 (session-aware wait without monkey-patching), Bug 2 (code_location), Bug 3 (CodeArtifact support) with proper validation and no monkey-patching
  • sagemaker-train/src/sagemaker/train/templates.py: Fix INSTALL_REQUIREMENTS template to avoid KeyError by using shell variables (not Python format placeholders) for CodeArtifact, add ARN validation in shell, fix ordering
  • sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Remove graphene dependency, add proper type annotations, remove dangerous silent fallback, use session-aware refresh properly
  • sagemaker-train/src/sagemaker/train/model_trainer.py: Move trainer_wait import to top of file, remove storing _sagemaker_session on TrainingJob Pydantic model, pass session as parameter instead
  • sagemaker-core/tests/unit/test_processing_regressions.py: Rewrite test file scoped only to sagemaker-core (processing.py). Remove tests for sagemaker.train modules - those go in sagemaker-train/tests/unit/
  • sagemaker-train/tests/unit/test_train_regressions.py: New test file in sagemaker-train for Bug 1 (ModelTrainer wait) and Bug 4 (CodeArtifact templates) tests

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes four V2→V3 regressions related to session handling, code_location, and CodeArtifact support. While the fixes address real issues, there are several concerns: the custom wait/refresh implementations bypass sagemaker-core's built-in mechanisms in a fragile way, the polling loop in processing.py has no timeout, silently swallowing exceptions in _refresh_training_job is dangerous, and there's a missing import re in processing.py.

Comment thread sagemaker-core/src/sagemaker/core/processing.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
if hasattr(training_job, snake_key):
try:
setattr(training_job, snake_key, value)
except (AttributeError, TypeError, ValueError):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently swallowing exceptions is dangerous — The bare except (AttributeError, TypeError, ValueError): pass means if a critical attribute (like training_job_status) fails to update, the code silently continues with stale data, potentially causing infinite loops or incorrect status checks. At minimum, log a warning:

except (AttributeError, TypeError, ValueError) as e:
    logger.debug(f"Could not set attribute {snake_key} on training job: {e}")

Also consider whether you should only update known/expected attributes rather than iterating over the entire describe response.

Comment thread sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py
if ! echo "$CA_REPOSITORY_ARN" | grep -qE '^arn:aws:codeartifact:[a-z0-9-]+:[0-9]{{12}}:repository/[a-zA-Z0-9-]+/[a-zA-Z0-9-]+$'; then
echo "WARNING: CA_REPOSITORY_ARN does not match expected format. Skipping CodeArtifact login."
elif ! hash aws 2>/dev/null; then
echo "AWS CLI is not installed. Skipping CodeArtifact login."
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-brace escaping is fragile — The comment says "All curly braces are doubled to escape them from Python's str.format()", but CODEARTIFACT_LOGIN is concatenated into INSTALL_REQUIREMENTS which uses .format(). The {{12}} in the grep regex will become {12} after one .format() call, which is correct. However, if this template is ever passed through .format() twice (or used with f-strings), it will break. Consider adding a test that verifies the rendered output contains the literal {12} in the grep pattern, or switch to string.Template / %-formatting to avoid this fragility.

Comment thread sagemaker-core/tests/unit/test_processing_regressions.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
Comment thread sagemaker-core/src/sagemaker/core/processing.py
Comment thread sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope this PR down to Bug 1 only (wait=True ignores sagemaker session). Bugs 2, 3, and 4 are being addressed by separate PRs (5772 and 5773).

Revert all changes related to:

  • Bug 2 (code_location) — any changes to _package_code or code upload paths in processing.py
  • Bug 3 (codeartifact_repo_arn) — any changes to FrameworkProcessor.run() signature or runproc script generation
  • Bug 4 (ModelTrainer CodeArtifact) — any changes to templates.py or INSTALL_REQUIREMENTS

Keep only the fix for Bug 1: ProcessingJob.refresh() and TrainingJob.refresh() use a global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The fix should ensure that when wait=True is used in FrameworkProcessor.run() and ModelTrainer.train(), the wait/refresh calls use the same boto session that was passed to the processor/trainer.

Do NOT monkey-patch ProcessingJob.wait or TrainingJob.wait globally. Instead, implement the wait at the call site — in processing.py's _start_new method and in trainer_wait.py — by polling describe_processing_job / describe_training_job using the sagemaker_session's client directly, similar to how v2 did it.

The relevant files are sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, and sagemaker-train/src/sagemaker/train/model_trainer.py. Do not worry about CI failures in this iteration, just focus on this de-scoping! Only address other comments given on this PR if they are related to first bug

@sagemaker-bot
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #2 — Review Comments Addressed

Description

Fixes Bug 1 from the V2→V3 regression report: wait=True does not respect sagemaker_session.

Bug 1: wait=True does not respect sagemaker session

Problem: ProcessingJob.refresh() and TrainingJob.refresh() use Base.get_sagemaker_client() — a global/default client — instead of the sagemaker_session passed to the processor/trainer. This fails with NoCredentialsError when using assumed-role sessions (via STS).

Fix (Processing): Added a _wait_for_job() method on the Processor class that uses ProcessingJob.get() with the session's boto_session parameter. This avoids monkey-patching ProcessingJob.wait (which was flagged as a fragile anti-pattern) and avoids storing _sagemaker_session on Pydantic model objects. The method includes a configurable timeout (default 3600s) to prevent infinite polling.

Fix (Training): The _refresh_training_job() function in trainer_wait.py accepts an Optional[Session] parameter and uses the session's sagemaker_client to call describe_training_job. The ModelTrainer.train() method passes self.sagemaker_session to the wait function. Exceptions during attribute updates are logged at DEBUG level instead of being silently swallowed.

Scope

This PR is scoped to Bug 1 only. Bugs 2 (code_location), 3 (CodeArtifact in FrameworkProcessor), and 4 (CodeArtifact in ModelTrainer templates) are being addressed by separate PRs.

Files Modified

  • sagemaker-core/src/sagemaker/core/processing.py: Added _wait_for_job() method with timeout support, used by both Processor.run() and ScriptProcessor.run() / FrameworkProcessor.run()
  • sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Added _to_snake_case helper (replacing graphene dependency), improved _refresh_training_job() with debug logging and TODO comment
  • sagemaker-train/src/sagemaker/train/model_trainer.py: Passes sagemaker_session to trainer_wait()
  • sagemaker-train/src/sagemaker/train/templates.py: Reverted Bug 4 changes (CODEARTIFACT_LOGIN block removed)
  • sagemaker-core/tests/unit/test_processing_regressions.py: Tests for Bug 1 processing wait behavior
  • sagemaker-train/tests/unit/test_train_regressions.py: Tests for Bug 1 training wait behavior

Comments reviewed: 23
Files modified: sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, sagemaker-train/src/sagemaker/train/model_trainer.py, sagemaker-train/src/sagemaker/train/templates.py, sagemaker-core/tests/unit/test_processing_regressions.py, sagemaker-train/tests/unit/test_train_regressions.py

  • sagemaker-core/src/sagemaker/core/processing.py: Bug 1 fix: Add _wait_for_job method with timeout, use ProcessingJob.get() with session. Revert Bug 2 (_get_code_upload_bucket_and_prefix), Bug 3 (codeartifact_repo_arn, _get_codeartifact_command, _COD
  • sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py: Bug 1 fix: Improve _refresh_training_job with proper logging instead of silent exception swallowing, add TODO comment about sagemaker-core limitation, keep _to_snake_case helper
  • sagemaker-train/src/sagemaker/train/model_trainer.py: Bug 1 fix: Ensure sagemaker_session is passed to trainer_wait, and logs parameter is not lost
  • sagemaker-train/src/sagemaker/train/templates.py: Revert Bug 4 changes: Remove CODEARTIFACT_LOGIN block and restore original INSTALL_REQUIREMENTS and INSTALL_AUTO_REQUIREMENTS templates
  • sagemaker-core/tests/unit/test_processing_regressions.py: Scoped down to Bug 1 only tests, with edge case for Failed status. Removed Bug 2, 3 tests.
  • sagemaker-train/tests/unit/test_train_regressions.py: Scoped down to Bug 1 only tests for trainer_wait. Removed Bug 4 template tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants