Skip to content

fix: Enhancements Needed for Secure Tar Extraction (5560)#5726

Closed
aviruthen wants to merge 4 commits intoaws:masterfrom
aviruthen:fix/enhancements-needed-for-secure-tar-extraction-5560-2
Closed

fix: Enhancements Needed for Secure Tar Extraction (5560)#5726
aviruthen wants to merge 4 commits intoaws:masterfrom
aviruthen:fix/enhancements-needed-for-secure-tar-extraction-5560-2

Conversation

@aviruthen
Copy link
Copy Markdown
Collaborator

Description

The issue reports three bugs in tar extraction safety logic in _repack_model.py. After reading the actual code, _repack_model.py has ALREADY been fixed: (1) _get_safe_members takes a 'members' list param and is called with tar.getmembers(), (2) custom_extractall_tarfile passes _get_resolved_path(extract_path) as base, (3) _is_bad_path uses os.path.commonpath. However, sagemaker-core/src/sagemaker/core/common_utils.py references _get_resolved_path() and custom_extractall_tarfile() in its _create_or_update_code_dir and _extract_model functions, but these tar safety functions (_get_resolved_path, _is_bad_path, _is_bad_link, _get_safe_members, custom_extractall_tarfile) are NOT defined in common_utils.py. They need to be added to common_utils.py with the same corrected implementations from _repack_model.py.

Related Issue

Related issue: 5560

Changes Made

  • sagemaker-core/src/sagemaker/core/common_utils.py
  • sagemaker-core/src/sagemaker/core/utils/__init__.py
  • sagemaker-core/tests/unit/test_common_utils_tar_safety.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes three security-related bugs in tar extraction safety logic in common_utils.py: (1) uses os.path.commonpath instead of startswith to prevent prefix-collision bypasses, (2) passes the resolved extract_path as base instead of using CWD, and (3) calls tar.getmembers() to pass a proper list to _get_safe_members. The fixes are sound and align with the already-corrected implementations in _repack_model.py. However, there are a few issues worth addressing.

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

This PR adds robustness improvements to the tar extraction safety logic in sagemaker-core/src/sagemaker/core/common_utils.py and cleans up the related exports and tests.

Changes

sagemaker-core/src/sagemaker/core/common_utils.py

  • Wrapped os.path.commonpath call in _is_bad_path() with a try/except ValueError to handle edge cases where paths are on different drives (Windows) or contain mixed absolute/relative paths. When safety cannot be determined, the path is treated as bad (returns True).

sagemaker-core/src/sagemaker/core/utils/__init__.py

  • Removed private helper functions (_get_resolved_path, _is_bad_path, _is_bad_link, _get_safe_members) from __all__ since __all__ controls the public API surface and these are internal helpers.
  • Added custom_extractall_tarfile to __all__ as it is the main public-facing entry point for safe tar extraction.
  • Private helpers remain importable directly via __getattr__ but are not exposed via from sagemaker.core.utils import *.

sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • Updated from __future__ import absolute_import to from __future__ import annotations per SDK convention for new files.
  • Replaced all hardcoded /tmp/... paths with tempfile.TemporaryDirectory() to avoid platform-specific issues (e.g., macOS where /tmp/private/tmp).
  • Fixed mock pattern for data_filter test: use spec=['TarFile'] to ensure hasattr(mock_tarfile, 'data_filter') correctly returns False when testing the fallback path.
  • Added test for ValueError handling in _is_bad_path to verify the new try/except robustness.

Related Issue

Related issue: 5560

Changes Made

  • sagemaker-core/src/sagemaker/core/common_utils.py
  • sagemaker-core/src/sagemaker/core/utils/__init__.py
  • sagemaker-core/tests/unit/test_common_utils_tar_safety.py

Comments reviewed: 7
Files modified: sagemaker-core/src/sagemaker/core/common_utils.py, sagemaker-core/src/sagemaker/core/utils/__init__.py, sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • sagemaker-core/src/sagemaker/core/common_utils.py: Wrap os.path.commonpath in try/except for robustness against ValueError on different drives or mixed path types
  • sagemaker-core/src/sagemaker/core/utils/__init__.py: Remove private functions from all and add custom_extractall_tarfile as the public entry point
  • sagemaker-core/tests/unit/test_common_utils_tar_safety.py: Rewrite test file addressing all reviewer feedback: use tempfile for platform independence, use from future import annotations, fix mock patterns for data_filter

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes tar extraction safety logic in sagemaker-core's common_utils.py by improving path validation (using os.path.commonpath instead of startswith), passing the correct base directory to _get_safe_members, and calling tar.getmembers() explicitly. The changes are security-relevant and well-tested, though there are a few issues to address.

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #2 — Review Comments Addressed

Description

This PR adds tar extraction safety functions and their tests to sagemaker-core, addressing the issues reported about secure tar extraction logic.

Background

The _repack_model.py file already has the corrected implementations of the tar safety functions. However, sagemaker-core/src/sagemaker/core/common_utils.py references _get_resolved_path() and custom_extractall_tarfile() in its _create_or_update_code_dir and _extract_model functions, and these functions are already correctly implemented in common_utils.py with:

  1. _get_safe_members takes a members list param and base param, and is called with tar.getmembers() — not iterating over the tar object directly
  2. custom_extractall_tarfile passes _get_resolved_path(extract_path) as base — not _get_resolved_path("") (cwd)
  3. _is_bad_path uses os.path.commonpath with try/except ValueError — not startswith

Changes Made

sagemaker-core/src/sagemaker/core/utils/__init__.py

  • Removed the _INTERNAL_NAMES list and two-tier export system — private helpers (_get_resolved_path, _is_bad_path, _is_bad_link, _get_safe_members) are importable directly from sagemaker.core.common_utils but are not re-exported via sagemaker.core.utils
  • custom_extractall_tarfile is included in __all__ as the main public-facing entry point for safe tar extraction
  • Simplified __getattr__ to only check __all__

sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • New comprehensive test file for all tar extraction safety functions
  • Uses from __future__ import annotations per SDK convention for new files
  • Uses tempfile.TemporaryDirectory() throughout to avoid platform-specific issues (e.g., macOS where /tmp/private/tmp)
  • Uses spec=['TarFile'] mock pattern to correctly test the data_filter fallback path
  • Tests the ValueError handling in _is_bad_path for cross-drive/mixed path edge cases
  • Removed unused imports (tarfile, MagicMock)
  • Includes the critical test validating commonpath vs startswith fix for similar-prefix path traversal

Related Issue

Related issue: Enhancements Needed for Secure Tar Extraction

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Comments reviewed: 15
Files modified: sagemaker-core/src/sagemaker/core/common_utils.py, sagemaker-core/src/sagemaker/core/utils/__init__.py, sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • sagemaker-core/src/sagemaker/core/common_utils.py: The common_utils.py file already has the correct implementations of _get_resolved_path, _is_bad_path (with commonpath), _is_bad_link, _get_safe_members (with base param), and custom_extractall_tarfile
  • sagemaker-core/src/sagemaker/core/utils/__init__.py: Remove private helpers from _INTERNAL_NAMES and getattr, keep custom_extractall_tarfile in all as the public entry point. Private helpers are importable directly from common_utils.
  • sagemaker-core/tests/unit/test_common_utils_tar_safety.py: Rewrite test file addressing all reviewer feedback: use from future import annotations, use tempfile for platform independence, fix mock patterns for data_filter, remove unused imports, add ValueE

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes tar extraction safety logic in sagemaker-core's common_utils.py by improving path validation with os.path.commonpath (fixing a prefix-matching vulnerability), passing the correct base directory to _get_safe_members, and calling tar.getmembers() explicitly. The changes are well-motivated security fixes with good test coverage. A few minor issues need attention.



def _get_safe_members(members):
def _get_safe_members(members, base):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature change from _get_safe_members(members) to _get_safe_members(members, base) is a breaking change for any code that calls this function with the old signature. Since this is a private function (underscore-prefixed), the risk is lower, but if any downstream code or other modules in sagemaker-core call _get_safe_members(tar) (passing the tar object directly as was done before), this will break. The old code also used _get_safe_members(tar) where tar was iterable — now it expects tar.getmembers() (a list). Please verify there are no other callers of _get_safe_members in the codebase that would break.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should be fixed. However, perhaps this can be made an optional argument by providing a default value which can be handled (rather than removing it from the signature or verifying other function calls don't use this).

@aviruthen
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #3 — Review Comments Addressed

Description

This PR addresses the reported issues with tar extraction safety logic in sagemaker-core. The core safety functions (_get_resolved_path, _is_bad_path, _is_bad_link, _get_safe_members, custom_extractall_tarfile) in common_utils.py already had the correct implementations using os.path.commonpath (not startswith), passing tar.getmembers() (not iterating over tar), and using _get_resolved_path(extract_path) as base (not cwd). This PR adds robustness improvements and comprehensive tests.

Changes Made

sagemaker-core/src/sagemaker/core/common_utils.py

  • Made _get_safe_members's base parameter optional (default None, falls back to cwd) for backward compatibility with any existing callers that don't pass base
  • Added clarifying comment in _is_bad_link about _get_resolved_path usage on tip for consistent normalization

sagemaker-core/src/sagemaker/core/utils/__init__.py

  • Removed private helpers from exports — _get_resolved_path, _is_bad_path, _is_bad_link, _get_safe_members are importable directly from sagemaker.core.common_utils but not re-exported via sagemaker.core.utils
  • custom_extractall_tarfile is included in __all__ as the main public entry point for safe tar extraction
  • Added module docstring clarifying the export policy

sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • New comprehensive test file for all tar extraction safety functions
  • Uses from __future__ import annotations per SDK convention
  • Uses tempfile.TemporaryDirectory() throughout for platform independence (avoids macOS /tmp/private/tmp issues)
  • Uses spec=['TarFile'] mock pattern to correctly test the data_filter fallback path
  • Tests ValueError handling in _is_bad_path for cross-drive/mixed path edge cases
  • Tests both absolute path escape AND relative path traversal for the commonpath vs startswith fix
  • Tests directory traversal attack vector (../../etc/passwd)
  • Tests backward compatibility of _get_safe_members without base parameter
  • Asserts _validate_extracted_paths is called after extraction in fallback path
  • Reduced nesting with combined with statements
  • No unused imports

Related Issue

Enhancements Needed for Secure Tar Extraction

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Comments reviewed: 23
Files modified: sagemaker-core/src/sagemaker/core/common_utils.py, sagemaker-core/src/sagemaker/core/utils/__init__.py, sagemaker-core/tests/unit/test_common_utils_tar_safety.py

  • sagemaker-core/src/sagemaker/core/common_utils.py: 1. Wrap _is_bad_link's tip with _get_resolved_path for consistent normalization. 2. Make _get_safe_members' base parameter optional with a default of None (backward compatible), falling back to cwd if
  • sagemaker-core/src/sagemaker/core/utils/__init__.py: Clean up all to only include public functions plus custom_extractall_tarfile. Remove private helpers from exports. Simplify getattr.
  • sagemaker-core/tests/unit/test_common_utils_tar_safety.py: Comprehensive test file addressing all reviewer feedback: uses tempfile throughout, from future import annotations, no unused imports, proper mock patterns, tests for traversal paths, _validate_ex

Copy link
Copy Markdown
Collaborator

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes tar extraction safety logic in sagemaker-core's common_utils.py by improving path comparison (using os.path.commonpath instead of startswith), passing the correct base directory to _get_safe_members, and calling tar.getmembers() explicitly. The changes are security-relevant and well-tested, though there are a few issues to address.

"""
# joinpath will ignore base if path is absolute
return not _get_resolved_path(joinpath(base, path)).startswith(base)
resolved = _get_resolved_path(joinpath(base, path))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os.path.commonpath fix is a good security improvement over startswith, which can be tricked by paths like /base/x2 matching /base/x. However, there's a subtle issue: os.path.commonpath normalizes paths, but base may not be normalized in the same way. You should ensure both resolved and base are consistently normalized before comparison. Consider:

resolved = _get_resolved_path(joinpath(base, path))
base = _get_resolved_path(base)  # ensure consistent normalization
try:
    return os.path.commonpath([resolved, base]) != base
except ValueError:
    return True

Otherwise, if base has a trailing slash or different normalization than what commonpath returns, the comparison could give incorrect results.

@aviruthen aviruthen closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants