Describe the feature you'd like
Expose additional CreateInferenceComponent API parameters through ModelBuilder.deploy() when deploying Inference Components. Currently, ModelBuilder._deploy_core_endpoint() builds a minimal InferenceComponentSpecification and hardcodes several values. The following API-supported configuration is not surfaced:
Specification.DataCacheConfig.EnableCaching — Cache model artifacts and container images on instances for faster auto-scaling cold starts
Specification.BaseInferenceComponentName — Adapter component deployment (e.g. LoRA adapters attached to a base model)
Specification.Container (Image, ArtifactUrl, Environment) — Custom container images, artifact URLs, and environment variables at the IC level
VariantName (top-level) — Currently hardcoded to "AllTraffic", not configurable for multi-variant endpoints
How would this feature be used? Please describe.
Our immediate need is DataCacheConfig.EnableCaching. When deploying large model artifacts (700MB+ Triton ensembles) on auto-scaling endpoints, new instances must re-download all artifacts during scale-out. Enabling caching eliminates this overhead.
Ideally these would be optional parameters on deploy() or on ResourceRequirements:
builder.deploy(
endpoint_name="my-endpoint",
inference_component_name="my-ic",
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
inference_config=ResourceRequirements(
requests={"memory": 8192, "num_accelerators": 1, "num_cpus": 2, "copies": 1}
),
data_cache_config={"enable_caching": True}, # new
)
Describe alternatives you've considered
The current workaround is to use ModelBuilder.build() for model creation, then call EndpointConfig.create(), Endpoint.create(), and InferenceComponent.create() directly via sagemaker-core, which supports the full InferenceComponentSpecification:
from sagemaker.core.resources import EndpointConfig, Endpoint, InferenceComponent
from sagemaker.core.shapes import (
ProductionVariant,
InferenceComponentSpecification,
InferenceComponentComputeResourceRequirements,
InferenceComponentRuntimeConfig,
InferenceComponentDataCacheConfig,
)
builder.build(model_name=sm_model_name)
EndpointConfig.create(
endpoint_config_name=epc_name,
production_variants=[
ProductionVariant(
variant_name="AllTraffic",
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
)
],
execution_role_arn=role,
)
Endpoint.create(
endpoint_name=endpoint_name,
endpoint_config_name=epc_name,
)
InferenceComponent.create(
inference_component_name=ic_name,
endpoint_name=endpoint_name,
variant_name="AllTraffic",
specification=InferenceComponentSpecification(
model_name=sm_model_name,
compute_resource_requirements=InferenceComponentComputeResourceRequirements(
min_memory_required_in_mb=8192,
number_of_accelerator_devices_required=1,
number_of_cpu_cores_required=2,
),
data_cache_config=InferenceComponentDataCacheConfig(enable_caching=True),
),
runtime_config=InferenceComponentRuntimeConfig(copy_count=1),
)
This works but defeats the purpose of ModelBuilder.deploy() as a high-level abstraction. Customers shouldn't need to drop down to sagemaker-core or boto3 for commonly used API parameters.
Additional context
- API reference: CreateInferenceComponent
- The shape classes (
InferenceComponentDataCacheConfig, InferenceComponentContainerSpecification, etc.) already exist in sagemaker.core.shapes — they just need to be wired into ModelBuilder._deploy_core_endpoint().
- SageMaker SDK version:
3.x
Describe the feature you'd like
Expose additional
CreateInferenceComponentAPI parameters throughModelBuilder.deploy()when deploying Inference Components. Currently,ModelBuilder._deploy_core_endpoint()builds a minimalInferenceComponentSpecificationand hardcodes several values. The following API-supported configuration is not surfaced:Specification.DataCacheConfig.EnableCaching— Cache model artifacts and container images on instances for faster auto-scaling cold startsSpecification.BaseInferenceComponentName— Adapter component deployment (e.g. LoRA adapters attached to a base model)Specification.Container(Image,ArtifactUrl,Environment) — Custom container images, artifact URLs, and environment variables at the IC levelVariantName(top-level) — Currently hardcoded to"AllTraffic", not configurable for multi-variant endpointsHow would this feature be used? Please describe.
Our immediate need is
DataCacheConfig.EnableCaching. When deploying large model artifacts (700MB+ Triton ensembles) on auto-scaling endpoints, new instances must re-download all artifacts during scale-out. Enabling caching eliminates this overhead.Ideally these would be optional parameters on
deploy()or onResourceRequirements:Describe alternatives you've considered
The current workaround is to use
ModelBuilder.build()for model creation, then callEndpointConfig.create(),Endpoint.create(), andInferenceComponent.create()directly via sagemaker-core, which supports the fullInferenceComponentSpecification:This works but defeats the purpose of
ModelBuilder.deploy()as a high-level abstraction. Customers shouldn't need to drop down to sagemaker-core or boto3 for commonly used API parameters.Additional context
InferenceComponentDataCacheConfig,InferenceComponentContainerSpecification, etc.) already exist insagemaker.core.shapes— they just need to be wired intoModelBuilder._deploy_core_endpoint().3.x