The KServe InferenceService CRD is the primary user interface that ModelMesh Serving uses for deploying models. An InferenceService is comprised of three components: a Predictor, a Transformer, and an Explainer. Currently, ModelMesh Serving primarily only supports the Predictor component for deploying models. There is preliminary support for Transformers, however, transformer deployment is handled by the KServe controller. As such, each InferenceService's transformer will require its own pod.
Here is an example of an InferenceService spec containing fields that would typically be used with ModelMesh:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-tensorflow-predictor
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: tensorflow
version: "1.15" # Optional
runtime: triton-2.x #Optional
storage:
key: my_storage
path: my_models/mnist-tf
schemaPath: my_schemas/mnist-schema.json # Optional
parameters:
bucket: my_bucket # Optional if bucket specified in secretNote
- While both the KServe controller and ModelMesh controller will reconcile
InferenceServiceresources, the ModelMesh controller will only handle thoseInferenceServiceswith theserving.kserve.io/deploymentMode: ModelMeshannotation. Otherwise, the KServe controller will handle reconciliation. Likewise, the KServe controller will not reconcile anInferenceServicewith theserving.kserve.io/deploymentMode: ModelMeshannotation, and will defer under the assumption that the ModelMesh controller will handle it. runtimeis optional. If included, the model will be loaded/served using theServingRuntimewith the specified name, and the predictorsmodelFormatmust match an entry in that runtime'ssupportedModelFormatslist (see runtimes).- The above spec makes use the
InferenceServicepredictor storage spec interface for passing in storage related information.
Users can alternatively continue to use storageUri to pass in storage information:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-tensorflow-predictor
annotations:
serving.kserve.io/deploymentMode: ModelMesh
serving.kserve.io/secretKey: my_storage
serving.kserve.io/schemaPath: my_schemas/mnist-schema.json
spec:
predictor:
model:
modelFormat:
name: tensorflow
version: "1.15"
runtime: triton-2.x
storageUri: s3://my_bucket/my_models/mnist-tfWhen using the storageUri field instead of the storage spec, additional information can be passed in as annotations:
serving.kserve.io/secretKeyfor specifying storage secret key (if needed).serving.kserve.io/schemaPath: The path within the object storage of a schema file. This allows specifying the input and output schema of ML models.- For example, if your model
storageURIwass3://modelmesh-example-models/pytorch/pytorch-cifarthe schema file would currently need to be in the same bucket (modelmesh-example-models). The path within this bucket is what would be specified in this annotation (e.g.pytorch/schema/schema.json)
- For example, if your model
InferenceService Status
The Status section of the InferenceService custom resource reflects details about its current state. Here are fields relevant to ModelMesh:
components.predictor - predictor related endpoint information.
url- URL holds the primary url that will distribute traffic over the provided traffic targets. This will be one the REST or gRPC endpoints that are available.restUrl- REST endpoint of the component if available. This endpoint is provided through a REST proxy sidecar (if enabled), and this will also be the same for all predictors owned by a given ModelMesh Serving installation.grpcUrl- gRPC endpoint of the component if available. Note that this will currently be the same for allInferenceServiceowned by a given ModelMesh Serving installation.
conditions - Various condition entries. Pertinent entries are:
PredictorReady: predictor readiness condition. Status istruewhen the predictor's endpoints are ready to serve inferencing requests. Note that this does not necessarily mean requests will respond immediately, the corresponding model may or may not be loaded in memory. In the case that it isn't there may be some delay before the response comes back.Ready: aggregated condition of all conditions.
modelStatus - Model related statuses.
-
states- State information of the predictor's model.-
activeModelState- The state of the model currently being served by the predictor's endpoints. It may be one of:Pending- The ModelMesh Serving controller has not yet acknowledged/registered this (new) predictor/model.Standby- The model is currently not loaded in memory anywhere, but will be automatically upon first usage. This means the first requests to this predictor will likely take longer to respond.Loading- The model is in the process of loading. Requests may take longer to respond since they will be blocked until the loading completes.Loaded- The model is loaded in at least one pod and ready to respond immediately to inferencing requests.FailedToLoad- The model could not be loaded for some reason. See thelastFailureInfofield for more details.
-
targetModelState- This will be set only whentransitionStatusis notUpToDate, meaning that the target model differs from the currently-active model. The target model always corresponds to theInferenceServicepredictor's current spec. The possible values are the same asactiveModelStatebut should generally only be eitherLoadingorFailedToLoad.
-
-
transitionStatus- Indicates state of the predictor relative to its current spec. It may be one of:UpToDate- The predictor's current model reflects its spec, that is, its active model matches its target model.InProgress- The predictor's currently active model configuration is older than its current spec reflects. This is usually the case immediately after the spec changes, while a new target model is loading (targetModelStateshould beLoading). Once the target model finishes loading successfully, the active model will become the target model and thetransitionStatuswill return toUpToDate.BlockedByFailedLoad- The predictor's currently active model configuration is older than its current spec reflects, because there was a problem loading the corresponding model. See thelastFailureInfofield for more details of the failure.InvalidSpec- The predictor's currently active model configuration was not transitioned to match its current spec because the current spec is invalid. There may be more details of the error in thelastFailureInfofield.
-
modelCopies- Model copy information of the predictor's model.failedCopies- The number of copies of the active or target model that failed to load recently (there will be at most one of each per pod).totalCopies- The total number of copies of this predictor's models that are currently loaded.
-
lastFailureInfo- Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.reason- A high level code indicating the nature of the failure, may be one of:ModelLoadFailed- The model failed to load within a serving runtime container. Loading is automatically retried in other pods of the same runtime if they exist, thefailedCopiesfield indicates how many different pods the model recently failed to load in.RuntimeUnhealthy- CorrespondingServingRuntimepods failed to start or are unhealthy.NoSupportingRuntime- There are noServingRuntimes which support the specified model type.RuntimeNotRecognized- There is noServingRuntimedefined with the specified runtime name.InvalidPredictorSpec- The currentInferenceServicepredictor spec is invalid or unsupported.
location- Indication of the pod in which a loading failure most recently occurred, if applicable. Its value will be the last 12 digits of the pod's full name.message- A message containing more detail about the error/failure.modelId- The internal id of the model in question. This includes a hash of theInferenceService's predictor spec.time- The time at which the failure occurred, if applicable.
Upon creation, the active model status of an InferenceService will always transition to Loaded state (unless the loading fails), but later if unused, it is possible that the active model status ends up in a Standby state which means the model is still available to serve requests but the first request could incur a loading delay. Whether this happens is a function of the available capacity and usage pattern of other models. It's possible that models will transition from Standby back to Loaded "by themselves" if more capacity becomes available.
Model loading will be retried immediately in other pods if it fails, after which it will be re-attempted periodically (every ten minutes or so).