Component
Helm Chart
Desired use case or feature
Currently the llm-d Helm chart supports two protocols for sampleApplication.model.modelArtifactURI:
hf://: This pulls the model just-in-time when vLLM starts up
pvc://: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.
vLLM also supports streaming a model directly from object storage with higher concurrency with --load-format=model_streamer (docs). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.
Proposed solution
In order to use the model streamer, vLLM needs additional command line arguments:
--load-format runai_streamer
- Model Name: Can be specified either through the
--model argument or directly as a served model name (eg: --model=s3://<path-to-model> or vllm serve s3://<path-to-model>.
I propose to add an optional .modelService.vllm.loadFormat parameter to the helm chart. When set to runai_streamer, relax the "Protocol" constraint (remove the model source check when the runai_streamer vLLM load format is specified). The loadFormat will also pass the --load-format command line argument to vLLM.
- When
pvc:// is specified as the protocol, this allows the pvc:// protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today.
- When the protocol is not recognized (eg:
s3://), the modelArtifactsURI will be used as the model name, passing through s3://<path-to-model> as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the .ModelPath).
Additionally, loading can be tuned with the parameter --model-loader-extra-config, or environment variables to vLLM. Command line args can be passed in through .sampleApplication.decode.extraArgs or .sampleApplication.prefill.extraArgs today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: .modelService.vllm.extraArgs and .modelService.vllm.extraEnvVars parameters).
Alternatives
Another option could be to add a new runai_streamer "Protocol" to the modelArtifactsURI chart parameter. This could encode both the object storage URI / filesystem path.
- If an object storage system is used, the
runai_streamer protocol would be unwrapped to identify the underlying model protocol. For example runai_streamer://s3://<path_to_model> would allow the suffix s3://<path_to_model> to be used as the model name for the inference server.
- If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg:
runai_streamer://pvc://<pvc_name>/<path_to_model>)
I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script.
Additional context or screenshots
No response
Component
Helm Chart
Desired use case or feature
Currently the llm-d Helm chart supports two protocols for
sampleApplication.model.modelArtifactURI:hf://: This pulls the model just-in-time when vLLM starts uppvc://: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.vLLM also supports streaming a model directly from object storage with higher concurrency with
--load-format=model_streamer(docs). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.Proposed solution
In order to use the model streamer, vLLM needs additional command line arguments:
--load-format runai_streamer--modelargument or directly as a served model name (eg:--model=s3://<path-to-model>orvllm serve s3://<path-to-model>.I propose to add an optional
.modelService.vllm.loadFormatparameter to the helm chart. When set torunai_streamer, relax the "Protocol" constraint (remove the model source check when therunai_streamervLLM load format is specified). TheloadFormatwill also pass the--load-formatcommand line argument to vLLM.pvc://is specified as the protocol, this allows thepvc://protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today.s3://), themodelArtifactsURIwill be used as the model name, passing throughs3://<path-to-model>as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the.ModelPath).Additionally, loading can be tuned with the parameter
--model-loader-extra-config, or environment variables to vLLM. Command line args can be passed in through.sampleApplication.decode.extraArgsor.sampleApplication.prefill.extraArgstoday, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg:.modelService.vllm.extraArgsand.modelService.vllm.extraEnvVarsparameters).Alternatives
Another option could be to add a new
runai_streamer"Protocol" to themodelArtifactsURIchart parameter. This could encode both the object storage URI / filesystem path.runai_streamerprotocol would be unwrapped to identify the underlying model protocol. For examplerunai_streamer://s3://<path_to_model>would allow the suffixs3://<path_to_model>to be used as the model name for the inference server.runai_streamer://pvc://<pvc_name>/<path_to_model>)I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script.
Additional context or screenshots
No response