Add support for loading models using the Run:AI Model Streamer

### Component

Helm Chart

### Desired use case or feature

Currently the llm-d Helm chart supports two protocols for `sampleApplication.model.modelArtifactURI`:
 * `hf://`: This pulls the model just-in-time when vLLM starts up
 * `pvc://`: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.

vLLM also supports streaming a model directly from object storage with higher concurrency with `--load-format=model_streamer` ([docs](https://docs.vllm.ai/en/v0.8.5/models/extensions/runai_model_streamer.html)). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.

### Proposed solution

In order to use the model streamer, vLLM needs additional command line arguments:
 * `--load-format runai_streamer`
 * Model Name: Can be specified either through the `--model` argument or directly as a served model name (eg: `--model=s3://<path-to-model>` or `vllm serve s3://<path-to-model>`.

I propose to add an optional `.modelService.vllm.loadFormat` parameter to the helm chart. When set to `runai_streamer`, relax the "Protocol" constraint (remove the [model source check](https://github.com/llm-d/llm-d-deployer/blob/c9e16e91d264ff719d4e9885fbe5e1b239eb87a1/quickstart/llmd-installer.sh#L380) when the `runai_streamer` vLLM load format is specified). The `loadFormat` will also pass the `--load-format` command line argument to vLLM.
  * When `pvc://` is specified as the protocol, this allows the `pvc://` protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today.
  * When the protocol is not recognized (eg: `s3://`), the `modelArtifactsURI` will be used as the model name, passing through `s3://<path-to-model>` as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the `.ModelPath`).

Additionally, loading can be tuned with the parameter `--model-loader-extra-config`, or environment variables to vLLM. Command line args can be passed in through `.sampleApplication.decode.extraArgs` or `.sampleApplication.prefill.extraArgs` today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: `.modelService.vllm.extraArgs` and `.modelService.vllm.extraEnvVars` parameters).

### Alternatives

Another option could be to add a new `runai_streamer` "Protocol" to the `modelArtifactsURI` chart parameter. This could encode both the object storage URI / filesystem path.
 * If an object storage system is used, the `runai_streamer` protocol would be unwrapped to identify the underlying model protocol. For example `runai_streamer://s3://<path_to_model>` would allow the suffix `s3://<path_to_model>` to be used as the model name for the inference server.
 * If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg: `runai_streamer://pvc://<pvc_name>/<path_to_model>`)

I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script. 

### Additional context or screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for loading models using the Run:AI Model Streamer #317

Component

Desired use case or feature

Proposed solution

Alternatives

Additional context or screenshots

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add support for loading models using the Run:AI Model Streamer #317

Description

Component

Desired use case or feature

Proposed solution

Alternatives

Additional context or screenshots

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions