Skip to content

Commit 3d3f3cc

Browse files
committed
docs(FR-2603): apply 26.4 model serving doc feedback — service launcher fields, vLLM/SGLang runtime params
1 parent a914365 commit 3d3f3cc

File tree

8 files changed

+271
-316
lines changed

8 files changed

+271
-316
lines changed

packages/backend.ai-webui-docs/src/en/model_serving/model_serving.md

Lines changed: 60 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
<a id="model-serving"></a>
2+
13
# Model Serving
24

35
## Model Service
@@ -22,58 +24,35 @@ administrators only need to specify the scaling parameters required for
2224
the Model Service, without the need to manually create or delete compute
2325
sessions.
2426

25-
<details>
26-
<summary>Model Service in Version 23.03 and Earlier</summary>
27-
28-
Although the model serving-specific feature is officially supported from
29-
version 23.09, you can still use model service in earlier versions.
30-
31-
For example, in version 23.03, you can configure a model service by
32-
modifying the compute session for training in the following way:
33-
34-
1. Add pre-opened ports during session creation to map the running
35-
server port inside the session for model serving.
36-
(For instructions on how to use preopen ports, refer to [Set Preopen Ports](#set-preopen-ports).)
27+
## Guide to Steps for Using Model Service
3728

38-
2. Check 'Open app to public' to allow the service mapped to the
39-
pre-opened port to be publicly accessible.
40-
(For detailed information about "Open app to public," refer to [Open app to public](#open-app-to-public).)
29+
Starting from version 26.4.0, you can deploy a model service easily without a separate configuration file.
4130

42-
However, there are certain limitations in version 23.03:
31+
**Quick Deploy (Recommended)**: Browse pre-configured models in the [Model Store](#model-store) and click the `Deploy` button to deploy immediately.
4332

44-
- Sessions do not automatically recover if they are terminated due to
45-
external factors such as idle timeout or system errors.
46-
- The app port changes every time a session is restarted.
47-
- If sessions are repeatedly restarted, the idle ports may be
48-
exhausted.
33+
**Deploy via Service Launcher**: Click the `Start Service` button on the Serving page to open the service launcher, then select a runtime variant such as `vLLM` or `SGLang` to create a model service without a separate model definition file.
4934

50-
The official Model Service feature in version 23.09 resolves these
51-
limitations. Therefore, starting from version 23.09, it is recommended
52-
to create/manage Model Services through the model serving menu whenever
53-
possible. The use of pre-opened ports is recommended only for
54-
development and testing purposes.
35+
The general workflow is as follows:
5536

56-
</details>
37+
1. Create a model service using the service launcher.
38+
2. (If the model service is not public) Generate a token.
39+
3. (For end users) Access the service endpoint to verify the service.
40+
4. (If needed) Modify the model service.
41+
5. (If needed) Terminate the model service.
5742

58-
## Guide to Steps for Using Model Service
43+
<details>
44+
<summary>Advanced: Using Model Definition and Service Definition Files (Custom Runtime)</summary>
5945

60-
To use the Model Service, you need to follow the steps below:
46+
If you are using the `Custom` runtime variant or need finer control, you can create and use model definition and service definition files:
6147

6248
1. Create a model definition file.
6349
2. Create a service definition file.
6450
3. Upload the definition files to the model type folder.
65-
4. Create/Validate the Model Service using the service launcher.
66-
5. (If the Model Service is not public) Obtain a token.
67-
6. (For end users) Access the endpoint corresponding to the Model
68-
Service to verify the service.
69-
7. (If needed) Modify the Model Service.
70-
8. (If needed) Terminate the Model Service.
51+
4. Select the `Custom` runtime in the service launcher to create/validate the model service.
7152

72-
:::tip
73-
As an alternative workflow, you can browse pre-configured models in the
74-
[Model Store](#model-store) and deploy them with a single click using the
75-
`Deploy` button.
76-
:::
53+
For details, refer to the [Creating a Model Definition File](#model-definition-guide) and [Creating a Service Definition File](#service-definition-file) sections below.
54+
55+
</details>
7756

7857
<a id="model-definition-guide"></a>
7958

@@ -371,9 +350,11 @@ Click the `Start Service` button on the Serving page to open the service launche
371350

372351
First, provide a service name. The following fields are available:
373352

353+
- **Service Name**: A unique name to identify the endpoint.
374354
- **Open To Public**: This option allows access to the model service without any separate token. By default, it is disabled.
375-
- **Model Storage**: The model storage folder to mount, which contains the model definition file inside the directory.
355+
- **Model Storage Folder to Mount**: Select the storage folder containing the model files.
376356
- **Inference Runtime Variant**: Selects the runtime variant for the model service. The available variants are dynamically loaded from the backend and may include `vLLM`, `SGLang`, `NVIDIA NIM`, `Modular MAX`, `Custom`, and others depending on your installation.
357+
- **Environment / Version**: Configure the execution environment for the model service. Selecting a runtime variant automatically filters the environment images.
377358

378359
![](../images/service_launcher1.png)
379360

@@ -393,18 +374,16 @@ Select `Enter Command` to paste a CLI command directly. For example:
393374
vllm serve /models/my-model --tp 2
394375
```
395376

396-
The system automatically parses the command and fills in the following fields:
397-
398-
- **Port**: Auto-detected from the command (default `8000`).
399-
- **Health Check URL**: Auto-detected from the command (default `/health`).
400-
- **Model mount path**: Auto-detected from the command.
401-
402377
![](../images/service_launcher_command_mode.png)
403378

404-
You can also configure:
379+
The system automatically parses the command and fills in the following fields:
405380

406-
- **Initial Delay**: Seconds to wait before the first health check after the service starts.
407-
- **Max Retries**: Maximum number of health check attempts before the service is considered failed.
381+
- **Start Command**: Enter the command to be executed in model serving directly.
382+
- **Model Mount**: The path where the model storage folder is mounted in the container (default `/models`).
383+
- **Port**: Auto-detected from the command (default `8000`). The port number that the model serving process listens on.
384+
- **Health Check URL**: Auto-detected from the command (default `/health`). The HTTP endpoint path called during service health checks.
385+
- **Initial Delay**: Seconds to wait before the first health check after the service starts (default `60.0`).
386+
- **Max Retries**: Maximum number of health check attempts before the service is considered failed (default `10`).
408387

409388
:::tip
410389
If the command suggests multi-GPU usage (e.g., `--tp 2`), a GPU hint will appear
@@ -425,38 +404,41 @@ Select `Use Config File` to use the traditional `model-definition.yaml` approach
425404

426405
When you select the `vLLM` or `SGLang` runtime variant, a **Runtime Parameters** section appears. This section lets you fine-tune the model serving behavior without manually editing configuration files.
427406

428-
![](../images/service_launcher_runtime_params.png)
407+
Parameters are organized into tab-separated categories. The tab list varies by runtime variant.
408+
409+
:::note
410+
Unchanged parameters will use the runtime's default values.
411+
:::
429412

430-
The parameters are organized into categories:
413+
**vLLM Runtime Parameters**
431414

432-
**Sampling Parameters:**
415+
![](../images/service_launcher_runtime_params_vllm.png)
433416

434-
- **Temperature**: Controls randomness in text generation. Higher values produce more diverse output.
435-
- **Top P**: Nucleus sampling threshold.
436-
- **Top K**: Limits the number of highest-probability tokens to consider.
437-
- **Min P**: Minimum probability threshold for token selection.
438-
- **Frequency Penalty**: Penalizes tokens based on their frequency in the generated text.
439-
- **Presence Penalty**: Penalizes tokens that have already appeared.
440-
- **Repetition Penalty**: Penalizes repeated tokens. Values above 1.0 discourage repetition.
441-
- **Seed**: Random seed for reproducible generation.
417+
vLLM provides the following tabs: **Model Loading**, **Resource Memory**, **Serving Performance**, **Multimodal**, **Tool R(easoning)**, and others.
442418

443-
**Context / Engine Parameters:**
419+
Key fields in the **Model Loading** tab:
444420

445-
- **Context Length**: Maximum context length the model can process.
446-
- **Data Type**: Data type for model weights and computation.
447-
- **KV Cache Data Type**: Data type for the key-value cache.
448-
- **GPU Memory Utilization**: Fraction of GPU memory to use for the model.
421+
- **Model**: The name or path of the model to use.
422+
- **DType**: The data type for model weights and computation (e.g., `Auto`, `float16`, `bfloat16`).
423+
- **Quantization**: The model quantization method (e.g., `awq`, `gptq`, `fp8`).
424+
- **Max Model Length**: The maximum context length (number of tokens) the model can process.
425+
- **Served Model Name**: The model name to expose at the API endpoint.
449426
- **Trust Remote Code**: Allow execution of custom model code from the model repository.
450-
- **Enforce Eager Mode** (vLLM only): Disable CUDA graph optimization for debugging.
451-
- **Disable CUDA Graph** (SGLang only): Disable CUDA graph capture.
452-
- **Memory Fraction Static** (SGLang only): Static memory fraction for the model.
453-
- **Max Model Length**: Maximum context length (number of tokens) the model can process.
454427

455-
**Additional Arguments**: A text field for extra CLI arguments not covered by the controls above.
428+
**SGLang Runtime Parameters**
456429

457-
:::note
458-
Unchanged parameters will use the runtime's default values.
459-
:::
430+
![](../images/service_launcher_runtime_params_sglang.png)
431+
432+
SGLang provides the following tabs: **Model Loading**, **Resource Memory**, **Serving Performance**, **Tool Reasoning**, and others.
433+
434+
Key fields in the **Model Loading** tab:
435+
436+
- **Model**: The name or path of the model to use.
437+
- **DType**: The data type for model weights and computation (e.g., `Auto`, `float16`, `bfloat16`).
438+
- **Quantization**: The model quantization method (e.g., `awq`, `gptq`, `fp8`).
439+
- **Context Length**: The maximum context length the model can process.
440+
- **Served Model Name**: The model name to expose at the API endpoint.
441+
- **Trust Remote Code**: Allow execution of custom model code from the model repository.
460442

461443
In addition to runtime parameters, the `vLLM` and `SGLang` runtime variants expose specific environment variables in the **Environment Variables** section of the service launcher:
462444

@@ -650,9 +632,9 @@ Click the `Add Rules` button to open the **Add Auto Scaling Rule** editor. To mo
650632

651633
- **Step Size**: A positive integer specifying how many replicas to add or remove per scaling event. The direction (add or remove) is derived automatically from which threshold is configured:
652634

653-
- Only a minimum threshold is set`[metric] < [minThreshold]`. Replicas are scaled **in** when the metric falls below the threshold.
654-
- Only a maximum threshold is set`[maxThreshold] < [metric]`. Replicas are scaled **out** when the metric rises above the threshold.
655-
- Both thresholds are set`[metric] < [minThreshold]` or `[maxThreshold] < [metric]`. Replicas are scaled in or out depending on which boundary the metric crosses.
635+
- Only a minimum threshold is set: `[metric] < [minThreshold]` triggers **Scale In** (replicas decrease when the metric falls below the threshold).
636+
- Only a maximum threshold is set: `[metric] > [maxThreshold]` triggers **Scale Out** (replicas increase when the metric rises above the threshold).
637+
- Both thresholds are set: replicas are scaled in or out depending on which boundary the metric crosses (`[minThreshold] < [metric] < [maxThreshold]` is the normal operating range).
656638

657639
- **Time Window**: The time window, in seconds, over which the metric is aggregated and evaluated for scaling. This replaces the legacy `CoolDown Seconds` field and has a different meaning.
658640
- **Min Replicas** and **Max Replicas**: The lower and upper bounds that auto-scaling enforces on the replica count. Auto-scaling will not reduce the number of replicas below **Min Replicas** or increase it above **Max Replicas**.
@@ -796,7 +778,7 @@ The page uses a search and sort layout at the top:
796778
- **Sort**: Choose how results are ordered. The available options are `Name (A→Z)`, `Name (Z→A)`, `Oldest first`, and `Newest first`.
797779
- **Refresh**: Click the refresh button to reload the card list.
798780

799-
Each card displays the model brand icon, title (or name when no title is set), task tag, relative creation time, and the author with an icon. Cards that have **no compatible presets** for the current project are shown at 50% opacity. You can still open such a card to view its details, but its **Deploy** button is disabled and an error alert is shown in the drawer: *No compatible presets available. This model cannot be deployed.*
781+
Each card displays the model brand icon, title (or name when no title is set), task tag, relative creation time, and the author with an icon. Cards that have **no compatible presets** for the current project are shown at 50% opacity. You can still open such a card to view its details, but its **Deploy** button is disabled and an error alert is shown in the drawer: **No compatible presets available. This model cannot be deployed.**
800782

801783
If the `MODEL_STORE` project is not set up on the server, the page shows a *Model Store project not found* message with instructions to contact an administrator. If no model cards match your filters, the page displays *No models found*.
802784

0 commit comments

Comments
 (0)