You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/backend.ai-webui-docs/src/en/model_serving/model_serving.md
+60-78Lines changed: 60 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,5 @@
1
+
<aid="model-serving"></a>
2
+
1
3
# Model Serving
2
4
3
5
## Model Service
@@ -22,58 +24,35 @@ administrators only need to specify the scaling parameters required for
22
24
the Model Service, without the need to manually create or delete compute
23
25
sessions.
24
26
25
-
<details>
26
-
<summary>Model Service in Version 23.03 and Earlier</summary>
27
-
28
-
Although the model serving-specific feature is officially supported from
29
-
version 23.09, you can still use model service in earlier versions.
30
-
31
-
For example, in version 23.03, you can configure a model service by
32
-
modifying the compute session for training in the following way:
33
-
34
-
1. Add pre-opened ports during session creation to map the running
35
-
server port inside the session for model serving.
36
-
(For instructions on how to use preopen ports, refer to [Set Preopen Ports](#set-preopen-ports).)
27
+
## Guide to Steps for Using Model Service
37
28
38
-
2. Check 'Open app to public' to allow the service mapped to the
39
-
pre-opened port to be publicly accessible.
40
-
(For detailed information about "Open app to public," refer to [Open app to public](#open-app-to-public).)
29
+
Starting from version 26.4.0, you can deploy a model service easily without a separate configuration file.
41
30
42
-
However, there are certain limitations in version 23.03:
31
+
**Quick Deploy (Recommended)**: Browse pre-configured models in the [Model Store](#model-store) and click the `Deploy` button to deploy immediately.
43
32
44
-
- Sessions do not automatically recover if they are terminated due to
45
-
external factors such as idle timeout or system errors.
46
-
- The app port changes every time a session is restarted.
47
-
- If sessions are repeatedly restarted, the idle ports may be
48
-
exhausted.
33
+
**Deploy via Service Launcher**: Click the `Start Service` button on the Serving page to open the service launcher, then select a runtime variant such as `vLLM` or `SGLang` to create a model service without a separate model definition file.
49
34
50
-
The official Model Service feature in version 23.09 resolves these
51
-
limitations. Therefore, starting from version 23.09, it is recommended
52
-
to create/manage Model Services through the model serving menu whenever
53
-
possible. The use of pre-opened ports is recommended only for
54
-
development and testing purposes.
35
+
The general workflow is as follows:
55
36
56
-
</details>
37
+
1. Create a model service using the service launcher.
38
+
2. (If the model service is not public) Generate a token.
39
+
3. (For end users) Access the service endpoint to verify the service.
40
+
4. (If needed) Modify the model service.
41
+
5. (If needed) Terminate the model service.
57
42
58
-
## Guide to Steps for Using Model Service
43
+
<details>
44
+
<summary>Advanced: Using Model Definition and Service Definition Files (Custom Runtime)</summary>
59
45
60
-
To use the Model Service, you need to follow the steps below:
46
+
If you are using the `Custom` runtime variant or need finer control, you can create and use model definition and service definition files:
61
47
62
48
1. Create a model definition file.
63
49
2. Create a service definition file.
64
50
3. Upload the definition files to the model type folder.
65
-
4. Create/Validate the Model Service using the service launcher.
66
-
5. (If the Model Service is not public) Obtain a token.
67
-
6. (For end users) Access the endpoint corresponding to the Model
68
-
Service to verify the service.
69
-
7. (If needed) Modify the Model Service.
70
-
8. (If needed) Terminate the Model Service.
51
+
4. Select the `Custom` runtime in the service launcher to create/validate the model service.
71
52
72
-
:::tip
73
-
As an alternative workflow, you can browse pre-configured models in the
74
-
[Model Store](#model-store) and deploy them with a single click using the
75
-
`Deploy` button.
76
-
:::
53
+
For details, refer to the [Creating a Model Definition File](#model-definition-guide) and [Creating a Service Definition File](#service-definition-file) sections below.
54
+
55
+
</details>
77
56
78
57
<aid="model-definition-guide"></a>
79
58
@@ -371,9 +350,11 @@ Click the `Start Service` button on the Serving page to open the service launche
371
350
372
351
First, provide a service name. The following fields are available:
373
352
353
+
-**Service Name**: A unique name to identify the endpoint.
374
354
-**Open To Public**: This option allows access to the model service without any separate token. By default, it is disabled.
375
-
-**Model Storage**: The model storage folder to mount, which contains the model definition file inside the directory.
355
+
-**Model Storage Folder to Mount**: Select the storage folder containing the model files.
376
356
-**Inference Runtime Variant**: Selects the runtime variant for the model service. The available variants are dynamically loaded from the backend and may include `vLLM`, `SGLang`, `NVIDIA NIM`, `Modular MAX`, `Custom`, and others depending on your installation.
357
+
-**Environment / Version**: Configure the execution environment for the model service. Selecting a runtime variant automatically filters the environment images.
377
358
378
359

379
360
@@ -393,18 +374,16 @@ Select `Enter Command` to paste a CLI command directly. For example:
393
374
vllm serve /models/my-model --tp 2
394
375
```
395
376
396
-
The system automatically parses the command and fills in the following fields:
397
-
398
-
-**Port**: Auto-detected from the command (default `8000`).
399
-
-**Health Check URL**: Auto-detected from the command (default `/health`).
400
-
-**Model mount path**: Auto-detected from the command.
401
-
402
377

403
378
404
-
You can also configure:
379
+
The system automatically parses the command and fills in the following fields:
405
380
406
-
-**Initial Delay**: Seconds to wait before the first health check after the service starts.
407
-
-**Max Retries**: Maximum number of health check attempts before the service is considered failed.
381
+
-**Start Command**: Enter the command to be executed in model serving directly.
382
+
-**Model Mount**: The path where the model storage folder is mounted in the container (default `/models`).
383
+
-**Port**: Auto-detected from the command (default `8000`). The port number that the model serving process listens on.
384
+
-**Health Check URL**: Auto-detected from the command (default `/health`). The HTTP endpoint path called during service health checks.
385
+
-**Initial Delay**: Seconds to wait before the first health check after the service starts (default `60.0`).
386
+
-**Max Retries**: Maximum number of health check attempts before the service is considered failed (default `10`).
408
387
409
388
:::tip
410
389
If the command suggests multi-GPU usage (e.g., `--tp 2`), a GPU hint will appear
@@ -425,38 +404,41 @@ Select `Use Config File` to use the traditional `model-definition.yaml` approach
425
404
426
405
When you select the `vLLM` or `SGLang` runtime variant, a **Runtime Parameters** section appears. This section lets you fine-tune the model serving behavior without manually editing configuration files.
SGLang provides the following tabs: **Model Loading**, **Resource Memory**, **Serving Performance**, **Tool Reasoning**, and others.
433
+
434
+
Key fields in the **Model Loading** tab:
435
+
436
+
-**Model**: The name or path of the model to use.
437
+
-**DType**: The data type for model weights and computation (e.g., `Auto`, `float16`, `bfloat16`).
438
+
-**Quantization**: The model quantization method (e.g., `awq`, `gptq`, `fp8`).
439
+
-**Context Length**: The maximum context length the model can process.
440
+
-**Served Model Name**: The model name to expose at the API endpoint.
441
+
-**Trust Remote Code**: Allow execution of custom model code from the model repository.
460
442
461
443
In addition to runtime parameters, the `vLLM` and `SGLang` runtime variants expose specific environment variables in the **Environment Variables** section of the service launcher:
462
444
@@ -650,9 +632,9 @@ Click the `Add Rules` button to open the **Add Auto Scaling Rule** editor. To mo
650
632
651
633
-**Step Size**: A positive integer specifying how many replicas to add or remove per scaling event. The direction (add or remove) is derived automatically from which threshold is configured:
652
634
653
-
- Only a minimum threshold is set → `[metric] < [minThreshold]`. Replicas are scaled **in** when the metric falls below the threshold.
654
-
- Only a maximum threshold is set → `[maxThreshold] < [metric]`. Replicas are scaled **out** when the metric rises above the threshold.
655
-
- Both thresholds are set → `[metric] < [minThreshold]` or `[maxThreshold] < [metric]`. Replicas are scaled in or out depending on which boundary the metric crosses.
635
+
- Only a minimum threshold is set: `[metric] < [minThreshold]` triggers **Scale In**(replicas decrease when the metric falls below the threshold).
636
+
- Only a maximum threshold is set: `[metric] > [maxThreshold]` triggers **Scale Out**(replicas increase when the metric rises above the threshold).
637
+
- Both thresholds are set: replicas are scaled in or out depending on which boundary the metric crosses (`[minThreshold] < [metric] < [maxThreshold]` is the normal operating range).
656
638
657
639
-**Time Window**: The time window, in seconds, over which the metric is aggregated and evaluated for scaling. This replaces the legacy `CoolDown Seconds` field and has a different meaning.
658
640
-**Min Replicas** and **Max Replicas**: The lower and upper bounds that auto-scaling enforces on the replica count. Auto-scaling will not reduce the number of replicas below **Min Replicas** or increase it above **Max Replicas**.
@@ -796,7 +778,7 @@ The page uses a search and sort layout at the top:
796
778
-**Sort**: Choose how results are ordered. The available options are `Name (A→Z)`, `Name (Z→A)`, `Oldest first`, and `Newest first`.
797
779
-**Refresh**: Click the refresh button to reload the card list.
798
780
799
-
Each card displays the model brand icon, title (or name when no title is set), task tag, relative creation time, and the author with an icon. Cards that have **no compatible presets** for the current project are shown at 50% opacity. You can still open such a card to view its details, but its **Deploy** button is disabled and an error alert is shown in the drawer: *No compatible presets available. This model cannot be deployed.*
781
+
Each card displays the model brand icon, title (or name when no title is set), task tag, relative creation time, and the author with an icon. Cards that have **no compatible presets** for the current project are shown at 50% opacity. You can still open such a card to view its details, but its **Deploy** button is disabled and an error alert is shown in the drawer: **No compatible presets available. This model cannot be deployed.**
800
782
801
783
If the `MODEL_STORE` project is not set up on the server, the page shows a *Model Store project not found* message with instructions to contact an administrator. If no model cards match your filters, the page displays *No models found*.
0 commit comments