Skip to content

Commit 63ab0ad

Browse files
authored
Merge pull request #2508 from Bihan/update_llama_example_new
[Example] Update Llama 4 Examples
2 parents b38ff56 + 83353fb commit 63ab0ad

File tree

7 files changed

+226
-17
lines changed

7 files changed

+226
-17
lines changed

docs/examples.md

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -129,24 +129,14 @@ hide:
129129
Deploy and train Deepseek models
130130
</p>
131131
</a>
132-
<a href="/examples/llms/llama31"
132+
<a href="/examples/llms/llama"
133133
class="feature-cell sky">
134134
<h3>
135-
Llama 3.1
135+
Llama
136136
</h3>
137137

138138
<p>
139-
Deploy and fine-tune Llama 3.1
140-
</p>
141-
</a>
142-
<a href="/examples/llms/llama32"
143-
class="feature-cell sky">
144-
<h3>
145-
Llama 3.2
146-
</h3>
147-
148-
<p>
149-
Deploy Llama 3.2 vision models
139+
Deploy Llama 4 models
150140
</p>
151141
</a>
152142
</div>
File renamed without changes.

docs/examples/llms/llama32/index.md

Whitespace-only changes.

examples/llms/llama/README.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Llama
2+
3+
This example walks you through how to deploy Llama 4 Scout model with `dstack`.
4+
5+
??? info "Prerequisites"
6+
Once `dstack` is [installed](https://dstack.ai/docs/installation), go ahead clone the repo, and run `dstack init`.
7+
8+
<div class="termy">
9+
10+
```shell
11+
$ git clone https://github.com/dstackai/dstack
12+
$ cd dstack
13+
$ dstack init
14+
```
15+
16+
</div>
17+
18+
## Deployment
19+
20+
Here's an example of a service that deploys
21+
[`Llama-4-Scout-17B-16E-Instruct` :material-arrow-top-right-thin:{ .external }](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct){:target="_blank"}
22+
using [SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang){:target="_blank"} and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"}
23+
with NVIDIA `H200` GPUs.
24+
25+
=== "SGLang"
26+
27+
<div editor-title="examples/llms/llama/sglang/nvidia/.dstack.yml">
28+
29+
```yaml
30+
type: service
31+
name: llama4-scout
32+
33+
image: lmsysorg/sglang
34+
env:
35+
- HF_TOKEN
36+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
37+
- CONTEXT_LEN=256000
38+
commands:
39+
- python3 -m sglang.launch_server
40+
--model-path $MODEL_ID
41+
--tp $DSTACK_GPUS_NUM
42+
--context-length $CONTEXT_LEN
43+
--kv-cache-dtype fp8_e5m2
44+
--port 8000
45+
46+
port: 8000
47+
## Register the model
48+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
49+
50+
resources:
51+
gpu: H200:2
52+
disk: 500GB..
53+
```
54+
</div>
55+
56+
=== "vLLM"
57+
58+
<div editor-title="examples/llms/llama/vllm/nvidia/.dstack.yml">
59+
60+
```yaml
61+
type: service
62+
name: llama4-scout
63+
64+
image: vllm/vllm-openai
65+
env:
66+
- HF_TOKEN
67+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
68+
- VLLM_DISABLE_COMPILE_CACHE=1
69+
- MAX_MODEL_LEN=256000
70+
commands:
71+
- |
72+
vllm serve $MODEL_ID \
73+
--tensor-parallel-size $DSTACK_GPUS_NUM \
74+
--max-model-len $MAX_MODEL_LEN \
75+
--kv-cache-dtype fp8 \
76+
--override-generation-config='{"attn_temperature_tuning": true}'
77+
78+
port: 8000
79+
# Register the model
80+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
81+
82+
resources:
83+
gpu: H200:2
84+
disk: 500GB..
85+
```
86+
</div>
87+
88+
!!! info "NOTE:"
89+
With vLLM, add `--override-generation-config='{"attn_temperature_tuning": true}'` to
90+
improve accuracy for [contexts longer than 32K tokens :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html){:target="_blank"}.
91+
92+
### Memory requirements
93+
94+
Below are the approximate memory requirements for loading the model.
95+
This excludes memory for the model context and CUDA kernel reservations.
96+
97+
| Model | Size | FP16 | FP8 | INT4 |
98+
|---------------|----------|--------|--------|--------|
99+
| `Behemoth` | **2T** | 4TB | 2TB | 1TB |
100+
| `Maverick` | **400B** | 800GB | 200GB | 100GB |
101+
| `Scout` | **109B** | 218GB | 109GB | 54.5GB |
102+
103+
104+
### Running a configuration
105+
106+
To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md) command.
107+
108+
<div class="termy">
109+
110+
```shell
111+
$ HF_TOKEN=...
112+
$ dstack apply -f examples/llms/llama/sglang/nvidia/.dstack.yml
113+
114+
# BACKEND REGION RESOURCES SPOT PRICE
115+
1 vastai is-iceland 48xCPU, 128GB, 2xH200 (140GB) no $7.87
116+
2 runpod EU-SE-1 40xCPU, 128GB, 2xH200 (140GB) no $7.98
117+
118+
119+
Submit the run llama4-scout? [y/n]: y
120+
121+
Provisioning...
122+
---> 100%
123+
```
124+
125+
</div>
126+
127+
Once the service is up, it will be available via the service endpoint
128+
at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
129+
130+
<div class="termy">
131+
132+
```shell
133+
curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
134+
-X POST \
135+
-H 'Authorization: Bearer &lt;dstack token&gt;' \
136+
-H 'Content-Type: application/json' \
137+
-d '{
138+
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
139+
"messages": [
140+
{
141+
"role": "system",
142+
"content": "You are a helpful assistant."
143+
},
144+
{
145+
"role": "user",
146+
"content": "What is Deep Learning?"
147+
}
148+
],
149+
"stream": true,
150+
"max_tokens": 512
151+
}'
152+
```
153+
154+
</div>
155+
156+
When a [gateway](https://dstack.ai/docs/concepts/gateways.md) is configured, the service endpoint
157+
is available at `https://<run name>.<gateway domain>/`.
158+
159+
[//]: # (TODO: https://github.com/dstackai/dstack/issues/1777)
160+
161+
## Source code
162+
163+
The source-code of this example can be found in
164+
[`examples/llms/llama` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama).
165+
166+
## What's next?
167+
168+
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
169+
[services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
170+
2. Browse [Llama 4 with SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md)
171+
and [Llama 4 with vLLM :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html).
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
type: dev-environment
2+
name: llama4-scout
3+
ide: vscode
4+
image: lmsysorg/sglang
5+
env:
6+
- HF_TOKEN
7+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
8+
- CONTEXT_LEN=256000
9+
commands:
10+
- python3 -m sglang.launch_server
11+
--model-path $MODEL_ID
12+
--tp $DSTACK_GPUS_NUM
13+
--context-length $CONTEXT_LEN
14+
--port 8000
15+
--kv-cache-dtype fp8_e5m2
16+
17+
port: 8000
18+
## Register the model
19+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
20+
21+
resources:
22+
gpu: H200:2
23+
disk: 500GB..
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
type: service
2+
name: llama4-scout
3+
4+
image: vllm/vllm-openai
5+
env:
6+
- HF_TOKEN
7+
- MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct
8+
- VLLM_DISABLE_COMPILE_CACHE=1
9+
- MAX_MODEL_LEN=256000
10+
commands:
11+
- |
12+
vllm serve $MODEL_ID \
13+
--tensor-parallel-size $DSTACK_GPUS_NUM \
14+
--max-model-len $MAX_MODEL_LEN \
15+
--kv-cache-dtype fp8 \
16+
--override-generation-config='{"attn_temperature_tuning": true}'
17+
18+
port: 8000
19+
# Register the model
20+
model: meta-llama/Llama-4-Scout-17B-16E-Instruct
21+
22+
resources:
23+
gpu: H200:2
24+
disk: 500GB..

mkdocs.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,10 @@ plugins:
9090
'docs/tasks.md': 'docs/concepts/tasks.md'
9191
'docs/services.md': 'docs/concepts/services.md'
9292
'docs/fleets.md': 'docs/concepts/fleets.md'
93-
'docs/examples/llms/llama31.md': 'examples/llms/llama31/index.md'
94-
'docs/examples/llms/llama32.md': 'examples/llms/llama32/index.md'
93+
'docs/examples/llms/llama31.md': 'examples/llms/llama/index.md'
94+
'docs/examples/llms/llama32.md': 'examples/llms/llama/index.md'
95+
'examples/llms/llama31/index.md': 'examples/llms/llama/index.md'
96+
'examples/llms/llama32/index.md': 'examples/llms/llama/index.md'
9597
'docs/examples/accelerators/amd/index.md': 'examples/accelerators/amd/index.md'
9698
'docs/examples/deployment/nim/index.md': 'examples/deployment/nim/index.md'
9799
'docs/examples/deployment/vllm/index.md': 'examples/deployment/vllm/index.md'
@@ -247,8 +249,7 @@ nav:
247249
- NIM: examples/deployment/nim/index.md
248250
- LLMs:
249251
- Deepseek: examples/llms/deepseek/index.md
250-
- Llama 3.1: examples/llms/llama31/index.md
251-
- Llama 3.2: examples/llms/llama32/index.md
252+
- Llama: examples/llms/llama/index.md
252253
- Accelerators:
253254
- AMD: examples/accelerators/amd/index.md
254255
- Intel Gaudi: examples/accelerators/intel/index.md

0 commit comments

Comments
 (0)