@@ -200,13 +200,90 @@ is available at `https://<run name>.<gateway domain>/`.
200200
201201[//] : # (TODO: https://github.com/dstackai/dstack/issues/1777)
202202
203+ # # Fine-tuning
204+
205+ Here's and example of FSDP and QLoRA fine-tuning of 4-bit Quantized [Llama-4-Scout-17B-16E :material-arrow-top-right-thin:{ .external }](https://huggingface.co/axolotl-quants/Llama-4-Scout-17B-16E-Linearized-bnb-nf4-bf16) on 2xH100 NVIDIA GPUs using [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/OpenAccess-AI-Collective/axolotl){:target="_blank"}
206+
207+ <div editor-title="examples/fine-tuning/axolotl/.dstack.yml">
208+
209+ ` ` ` yaml
210+ type: task
211+ # The name is optional, if not specified, generated randomly
212+ name: axolotl-nvidia-llama-scout-train
213+
214+ # Using the official Axolotl's Docker image
215+ image: axolotlai/axolotl:main-latest
216+
217+ # Required environment variables
218+ env:
219+ - HF_TOKEN
220+ - WANDB_API_KEY
221+ - WANDB_PROJECT
222+ - WANDB_NAME=axolotl-nvidia-llama-scout-train
223+ - HUB_MODEL_ID
224+ # Commands of the task
225+ commands:
226+ - wget https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/examples/llama-4/scout-qlora-fsdp1.yaml
227+ - axolotl train scout-qlora-fsdp1.yaml
228+ --wandb-project $WANDB_PROJECT
229+ --wandb-name $WANDB_NAME
230+ --hub-model-id $HUB_MODEL_ID
231+
232+ resources:
233+ # Two GPU (required by FSDP)
234+ gpu: H100:2
235+ # Shared memory size for inter-process communication
236+ shm_size: 24GB
237+ disk: 500GB..
238+ ` ` `
239+ </div>
240+
241+ The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
242+
243+ # ## Memory requirements
244+
245+ Below are the approximate memory requirements for loading the model.
246+ This excludes memory for the model context and CUDA kernel reservations.
247+
248+ | Model | Size | Full fine-tuning | LoRA | QLoRA |
249+ |---------------|----------|--------------------|--------|--------|
250+ | `Behemoth` | **2T** | 32TB | 4.3TB | 1.3TB |
251+ | `Maverick` | **400B** | 6.5TB | 864GB | 264GB |
252+ | `Scout` | **109B** | 1.75TB | 236GB | 72GB |
253+
254+ The memory estimates assume FP16 precision for model weights, with low-rank adaptation (LoRA/QLoRA) layers comprising 1% of the total model parameters.
255+
256+ | Fine-tuning type | Calculation |
257+ |------------------|--------------------------------------------------|
258+ | Full fine-tuning | 2T × 16 bytes = 32TB |
259+ | LoRA | 2T × 2 bytes + 1% of 2T × 16 bytes = 4.3TB |
260+ | QLoRA(4-bit) | 2T × 0.5 bytes + 1% of 2T × 16 bytes = 1.3TB |
261+
262+ # # Running a configuration
263+
264+ Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
265+ cloud resources and run the configuration.
266+
267+ <div class="termy">
268+
269+ ` ` ` shell
270+ $ HF_TOKEN=...
271+ $ WANDB_API_KEY=...
272+ $ WANDB_PROJECT=...
273+ $ WANDB_NAME=axolotl-nvidia-llama-scout-train
274+ $ HUB_MODEL_ID=...
275+ $ dstack apply -f examples/fine-tuning/axolotl/.dstack.yml
276+ ` ` `
277+
278+ </div>
279+
203280# # Source code
204281
205- The source-code of this example can be found in
206- [`examples/llms/llama` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama).
282+ The source-code for deployment examples can be found in
283+ [`examples/llms/llama` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama) and the source-code for the finetuning example can be found in [`examples/fine-tuning/axolotl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/axolotl){:target="_blank"} .
207284
208285# # What's next?
209286
2102871. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
211288 [services](https://dstack.ai/docs/services), and [protips](https://dstack.ai/docs/protips).
212- 2. Browse [Llama 4 with SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md), [Llama 4 with vLLM :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html) and [Llama 4 with AMD :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html).
289+ 2. Browse [Llama 4 with SGLang :material-arrow-top-right-thin:{ .external }](https://github.com/sgl-project/sglang/blob/main/docs/references/llama4.md), [Llama 4 with vLLM :material-arrow-top-right-thin:{ .external }](https://blog.vllm.ai/2025/04/05/llama4.html), [Llama 4 with AMD :material-arrow-top-right-thin:{ .external }](https://rocm.blogs.amd.com/artificial-intelligence/llama4-day-0-support/README.html) and [Axolotl :material-arrow-top-right-thin:{ .external }](https://github.com/OpenAccess-AI-Collective/axolotl){:target="_blank"} .
0 commit comments