Skip to content

Commit 83219e3

Browse files
jan-wassenbergcopybara-github
authored andcommitted
Add note on attention length and SFP
PiperOrigin-RevId: 738698399
1 parent 3d419ec commit 83219e3

2 files changed

Lines changed: 18 additions & 10 deletions

File tree

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -347,6 +347,12 @@ instruction-tuned and thus does not respond to instructions. Make sure you are
347347
using an instruction-tuned model (`2b-it-sfp`, `2b-it`, `7b-it-sfp`, `7b-it`)
348348
and not a pre-trained model (any model with a `-pt` suffix).
349349

350+
**What sequence lengths are supported?**
351+
352+
See `seq_len` in `configs.cc`. For the Gemma 3 models larger than 1B, this is
353+
typically 32K but 128K would also work given enough RAM. Note that long
354+
sequences will be slow due to the quadratic cost of attention.
355+
350356
**How do I convert my fine-tune to a `.sbs` compressed model file?**
351357

352358
For PaliGemma (1 and 2) checkpoints, you can use
@@ -372,15 +378,17 @@ pytorch checkpoint. (The code may need updates to work with Gemma-2 models.)
372378

373379
**What are some easy ways to make the model run faster?**
374380

375-
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
376-
2. If you're on a laptop, make sure power mode is set to maximize performance
377-
and saving mode is **off**. For most laptops, the power saving modes get
378-
activated automatically if the computer is not plugged in.
379-
3. Close other unused cpu-intensive applications.
380-
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
381-
cores get engaged.
382-
5. Experiment with the `--num_threads` argument value. Depending on the device,
383-
larger numbers don't always mean better performance.
381+
1. Make sure you are using the 8-bit switched floating point `-sfp` models.
382+
These are half the size of bf16 and thus use less memory bandwidth and cache
383+
space.
384+
2. If you're on a laptop, make sure power mode is set to maximize performance
385+
and saving mode is **off**. For most laptops, the power saving modes get
386+
activated automatically if the computer is not plugged in.
387+
3. Close other unused cpu-intensive applications.
388+
4. On macs, anecdotally we observe a "warm-up" ramp-up in speed as performance
389+
cores get engaged.
390+
5. Experiment with the `--num_threads` argument value. Depending on the device,
391+
larger numbers don't always mean better performance.
384392

385393
We're also working on algorithmic and optimization approaches for faster
386394
inference, stay tuned.

gemma/common.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ constexpr PromptWrapping kPromptWrapping[] = {
8080
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 3B 224/448
8181
PromptWrapping::PALIGEMMA, PromptWrapping::PALIGEMMA, // PG2 10B 224/448
8282
PromptWrapping::GEMMA_VLM, // Gemma3 4B
83-
PromptWrapping::GEMMA_IT, // Gemma3 1B
83+
PromptWrapping::GEMMA_PT, // Gemma3 1B
8484
PromptWrapping::GEMMA_VLM, // Gemma3 12B
8585
PromptWrapping::GEMMA_VLM, // Gemma3 27B
8686
};

0 commit comments

Comments
 (0)