Skip to content

Commit 44ee7e6

Browse files
Merge pull request #2888 from CIeNET-International:doc/multimodal
PiperOrigin-RevId: 852109457
2 parents fbc8fef + 3336b53 commit 44ee7e6

1 file changed

Lines changed: 17 additions & 4 deletions

File tree

docs/tutorials/posttraining/multimodal.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,14 @@ Multimodal Large Language Models (LLMs) extend traditional text-only models by i
2525

2626
## Checkpoint Conversion
2727

28-
Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace ([README](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/README.md)). This tool is used for the Gemma3 model family. Use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
28+
Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace ([README](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/README.md)).
29+
30+
Install pytorch:
31+
```
32+
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
33+
```
34+
35+
Then use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
2936

3037
```shell
3138
export HF_ACCESS_TOKEN=hf_...
@@ -66,7 +73,7 @@ python -m MaxText.decode \
6673
MaxText/configs/base.yml \
6774
model_name=gemma3-4b \
6875
hf_access_token=$HF_ACCESS_TOKEN \
69-
tokenizer_path=assets/tokenizer.gemma3 \
76+
tokenizer_path=src/MaxText/assets/tokenizer.gemma3 \
7077
load_parameters_path=$MAXTEXT_CKPT_GCS_PATH/0/items \
7178
per_device_batch_size=1 \
7279
run_name=ht_test \
@@ -77,7 +84,7 @@ python -m MaxText.decode \
7784
scan_layers=false \
7885
use_multimodal=true \
7986
prompt='Describe image <start_of_image>' \
80-
image_path='MaxText/test_assets/test_image.jpg' \
87+
image_path='src/MaxText/test_assets/test_image.jpg' \
8188
attention='dot_product'
8289
```
8390

@@ -94,10 +101,15 @@ Describe image <start_of_image><end_of_turn>
94101
To decode with multiple images at once, you can provide multiple image paths like this:
95102

96103
```
104+
export TARGET_LENGTH=... # Adjust to fit expected output length
105+
export PREDICT_LENGTH=... # Adjust to fit image tokens + text prompt
106+
97107
python -m MaxText.decode \
98108
MaxText/configs/base.yml \
99109
model_name=gemma3-4b \
100110
... \
111+
max_prefill_predict_length=$PREDICT_LENGTH # Adjust to fit image tokens + text prompt \
112+
max_target_length=$TARGET_LENGTH \
101113
image_path=/path/to/image1.jpg,/path/to/image2.jpg \
102114
prompt="Describe each image in a short sentence." # <start_of_image> will be added to prompt if not provided
103115
# or prompt="Describe each image in a short sentence: <start_of_image> and <start_of_image>"
@@ -113,8 +125,9 @@ Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as
113125

114126

115127
```shell
128+
export UNSCANNED_CKPT_PATH=... # either set to an already available MaxText ckpt or to the one we just converted in the previous step
116129
python -m MaxText.sft_trainer \
117-
$MAXTEXT_REPO_ROOT/configs/sft-vision-chartqa.yml \
130+
src/MaxText/configs/sft-vision-chartqa.yml \
118131
run_name="chartqa-sft" \
119132
model_name=gemma3-4b \
120133
tokenizer_path="google/gemma-3-4b-it" \

0 commit comments

Comments
 (0)