Skip to content

Train_loss = 0 and Eval_loss = NaN in stage2_sft #31

@xuxiaoang

Description

@xuxiaoang

Hello!
Thank you for your work at MLLM.
I had a fine-tuning bug that I couldn't fix: when I ran the stage2_sft.sh script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.
屏幕截图 2024-07-20 210750

Command in stage2_sft.sh as follows:

torchrun
    --nproc_per_node 2 \
    anygpt/src/train/stage2_sft.py \
    --model_name_or_path "${METAROOT}" \
    --run_name "mm_sft" \
    --cache_dir ${CACHEROOT} \
    --report_to "wandb" \
    --speech_conv_datasets "$speech_conv_datasets" \
    --speech_datasets "$speech_datasets"\
    --preprocessing_num_workers 100 \
    --bf16 True \
    --do_train \
    --do_eval \
    --output_dir "${OUTROOT}" \
    --model_max_length 4096 \
    --save_strategy "steps" \
    --save_steps 5 \
    --evaluation_strategy "steps" \
    --eval_steps 5 \
    --max_steps 5 \
    --concatenating False \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --val_set_size 10 \
    --num_train_epochs 3\
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --log_level debug \
    --logging_steps 1 \
    --overwrite_output_dir False\
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --use_flash_attn True \
    --ddp_timeout 7200 \
    --save_total_limit 10

I'm using the following python environment:

transformers              4.34.1
huggingface-hub           0.24.0
tokenizers                0.14.1
torch                     2.1.0
torchaudio                2.1.0
torchvision               0.16.0
flash-attn                2.5.9.post1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions