Skip to content

configuration of SFT on SDAR-4B-Chat  #55

@zaiquanyang

Description

@zaiquanyang

hi, I met a issue when reproducing SFT on SDAR-4B-Chat on a machine with 8-H20 GPUS.

I prepare the dataset sft_openr1math_sdar.json (2k samples) according to the guidance and train the SDAR-4B-Chat by the sft_sdar.yaml.

wandb:
  entity: null
  resume: 'auto'


experiment:
    project: "sft_sdar_4b_b32" # need to be same of this file name 
    num_node: 1 # the number of machines you have


model:
    pretrained_model: "/mnt/workspace/yyzq/data/dLLM/model_zoo/SDAR-4B-Chat-b32" # absolute path of your model
    optimized_name: "optimized" # the output name for your optimized model, will be saved under sft_sdar/ckpt



# sft dataset
dataset:
    optimization_data: "sft_openr1math_sdar" # "sft_openr1math_sdar"

training:
    gradient_checkpointing_enable: True # if the sequence is very larger, set as True
    gradient_accumulation_steps: 16
    batch_size_lm: 1
    mixed_precision: "bf16"
    enable_tf32: True
    seed: 10086
    num_train_epochs: 1
    max_grad_norm: 1
    method: "semi-ar" # "semi-ar""trace"
    lower_p: 0.1
    upper_p: 0.9
    block_size: 32 # the block size of your model
    shrink: 1
    post_num: 0 # number of pad token need to be trained for each data point
    max_gen_length: 2000
    max_prompt_len: 784



optimizer:
    name: adamw
    params: # default adamw params
        learning_rate: 1e-5
        scale_lr: False # scale learning rate by total batch size
        beta1: 0.9
        beta2: 0.999
        weight_decay: 0.0
        epsilon: 1e-8

lr_scheduler:
    scheduler: "cosine"
    params:
        learning_rate: ${optimizer.params.learning_rate}
        warmup_steps: 0
        min_lr_scale: 1.0

After training, the saved model achieved acc: 0.572 avg length: 619.924 while the base model is acc: 0.606 avg length: 642.392 .

The sdar_eval yaml is

experiment:
    project: "sdar_eval" # need to be same of this file name
    num_node: 1 # the number of machines you have
    node_index: 0 # no need to change



# model: "/mnt/workspace/yyzq/data/dLLM/model_zoo/SDAR-4B-Chat-b32"
model: "/mnt/workspace/yyzq/data/dLLM/dLLM-RL/sft_sdar_4b_b32/ckpt/optimized" # absolute path of your model
model_base: "sdar" # set sdar for TraDo and SDAR


# dataset you want to eval on, you need to download first, you can also modify your own dataset, see instructions in ./data
dataset:
    eval_dataset: "MATH500" #"MBPP""MATH500""GSM8K""AIME2024""GPQA""LiveCodeBench""HumanEval""LiveBench"
    data_type: "math" #"code""math"

execute:
    num_chunk: 128

rollout:
    tensor_parallel_size: 1 # set to 1 by default, if oom, try reduce max_active first, if still oom, set tensor_parallel_size to 8
    max_active: 256
    num_response_per_task: 1
    temperature: 1.0
    max_token: 4096 # max generation token num
    block_size: 32
    denoising_steps_per_block: 32
    top_p: 1.0
    top_k: 0
    remasking_strategy: "low_confidence_dynamic" #"low_confidence_static""low_confidence_dynamic"
    dynamic_threshold: 0.99
    start_with_think: False # if not reasoning model, set to False, otherwise True
    output_unmasking_history: True

There exists a large gap with the results in Table 1. I am not sure what has caused this problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions