we support generating datasets in alpaca, sharegpt and chatml format.
- Example
In supervised fine-tuning, the
instructioncolumn will be concatenated with theinputcolumn and used as the user prompt, then the user prompt would beinstruction\ninput. Theoutputcolumn represents the model response.
[
{
"instruction": "user instruction (required)",
"input": "user input (optional)",
"output": "model response (required)"
}
]- Example
Compared to the alpaca format, the sharegpt format allows the datasets have more roles, such as human, gpt, observation and function. They are presented in a list of objects in the
conversationscolumn.
Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions. The gpt and function will be learned by the model.
In our implementation, only human and gpt will be used.
[
{
"conversations": [
{
"from": "human",
"value": "user instruction (required)"
},
{
"from": "gpt",
"value": "model response (required)"
}
]
}
]- Example
Like the sharegpt format, the chatml format also allows the datasets have more roles, such as user, assistant, system and tool. They are presented in a list of objects in the
messagescolumn.
In our implementation, only user and assistant will be used.
[
{
"messages": [
{
"role": "user",
"content": "user instruction (required)"
},
{
"role": "assistant",
"content": "model response (required)"
}
]
}
]