Preference Fine-Tuning
Preference fine-tuning allows you to train models using pairs of preferred and non-preferred examples. This approach is more effective than standard fine-tuning when you have paired examples that show which responses your model should generate and which it should avoid.
We use Direct Preference Optimization (DPO) for this type of fine-tuning.
Data Preparation
Your dataset should contain examples with:
- An
inputfield with messages in in the conversational format. - A
preferred_outputfield with the ideal assistant response - A
non_preferred_outputfield with a suboptimal assistant response
Both outputs must contain exactly one message from the assistant role.
Format your data in JSONL, with each line structured as:
json JSON
{
"input": {
"messages": [
{
"role": "assistant",
"content": "Hello, how can I assist you today?"
},
{
"role": "user",
"content": "Can you tell me about the rise of the Roman Empire?"
}
]
},
"preferred_output": [
{
"role": "assistant",
"content": "The Roman Empire rose from a small city-state founded in 753 BCE. Through military conquests and strategic alliances, Rome expanded across the Italian peninsula. After the Punic Wars, it grew even stronger, and in 27 BCE, Augustus became the first emperor, marking the start of the Roman Empire. This led to a period of peace and prosperity known as the Pax Romana."
}
],
"non_preferred_output": [
{
"role": "assistant",
"content": "The Roman Empire rose due to military strength and strategic alliances."
}
]
}
Launching preference fine-tuning
Hyperparameters
-
Set
--training-method="dpo" -
The
--dpo-betaparameter controls how much the model is allowed to deviate from its reference (or pre-tuned) model during fine-tuning. The default value is0.1but you can experiment with values between0.05-0.9- A lower value of beta (e.g., 0.1) allows the model to update more aggressively toward preferred responses
- A higher value of beta(e.g., 0.7) keeps the updated model closer to the reference behavior.
-
The
--dpo-normalize-logratios-by-lengthparameter (optional, default is False) enables normalization of log ratios by sample length during the DPO loss calculation. -
The
--rpo-alphacoefficient (optional, default is 0.0) incorporates the NLL loss on selected samples with the corresponding weight. -
The
--simpo-gammacoefficient (optional, default is 0.0) adds a margin to the loss calculation, force-enables log ratio normalization (--dpo-normalize-logratios-by-length), and excludes reference logits from the loss computation. The resulting loss function is equivalent to the one used in the SimPO paper.shell CLIbytecompute fine-tuning create \ --training-file $FILE_ID \ --model "meta-llama/Llama-3.2-3B-Instruct" \ --wandb-api-key $WANDB_API_KEY \ --lora \ --training-method "dpo" \ --dpo-beta 0.2python Pythonimport os from bytecompute import bytecompute client = bytecompute(api_key=os.environ.get("bytecompute_API_KEY")) file_id = #Your training file response = client.fine_tuning.create( training_file=file_id, model='meta-llama/Llama-3.2-3B-Instruct', lora=True, training_method='dpo', dpo_beta=0.2, rpo_alpha=1.0, simpo-gamma=1.0, ) print(response)Note
- For LoRA Long-context fine-tuning we currently use half of the context length for the preferred response and half for the non-preferred response. So, if you are using a 32K model, the effective context length will be 16K.
- Preference fine-tuning calculates loss based on the preferred and non-preferred outputs. Therefore, the
--train-on-inputsflag is ignored with preference fine-tuning.
Combining methods: supervised fine-tuning & preference fine-tuning
Supervised fine-tuning (SFT) is the default method on our platform. The recommended approach is to first perform SFT followed up by preference tuning as follows:
- First perform supervised fine-tuning (SFT) on your data.
- Then refine with preference fine-tuning using continued fine-tuning on your SFT checkpoint.
Performing SFT on your dataset prior to DPO can significantly increase the resulting model quality, especially if your training data differs significantly from the data the base model observed during pretraining. To perform SFT, you can concatenate the context with the preferred output and use one of our SFT data formats.
