Sunday, April 21, 2024

Wonderful-tune Llama 3 with ORPO. A less expensive and quicker unified… | by Maxime Labonne | Apr, 2024

Must read


A less expensive and quicker unified fine-tuning approach

Towards Data Science
Picture generated with DALL-E 3 by writer

ORPO is a new thrilling fine-tuning approach that mixes the standard supervised fine-tuning and desire alignment levels right into a single course of. This reduces the computational sources and time required for coaching. Furthermore, empirical outcomes reveal that ORPO outperforms different alignment strategies on varied mannequin sizes and benchmarks.

On this article, we are going to fine-tune the brand new Llama 3 8B mannequin utilizing ORPO with the TRL library. The code is offered on Google Colab and within the LLM Course on GitHub.

Instruction tuning and desire alignment are important strategies for adapting Giant Language Fashions (LLMs) to particular duties. Historically, this entails a multi-stage course of: 1/ Supervised Wonderful-Tuning (SFT) on directions to adapt the mannequin to the goal area, adopted by 2/ desire alignment strategies like Reinforcement Studying with Human Suggestions (RLHF) or Direct Desire Optimization (DPO) to extend the probability of producing most well-liked responses over rejected ones.

Picture by writer

Nevertheless, researchers have recognized a limitation on this strategy. Whereas SFT successfully adapts the mannequin to the specified area, it inadvertently will increase the likelihood of producing undesirable solutions alongside most well-liked ones. For this reason the desire alignment stage is critical to widen the hole between the likelihoods of most well-liked and rejected outputs.

Word how the likelihood of rejected responses will increase throughout supervised fine-tuning (picture from the ORPO paper).

Launched by Hong and Lee (2024), ORPO affords a sublime resolution to this downside by combining instruction tuning and desire alignment right into a single, monolithic coaching course of. ORPO modifies the usual language modeling goal, combining the destructive log-likelihood loss with an odds ratio (OR) time period. This OR loss weakly penalizes rejected responses whereas strongly rewarding most well-liked ones, permitting the mannequin to concurrently be taught the goal activity and align with human preferences.

ORPO has been carried out within the main fine-tuning libraries, like TRL, Axolotl, and LLaMA-Manufacturing unit. Within the subsequent part, we are going to see methods to use with TRL.

Llama 3 is the most recent household of LLMs developed by Meta. The fashions had been skilled on an in depth dataset of 15 trillion tokens (in comparison with 2T tokens for Llama 2). Two mannequin sizes have been launched: a 70 billion parameter mannequin and a smaller 8 billion parameter mannequin. The 70B mannequin has already demonstrated spectacular efficiency, scoring 82 on the MMLU benchmark and 81.7 on the HumanEval benchmark.

Llama 3 fashions additionally elevated the context size as much as 8,192 tokens (4,096 tokens for Llama 2), and doubtlessly scale as much as 32k with RoPE. Moreover, the fashions use a brand new tokenizer with a 128K-token vocabulary, decreasing the variety of tokens required to encode textual content by 15%. This vocabulary additionally explains the bump from 7B to 8B parameters.

Samples from ORPO-DPO-mix-40k (picture by writer).

ORPO requires a desire dataset, together with a immediate, a selected reply, and a rejected reply. On this instance, we are going to use mlabonne/orpo-dpo-mix-40k, a mixture of the next high-quality DPO datasets:

Due to argilla, unalignment, M4-ai, and jondurbin for offering the supply datasets.

As per traditional, let’s begin by putting in the required libraries:

pip set up -U transformers datasets speed up peft trl bitsandbytes wandb

As soon as it’s put in, we will import the required libraries and log in to W&B (non-obligatory):

import gc
import os

import torch
import wandb
from datasets import load_dataset
from google.colab import userdata
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
pipeline,
)
from trl import ORPOConfig, ORPOTrainer, setup_chat_format
wb_token = userdata.get('wandb')
wandb.login(key=wb_token)

If in case you have a latest GPU, you must also have the ability to use the Flash Consideration library to switch the default keen consideration implementation with a extra environment friendly one.

if torch.cuda.get_device_capability()[0] >= 8:
!pip set up -qqq flash-attn
attn_implementation = "flash_attention_2"
torch_dtype = torch.bfloat16
else:
attn_implementation = "keen"
torch_dtype = torch.float16

Within the following, we are going to load the Llama 3 8B mannequin in 4-bit precision due to bitsandbytes. We then set the LoRA configuration utilizing PEFT for QLoRA. I’m additionally utilizing the handy setup_chat_format() operate to change the mannequin and tokenizer for ChatML assist. It mechanically applies this chat template, provides particular tokens, and resizes the mannequin’s embedding layer to match the brand new vocabulary measurement.

# Mannequin
base_model = "meta-llama/Meta-Llama-3-8B"
new_model = "OrpoLlama-3-8B"

# QLoRA config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Load mannequin
mannequin = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
attn_implementation=attn_implementation
)
mannequin, tokenizer = setup_chat_format(mannequin, tokenizer)
mannequin = prepare_model_for_kbit_training(mannequin)

Now that the mannequin is prepared for coaching, we will handle the dataset. We load mlabonne/orpo-dpo-mix-40k and use the apply_chat_template() operate to transform the “chosen” and “rejected” columns into the ChatML format. Word that I am solely utilizing 1,000 samples and never the complete dataset, as it might take too lengthy to run.

dataset_name = "mlabonne/orpo-dpo-mix-40k"
dataset = load_dataset(dataset_name, break up="all")
dataset = dataset.shuffle(seed=42).choose(vary(10))

def format_chat_template(row):
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row

dataset = dataset.map(
format_chat_template,
num_proc= os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)

First, we have to set a couple of hyperparameters:

  • learning_rate: ORPO makes use of very low studying charges in comparison with conventional SFT and even DPO. This worth of 8e-6 comes from the unique paper, and roughly corresponds to an SFT studying price of 1e-5 and a DPO studying price of 5e-6. I’d suggest rising it round 1e-6 for an actual fine-tune.
  • beta: It’s the $lambda$ parameter within the paper, with a default worth of 0.1. An appendix from the unique paper reveals the way it’s been chosen with an ablation research.
  • Different parameters, like max_length and batch measurement are set to make use of as a lot VRAM as accessible (~20 GB on this configuration). Ideally, we might practice the mannequin for 3-5 epochs, however we’ll follow 1 right here.

Lastly, we will practice the mannequin utilizing the ORPOTrainer, which acts as a wrapper.

orpo_args = ORPOConfig(
learning_rate=8e-6,
beta=0.1,
lr_scheduler_type="linear",
max_length=1024,
max_prompt_length=512,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
optim="paged_adamw_8bit",
num_train_epochs=1,
evaluation_strategy="steps",
eval_steps=0.2,
logging_steps=1,
warmup_steps=10,
report_to="wandb",
output_dir="./outcomes/",
)

coach = ORPOTrainer(
mannequin=mannequin,
args=orpo_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=peft_config,
tokenizer=tokenizer,
)

coach.practice()
coach.save_model(new_model)

Coaching the mannequin on these 1,000 samples took about 2 hours on an L4 GPU. Let’s test the W&B plots:

Whereas the loss goes down, the distinction between the chosen and rejects solutions just isn’t clear: the typical margin and accuracy are solely barely above zero and 0.5, respectively.

Within the authentic paper, the authors skilled fashions on the Anthropic/hh-rlhf dataset (161k samples) for 10 epochs, which is quite a bit longer than our fast run. In addition they experimented with Llama 3 and kindly shared their logs with me (thanks Jiwoo Hong).

To finish this tutorial, let’s merge the QLoRA adapter with the bottom mannequin and push it to the Hugging Face Hub.

# Flush reminiscence
del coach, mannequin
gc.acquire()
torch.cuda.empty_cache()

# Reload tokenizer and mannequin
tokenizer = AutoTokenizer.from_pretrained(base_model)
mannequin = AutoModelForCausalLM.from_pretrained(
base_model,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map="auto",
)
mannequin, tokenizer = setup_chat_format(mannequin, tokenizer)

# Merge adapter with base mannequin
mannequin = PeftModel.from_pretrained(mannequin, new_model)
mannequin = mannequin.merge_and_unload()
mannequin.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Congrats, we completed this fast fine-tune of Llama 3: mlabonne/OrpoLlama-3–8B. You may play with it utilizing this Hugging Face Area (right here’s a pocket book to make your personal). Though the mannequin is undertrained, as highlighted by the W&B curves, I ran some evaluations on Nous’ benchmark suite utilizing LLM AutoEval.

Our ORPO fine-tune is definitely fairly first rate and improves the bottom mannequin’s efficiency on each benchmark. That is encouraging and sure implies that a fine-tune on the complete 40k samples would yield nice outcomes.

That is an thrilling time for the open-source neighborhood, with increasingly more high-quality open-weight fashions being launched. The hole between closed-source and open-weight fashions is slowly closing, and fine-tuning is a necessary software to get the very best efficiency on your use circumstances.

Picture by writer

On this article, we launched the ORPO algorithm and defined the way it unifies the SFT and desire alignment levels right into a single course of. Then, we used TRL to fine-tune a Llama 3 8B mannequin on a customized desire dataset. The ultimate mannequin reveals encouraging outcomes and highlights ORPO’s potential as a brand new fine-tuning paradigm.

I hope it was helpful, and I like to recommend working the Colab pocket book to fine-tune your personal Llama 3 fashions. In future articles, we are going to see methods to create high-quality datasets — some extent that’s usually missed. When you preferred this text, please observe me on Hugging Face and Twitter @maximelabonne.

  • J. Hong, N. Lee, and J. Thorne, ORPO: Monolithic Desire Optimization with out Reference Mannequin. 2024.
  • L. von Werra et al., TRL: Transformer Reinforcement Studying. GitHub, 2020. [Online]. Accessible: https://github.com/huggingface/trl
  • Bartolome, A., Martin, G., & Vila, D. (2023). Notus. In GitHub Repository. GitHub. https://github.com/argilla-io/notus
  • AI at Meta, Introducing Meta Llama 3, 2024.





Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article