LLM Alignment: Reward-Primarily based vs Reward-Free Strategies | by Anish Dubey | Jul, 2024

Optimization strategies for LLM alignment

Language fashions have demonstrated outstanding skills in producing a variety of compelling textual content primarily based on prompts offered by customers. Nonetheless, defining what constitutes “good” textual content is difficult, because it usually will depend on private preferences and the particular context. As an illustration, in storytelling, creativity is vital; in crafting informative content material, accuracy and reliability are essential; and when producing code, making certain it runs appropriately is important. Therefore the “LLM alignment downside,” which refers back to the problem of making certain that enormous language fashions (LLMs) act in methods which are according to human values, intentions, and preferences.

Designing a loss operate that captures the varied qualities we worth in textual content — like creativity, accuracy, or executability — is very advanced and sometimes impractical. Ideas like these should not differentiable and therefore not back-propagated and can’t be skilled upon with easy subsequent token era.

Think about if we may harness human suggestions to guage the standard of generated textual content or, even higher, use that suggestions as a guiding loss operate to enhance the mannequin’s efficiency. This idea is on the coronary heart of Reinforcement Studying from Human Suggestions (RLHF). By making use of reinforcement studying methods, RLHF permits us to fine-tune language fashions primarily based on direct human suggestions, aligning the fashions extra carefully with nuanced human values and expectations. This method has opened up new potentialities for coaching language fashions that aren’t solely extra responsive but in addition extra aligned with the complexity of human preferences.

Under, we are going to purpose to study extra about RLHF through reward-based after which about RLHF through reward-free strategies.

Let’s undergo Reinforcement studying via human suggestions (RLHF). It consist of three important phases:

Supervised positive tuning
Reward modeling section
RL fine-tuning section

Supervised positive tuning

RLHF is a pre-trained mannequin which is okay tuned already on a top quality knowledge set. Its goal is easy i.e. when given an enter (immediate), it produces an output. The final word goal right here is to additional positive tune this mannequin to supply output based on human choice. Therefore, let’s name this a base mannequin for reference. At the moment, this mannequin is a vanilla base mannequin which isn’t conscious of any human choice.

Reward Modelling Part

Reward mannequin innovation: That is the place the brand new innovation begins on how reward fashions are included into RLHF. The thought behind the reward mannequin is {that a} new LLM mannequin, which may be similar because the above talked about base mannequin, may have the flexibility to generate human choice rating. The rationale it’s just like a big language mannequin is as a result of this mannequin additionally wants to grasp the language semantics earlier than it will probably price if an output is human most popular or not. Because the reward is scalar, we add a linear layer on prime of LLM to generate a scalar rating by way of human choice.

Knowledge assortment section: That is performed from the supervised positive tuning stage the place the bottom mannequin is requested to generate 2 outputs for a given textual content. Instance: For an enter token x, two output tokens are generated, y1 and y2 by the bottom mannequin. These outputs are proven to human raters to price and human choice is recorded for every particular person output.

Coaching section: As soon as the info pattern is collected from the info assortment section, the reward mannequin is skilled with the next immediate. “Given the next enter: <x>, LLM generated <y> output. Are you able to price the efficiency of the output?”. The mannequin will output r(reward) and we already know the precise worth of reward r1 from the info assortment section. Now, this may be back-propagated with the loss operate and the mannequin may be skilled. Under is the target loss operate which the mannequin optimises for via back-propagation:

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

rΦ(x, y): a reward mannequin parameterized by Φ which estimates the reward. Parameterized means we don’t know the precise worth and this must be optimized from the above equation. That is the reward LLM mannequin itself. Largely, the LLM parameters are frozen right here and solely few parameters are left to alter. Most necessary layer is the linear layer added on the prime. This does a lot of the studying to price the rating of output.
Ɗ: A dataset of triplets (x, yw, yl) the place x: enter, yw: the winner output and yl: the loser output
σ: the sigmoid operate which maps the distinction in reward to a chance (0–1)
∑(x, y,w yl) ~Ɗ means x, yw, yl are all sampled from Ɗ

Instance situation: Think about you’re coaching a reward mannequin to guage responses. You have got pairs of responses to a given immediate, and human suggestions tells you which ones response is best. For context, x(“What’s the capital of France?”), you’ve yw(“The capital of France is Paris.”) as winner and yl(“The capital of France is Berlin.” ) as loser. The reward mannequin ought to ultimately study to provide increased reward for “The capital of France is Paris.” output when in comparison with “The capital of France is Berlin.” output if “What’s the capital of France?” enter is given.

RL fine-tuning section

Reinforcement studying thought: Now the base mannequin and reward mannequin are skilled, the concept is learn how to leverage reward mannequin rating and replace base mannequin parameters to replicate human choice. Because the reward mannequin outputs a scalar rating and isn’t differentiable, we can not use easy back-propogation to replace the bottom mannequin param. Therefore, we want different methods to replace the bottom mannequin. That is the place reinforcement studying comes which helps the bottom mannequin to alter the params via reward mannequin rating. That is performed via PPO (proximal coverage optimization). Understanding the core structure of PPO shouldn’t be required to know this idea and therefore we is not going to cowl it right here however on a excessive stage, the concept is that PPO can use scalar rating to replace base mannequin parameters. Now let’s perceive how base and reward fashions are included to make base fashions study human choice.

RL fine-tuning thought: In reinforcement studying, we’ve got motion, area and rewards. The thought is to give you a coverage which any motion agent can take within the area which maximizes the reward. This turns into fairly sophisticated however in a simplified sense, π is the coverage which is our base LLM mannequin solely. Πref means the bottom mannequin and ΠӨ means a unique LLM optimum mannequin which we try to generate. We have to discover ΠӨ (the bottom mannequin’s neural community weights might be fine-tuned) which provides human-preferred output. It’s simply that we don’t know ΠӨ and the concept is to search out this optimum mannequin.

RL coaching and suggestions loop section: An enter x is given to 2 coverage fashions, Πref (baseline mannequin) and ΠӨ (optimum mannequin which we try to generate). Initially each fashions are stored the identical. Enter x to 2 fashions individually will give two outputs correspondingly. The output from ΠӨ mannequin can also be fed to reward mannequin (enter: x, output: y; as mentioned above) and requested to output the reward rating which is rΦ(x, y). Now we’ve got 3 issues, output from the baseline mannequin, output from the optimum mannequin, and a reward rating from the optimum mannequin. There are 2 issues we’re optimizing right here, one is to maximize the reward as a result of ultimately we wish the mannequin to be as shut as human choice and one other is to reduce the divergence from baseline mannequin. Maximizing the reward is straightforward since it’s already a scalar amount however how will we reduce the divergence of baseline and optimum mannequin. Right here we use “Kullback–Leibler divergence” which estimates the distinction between 2 steady chance distributions. Let’s take a deeper look into the target loss operate

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

rΦ(x, y): a scalar worth for an enter x and output y (from optimum mannequin). To be specific, output from the optimum mannequin is fed into the reward mannequin.
Dkl (ΠӨ (y | x) || Πref (y | x)): This computes the Kullback–Leibler divergence between 2 chance distributions. Every token from every mannequin is a chance distribution. KL estimates how far the distribution is from one another.
β : Hyperparameter which is used to find out how necessary it’s to have optimum mannequin near baseline mannequin.

Instance situation: Think about you’re asking (“What’s the capital of France?”), Πref (baseline mannequin) says: “The capital of France is Berlin.” and ΠӨ (optimum mannequin) “There are 3 capitals, Paris, Versailles, and Lyon, however Paris is taken into account because the official capital”. Now rΦ(“x: What’s the capital…”, “y: There are 3 capital..”) ought to give low rating as it’s much less human-preferred and Kullback–Leibler divergence of (ΠӨ (y | x) || Πref (y | x)) needs to be excessive as nicely because the chance distribution area differs for each particular person output. Therefore the loss might be excessive from each phrases. We don’t want the mannequin to solely optimize for reward but in addition keep nearer to the baseline mannequin and therefore each the phrases are used to optimize the reward. Within the subsequent iteration with studying let’s say, ΠӨ (optimum mannequin) says “The capital of France is Delhi”, on this case mannequin discovered to remain nearer to Πref (baseline mannequin) and output the format nearer to baseline mannequin however the reward part will nonetheless be decrease. Hopefully, within the third iteration ΠӨ (optimum mannequin) ought to be capable of study and output “The capital of France is Paris” with increased reward and mannequin output aligning carefully with baseline mannequin.

The under diagram helps illustrate the logic. I will even extremely suggest to undergo RLHF hyperlink from hugging face.

Picture by creator, impressed by https://huggingface.co/weblog/rlhf

With RLHF utilizing a reward-based technique in thoughts, let’s transfer to the reward-free technique. In keeping with the paper: “our key perception is to leverage an analytical mapping from reward capabilities to optimum insurance policies, which permits us to remodel a loss operate over reward capabilities right into a loss operate over insurance policies. This variation-of-variables method avoids becoming an specific, standalone reward mannequin, whereas nonetheless optimizing below current fashions of human preferences”. Very sophisticated to grasp, however let’s attempt to break this down in easy phases within the subsequent part.

Reward-free technique’s key thought: In RLHF, a separate new reward mannequin is skilled which is pricey and expensive to take care of. Is there any mechanism to keep away from coaching a brand new reward mannequin and use the present base mannequin to attain a brand new optimum mannequin? That is precisely what reward-free technique does i.e. it avoids coaching a brand new reward mannequin and in flip adjustments the equation in such a approach that there is no such thing as a reward mannequin time period within the loss operate of DPO (Direct choice optimization). A technique to consider that is that we have to attain optimum mannequin coverage(ΠӨ) from base mannequin (Πref). It may be reached both via optimizing the reward operate area which helps construct a proxy to achieve optimum mannequin coverage or instantly studying a mapping operate from reward to coverage and in flip optimize for coverage itself. That is precisely what the authors have tried by eradicating the reward operate part in loss operate and substitute it instantly by mannequin coverage parameter. That is what the creator meant once they say “leverage an analytical mapping from reward operate to optimum insurance policies …. right into a loss operate over insurance policies”. That is the core innovation of the paper.

DPO coaching and suggestions loop section: Utilizing Πref (baseline mannequin), enter x is given and requested to supply 2 outputs (y1 and y2). All x, y1 and y2 are utilized by human raters to resolve profitable yw and dropping yl. Offline knowledge set is collected with triplet info <x, yw and yl>. With this info, we all know what the profitable (human most popular) and dropping (human not most popular) solutions are. Now, the identical enter x is given to 2 coverage (fashions) Πref (baseline mannequin) and ΠӨ (optimum mannequin). Initially each fashions are stored the identical for coaching functions. Enter x to 2 fashions individually will give two outputs correspondingly. We compute how far the output is from profitable and dropping solutions from each reference and optimum mannequin via “Kullback–Leibler divergence”. Let’s take a deeper look into the target loss operate

Equation

ΠӨ (yw | x) -> Given x(enter), how far is the corresponding output of the mannequin say youtput from the profitable output yw. Output youtput and yw are chance distributions and variations amongst each might be computed via “Kullback–Leibler divergence”. This might be a scalar worth. Additionally that is computed for each fashions with completely different mixtures of Πref (yw | x), Πref (yl | x), ΠӨ (yw | x) and ΠӨ (yl | x).
β : Hyperparameter which is used to find out how necessary it’s to have optimum mannequin near baseline mannequin.

Naturally, the query comes all the way down to which one is best, RLHF via reward-based technique utilizing PPO or reward-free technique utilizing DPO. There is no such thing as a proper reply to this query. A current paper compares “Is DPO superior to PPO for LLM alignment” (paper hyperlink) and concludes that PPO is usually higher than DPO and that DPO suffers extra closely from out-of-distribution knowledge. “Out-of-distribution” knowledge means the human choice knowledge is completely different from the baseline skilled knowledge. This may occur if base mannequin coaching is finished on some dataset whereas choice output is finished for another dataset.
Total, the analysis continues to be out on which one is best whereas we’ve got seen corporations like OpenAI, Anthropic, Meta leverage each RLHF through PPO and DPO as a software for LLM alignment.

Supply hyperlink

LLM Alignment: Reward-Primarily based vs Reward-Free Strategies | by Anish Dubey | Jul, 2024

Must read

Dogecoin And Shiba Inu Whales Withdraw Hundreds of thousands From Robinhood

XRP Soars 20% as Buying and selling Quantity Flips Bitcoin on South Korean Exchanges

Fashionable Enterprise Knowledge Modeling. The way to handle the shortcomings of… | by Bernd Wessely | Jul, 2024

Former Finance VP Pleads Responsible to Embezzling $4 Million From Crypto Agency

Optimization strategies for LLM alignment

Supervised positive tuning

Reward Modelling Part

RL fine-tuning section

More articles

LEAVE A REPLY Cancel reply

Latest article

Dogecoin And Shiba Inu Whales Withdraw Hundreds of thousands From Robinhood

XRP Soars 20% as Buying and selling Quantity Flips Bitcoin on South Korean Exchanges

Fashionable Enterprise Knowledge Modeling. The way to handle the shortcomings of… | by Bernd Wessely | Jul, 2024

Former Finance VP Pleads Responsible to Embezzling $4 Million From Crypto Agency

Arkham transfers $487M ARKMs to Coinbase Prime for tax compliance

Popular Category

Editor Picks

Dogecoin And Shiba Inu Whales Withdraw Hundreds of thousands From Robinhood

XRP Soars 20% as Buying and selling Quantity Flips Bitcoin on South Korean Exchanges