Monday, September 9, 2024

Scaling legal guidelines for reward mannequin overoptimization

Must read


In reinforcement studying from human suggestions, it is not uncommon to optimize in opposition to a reward mannequin skilled to foretell human preferences. As a result of the reward mannequin is an imperfect proxy, optimizing its worth an excessive amount of can hinder floor reality efficiency, in accordance with Goodhart’s regulation. This impact has been steadily noticed, however not rigorously measured because of the expense of gathering human desire information. On this work, we use an artificial setup by which a hard and fast “gold-standard” reward mannequin performs the position of people, offering labels used to coach a proxy reward mannequin. We examine how the gold reward mannequin rating adjustments as we optimize in opposition to the proxy reward mannequin utilizing both reinforcement studying or best-of-n sampling. We discover that this relationship follows a special practical type relying on the strategy of optimization, and that in each instances its coefficients scale easily with the variety of reward mannequin parameters. We additionally examine the impact on this relationship of the dimensions of the reward mannequin dataset, the variety of reward mannequin and coverage parameters, and the coefficient of the KL penalty added to the reward within the reinforcement studying setup. We discover the implications of those empirical outcomes for theoretical issues in AI alignment.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article