We have skilled a mannequin to realize a brand new state-of-the-art in mathematical downside fixing by rewarding every right step of reasoning (“course of supervision”) as a substitute of merely rewarding the right remaining reply (“final result supervision”). Along with boosting efficiency relative to final result supervision, course of supervision additionally has an essential alignment profit: it instantly trains the mannequin to provide a chain-of-thought that’s endorsed by people.
Supply hyperlink
Enhancing mathematical reasoning with course of supervision
