Skip to main content

How to Construct Reward Model?

To build an effective reward model, the first step is determining the appropriate granularity. Rather than evaluating only final results, we focus on step-level granularity to enhance LLM capabilities in reflection and backtracking. Using fine-tuning data, we distinguish solutions by line numbers to capture more detailed cognitive processes.

Meta-Evaluation

We tested both open-source and proprietary reward models on subsets of the PRM800K and MR-GSM8K datasets, comparing their performance. The results, presented in our tables, show that O1-mini consistently performs best across different datasets.

ModelF1 score
o1-mini0.855
GPT-4o-mini0.722
Math-shepherd0.734
ReasonEval-7B0.728
ReasonEval-34B0.735

Results on the subset of MR-GSM8K

ModelF1 score
GPT-4o-mini0.756
o1-mini0.880
o1-preview0.867

Results on the subset of PRM800K