How to Construct Reward Model?

To build an effective reward model, the first step is determining the appropriate granularity. Rather than evaluating only final results, we focus on step-level granularity to enhance LLM capabilities in reflection and backtracking. Using fine-tuning data, we distinguish solutions by line numbers to capture more detailed cognitive processes.

Meta-Evaluation

We tested both open-source and proprietary reward models on subsets of the PRM800K and MR-GSM8K datasets, comparing their performance. The results, presented in our tables, show that O1-mini consistently performs best across different datasets.

Model	F1 score
o1-mini	0.855
GPT-4o-mini	0.722
Math-shepherd	0.734
ReasonEval-7B	0.728
ReasonEval-34B	0.735

Results on the subset of MR-GSM8K

Model	F1 score
GPT-4o-mini	0.756
o1-mini	0.880
o1-preview	0.867

Results on the subset of PRM800K

Meta-Evaluation​

Meta-Evaluation