How to Construct Reward Model?
To build an effective reward model, the first step is determining the appropriate granularity. Rather than evaluating only final results, we focus on step-level granularity to enhance LLM capabilities in reflection and backtracking. Using fine-tuning data, we distinguish solutions by line numbers to capture more detailed cognitive processes.
Meta-Evaluation
We tested both open-source and proprietary reward models on subsets of the PRM800K and MR-GSM8K datasets, comparing their performance. The results, presented in our tables, show that O1-mini consistently performs best across different datasets.
Model | F1 score |
---|---|
o1-mini | 0.855 |
GPT-4o-mini | 0.722 |
Math-shepherd | 0.734 |
ReasonEval-7B | 0.728 |
ReasonEval-34B | 0.735 |
Results on the subset of MR-GSM8K
Model | F1 score |
---|---|
GPT-4o-mini | 0.756 |
o1-mini | 0.880 |
o1-preview | 0.867 |
Results on the subset of PRM800K