How to Construct Reward Model?
To build an effective reward model, the first step is determining the appropriate granularity. Rather than evaluating only final results, we focus on step-level granularity to enhance LLM capabilities in reflection and backtracking. Using fine-tuning data, we distinguish solutions by line numbers to capture more detailed cognitive processes.
Meta-Evaluation
We tested both open-source and proprietary reward models on subsets of the PRM800K and MR-GSM8K datasets, comparing their performance. The results, presented in our tables, show that O1-mini consistently performs best across different datasets.
| Model | F1 score | 
|---|---|
| o1-mini | 0.855 | 
| GPT-4o-mini | 0.722 | 
| Math-shepherd | 0.734 | 
| ReasonEval-7B | 0.728 | 
| ReasonEval-34B | 0.735 | 
Results on the subset of MR-GSM8K
| Model | F1 score | 
|---|---|
| GPT-4o-mini | 0.756 | 
| o1-mini | 0.880 | 
| o1-preview | 0.867 | 
Results on the subset of PRM800K