How to Construct an On-policy Reasoning Tree?

Constructing a reasoning tree requires a policy model that performs single-step reasoning. Starting from a problem as the root node, the model generates possible reasoning steps as child nodes, continuing iteratively until a maximum depth is reached or the correct answer is found.

Policy Model and Step Segmentation: To structure reasoning steps, we used the dataset from Abel, fine-tuning DeepSeekMath-7B-Base to create Abel-DSMath. This model generates reasoning steps clearly segmented by lines, which allows for controlled and precise stepwise reasoning.
Reward Model and Pruning: Generating a reasoning tree is computationally expensive. To address this, we implemented beam search to prune erroneous steps and improve efficiency. Two reward models were tested: math-shepherd and o1-mini. While math-shepherd scores each step’s probability of correctness, o1-mini offers more robust step-level rewards, directly indicating whether a reasoning step is correct. By selecting only the K highest-scoring steps in each iteration, we drastically reduce the number of generated steps, making the process more efficient.