How to Train our Models?

Our experiments utilize the pre-trained language model deepseek-math-7b-base. The training process is divided into two main phases: Supervised Fine-Tuning (SFT) and Direct Preference Learning (DPO).

Phase 1: Supervised Fine-Tuning (SFT)

The SFT process consists of two stages:

ShortCut Learning: In this initial stage, we focus on fine-tuning the model using responses that include only the correct intermediate steps and the final correct answer. We fine-tune Deepseek-math-7b-base on the Abel dataset, which comprises 120k examples, and the PRM800K dataset. For each question in PRM800K, we utilize a single correct step-by-step solution, discarding responses that do not lead to the final answer. This results in a total of 6,998 examples for fine-tuning. During this stage, we conduct fine-tuning for one epoch on each dataset, primarily aiming to familiarize the model with the desired response format.
Journey Learning: In this second stage, we further fine-tune the initial stage SFT model using the long thoughts we constructed, which comprise 327 examples. This phase is designed to enhance the model's ability to detect errors, incorporate reflections, execute corrections, and perform backtracking. By training on long thoughts that include not only the correct reasoning paths but also erroneous trials, we aim to equip the model with a deeper understanding of the complexities involved in longer reasoning chains. As a comparison, we also fine-tune the model on the corresponding shortCut generated from the same reasoning tree, which also consists of 327 examples. Both the long thought SFT and shortCut SFT settings are trained for 3 epochs on these 327 examples.

Phase 2: Direct Preference Learning (DPO)

In this phase, we generate 20 responses per question from the MATH Train dataset, a re-divided dataset from PRM800k that includes 12,000 examples, using nucleus sampling with top_p = 0.95 and temperature T = 0.7. These 20 responses are categorized into positive and negative responses based on the correctness of the final answer. From these, we randomly select 5 positive responses and 5 negative responses to create 5 preference pairs. We then train the model using these preference pairs with DPO loss, allowing it to learn from the comparison of correct and incorrect answers.

Results

	deepseek-sft-abel	deepseek-sft-prm800k
SFT-phase1	0.372	0.290
SFT-phase2-shortcutLearning	0.386	0.348
SFT-phase2-journeyLearning	0.470	0.428
DPO	0.472	0.440

Table: Training Results on MATH Test Set

Phase 1: Supervised Fine-Tuning (SFT)​

Phase 2: Direct Preference Learning (DPO)​

Results​

Phase 1: Supervised Fine-Tuning (SFT)

Phase 2: Direct Preference Learning (DPO)

Results