📄️ Outline
This section outlines the core of our O1 replication project, guiding readers through our research journey using key questions that reflect the complexity of the process. From our initial evaluation of O1 with the OlympicArena datasets to the construction of long thoughts, our work has involved numerous attempts, iterations, and in-depth analysis of O1’s capabilities.
📄️ What does O1’s Thought Look Like?
Official Example
📄️ How does Long Thought Work?
While we are still in the hypothesis stage without sufficient empirical evidence, we believe the success of O1’s long-thought approach is due to journey learning, as discussed earlier. Unlike shortcut learning, journey learning allows the model to explore the entire decision-making process, much like human problem-solving. O1 can consider multiple solution paths, learn from mistakes, and develop a deeper understanding of the problem—not just finding the correct answer but understanding why and how to reach it.
📄️ How to Construct Long Thoughts?
Constructing long thoughts with actions such as reflection and backtracking is a key element of journey learning. We have explored several approaches to achieve this.
📄️ How to Construct Reward Model?
To build an effective reward model, the first step is determining the appropriate granularity. Rather than evaluating only final results, we focus on step-level granularity to enhance LLM capabilities in reflection and backtracking. Using fine-tuning data, we distinguish solutions by line numbers to capture more detailed cognitive processes.
📄️ How to Construct an On-policy Reasoning Tree?
Constructing a reasoning tree requires a policy model that performs single-step reasoning. Starting from a problem as the root node, the model generates possible reasoning steps as child nodes, continuing iteratively until a maximum depth is reached or the correct answer is found.
📄️ How to Derive a Long Thought from a Reasoning Tree?
Once the reasoning tree is constructed, the next step is to derive a long thought that includes trial and error, moving beyond traditional shortcuts focused solely on the correct answer.
📄️ How to Evaluate our Trials?
In addition to testing accuracy scores using specific evaluation metrics on benchmarks, manually reviewing actual cases is a crucial step in evaluating data and models. Therefore, to provide a more intuitive way to evaluate the model’s performance on specific problems, we build a visual data analysis platform using Streamlit. Specifically, our visualization platform includes the visualization of synthetic trees and their corresponding long thoughts as well as the output of the trained model. Furthermore, when visualizing results, we support detailed conditional filtering, such as filtering for correctly or incorrectly answered questions, or whether the output contains keywords indicating reflection or hesitation (e.g., “wait”). Additionally, we support comparison between different iterations of synthetic data and model outputs, which makes it highly intuitive and helps us easily validate whether the new round of data or models is effective.
📄️ How to Train our Models?
Our experiments utilize the pre-trained language model deepseek-math-7b-base. The training process is divided into two main phases: Supervised Fine-Tuning (SFT) and Direct Preference Learning (DPO).
📄️ What would be an effective annotation strategy for human-ai collaboration?
We have developed a human-AI pipeline that generates high-quality, long-form reasoning data based on the MATH dataset, following our “journey learning” paradigm. This pipeline expands human-annotated solutions from a few lines to thousands of tokens, using key techniques to ensure efficient annotation.