How to Construct Long Thoughts?
Constructing long thoughts with actions such as reflection and backtracking is a key element of journey learning. We have explored several approaches to achieve this.
- Attempt1: Tree Search with LLM and Reward: In this method, reasoning is modeled as a search on a tree, where the problem is the root, and each node represents a reasoning step. When incorrect paths are identified, the model backtracks to find the correct solution. A fine-grained reward model guides this process, incorporating errors into the reasoning chain.
- Attempt2: Propose-Critique Loop: Attempt 1 constructs long thought by executing searches on the tree based on predefined rules, but this limits the freedom of actions like backtracking and reflection. Therefore, we allow the model to choose its current actions. We constructed a Propose-Critique Loop, where we pre-define some possible actions for the model (i.e., continue, backtracking, reflection, terminate) and let the model select actions to build the reasoning tree. If the tree does not reach the final answer, the model can be informed of this negative signal, guiding it to reflect and correct its approach.
- Attempt3: Multi-Agent Approach: Building long thought on the foundation of a reasoning tree presents several challenges, including the presence of numerous ineffective nodes that do not contribute to constructing Long Thought, as well as issues of logical inconsistency caused by reasoning steps that do not depend on the reflection behavior. To address this, we designed an algorithm utilizing multi-agent debate, where one agent acts as the policy model, continuously reasoning, while another agent serves as the critique model, indicating whether the policy model should continue with the current reasoning or perform actions like backtracking. The two agents engage in ongoing dialogue, naturally constructing a long thought dataset when the correct answer is found.
- Attempt4: Human Thought Process Annotation: By observing how humans solve reasoning problems—through reflection, backtracking, and revision—we can comprehensively document and model high-quality long thought processes that reflect human-like reasoning.