π€ A thought-provoking research question
Self-improvement through post-training methods has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, current research all concentrate on maximizing benchmark scores through iterative self-improvement, there is little exploration of the underlying factors contributing to performance gains. As a result, the progress and reliability of different self-improvement methods are not guaranteed. Amidst the quest for self-improvement in LLMs, the persistent question arises: are these iterative post-training methods truly fostering progress, or are they inadvertently leading to regression?
π©βπ» Our research route
We first provide a comprehensive overview of the main iterative post-training paradigms for self-improvement, understanding both the explicit and implicit influencing factors that contribute to the consistent performance improvements. This provide actionable insights for practitioners on how to perform iterative self-improvement more effectively.
We further develop an evaluative framework equipped with a comprehensive suite of metrics to assess improvement problems, solutions diversity, and OOD capabilities within the iterative process, enabling us to scrutinize the actual improvements beneath self-improvement.
π§ What we reveal?
We identify key variables that influence the optimal improving performance and trend during the iterative post-training process: foundation model M, problem-solving task D, iteration steps T and post-training method F. For our experimental setup, we choose M = {LLaMA2-7B, LLaMA3-8B, Mistral-7B}, D = {CommonsenseQA for Commonse Knowledge, GSM8K and MATH for Mathematical Reasoning, MBPP for Code Generation}, T = {1, 2, 3, 4, 5}, and F = {Iterative SFT, Iterative DPO, Iterative SFT-DPO}.
We engage in a critical examination and reevaluation of iterative self-improvement: discerning whether the improvements constitute genuine progress or merely regression.
Reversal Observation As N grows, M1 achieves near-perfect pass@N accuracy on IS(t) (Improvement set), suggesting its inherent capacity to tackle the deemed improvement problems.
Reversal Observation All methods show a consistent decrease in diversity, significantly diminishing the diversity of model outputs over iterations, impacting both correct and incorrect answers. This reduction is evident across all three metrics: syntactic (Distinct-N grams), semantic (SentenceBERT Consine Similarity), and logical (Distinct Equations) diversity.
Reversal Observation With an increase in iterative steps, Iterative SFT and Iterative SFT-DPO can significantly impair OOD generalization. In contrast, Iterative DPO shows a noticeable improvement. However, all three iterative post-training methods can exacerbate generalization disparities across groups, inadvertently causing models to focus on easier problems instead of improving their ability to solve more complex ones.
If you have any questions regarding this project, feel free to submit a github issue or reach out to us via email.
If you find our paper and code helpful, please consider citing our workπ
@artical{wu2024progressregressselfimprovementreversal,
title={Progress or Regress? Self-Improvement Reversal in Post-training},
author={Ting Wu and Xuefeng Li and Pengfei Liu},
year={2024},
eprint={2407.05013},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.05013}
}