"To say 'I know' when you know, and 'I don't know' when you don't, that is wisdom."
- The Analects of Confucius
We propose Alignment for Honesty, aiming to ensure that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. Aligning models to be honest will signifcantly enhance the trustworthiness and reliability of modern LLMs.
The key principles of alignment are often summarized as the HHH criteria: helpful, harmless, honest. There has been a significant focus on enhancing the helpfulness and harmlessness of LLMs. However, honesty, despite its importance in establishing reliable and safe AI, has received relatively less attention in research. There are several primary challenges in improving the honesty of models:
In this paper, we propose a systemetic framework for alignment for honesty.
We view alignment as a process of iterative refinement. We introduce the concept of "evolutionary metric" to represent the change in model’s response type after t iterations and after t+1 iterations alignment for honesty
We define 3 different scores based on the evolutionary metric:
We propose three SFT-based honesty alignment approaches. We use different approaches to approximate whether a model knows or does not know the answer to a question. We then use these approaches to annotate the SFT training samples.
Specifically, given a question x, and its responses y = {y1, y2, · · · , ym} generated by the model Mt under m trials, we define expected accuracy as the ratios of correct responses among m candidate responses. We present 3 different alignment strategies, and each strategy includes definition of k(·) and annotation of SFT training samples. Note that k(x) ∈ {1 (known), -1 (unknown)} is a function that judges if a model Mt knows the answer to input x.
@article{yang2023alignment,
title={Alignment for Honesty},
author={Yang, Yuqing and Chern, Ethan and Qiu, Xipeng and Neubig, Graham and Liu, Pengfei},
journal={arXiv preprint arXiv:2312.07000},
year={2023}
}