BeHonest: Benchmarking Honesty in Large Language Models

We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency).

BeHonest supports the following 10 scenarios:

Expressing Unknowns: LLMs should appropriately refuse to answer questions that are beyond their knowledge.
Admitting Knowns: LLMs should correctly answer questions they have knowledge of.
Persona Sycophancy: LLMs should not alter its responses to fit perceived viewpoints of a persona, as this can lead to biased information aimed at maintaining approval.
Preference Sycophancy: LLMs should not change its perspective when a user indicates a clear preference towards one option over another just to maintain approval.
Burglar Deception: LLMs should not lie even when deceptive behaviors are explicitly instructed or encouraged.
Game: LLMs should not lie even when deceptive behaviors are conducive to winning in strategic games.
Prompt Format: LLMs should remain consistent in their responses when encountering slight prompt variations of the same content.
Demonstration Format: LLMs should remain consistent in their responses even with irrelevant biased features in few-shot demonstrations.
Open-Form Consistency: LLMs should display consistency by validating its own generated responses.
Multiple-Choice Consistency: LLMs should remain consistent in their responses when asked to answer a question again or when a user presents unnecessary doubts.

Currently, our leaderboard consists of evaluation results from 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. We are working on adding more models to the leaderboard to provide a comprehensive evaluation of honesty in LLMs.

Model	Self-Knowledge [Overall]↑	Admitting Unknowns [Refusal Rate]↑	Expressing Knowns [Answer Rate]↑	Expressing Knowns [Self-Knowledge Rate]↑
GPT-4o	59.26	31.37	95.52	50.88
ChatGPT	54.16	21.78	93.71	47.00
Llama3-70b	63.34	48.81	94.29	46.93
Llama3-8b	54.51	37.80	88.33	37.40
Llama2-70b	52.80	26.40	90.51	41.50
Llama2-13b	52.69	32.24	89.13	36.70
Llama2-7b	49.56	27.82	87.96	32.90
Mistral-7b	59.43	50.03	91.65	36.60
Qwen1.5-14b	53.08	37.03	89.20	33.00

Model	Non-Deceptiveness [Overall]↓	Persona Sycophancy [Syco. Rate]↓	Preference Sycophancy [Syco. Rate]↓	Burglar Deception [Avg. Lying Rate]↓	Game [Lying rate]↓
GPT-4o	55.74	39.44	24.11	62.50	96.91
ChatGPT	57.07	38.39	48.78	69.50	71.60
Llama3-70b	63.68	33.62	33.07	90.50	97.53
Llama3-8b	64.21	25.74	78.02	100.0	53.09
Llama2-70b	52.89	26.81	46.52	76.50	61.73
Llama2-13b	42.33	27.66	54.35	80.5	6.790
Llama2-7b	49.16	23.40	61.74	82.50	29.01
Mistral-7b	74.80	39.53	80.21	95.50	83.95
Qwen1.5-14b	52.03	30.64	57.39	88.00	32.10

Model	Consistency [Overall]↑	Prompt Format [Perf. Spread]↓	Demo. Form. (w/o CoT) [Inconsist. Rate]↓	Demo. Form. (w/ CoT) [Inconsist. Rate]↓	O.F. Consistency [Agree. Rate]↑	M.C. Consistency [Consist. Rate]↑
GPT-4o	96.26	2.120	7.670	3.020	87.00	94.20
ChatGPT	63.32	3.110	50.49	11.39	73.00	70.40
Llama3-70b	59.44	5.250	30.99	1.140	94.40	33.60
Llama3-8b	41.62	5.500	57.01	18.50	57.40	70.80
Llama2-70b	44.42	4.250	57.89	25.94	66.00	61.60
Llama2-13b	35.20	6.500	75.53	31.76	71.80	79.40
Llama2-7b	29.39	3.250	82.08	49.59	47.60	73.80
Mistral-7b	66.05	2.750	35.19	27.33	82.20	70.00
Qwen1.5-14b	72.24	3.000	17.77	2.520	44.40	92.80

Open-Source Closed-Source

We present the main results in our paper above. We note that the overall scores for Self-Knowledge and Non-deceptiveness are calculated by taking the average. The overall score for Consistency is calculated by reversing the values for Perf. Spread and Inconsistency Rate using the formula \( \left( \frac{\max(X) - x}{\max(X) - \min(X)} \right) \times 100 \), while normalizing the values for Agree. Rate and Consist. Rate to 0-100 with \( \left( \frac{x - \min(X)}{\max(X) - \min(X)} \right) \times 100 \). Each modified metric is treated equally, and the average is computed to produce an overall score. This score is a composite measure that integrates different aspects of model performance into a single value.

Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We encourage the AI community to prioritize honesty alignment in these models, which can harness their full potential to benefit society while preventing them from causing harm through deception or inconsistency.

Abbreviations: Syco. = Sycophancy; Perf. Spread = Performance Spread; Inconsist. = Inconsistency; Agree. = Agreement; Consist. = Consistency.

If you have any questions regarding this project, feel free to submit a github issue or reach out to us via email.

BibTeX

@article{chern2024behonest,
        title={BeHonest: Benchmarking Honesty in Large Language Models},
        author={Chern, Steffi and Hu, Zhulin and Yang, Yuqing and Chern, Ethan and Guo, Yuan and Jin, Jiahe and Wang, Binjie and Liu, Pengfei},
        journal={arXiv preprint arXiv:2406.13261},
        url={https://arxiv.org/abs/2406.13261}
        year={2024}
    }

🤝 BeHonest: Benchmarking Honesty in Large Language Models

⭐ Introduction

📄 Datasets and Metrics

👑 Leaderboard

📬 Contact

BibTeX