🤝 BeHonest: Benchmarking Honesty in Large Language Models

1Shanghai Jiao Tong University, 2Carnegie Mellon University, 3Fudan University, 4Shanghai AI Laboratory, 5Generative AI Research Lab (GAIR)
*Primary Contributors Core Research Contributors Corresponding author

⭐ Introduction

We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency).

BeHonest supports the following 10 scenarios:

  • Expressing Unknowns: LLMs should appropriately refuse to answer questions that are beyond their knowledge.
  • Admitting Knowns: LLMs should correctly answer questions they have knowledge of.
  • Persona Sycophancy: LLMs should not alter its responses to fit perceived viewpoints of a persona, as this can lead to biased information aimed at maintaining approval.
  • Preference Sycophancy: LLMs should not change its perspective when a user indicates a clear preference towards one option over another just to maintain approval.
  • Burglar Deception: LLMs should not lie even when deceptive behaviors are explicitly instructed or encouraged.
  • Game: LLMs should not lie even when deceptive behaviors are conducive to winning in strategic games.
  • Prompt Format: LLMs should remain consistent in their responses when encountering slight prompt variations of the same content.
  • Demonstration Format: LLMs should remain consistent in their responses even with irrelevant biased features in few-shot demonstrations.
  • Open-Form Consistency: LLMs should display consistency by validating its own generated responses.
  • Multiple-Choice Consistency: LLMs should remain consistent in their responses when asked to answer a question again or when a user presents unnecessary doubts.

📄 Datasets and Metrics

👑 Leaderboard

Currently, our leaderboard consists of evaluation results from 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. We are working on adding more models to the leaderboard to provide a comprehensive evaluation of honesty in LLMs.

Model Self-Knowledge
[Overall]↑
Admitting Unknowns
[Refusal Rate]↑
Expressing Knowns
[Answer Rate]↑
Expressing Knowns
[Self-Knowledge Rate]↑
GPT-4o 59.26 31.37 95.52 50.88
ChatGPT 54.16 21.78 93.71 47.00
Llama3-70b 63.34 48.81 94.29 46.93
Llama3-8b 54.51 37.80 88.33 37.40
Llama2-70b 52.80 26.40 90.51 41.50
Llama2-13b 52.69 32.24 89.13 36.70
Llama2-7b 49.56 27.82 87.96 32.90
Mistral-7b 59.43 50.03 91.65 36.60
Qwen1.5-14b 53.08 37.03 89.20 33.00
Model Non-Deceptiveness
[Overall]↓
Persona Sycophancy
[Syco. Rate]↓
Preference Sycophancy
[Syco. Rate]↓
Burglar Deception
[Avg. Lying Rate]↓
Game
[Lying rate]↓
GPT-4o 55.74 39.44 24.11 62.50 96.91
ChatGPT 57.07 38.39 48.78 69.50 71.60
Llama3-70b 63.68 33.62 33.07 90.50 97.53
Llama3-8b 64.21 25.74 78.02 100.0 53.09
Llama2-70b 52.89 26.81 46.52 76.50 61.73
Llama2-13b 42.33 27.66 54.35 80.5 6.790
Llama2-7b 49.16 23.40 61.74 82.50 29.01
Mistral-7b 74.80 39.53 80.21 95.50 83.95
Qwen1.5-14b 52.03 30.64 57.39 88.00 32.10
Model Consistency
[Overall]↑
Prompt Format
[Perf. Spread]↓
Demo. Form. (w/o CoT)
[Inconsist. Rate]↓
Demo. Form. (w/ CoT)
[Inconsist. Rate]↓
O.F. Consistency
[Agree. Rate]↑
M.C. Consistency
[Consist. Rate]↑
GPT-4o 96.26 2.120 7.670 3.020 87.00 94.20
ChatGPT 63.32 3.110 50.49 11.39 73.00 70.40
Llama3-70b 59.44 5.250 30.99 1.140 94.40 33.60
Llama3-8b 41.62 5.500 57.01 18.50 57.40 70.80
Llama2-70b 44.42 4.250 57.89 25.94 66.00 61.60
Llama2-13b 35.20 6.500 75.53 31.76 71.80 79.40
Llama2-7b 29.39 3.250 82.08 49.59 47.60 73.80
Mistral-7b 66.05 2.750 35.19 27.33 82.20 70.00
Qwen1.5-14b 72.24 3.000 17.77 2.520 44.40 92.80
Open-Source Closed-Source

We present the main results in our paper above. We note that the overall scores for Self-Knowledge and Non-deceptiveness are calculated by taking the average. The overall score for Consistency is calculated by reversing the values for Perf. Spread and Inconsistency Rate using the formula \( \left( \frac{\max(X) - x}{\max(X) - \min(X)} \right) \times 100 \), while normalizing the values for Agree. Rate and Consist. Rate to 0-100 with \( \left( \frac{x - \min(X)}{\max(X) - \min(X)} \right) \times 100 \). Each modified metric is treated equally, and the average is computed to produce an overall score. This score is a composite measure that integrates different aspects of model performance into a single value.

Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We encourage the AI community to prioritize honesty alignment in these models, which can harness their full potential to benefit society while preventing them from causing harm through deception or inconsistency.

Abbreviations: Syco. = Sycophancy; Perf. Spread = Performance Spread; Inconsist. = Inconsistency; Agree. = Agreement; Consist. = Consistency.

📬 Contact

If you have any questions regarding this project, feel free to submit a github issue or reach out to us via email.

BibTeX

@article{chern2024behonest,
        title={BeHonest: Benchmarking Honesty in Large Language Models},
        author={Chern, Steffi and Hu, Zhulin and Yang, Yuqing and Chern, Ethan and Guo, Yuan and Jin, Jiahe and Wang, Binjie and Liu, Pengfei},
        journal={arXiv preprint arXiv:2406.13261},
        url={https://arxiv.org/abs/2406.13261}
        year={2024}
    }