Safety-J: Evaluating Safety with Critique

1Shanghai Jiao Tong University, 2Shanghai Artificial Intelligence Laboratory, 3Generative AI Research Lab (GAIR), 4Huawei Technologies Ltd.
*Co-first authors +Corresponding authors

πŸš€ Brief Introduction

Safety-J is an advanced bilingual (English and Chinese) safety evaluator designed to assess the safety of content generated by Large Language Models (LLMs). It provides detailed critiques and judgments, setting a new standard in AI content safety evaluation. It is featured with:

  • Bilingual Capability: Evaluates content in both English and Chinese.
  • Critique-based Judgment: Offers detailed critiques alongside safety classifications.
  • Iterative Preference Learning: Continuously improves through an innovative iterative learning process.
  • Comprehensive Coverage: Addresses a wide range of safety scenarios, from privacy concerns to ethical issues.
  • Meta-evaluation Framework: Includes an automated benchmark for assessing critique quality.
  • State-of-the-art Performance: Outperforms existing open-source models and strong proprietary models like GPT-4o in safety evaluation tasks.

πŸ† Leaderboard

We release the benchmarking results on various safety-related datasets as a leaderboard.

For the model accuracy evaluation, we test different models on five datasets: BeaverTails, DiaSafety, Jade, Flames, and WildSafety. The metric is the accuracy rate on each dataset, with an overall average (Avg) across all datasets. We report the results for various models, including large language models and specialized safety models. The "Generative" column indicates whether a model is capable of generating text (βœ”οΈ) or if it's a non-generative model primarily used for classification(❌).

Open-Source Proprietary
Model Generative BeaverTails DiaSafety Jade Flames WildSafety Average
Safety-J (7B) βœ”οΈ 84.3 71.4 98.6 74.0 92.2 84.1
ShieldLM 14B βœ”οΈ 83.7 71.6 96.6 63.7 78.3 78.8
ShieldLM 7B βœ”οΈ 84.0 67.9 96.4 62.3 77.9 77.7
GPT-4o βœ”οΈ 82.3 56.1 97.8 71.6 80.3 77.6
GPT-4 βœ”οΈ 77.2 65.4 96.8 65.3 77.0 76.3
InternLM βœ”οΈ 80.4 54.0 92.7 53.3 78.5 71.8
GPT-3.5 βœ”οΈ 81.9 52.3 89.0 51.0 73.2 69.5
Moderation ❌ 43.6 63.8 53.0 56.2 51.3 53.6
Perspective ❌ 46.3 55.8 48.3 51.7 57.4 51.9

For the critique evaluation task, we assess models at both the critique level (Macro) and the AIU (Atomic Information Unit) level (Micro). The metrics include precision (Meta-P), recall (Meta-R), and F1 score (Meta-F1) for both levels in English and Chinese evaluations. We evaluate various models on their ability to generate accurate critiques and AIU analyses.

πŸ“¬ Contact

If you have any questions regarding this project, feel free to directly submit a github issue.

BibTeX

@article{liu2024safety,
  title={SAFETY-J: Evaluating Safety with Critique},
  author={Liu, Yixiu and Zheng, Yuxiang and Xia, Shijie and Guo, Yuan and Li, Jiajun and Tu, Yi and Song, Chaoling and Liu, Pengfei},
  journal={arXiv preprint arXiv:2407.17075},
  year={2024}
}