Safety-J: Evaluating Safety with Critique

Yixiu Liu*^{1, 3, 4}, Yuxiang Zheng*^{1, 3, 4}, Shijie Xia^{1, 3}, Yuan Guo¹, Jiajun Li⁴, Yi Tu⁴, Chaoling Song⁴, Pengfei Liu^{1, 2, 3+}

¹Shanghai Jiao Tong University, ²Shanghai Artificial Intelligence Laboratory, ³Generative AI Research Lab (GAIR), ⁴Huawei Technologies Ltd.

^*Co-first authors ⁺Corresponding authors

Paper Code

🤗

Models

🏟️

Leaderboard

🚀 Brief Introduction

Safety-J is an advanced bilingual (English and Chinese) safety evaluator designed to assess the safety of content generated by Large Language Models (LLMs). It provides detailed critiques and judgments, setting a new standard in AI content safety evaluation. It is featured with:

Bilingual Capability: Evaluates content in both English and Chinese.
Critique-based Judgment: Offers detailed critiques alongside safety classifications.
Iterative Preference Learning: Continuously improves through an innovative iterative learning process.
Comprehensive Coverage: Addresses a wide range of safety scenarios, from privacy concerns to ethical issues.
Meta-evaluation Framework: Includes an automated benchmark for assessing critique quality.
State-of-the-art Performance: Outperforms existing open-source models and strong proprietary models like GPT-4o in safety evaluation tasks.

🏆 Leaderboard

We release the benchmarking results on various safety-related datasets as a leaderboard.

For the model accuracy evaluation, we test different models on five datasets: BeaverTails, DiaSafety, Jade, Flames, and WildSafety. The metric is the accuracy rate on each dataset, with an overall average (Avg) across all datasets. We report the results for various models, including large language models and specialized safety models. The "Generative" column indicates whether a model is capable of generating text (✔️) or if it's a non-generative model primarily used for classification(❌).

Open-Source Proprietary

Model	Generative	BeaverTails	DiaSafety	Jade	Flames	WildSafety	Average
Safety-J (7B)	✔️	84.3	71.4	98.6	74.0	92.2	84.1
ShieldLM 14B	✔️	83.7	71.6	96.6	63.7	78.3	78.8
ShieldLM 7B	✔️	84.0	67.9	96.4	62.3	77.9	77.7
GPT-4o	✔️	82.3	56.1	97.8	71.6	80.3	77.6
GPT-4	✔️	77.2	65.4	96.8	65.3	77.0	76.3
InternLM	✔️	80.4	54.0	92.7	53.3	78.5	71.8
GPT-3.5	✔️	81.9	52.3	89.0	51.0	73.2	69.5
Moderation	❌	43.6	63.8	53.0	56.2	51.3	53.6
Perspective	❌	46.3	55.8	48.3	51.7	57.4	51.9

For the critique evaluation task, we assess models at both the critique level (Macro) and the AIU (Atomic Information Unit) level (Micro). The metrics include precision (Meta-P), recall (Meta-R), and F1 score (Meta-F1) for both levels in English and Chinese evaluations. We evaluate various models on their ability to generate accurate critiques and AIU analyses.

📬 Contact

If you have any questions regarding this project, feel free to directly submit a github issue.

BibTeX

@article{liu2024safety,
  title={SAFETY-J: Evaluating Safety with Critique},
  author={Liu, Yixiu and Zheng, Yuxiang and Xia, Shijie and Guo, Yuan and Li, Jiajun and Tu, Yi and Song, Chaoling and Liu, Pengfei},
  journal={arXiv preprint arXiv:2407.17075},
  year={2024}
}