Due to the opacity of the training process, comparing large language models (LLMs) directly introduces a potential unfairness, thus impacting the healthy development of the field of LLMs.
This leaderboard shows the relative possibility that various models conduct verbatim training on the training set of a benchmark over test set to enhance capabilities (measured based on PPL and N-gram Accuracy). Models exhibiting near-zero possibilities suggest either the absence of training and test split or the use of both splits in the training process. This metric does not imply cheating, but rather indicates the potential use of the benchmark data during the (pre-)training phase; while using benchmarks to enhance capabilities is acceptable, the lack of relevant documentation can reduce transparency, potentially resulting in unfair comparisons and hindering the field's healthy development.
Models | 5-gram | ppl (answer) |
---|---|---|
Aquila2-7B | 23.24 | 75.25 |
InternLM2-20B | 29.11 | 67.77 |
InternLM2-7B | 30.12 | 64.75 |
Qwen1.5-7B | 27.38 | 48.78 |
Qwen1.5-14B | 25.62 | 40.24 |
Aquila2-34B | 23.12 | 39.78 |
Qwen-14B | 35.16 | 26.94 |
Qwen-7B | 35.54 | 25.81 |
Qwen1.5-1.8B | 25.17 | 17.59 |
ChatGLM2-6B | 13.65 | 28.76 |
Qwen-1.8B | 22.87 | 10.30 |
Baichuan2-13B-Base | 17.85 | 11.82 |
InternLM-20B | 10.68 | 16.49 |
Orca-2-7B | 6.45 | 17.10 |
Yi-34B | 11.55 | 10.19 |
Yi-6B | 9.05 | 5.30 |
Aquila-7B | 2.57 | 9.80 |
Phi-2 | 10.60 | -2.03 |
InternLM-7B | 3.86 | 4.13 |
ChatGLM3-6B | 1.57 | 5.43 |
Yi1.5-34B | 3.41 | 1.84 |
LLaMA2-7B | 3.82 | -0.29 |
Yi1.5-6B | 3.42 | 0.05 |
Phi-1.5 | 4.46 | -1.74 |
Baichuan-7B | 1.66 | -1.08 |
Mistral-7B-v0.1 | 0.96 | -0.63 |
LLaMA-7B | 1.07 | -0.85 |
Grok-1 | -0.54 | -0.39 |
Gemma-2B | -0.57 | -1.48 |
Gemma-7B | -1.91 | -0.70 |
Llama-3-8B | -2.13 | -0.87 |
Baichuan2-7B-Base | -2.06 | -1.23 |
Baichuan-13B-Base | -2.58 | -0.91 |
InternLM2-20B-Base | -2.42 | -1.16 |
DeepSeekMath-7B | -2.64 | -0.97 |
InternLM2-7B-Base | -3.21 | -0.83 |
Llama-3-8B-Instruct | -4.50 | -0.22 |
Models | 5-gram | ppl (answer) |
---|---|---|
Aquila2-7B | 15.24 | 158.76 |
InternLM2-20B | 20.48 | 72.44 |
Aquila2-34B | 18.50 | 70.23 |
Yi1.5-6B | 19.21 | 59.88 |
Yi1.5-34B | 20.55 | 54.17 |
InternLM2-7B | 17.67 | 45.22 |
Qwen1.5-1.8B | 8.08 | 23.11 |
Qwen1.5-7B | 4.74 | 25.32 |
Qwen1.5-14B | 5.06 | 22.20 |
Qwen-14B | 3.15 | 20.34 |
Qwen-7B | 4.20 | 17.60 |
Qwen-1.8B | 9.94 | 9.79 |
ChatGLM3-6B | 5.33 | 9.69 |
InternLM-7B | 4.11 | 6.05 |
InternLM-20B | 0.71 | 4.85 |
Gemma-2B | 2.66 | 1.05 |
Orca-2-7b | 2.24 | 1.31 |
Llama-3-8B | 3.06 | -0.01 |
Gemma-7B | 3.54 | -0.63 |
Yi-6B | 3.05 | -0.55 |
Yi-34B | 2.81 | -0.60 |
Llama-3-8B-instruct | 2.47 | -0.28 |
Grok-1 | 0.36 | 1.39 |
Phi-2 | 2.07 | -0.41 |
LLaMA-7B | 2.28 | -0.88 |
DeepSeekMath-7b | 1.34 | -0.64 |
ChatGLM2-6B | 1.57 | -0.92 |
InternLM2-7B-Base | 0.58 | -0.20 |
Phi-1.5 | 0.78 | -0.43 |
LLaMA2-7B | 0.09 | 0.18 |
Baichuan-7B | 0.15 | -0.09 |
InternLM2-20B-Base | 0.46 | -0.61 |
Baichuan2-7B-Base | -0.58 | -0.13 |
Mistral-7B-v0.1 | -0.49 | -0.62 |
Baichuan-13B-Base | -0.61 | -0.53 |
Aquila-7B | -2.10 | 0.48 |
Baichuan2-13B-Base | -2.40 | -0.56 |
Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. Given that training data and model details are often opaque, and the leakage detection is influenced by various factors such as mode size and training strategies, detecting benchmark leakage is not a trivial task. In this work, we are not pursuing technical contributions in system development; instead, we are attempting to encourage the healthy development of this field, particularly through the lens of mathematical reasoning tasks, in the following aspects:
High prediction accuracy for each n-gram of an example's prediction suggests a high probability that the sample was encountered during the training process. To investigate instance-level leakage, we looked closer at n-gram predictions across different models. Additionally, considering that benchmark data may undergo reformatting, paraphrasing, or other modifications when integrated into model training, we leverage lenient metrics, such as ROUGE-L and edit distance similarity, for comparing n-grams. Under this context, an instance is deemed correctly predicted if it achieves an Exact Match (meaning all predictions align perfectly), or if the edit distance similarity of all predictions exceeds 0.9 (indicating substantial similarity), and further, if the ROUGE-L score of all predictions surpasses 0.75.
We can observe that many models can pricisely predict all ngrams of an example from benchmark training set even test set. Surprisingly, Qwen-1.8B can accurately predict all 5-grams in 223 examples from the GSM8K training set and 67 from the MATH training set, with an additional 25 correct predictions even in the MATH test set. We would like to emphasize that the n-gram accuracy metric can mitigate issues in our detection pipeline, particularly when the training and test datasets are simultaneously leaked and remain undetected. However, this also has its limitations; it can only detect examples that are integrated into the model training in their original format and wording, unless we know the organizational format of the training data used by the model in advance.
In the first case, the Qwen-1.8B model achieves perfect n-gram predictions on a sample from the GSM8K training set, completing all 5-grams accurately. This strongly suggests potential data leakage within the training set of GSM8K. Additionally, we also conducted a case study on the Aquila2-34B model, known to accidentally be exposed to the entire GSM8K test set. It consistently predicts n-grams as "The answer is" for all instances where the ground truth was represented by a placeholder "####". This observation exactly explains why it is challenging to detect leakage using our n-gram accuracy metric. To enhance readers' comprehension of model behaviors, we have released an interactive demo for case studies, available at Huggingface Space: BenBench.
To ensure fairness in the evaluation of large language models moving forward, we propose the following suggestions:
Documentation: For any LLMs to be released, comprehensive documentation should be provided. This documentation at least specifies whether the model has been trained on the training or test sets of commonly used benchmarks to prevent potentially unfair comparisons. To this end, we introduce Benchmark Transparency Card, which serves as the supplement of the Data Card and Model Card, aiming to document the utilization of benchmarks (such as whether any benchmark sets are used for training and whether any data augmentation techniques are applied) and benchmark evaluation details. We hope that this card will be widely adopted upon the release of models to foster the healthy development of large language models.
@article{xu2024benchmarking,
title={Benchmarking Benchmark Leakage in Large Language Models},
author={Xu, Ruijie and Wang, Zengzhi and Fan, Run-Ze and Liu, Pengfei},
year={2024},
journal={arXiv preprint arXiv:2404.18824},
url={https://arxiv.org/abs/2404.18824}
}