OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

🏟️ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

¹Shanghai Jiao Tong University, ²Shanghai Artificial Intelligence Laboratory, ³Generative AI Research Lab (GAIR)

^*Corresponding authors

🚀 Brief Introduction

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. In this paper, we introduce OlympicArena, a comprehensive, highly-challenging, and rigorously curated benchmark featuring a detailed, fine-grained evaluation mechanism designed to assess advanced AI capabilities across a broad spectrum of Olympic-level challenges.

Comprehensive: The benchmark includes a comprehensive set of 11,163 problems from 62 distinct Olympic competitions, structured with 13 answer types. It spans seven core disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science, encompassing a total of 34 specialized branches
High-challenging: The benchmark focuses on Olympic-level problems and covers 8 types of logical reasoning abilities and 5 types of visual reasoning abilities.
Rigorous: Given the increasing scale of pre-training corpora, it is crucial to detect potential benchmark leakage. We employ a recently proposed instance-level leakage detection metric to validate our benchmark’s effectiveness.
Fine-grained Evaluation: We conduct comprehensive evaluations from both the answer-level and process-level perspectives. Additionally, we perform fine-grained evaluations and analyses on different types of cognitive reasoning, from both logical and visual perspectives to better interpret the current capabilities of AI.

Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond.

🏟️ Leaderboard

We evaluate various models, including LLMs and LMMs, considering both closed- and open-source versions. All experiments utilize zero-shot prompts, specifically tailored to each answer type and specifying output formats to facilitate answer extraction and rule-based matching. For the LLMs, we do not provide any image information, considering the potential hallucinations and instability in captions, which also allows for a better comparison with LMMs.

For CS problems, we set the inference temperature to 0.2 to obtain multiple and diverse candidate results, while for all other disciplines, the temperature is set to 0.0. Additionally, we set the maximum length of output tokens to 2048. Note that some LMMs only support a single image input; in such cases, we input only the first image. For LMMs that require a compulsive image input, we use their corresponding text-only model when handling text-only problems. Here is the leaderboard for our benchmark's validation and test sets.

Note: The results reported in the paper are based on the combined outcomes from the validation and test sets, as well as some additional problems evaluated by the models. Therefore, they may slightly differ from the data presented below.

Open-Source Proprietary

BibTeX

@article{huang2024olympicarena, title={OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI}, author={Zhen Huang and Zengzhi Wang and Shijie Xia and Xuefeng Li and Haoyang Zou and Ruijie Xu and Run-Ze Fan and Lyumanshan Ye and Ethan Chern and Yixin Ye and Yikai Zhang and Yuqing Yang and Ting Wu and Binjie Wang and Shichao Sun and Yang Xiao and Yiyuan Li and Fan Zhou and Steffi Chern and Yiwei Qin and Yan Ma and Jiadi Su and Yixiu Liu and Yuxiang Zheng and Shaoting Zhang and Dahua Lin and Yu Qiao and Pengfei Liu}, year={2024}, journal={arXiv preprint arXiv:2406.12753}, url={https://arxiv.org/abs/2406.12753} }

Model	Setting	Date	Overall	Math	Physics	Chemistry	Biology	Geography	Astronomy	CS	EN	ZH	Text-only	Multi-modal
GPT-4o	LMM	2024-6-12	34.01	34.43	34.44	29.23	30.16	44.12	33.33	15.56	34.44	33.55	35.93	31.54
GPT-4V	LMM	2024-6-12	26.18	22.95	26.67	30.77	25.40	33.82	28.89	7.78	24.47	28.01	27.30	24.73
Claude-3-Sonnet	LMM	2024-6-12	18.97	12.70	20.00	18.46	28.57	25.00	25.56	6.67	19.34	18.57	18.66	19.35
Gemini Pro Vision	LMM	2024-6-12	17.08	11.07	15.56	15.38	33.33	30.88	16.67	2.22	12.69	21.82	16.43	17.92
Qwen-VL-Max	LMM	2024-6-12	17.08	10.25	10.00	20.00	28.57	35.29	22.22	0	11.18	23.45	16.43	17.92
LLaVA-NeXT-34B	LMM	2024-6-12	12.70	3.69	8.89	15.38	26.98	35.29	14.44	0	6.34	19.54	8.91	17.56
InternVL-Chat-V1.5	LMM	2024-6-12	12.70	5.74	7.78	15.38	28.57	27.94	13.33	2.78	7.85	17.92	11.70	13.98
Yi-VL-34B	LMM	2024-6-12	9.87	3.28	6.67	13.85	19.05	25.00	12.22	0	5.44	14.66	9.19	10.75
Qwen-VL-Chat	LMM	2024-6-12	4.08	2.05	5.56	4.62	4.76	7.35	5.56	0	3.63	4.56	4.18	3.94
Qwen1.5-32B-Chat	LLM	2024-6-20	21.16	13.11	15.56	16.92	28.57	47.06	30.00	3.33	15.11	27.69	19.22	23.66
Internlm2-Chat-20B	LLM	2024-6-20	13.95	7.38	7.78	18.46	31.75	32.35	10.00	2.22	8.16	20.20	11.70	16.85
Yi-34B-Chat	LLM	2024-6-20	12.54	3.28	7.78	15.38	20.63	42.65	14.44	0	5.44	20.20	8.64	17.56
Qwen-7B-Chat	LLM	2024-6-20	3.45	2.05	3.33	3.08	7.94	2.94	5.56	0	2.11	4.89	3.34	3.58
Llama-3-70B-Instruct	LLM	2024-6-21	21.63	15.16	22.22	21.54	26.98	32.35	30.00	2.22	19.03	24.43	22.56	20.43
Claude-3.5-Sonnet	LMM	2024-6-21	33.54	30.74	31.11	30.77	39.68	45.59	35.56	14.44	30.21	37.13	35.38	31.18
Gemini-1.5-Pro	LMM	2024-6-22	28.68	23.77	27.78	26.15	31.75	44.12	33.33	12.22	25.08	32.57	28.69	28.67
Doubao-Pro-32k	LLM	2024-6-25	31.66	25.82	22.22	36.92	50.79	41.18	36.67	8.89	23.26	40.72	33.43	29.39
DeepSeek-Coder-V2	LLM	2024-6-26	29.31	29.51	26.67	23.08	33.33	35.29	32.22	6.67	27.79	30.94	32.03	25.81
Qwen2-72B-Instruct	LLM	2024-7-5	28.84	26.23	22.22	18.46	38.10	47.06	34.44	5.56	23.87	34.20	30.92	26.16
Internlm-2.5-7B-Chat	LLM	2024-7-8	18.65	13.52	11.11	12.31	34.92	36.92	22.22	5.56	14.20	23.45	17.55	20.07
MiniCPM-Llama3-V-2_5	LMM	2024-7-12	11.91	5.33	13.33	10.77	23.81	23.53	13.33	2.22	8.76	15.31	11.14	12.90
Mathstral-7B-v0.1	LLM	2024-7-22	16.61	15.98	15.56	10.77	15.87	25.00	20.00	4.44	14.50	18.89	17.83	15.05
InternVL2-8B	LMM	2024-7-22	14.73	12.70	7.78	10.77	23.81	29.41	15.56	0.00	10.57	19.22	15.60	13.62
GPT-4o-mini	LMM	2024-7-24	28.37	34.02	25.56	16.92	31.75	30.88	23.33	6.67	28.70	28.01	32.87	22.58
Internlm-xcomposer-2.5-7B	LMM	2024-7-24	12.70	5.74	6.67	18.46	22.22	30.88	15.56	0.00	7.25	18.57	10.86	15.05
Llama-3.1-8B-Instruct	LLM	2024-8-1	14.73	11.48	17.78	10.77	19.05	27.94	12.22	5.56	14.50	14.98	15.32	13.98
InternVL2-40B	LMM	2024-8-6	15.52	5.74	14.44	10.77	38.10	36.76	17.78	0.00	11.48	19.87	14.21	17.20
Gemma2-27B-it	LLM	2024-8-15	19.91	14.34	23.33	18.46	28.57	29.41	22.22	1.11	17.52	22.48	19.78	20.07

Model	Setting	Date	Overall	Math	Physics	Chemistry	Biology	Geography	Astronomy	CS	EN	ZH	Text-only	Multi-modal
GPT-4o	LMM	2024-6-12	40.47	28.32	30.01	46.68	53.11	56.77	44.50	8.43	44.16	34.59	41.79	39.03
GPT-4V	LMM	2024-6-12	33.17	18.98	24.94	41.06	47.69	50.33	32.07	6.94	37.17	26.49	33.96	32.16
Claude-3-Sonnet	LMM	2024-6-12	25.53	7.12	18.42	30.06	39.40	40.80	24.50	1.02	27.41	17.08	22.80	24.47
Gemini Pro Vision	LMM	2024-6-12	21.02	5.91	12.97	28.36	37.66	37.71	20.54	1.39	22.48	18.58	19.38	23.10
Qwen-VL-Max	LMM	2024-6-12	21.41	6.68	14.12	24.52	36.32	40.41	23.42	0.83	21.34	21.52	19.86	23.37
LLaVA-NeXT-34B	LMM	2024-6-12	18.16	2.99	11.59	22.38	33.44	36.99	18.47	0.19	19.00	16.76	16.39	20.41
InternVL-Chat-V1.5	LMM	2024-6-12	17.39	6.08	10.82	20.09	30.57	33.18	15.95	0.19	18.36	15.77	15.85	19.34
Yi-VL-34B	LMM	2024-6-12	15.07	2.92	10.82	21.05	28.09	25.16	16.94	0	15.20	14.86	16.48	13.29
Qwen-VL-Chat	LMM	2024-6-12	7.34	1.71	5.22	9.08	12.44	14.06	8.11	0	8.32	5.69	6.52	8.38
Qwen1.5-32B-Chat	LLM	2024-6-20	24.36	9.41	15.81	32.13	39.00	40.41	27.84	0.28	23.80	25.29	23.57	25.35
Internlm2-Chat-20B	LLM	2024-6-20	17.33	5.78	11.13	19.13	31.91	32.13	16.49	0.46	17.82	16.52	15.76	19.32
Yi-34B-Chat	LLM	2024-6-20	18.01	3.06	11.13	24.52	33.18	34.69	18.29	0.19	17.43	18.98	16.26	20.23
Qwen-7B-Chat	LLM	2024-6-20	4.34	1.55	4.45	6.50	7.29	4.60	5.59	0	4.20	4.57	4.62	3.98
Claude-3.5-Sonnet	LMM	2024-6-21	39.24	23.18	31.16	47.27	56.05	55.19	43.51	5.19	43.09	32.83	39.64	38.73
Gemini-1.5-Pro	LMM	2024-6-22	35.09	19.99	28.93	43.80	49.16	49.67	38.29	5.37	38.58	29.27	35.77	34.23
Doubao-Pro-32k	LLM	2024-6-25	35.40	22.44	23.56	45.79	50.37	45.47	43.24	3.33	32.94	39.51	37.11	33.23
DeepSeek-Coder-V2	LLM	2024-6-26	34.14	25.19	27.48	42.25	42.41	43.69	36.13	7.50	36.58	30.07	37.11	30.36
Llama-3-70B-Instruct	LLM	2024-7-5	26.74	11.86	20.80	33.90	42.41	42.38	26.85	2.22	29.19	22.67	26.54	26.99
Qwen2-72B-Instruct	LLM	2024-7-5	34.11	21.06	24.41	41.36	49.23	47.11	39.55	1.85	34.67	33.17	36.09	31.59
Internlm-2.5-7B-Chat	LLM	2024-7-8	21.74	11.19	14.27	26.07	35.25	36.40	19.19	0.28	22.64	20.24	21.31	22.28
MiniCPM-Llama3-V-2_5	LMM	2024-7-12	17.41	5.17	12.43	21.42	29.77	33.57	15.77	0	18.47	15.64	16.68	18.34
Mathstral-7B-v0.1	LLM	2024-7-22	20.07	11.56	14.04	23.86	29.23	32.59	19.37	0.46	22.02	16.81	20.71	19.25
InternVL2-8B	LMM	2024-7-22	18.49	6.21	12.74	21.57	30.84	34.63	19.28	0.00	19.40	16.97	17.59	19.64
GPT-4o-mini	LMM	2024-7-24	34.09	27.44	27.32	38.48	41.94	45.34	33.24	6.20	38.37	26.94	37.15	30.20
Internlm-xcomposer-2.5-7B	LMM	2024-7-24	12.70	5.74	6.67	18.46	22.22	30.88	15.56	0.00	7.25	18.57	10.86	15.05
Llama-3.1-8B-Instruct	LLM	2024-8-1	19.36	7.89	13.35	24.89	32.64	33.77	16.31	0.83	22.69	13.82	19.38	19.34
InternVL2-40B	LMM	2024-8-6	20.01	4.60	12.74	22.67	37.86	39.42	19.82	0.00	20.62	18.98	19.60	20.53
Gemma2-27B-it	LLM	2024-8-15	26.32	11.42	21.03	35.52	41.74	38.83	27.66	2.50	29.14	21.63	26.99	25.47
Spark 4.0-2024-10-14	LLM	2024-12-15	36.73	27.71	22.03	45.57	49.03	48.09	41.71	2.22	38.65	33.55	39.64	33.05

🏟️ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

🚀 Brief Introduction

🏟️ Leaderboard

📬 Contact

BibTeX