🏟️ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

1Shanghai Jiao Tong University, 2Shanghai Artificial Intelligence Laboratory, 3Generative AI Research Lab (GAIR)
*Corresponding authors

πŸš€ Brief Introduction

The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. In this paper, we introduce OlympicArena, a comprehensive, highly-challenging, and rigorously curated benchmark featuring a detailed, fine-grained evaluation mechanism designed to assess advanced AI capabilities across a broad spectrum of Olympic-level challenges.

  • Comprehensive: The benchmark includes a comprehensive set of 11,163 problems from 62 distinct Olympic competitions, structured with 13 answer types. It spans seven core disciplines: mathematics, physics, chemistry, biology, geography, astronomy, and computer science, encompassing a total of 34 specialized branches
  • High-challenging: The benchmark focuses on Olympic-level problems and covers 8 types of logical reasoning abilities and 5 types of visual reasoning abilities.
  • Rigorous: Given the increasing scale of pre-training corpora, it is crucial to detect potential benchmark leakage. We employ a recently proposed instance-level leakage detection metric to validate our benchmark’s effectiveness.
  • Fine-grained Evaluation: We conduct comprehensive evaluations from both the answer-level and process-level perspectives. Additionally, we perform fine-grained evaluations and analyses on different types of cognitive reasoning, from both logical and visual perspectives to better interpret the current capabilities of AI.

Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond.

🏟️ Leaderboard

We evaluate various models, including LLMs and LMMs, considering both closed- and open-source versions. All experiments utilize zero-shot prompts, specifically tailored to each answer type and specifying output formats to facilitate answer extraction and rule-based matching.

For CS problems, we set the inference temperature to 0.2 to obtain multiple and diverse candidate results, while for all other disciplines, the temperature is set to 0.0. Additionally, we set the maximum length of output tokens to 2048. Here is the leaderboard for our benchmark's validation and test sets.

Note: The results reported in the paper are based on the combined outcomes from the validation and test sets, as well as some additional problems evaluated by the models. Therefore, they may slightly differ from the data presented below.

Open-Source Proprietary
Model Setting Date Overall Math Physics Chemistry Biology Geography Astronomy CS EN ZH Text-only Multi-modal
GPT-4o LMM 2024-6-12 34.01 34.43 34.44 29.23 30.16 44.12 33.33 15.56 34.44 33.55 35.93 31.54
GPT-4V LMM 2024-6-12 26.18 22.95 26.67 30.77 25.40 33.82 28.89 7.78 24.47 28.01 27.30 24.73
Claude-3-Sonnet LMM 2024-6-12 18.97 12.70 20.00 18.46 28.57 25.00 25.56 6.67 19.34 18.57 18.66 19.35
Gemini Pro Vision LMM 2024-6-12 17.08 11.07 15.56 15.38 33.33 30.88 16.67 2.22 12.69 21.82 16.43 17.92
Qwen-VL-Max LMM 2024-6-12 17.08 10.25 10.00 20.00 28.57 35.29 22.22 0 11.18 23.45 16.43 17.92
LLaVA-NeXT-34B LMM 2024-6-12 12.70 3.69 8.89 15.38 26.98 35.29 14.44 0 6.34 19.54 8.91 17.56
InternVL-Chat-V1.5 LMM 2024-6-12 12.70 5.74 7.78 15.38 28.57 27.94 13.33 2.78 7.85 17.92 11.70 13.98
Yi-VL-34B LMM 2024-6-12 9.87 3.28 6.67 13.85 19.05 25.00 12.22 0 5.44 14.66 9.19 10.75
Qwen-VL-Chat LMM 2024-6-12 4.08 2.05 5.56 4.62 4.76 7.35 5.56 0 3.63 4.56 4.18 3.94
Qwen1.5-32B-Chat LLM 2024-6-20 21.16 13.11 15.56 16.92 28.57 47.06 30.00 3.33 15.11 27.69 19.22 23.66
Internlm2-Chat-20B LLM 2024-6-20 13.95 7.38 7.78 18.46 31.75 32.35 10.00 2.22 8.16 20.20 11.70 16.85
Yi-34B-Chat LLM 2024-6-20 12.54 3.28 7.78 15.38 20.63 42.65 14.44 0 5.44 20.20 8.64 17.56
Qwen-7B-Chat LLM 2024-6-20 3.45 2.05 3.33 3.08 7.94 2.94 5.56 0 2.11 4.89 3.34 3.58
Llama-3-70B-Instruct LLM 2024-6-21 21.63 15.16 22.22 21.54 26.98 32.35 30.00 2.22 19.03 24.43 22.56 20.43
Claude-3.5-Sonnet LMM 2024-6-21 33.54 30.74 31.11 30.77 39.68 45.59 35.56 14.44 30.21 37.13 35.38 31.18
Gemini-1.5-Pro LMM 2024-6-22 28.68 23.77 27.78 26.15 31.75 44.12 33.33 12.22 25.08 32.57 28.69 28.67
Doubao-Pro-32k LLM 2024-6-25 31.66 25.82 22.22 36.92 50.79 41.18 36.67 8.89 23.26 40.72 33.43 29.39
DeepSeek-Coder-V2 LLM 2024-6-26 29.31 29.51 26.67 23.08 33.33 35.29 32.22 6.67 27.79 30.94 32.03 25.81

In addition to presenting scores by discipline, we also display scores by language (EN: English, ZH: Chinese) and by modality. Results marked with an asterisk (*) are provided by the authors. If you wish to have your model's results featured on our website's leaderboard (either val or test), please contact us via email at gair.olympicarena@gmail.com with your Submission ID and results.

Medal Rank

The OlympicArena Medal Table, similar to the medal system used in the Olympic Games, is a pioneering ranking mechanism specifically designed to evaluate the performance of AI models across various academic disciplines. This table awards medals to models that achieve the top three scores in any given discipline, thereby providing a clear and competitive framework for comparing different models. Specifically, We rank AI models first by the number of Gold medals, then by the number of Silver medals, followed by the number of Bronze medals, and finally by the overall score if there are still ties. It offers a straightforward and intuitive way to identify leading models in distinct academic fields, making it easier for researchers and developers to understand the strengths and weaknesses of different models.

Note: The medal table is generated based on the leaderboard scores of the test set.

πŸ“¬ Contact

If you have any questions regarding this project, feel free to reach out to us via email at gair.olympicarena@gmail.com or directly submit a github issue.

BibTeX

@article{huang2024olympicarena,
      title={OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI},
      author={Zhen Huang and Zengzhi Wang and Shijie Xia and Xuefeng Li and Haoyang Zou and Ruijie Xu and Run-Ze Fan and Lyumanshan Ye and Ethan Chern and Yixin Ye and Yikai Zhang and Yuqing Yang and Ting Wu and Binjie Wang and Shichao Sun and Yang Xiao and Yiyuan Li and Fan Zhou and Steffi Chern and Yiwei Qin and Yan Ma and Jiadi Su and Yixiu Liu and Yuxiang Zheng and Shaoting Zhang and Dahua Lin and Yu Qiao and Pengfei Liu},
      year={2024},
      journal={arXiv preprint arXiv:2406.12753},
      url={https://arxiv.org/abs/2406.12753}
}