We evaluate various models, including LLMs and LMMs, considering both closed- and open-source versions. All experiments utilize zero-shot prompts, specifically tailored to each answer type and specifying output formats to facilitate answer extraction and rule-based matching.
For CS problems, we set the inference temperature to 0.2 to obtain multiple and diverse candidate results, while for all other disciplines, the temperature is set to 0.0. Additionally, we set the maximum length of output tokens to 2048. Here is the leaderboard for our benchmark's validation and test sets.
Note: The results reported in the paper are based on the combined outcomes from the validation and test sets, as well as some additional problems evaluated by the models. Therefore, they may slightly differ from the data presented below.