We evaluate various models, including LLMs and LMMs, considering both closed- and open-source versions. All experiments utilize zero-shot prompts, specifically tailored to each answer type and specifying output formats to facilitate answer extraction and rule-based matching. For the LLMs, we do not provide any image information, considering the potential hallucinations and instability in captions, which also allows for a better comparison with LMMs.
For CS problems, we set the inference temperature to 0.2 to obtain multiple and diverse candidate results, while for all other disciplines, the temperature is set to 0.0. Additionally, we set the maximum length of output tokens to 2048. Note that some LMMs only support a single image input; in such cases, we input only the first image. For LMMs that require a compulsive image input, we use their corresponding text-only model when handling text-only problems. Here is the leaderboard for our benchmark's validation and test sets.
Note: The results reported in the paper are based on the combined outcomes from the validation and test sets, as well as some additional problems evaluated by the models. Therefore, they may slightly differ from the data presented below.