Generative AI for Math: Abel

1Generative Artificial Intelligence Research Lab (GAIR), Shanghai Jiaotong University
2Shanghai AI Lab
*Core contributors +Corresponding Author

Abstract

📝 Abel is created as a tribute to Niels Henrik Abel for his groundbreaking work in algebra and analysis, at which our model is relatively better as well. There is still a long way for us to go, though 🏃‍♂️🏃‍♀️🏁🏃‍♂️🏃‍♀️.

We show that:

  • without tools
  • without continuing pretraining
  • without reward model
  • without RLHF
  • ONLY using SFT

We have established a new state-of-the-art performance across open-source LLMs (that do not use external tools) on the GSM8k (83.62) and MATH (28.26) benchmarks. Specifically:

  • the performance on GSM8K, at 83.62%, surpasses top-tier models, such as PaLM-1, Minerva (Google), Claude-instant (Anthropic), ChatGPT (OpenAI), with only a 1-percentage-point lag behind Google's latest model, PaLM-2-Flan.
  • achieving an accuracy rate of 28.26% on highly challenging mathematical competition problems (compared to GPT4's 42.5%), it maintains a significant lead over other open-source models, surpassing the previous best open-source model by 5.56%.
  • the 7B and 13B models have achieved a historic milestone in open-source model performance in both GSM8K and MATH.
  • GAIRMath-Abel secures 3 positions in the Top 10 rankings and stands as the only university-led project in the list (others are either star startups or big tech companies).
  • Using our approach, we not only achieved excellent results on GSM8K and MATH, but when given a new dataset (TALSCQ-EN), we quickly attained state-of-the-art (SOTA) performance without too much effort, surpassing the commercial multi-billion-dollar model MathGPT and GPT4.

We demonstrate that:

  • the capabilities of SFT are significantly underestimated, and researchers should approach SFT with due reverence and caution
  • exceptional mathematical problem-solving capability can be achieved solely through SFT, which elicits more imaginative possibilities in future exploration in this direction.

Models and Performance

Numbers in "()" represent improvement against previous SOTA open-sourced methods, i.e., WizardMath.

Model Name HF Checkpoints GSM8k MATH License
GAIRMath-Abel-70B 🤗 70B 83.62 (+ 2.02) 28.26 (+ 5.56) Llama 2
GAIRMath-Abel-13B 🤗 13B 66.41 (+ 2.51) 17.34 (+ 3.34) Llama 2
GAIRMath-Abel-7B 🤗 7B 59.74 (+ 4.84) 13.00 (+ 2.40) Llama 2

Methodology

We propose Parental Oversight, A Babysitting Strategy for Supervised Fine-tuning,

Parental Oversight is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI (GAI). We believe that in the era of GAI, data structure engineering has emerged as a new paradigm. Within this paradigm, the manner in which the fine-tuning data is processed significantly impacts the performance of the trained GAI. We expect a growing number of studies in the community to focus on this data processing philosophy.

The principle of Parental Oversight emphasizes treating supervised fine-tuning with care and prudence. This is analogous to the way parents are encouraged to educate their children. Different types of data, along with their presentation formats (e.g., step-by-step reasoning, iterative refinement), can be likened to varied educational methods. Just as parents cautiously select the most effective approach to instruct their children, GAI practitioners should cautiously select the most effective data processing approaches to better instruct their LLMs.

Furthermore, the "the more data, the better" philosophy doesn't always hold true. The quality and relevance of annotated samples can often outweigh their quantity. Training samples used in SFT should not just present the right answer, but also instruct the model on how the correct answer was derived based on the knowledge of the LLM. Additionally, if the LLM's knowledge is not sufficient to answer a question, Parental Oversight should step in to address the knowledge gaps promptly.

Leaderboard for Mathematical Reasoning

🔒 stands for the proprietary model while 🌍 represents the open-source model

🎓 suggests that model development is led by academic university (instead of companies)

We only consider models without using any tool (e.g., Python)

Ranking Model Param. Leading Organization GSM8K MATH
🔒 1GPT-4unknownOpenAI92.042.5
🔒 2Claude-2unknownAnthropic88.0-
🔒 3PaLM-2-FlanunknownGoogle84.733.2
🌍 4GAIRMath-Abel70B🎓 GAIR Lab at Shanghai Jiaotong University83.628.3
🌍 5WizardMath70BMicrosoft81.622.7
🔒 6Claude-InstantunknownAnthropic80.9-
🔒 7ChatGPTunknownOpenAI80.834.1
🔒 8ChatGPT-0301unknownOpenAI74.9-
🌍 9GAIRMath-Abel13B🎓 GAIR Lab at Shanghai Jiaotong University66.417.3
🌍 10GAIRMath-Abel7B🎓 GAIR Lab at Shanghai Jiaotong University59.713.0
🔒 11Minerva540BGoogle58.833.6
🔒 12PaLM540BGoogle56.98.8
🌍 13Llama-270BMeta56.813.5
🌍 14RFT33BOFA56.57.4
🌍 15Baichuan2-13B13B Baichuan52.810.1
🔒 16Minerva62BGoogle52.427.6
🌍 17PaLM64BGoogle52.44.4
🔒 18RFT13BOFA52.15.1
🔒 19LlaMA65BMeta50.910.6
🌍 20QWen7BAlibaba44.98.5
🌍 21Chinchilla70B DeepMind43.7-
🔒 22Llama-234BMeta42.26.24
🔒 23Galactica30BMeta41.712.7
🌍 24ChatGLM212B Zhipu40.9-
🔒 25Text-davinci-002175BOpenAI40.719.1
🔒 26Llama33BMeta35.67.1
🌍 27GPT-3175BOpenAI345.2
🌍 28InternLM7BShanghai AI Lab31.2-
🌍 29Llama-213BMeta28.73.9
🔒 30Vicuna v1.313BLMSys27.6-
🌍 31Falcon40BTechnology Innovation Institute19.62.5
🔒 32Llama13BMeta17.83.9
🔒 33MPT30BMosaicML15.23.1
🔒 34Galactica6.7BMeta10.22.2

Demo

Description of first image
Llama2 v.s. Abel on gsm8k question
Description of second image
Llama2 v.s. Abel on MATH question

BibTeX

@misc{abel,
  author = {Chern, Ethan and Zou, Haoyang and Li, Xuefeng and Hu, Jiewen and Feng, Kehua and Li, Junlong and Liu, Pengfei},
  title = {Generative AI for Math: Abel},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/GAIR-NLP/abel}},
}