High-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics:
We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. See our paper for more technical details.
These invaluable corpora are the culmination of human intellect and should be utilized for the betterment of humanity, aiding in the improvement of human life. We strongly urge all users to refrain from using our corpus for any activities that may harm national or social security or violate the law.
We have done our utmost to ensure the high quality and lawful use of the data. However, unforeseen issues may still arise, including but not limited to data security concerns and any risks or problems stemming from misuse. We shall not be held responsible for any such issues.
If the source data of MathPile is governed by a license more restrictive than BY-NC-SA 4.0 license, MathPile adheres to that stricter licensing. In all other cases, it operates under the BY-NC-SA 4.0 license. We also plan to release a commercially usable version of the dataset soon.
@article{wang2023mathpile,
title={Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math},
author={Wang, Zengzhi and Xia, Rui and Liu Pengfei},
journal={arXiv preprint arXiv:2312.17120},
year={2023}
}