Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Fan Zhou^1,4,*, Zengzhi Wang^1,4,*, Qian Liu³, Junlong Li⁴, Pengfei Liu^1,2,4,+,

¹Shanghai Jiao Tong University ²Shanghai Artificial Intelligence Laboratory
³Sea AI Lab ⁴Generative Artificial Intelligence Research Lab (GAIR)
^*Equal Contribution ⁺Corresponding Author

arXiv Code

HF Repo Official Twitter

Introduction Video for ProX

Overview

Still relies heavily on human experts crafted rules to improve the quality of pre-training corpora? Depsite its efficency, these rules lack the flexibility to address the unique characteristics of individual example effectively and struggle to handle fine-gained cleaning operations. Meanwhile, applying tailored rules to every example is impractical for human experts.

🚀 We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. ProX go through 2 refining stages focusing on different scopes: document level and chunk level. We use very small language models (e.g., 🐣0.3B) to generate programs for each example in the training data, and then use these programs together with a Python Executor to refine the data.

We demonstrate that ProX could significantly improve the quality of pre-training corpora.

A 🐘1.7B language model trained on corpus refined by ProX with 50B tokens training perform similarly to TinyLlama-1.1B-3T, which saves 30 times training compute!
ProX works well across various corpora (e.g., C4, RedPajama-v2, FineWeb) and model sizes
ProX can refine domain-specific corpora (such as math) without customization, boosting math reasoning performance by 7%~20% for models like Mistral-7B and Llama-7B. Additionally, ProX builds a performance-comparable Llemma-7B model with 20x less training compute
Quantitative analysis demonstreate that investment compute on the inference side to refine corpora is totally worthwhile.

Programming Every Example

To keep things lean, the steps are generally as follows:

Step 1: Adapt a base language model (typically less than 1B) to perform data refinement via fine-tuning on synthetic supervised data. (How to adapt? see below👇)

Step 2: Apply this refinement model to generate programs for each docment in the corpus

Step 3: Execute the generated programs for each docment, producing refined corpus ready for pre-training.

Description of first image — ProX Frameworks, lifting pre-training data quality by generating programs for each example in the training data.

How to adpat a LM to perform data refinement

To keep things lean, the steps are generally as follows:

Step 1: Prompt a advanced language model (i.e., Llama-3 series) to annotate random seed documents based on designed prompts (including scoring criteria, function interface and description)

Step 2: Supervisedly fine-tine a base language model (often less than 1B) on the synthetic data

Then, we can employ this refinement model to annotate docments from the corpus at scale.

ProX Program Design
Stage	Function Interface	Function Description
Document Level	drop_doc()	Delete the document from corpus.
Document Level	keep_doc()	Keep the document in corpus.
Chunk Level	remove_lines(line_start:int, line_end:int)	Delete noisy lines from the chunk.
	normalize(source_str:str, target_str:str)	Replace strings with normalized ones.
	skip_chunk()	Keep the chunk as it is.

Benchmarking Performance

TL;DR: ProX could significantly lift the quality of the pre-training corpora (e.g., C4, RedPajama-V2, FineWeb and OpenWebMath), enabling the creation of performance-equivalent language models with far fewer training FLOPs.

Description of the image — Training FLOPs v.s. average downstream performance. Although these corpora have gone through expert-crafted rules, applying PROX still yields significant improvements over these baseline models trained with original data corpus. Moreover, with much less training FLOPs, model trained on PROX curated data show comparable performance with existing models.

Main Results: applying on RedPajama-V2

We first conduct a series of experiments to verify the effectiveness of each ProX operation. We begin by training a 750M on the RedPajama-V2 raw data for approximately 26B tokens (or 12.5K steps) as the initial baseline. Then, we apply different data refinement approaches (i.e., various rules and ProX) to the raw data and pre-train the model from scratch with the same steps. We evaluate the performance across ten selected tasks using lighteval's implementation (Fourrier et al., 2023), and report the zero-shot accuracy.

Method	ARC-C	ARC-E	CSQA	HellaS	MMLU	OBQA	PIQA	SIQA	WinoG	SciQ	AVG	# Win
Raw	26.1	44.3	29.7	39.1	27.3	29.2	66.9	39.0	52.0	67.4	42.1	1 / 10
Apply Existing Rules
Gopher Rules	25.7	44.0	31.3	40.2	27.3	29.0	66.3	39.0	51.2	68.9	42.3	0 / 10
C4 Rules	25.0	46.0	31.0	40.5	27.1	29.2	68.5	40.5	51.7	66.6	42.6	2 / 10
FineWeb Rules	25.2	46.8	32.6	39.6	27.2	29.0	66.5	39.4	52.4	69.2	42.8	2 / 10
Gopher + C4 + FineWeb	25.2	43.9	30.0	41.9	27.5	31.0	67.0	39.9	51.9	65.3	42.3	0 / 10
ProX(Ours): D = Document-level programming, C = Chunk-level programming.
ProX-D	26.6	49.7	30.1	40.5	29.4	30.4	66.3	39.0	51.2	71.6	43.5	2 / 10
ProX-D+C	26.4	51.9	30.9	42.4	29.4	31.6	67.9	40.0	52.2	73.5	44.6	5 / 10

Beyond heuristic rules, ProX also outperforms existing data selection methods

We adopt the setting of MATES (Yu et al., 2024) to enable a controlled comparison with existing data selection methods, showcasing the effectiveness of ProX. Specifically, we apply ProX to the C4 dataset and pre-train the model using the Pythia architecture. We report both zero-shot and 2-shot performance using the LM-eval-harness.

Method	0-shot	2-shot	# Win
Model Architecture: Pythia-410M
Random	42.7	43.8	0 / 8
DSIR (Xie et al., 2023)	42.5	43.7	1 / 8
DsDM (Engstom et al., 2024)	43.4	44.1	0 / 8
QuRating (Wettig et al., 2024)	43.5	44.6	0 / 8
MATES (Yu et al., 2024)	44.0	45.0	0 / 8
ProX	46.2	47.5	7 / 8
Model Architecture: Pythia-1B
Random	44.7	45.4	0 / 8
MATES (Yu et al., 2024)	45.8	46.4	1 / 8
ProX	46.8	48.4	7 / 8

Applying ProX on Domain-specific Continual Pre-training

We also demonstrate the potential of ProX on the continual pre-training scenario, specifically, in the mathematical domain. We apply the very same pipeline as in general domains to the already cleaned OpenWebMath corpus.

All models are tested using few-shot CoT prompts. Llemma and InternLM2-Math are continual pre-trained models from CodeLlama and InternLM2 with public available data, respectively. Note that the unique tokens and training tokens in the column refer exclusively to the token numbers from math-specific corpora (calculated by corresponding tokenizers). ⁺: MQA evaluation of InternLM2 is based on an alternative prompt due to non-prediction issues with the original prompt. The bold entries represent the best results within the same base model.

Model	Size	Method	Uniq Toks	Train Toks	GSM8K	MATH	SVAMP	ASDiv	MAWPS	TAB	MQA	MMLU STEM	MATH	AVG
Existing Continual Pre-training for Reference
DeepSeek-LLM	1.3B	-	-	-	2.9	3.0	-	-	-	-	-	19.5	15.6	-
DeepSeek-LLM	1.3B	-	14B	150B	11.5	8.9	-	-	-	-	-	29.6	31.3	-
CodeLlama (Base)	7B	-	-	-	11.8	5.0	44.2	50.7	62.6	30.6	14.3	20.4	21.9	29.1
CodeLlama (Base)	34B	-	-	-	31.8	10.8	61.9	66.0	83.4	51.6	23.7	43.0	53.1	47.3
Llemma	7B	-	55B	200B	38.8	17.2	56.1	69.1	82.4	48.7	41.0	45.4	59.4	50.9 (+21.8)
Llemma	34B	-	55B	50B	54.2	23.0	67.9	75.7	90.1	57.9	49.8	54.7	68.8	60.1 (+12.8)
InternLM2	7B	-	-	-	27.0	6.6	49.0	59.3	74.8	40.1	20.9 ⁺	19.0	28.1	36.1
InternLM2	20B	-	-	-	50.6	18.8	72.5	75.9	93.9	45.4	33.1	53.7	59.4	55.9
InternLM2-Math	7B	-	31B	125B	41.8	14.4	61.6	66.8	83.7	50.0	57.3	24.8	37.5	48.7 (+12.6)
InternLM2-Math	20B	-	120B	500B	65.4	30.0	75.7	79.3	94.0	50.9	38.5	53.1	71.9	62.1 (+6.2)
Applying Data Refinement Approaches
TinyLlama (Base)	1.1B	-	-	-	2.8	3.2	10.9	18.0	20.2	12.5	14.6	16.4	21.9	14.7
TinyLlama (CPT)	1.1B	-	15B	15B	6.2	4.8	22.3	36.2	47.6	19.3	11.6	20.7	25.0	21.5 (+8.1)
	1.1B	Rho	15B	9B	7.1	5.0	23.5	41.2	53.8	-	18.0	-	-	-
	1.1B	Rule	6.5B	15B	4.5	2.8	17.5	29.4	39.3	15.1	12.4	19.4	25.0	18.4 (+3.7)
	1.1B	ProX	5B	15B	9.0	5.6	23.8	41.9	56.9	22.2	15.6	26.8	31.2	25.7 (+11.0)
Llama-2 (Base)	7B	-	-	-	14.1	3.8	39.5	51.6	63.6	30.9	12.5	32.9	34.4	31.5
Llama-2 (CPT)	7B	-	15B	10B	29.6	13.6	49.2	61.9	78.4	36.3	31.9	40.5	43.8	42.8 (+11.3)
Llama-2 (CPT)	7B	ProX	5B	10B	30.6	16.8	50.2	63.7	79.3	37.3	40.1	43.8	53.1	46.1 (+14.6)
CodeLlama (Base)	7B	-	-	-	11.8	5.0	44.2	50.7	62.6	30.6	14.3	20.4	21.9	29.1
CodeLlama (CPT)	7B	-	15B	10B	31.1	14.8	51.4	62.1	81.2	33.6	30.4	40.5	43.8	43.2 (+14.1)
CodeLlama (CPT)	7B	ProX	5B	10B	35.6	17.6	55.8	67.9	82.7	41.3	38.9	42.6	62.5	49.4 (+20.3)
Mistral (Base)	7B	-	-	-	40.6	11.4	65.4	68.5	87.0	52.9	32.3	50.0	56.2	51.6
Mistral (CPT)	7B	-	15B	10B	44.4	19.2	65.2	69.6	88.4	46.6	43.1	50.8	65.6	54.8 (+3.2)
Mistral (CPT)	7B	ProX	4.7B	10B	51.0	22.4	64.9	72.9	89.2	49.8	53.0	54.2	75.0	59.2 (+7.6)

Please refer to our paper for more details.

What's More?

ProX might just be the art moment for data engineering in language model pre-training. It's a pioneering effort to optimize pre-trained corpora through various advanced operations powered by language models at scale.

But this is only the beginning! Stay tuned for more exciting updates, and follow us on Hugging Face and GitHub for the latest developments.

BibTeX

@article{zhou2024programming,
  title={Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale},
  author={Zhou, Fan and Wang, Zengzhi and Liu, Qian and Li, Junlong and Liu, Pengfei},
  journal={arXiv preprint arXiv:2409.17115},
  year={2024}
}