What would be an effective annotation strategy for human-ai collaboration?

We have developed a human-AI pipeline that generates high-quality, long-form reasoning data based on the MATH dataset, following our “journey learning” paradigm. This pipeline expands human-annotated solutions from a few lines to thousands of tokens, using key techniques to ensure efficient annotation.

1. Complete Thought Process: It’s vital to document trials, reflections, associations, and corrections in reasoning. Even cognitive transitions not consciously recognized should be captured, as they are crucial for training large language models.
2. Explicit Common-Sense Explanations: To avoid hallucinations in LLMs, human annotations should include explicit explanations of common-sense knowledge, even if it seems obvious, ensuring that LLMs don’t misinterpret omitted information.

Once the human annotations are completed, AI-driven processes take over. We use sophisticated prompts for data augmentation in the following ways:

Data Granularity: We break down the problem-solving process into smaller, more digestible steps to enhance understanding.
Gradual Reasoning: LLMs are prompted to pause and reflect, simulating how students think and process information.
Student-Explorer Perspective: The LLMs approach problem-solving with curiosity, thinking through it as if for the first time, encouraging critical engagement in the learning process.