Skip to main content

What would be an effective annotation strategy for human-ai collaboration?

We have developed a human-AI pipeline that generates high-quality, long-form reasoning data based on the MATH dataset, following our “journey learning” paradigm. This pipeline expands human-annotated solutions from a few lines to thousands of tokens, using key techniques to ensure efficient annotation.

  • 1. Complete Thought Process: It’s vital to document trials, reflections, associations, and corrections in reasoning. Even cognitive transitions not consciously recognized should be captured, as they are crucial for training large language models.

  • 2. Explicit Common-Sense Explanations: To avoid hallucinations in LLMs, human annotations should include explicit explanations of common-sense knowledge, even if it seems obvious, ensuring that LLMs don’t misinterpret omitted information.

Once the human annotations are completed, AI-driven processes take over. We use sophisticated prompts for data augmentation in the following ways:

  • Data Granularity: We break down the problem-solving process into smaller, more digestible steps to enhance understanding.

  • Gradual Reasoning: LLMs are prompted to pause and reflect, simulating how students think and process information.

  • Student-Explorer Perspective: The LLMs approach problem-solving with curiosity, thinking through it as if for the first time, encouraging critical engagement in the learning process.