We propose On-the-fly Preference Optimization (OPO), a real-time alignment that works in a streaming way. OPO employs an external memory to store established rules for alignment, which can constrain LLMs’ behaviors without further training, allowing for convenient updates and customization of human values.
OPO consists of a rule creation module, an alignment module, and an evaluation module.
@article{xu2023align,
title={Align on the Fly: Adapting Chatbot Behavior to Established Norms},
author={Xu, Chunpu and Chern, Steffi and Chern, Ethan and Zhang, Ge and Wang, Zekun and Liu, Ruibo and Li, Jing and Fu, Jie and Liu, Pengfei},
journal={arXiv preprint arXiv:2312.15907},
year={2023}
}