Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation

Ethan Chern*1, 4, Jiadi Su*3,4, Yan Ma*3,4, Pengfei Liu1, 2, 4+
1Shanghai Jiao Tong University, 2Shanghai Artificial Intelligence Laboratory, 3Fudan University, 4Generative AI Research Lab (GAIR)
*Co-first authors +Corresponding authors

👋 Overview

Anole is the first open-source, autoregressive, and natively trained large multimodal model capable of interleaved image-text generation (without using stable diffusion). While it builds upon the strengths of Chameleon, Anole excels at the complex task of generating coherent sequences of alternating text and images. Through an innovative fine-tuning process using a carefully curated dataset of approximately 6,000 images, Anole achieves remarkable image generation and understanding capabilities with minimal additional training. This efficient approach, combined with its open-source nature, positions Anole as a catalyst for accelerated research and development in multimodal AI. Preliminary tests demonstrate Anole's exceptional ability to follow nuanced instructions, producing high-quality images and interleaved text-image content that closely aligns with user prompts.

The major functionalities of Anole are listed below:
  • Text-to-Image Generation
  • Interleaved Text-Image Generation
  • Text Generation
  • Fine-grained Evaluation
where Bold represents newly added capabilities on the basis of Chameleon.

📊 Examples

🔍 Methodology

Based on available information and our testings, the latest release of Chameleon have demonstrated strong performance in text understanding, text generation, and multimodal understanding. Anole, build on top of Chameleon, aiming to facilitate the image generation and multimodal generation capabilities from Chameleon.

Chameleon’s pre-training data natively includes both text and image modalities, theoretically equipping it with image generation capabilities. Our goal is to facilitate this ability without compromising its text understanding, generation, and multimodal comprehension. To achieve this, we froze most of Chameleon’s parameters and fine-tuned only the logits corresponding to image token ids in transformer’s output head layer.

Specifically, Anole-7b-v0.1 was developed using a small amount of image data (5,859 images, approximately 6 million image tokens) and was fine-tuned on just a few parameters (less than 40M) in a short time (around 30 minutes on 8 A100 GPUs). Despite this, Anole-7b-v0.1 expresses impressive image generation capabilities.

📬 Contact

If you have any questions regarding this project, feel free to directly submit a github issue.

BibTeX

@article{chern2024anole,
  title={ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation},
  author={Chern, Ethan and Su, Jiadi and Ma, Yan and Liu, Pengfei},
  journal={arXiv preprint arXiv:2407.06135},
  year={2024}
}