Research Vision

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples but also more complex prompts involving interaction between images. Multimodal C4 is an augmentation of the popular text-only c4 corpus2 with images interleaved.

Paper

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Wanrong Zhu, Jack Hessel, Anas Awadalla, S. Gadre, Jesse Dodge, Alex Fang, and 4 more... ArXiv  2023