Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. (View Highlight)
The method is almost as straightforward as it gets: Cap (middle) is an encoder-decoder model with ViT encoder and auto-regressive decoder.
Because predicting the rest of the caption after the first few tokens may be too easy, leading to little signal from later tokens to image… (View Highlight)
you’re generally interested in pre-training models with noisy image-text data, I highly recommend you read it: (View Highlight)