Who killed non-contrastive image-text pretraining?

rw-book-cover

Metadata

Highlights

  • Looking at a wide mix of tasks, an image encoder pre-trained on image/alt-text pairs via captioning (Cap/CapPa) almost matches a contrastive one (CLIP) on classification tasks, and largely outperforms it on image-text tasks. (View Highlight)
  • The method is almost as straightforward as it gets: Cap (middle) is an encoder-decoder model with ViT encoder and auto-regressive decoder. Because predicting the rest of the caption after the first few tokens may be too easy, leading to little signal from later tokens to image… (View Highlight)
  • you’re generally interested in pre-training models with noisy image-text data, I highly recommend you read it: (View Highlight)