rw-book-cover

Metadata

Highlights

  • Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size. (View Highlight)
  • While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments. (View Highlight)
  • Today’s most advanced multimodal models remain proprietary. Research efforts aimed at building vision-language models (VLMs) utilizing open data lag significantly behind this state-of-the-art. Recent stronger open-weights models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. (View Highlight)
  • We present Molmo, a new family of state-of-the-art VLMs. Starting from a pre-trained vision encoder (CLIP) and language-only LLMs, the entire remainder of our VLM pipeline – weights, code, data, and evaluations – is open and free from VLM distillation. Our key innovation is a novel, highly-detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of capabilities, we also introduce a diverse dataset mixture for fine-tuning. This includes innovative 2D pointing data that enables Molmo to answer questions not just using natural language but also using non verbal cues. We believe this opens up important future directions for VLMs enabling agents to interact in virtual and physical worlds. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and most critically the quality of our newly collected datasets, all of which will be released. (View Highlight)
  • The best in class model within the Molmo family not only outperforms others in the class of open weight and data models, but also compares favorably against proprietary systems like GPT-4o, Claude 3.5 and Gemini 1.5. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and a public demo (using Molmo-7B-D model) are available starting today. (View Highlight)
  • VLM Openness Comparison. We characterize the openness of VLMs based on two attributes (open weights, open data and code) across three model components (the VLM and its two pre-trained components, the LLM backbone and the vision encoder). In addition to open vs. closed, we use the “distilled” label to indicate that the data used to train the VLM includes images and text generated by a different, proprietary VLM, meaning that the model cannot be reproduced without a dependency on the proprietary VLM. (View Highlight)
  • Large VLMs are conventionally trained on billions of image text pairs sourced from the web. Such massive corpora tend to be extremely noisy, requiring models to separate signal from noise in their training process. Noisy text also leads to hallucinations in a model’s output. We take a vastly different approach to sourcing data with an intense focus on data quality, and are able to train powerful models with less than 1M image text pairs, representing 3 orders of magnitude less data than many competitive approaches. (View Highlight)
  • The most critical ingredient to the success of the Molmo family of models is PixMo, Molmo’s training data. Pixmo includes two broad categories of data: (1) dense captioning data for multimodal pre-training and (2) supervised fine-tuning data for enabling a wide array of user interactions, including behaviors like question answering, document reading, and pointing. Our primary constraint in the collection of this data is to avoid making use of existing VLMs, since we want to build a performant VLM from the ground-up, rather than by distillation of an existing system (note that we do make use of language-only LLMs, but we never pass images to these models). (View Highlight)
  • In practice, it is challenging to collect dense captioning datasets from human annotators. If asked to write an image description, the result often only mentions a few salient visual elements and lacks detail. If a minimum word count is enforced, annotators will either take too long to type, making collection uneconomical, or copy-and-paste responses from proprietary VLMs, circumventing our goal to avoid distillation. As a result the open research community has struggled to create such datasets without relying on synthetic data from proprietary VLMs. Our key innovation is a simple but effective data collection methodology that avoids these problems: we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching “trick” annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator’s recording) proving that a VLM was not used. In total, we collected detailed audio descriptions for 712k images that were sampled from 50 high-level topics. (View Highlight)
  • Our fine-tuning data mixture includes standard academic datasets as well as several newly collected datasets which we will also be releasing. While the academic datasets primarily allow the model to work well on benchmark datasets, our newly collected datasets enable a wide range of important functionality including the ability to answer general questions about images in chats with users beyond the scope of the academic benchmark datasets, improves OCR-centric tasks like reading documents and charts, enables accurate reading of analog clocks, and allows the model to point to one or more visual elements in the image. Pointing provides a natural explanation grounded in image pixels resulting in new and improved capabilities for Molmo. We believe that in the future pointing will be an important communication channel between VLMs and agents. For example, a robot could query a pointing-enabled VLM for a waypoint or the location of an object to pick up, or a web agent could query the VLM for the location of a user interface element to click. (View Highlight)
  • PixMo-Cap is a dataset for pre-training VLMs to understand images in great detail. It contains 712,000 distinct images with approximately 1.3 million dense image captions. The captions were generated by human annotators who provided detailed 60-90 second spoken descriptions of diverse web images, which were then transcribed and refined using language models. The dataset covers a wide range of topics and includes detailed descriptions of image contents, objects, text, positions, subtle details, background, style, and color. (View Highlight)
  • PixMo-AskModelAnything is a dataset designed to enable AI models to answer diverse questions about images. It includes 162,000 question-answer pairs for 73,000 images, created through a process where human annotators selected images, wrote questions, and iteratively refined answers generated by a language model based on image captions and OCR output. The dataset also incorporates unusual requests, such as answers written upside down, to increase diversity. (View Highlight)
  • PixMo-Points is a dataset that includes image captioning data where human annotators were asked to point at objects in images and write descriptions of them. The dataset contains 2.3 million question-point pairs from 428,000 images, including instances where annotators pointed to every occurrence of a described object and cases where objects were not present in the image. This dataset aims to enable models to point to anything described by text, count objects by pointing, and use pointing as a form of visual explanation. (View Highlight)
  • This dataset includes 255,000 text and figure-heavy images (charts, documents, tables, diagrams) with corresponding code generated by a language model. It also contains 2.3 million question-answer pairs based on the generated code. (View Highlight)
  • Vision-language model evaluation is evolving rapidly with new academic benchmarks constantly appearing. These benchmarks work well for evaluating specific skills, but doing well on them often requires answering questions in a benchmark-specific style. These answers are often short and do not work well in other settings. As a result, academic benchmarks provide only a partial picture of how a model performs. To complement these benchmarks we also perform a human evaluation that allows us to rank models according to user preference. (View Highlight)
  • For academic benchmarking, we attempted to collect results for all models on a set of 11 commonly used academic benchmarks.[1] We prioritized using numbers published by the authors themselves when they were available, but many were missing. When results were not available, we attempted to find the best previously reported values from other technical reports or from public leaderboards, such as the OpenVLM Leaderboard. Finally, if a value was still missing we computed it ourselves. We note that computing results is difficult in practice and for a fixed model results on a given benchmark can vary by a large amount (e.g., 10 percentage points) depending on the details of how it was evaluated. Further complicating matters, in many cases critical evaluation details, such as what prompts were used or how the data was processed, may not be available making it difficult to reproduce published results. These issues underscore the importance of open evaluation. (View Highlight)
  • We also avoid making a strong distinction between claimed “zero-shot” performance (often reported for closed-data models) and the supervised performance of models that explicitly train on benchmark training sets. The distinction between supervised training and zero-shot transfer is fuzzy since one can curate new data sources that serve as effective proxies for any given benchmark’s literal training data. When training data is not disclosed, the community has no means of evaluating zero-shot transfer claims. (View Highlight)
  • For our human evaluation, we collected a diverse set of image and text prompt pairs and queried a set of VLMs for responses. We then presented the resulting image-text-response triplets for all VLM pairings to a set of ~870 human annotators who gave pairwise preference rankings. From these preference rankings, we calculated an ELO ranking using the Bradley-Terry model following the methodology of LMSYS Org’s Chatbot Arena. We collected over 325,231 pairwise comparisons across 27 models, making it the biggest human preference evaluation for multimodal models to date. As a reference, our ELO rankings are based on 3X more votes than Chatbot Arena (LMSYS) for vision models. (View Highlight)
  • (View Highlight)
  • (View Highlight)
  • Our most efficient Molmo model MolmoE-1B, based on our fully open OLMoE-1B-7B mixture-of-experts LLM, nearly matches the performance of GPT-4V on both academic benchmarks and human evaluation. (View Highlight)
  • Our two Molmo-7B models perform comfortably between GPT-4V and GPT-4o on both academic benchmarks and human evaluation, and significantly outperform recently released Pixtral 12B models on both benchmarks. (View Highlight)
  • Our best Molmo model also outperforms several state-of-the-art proprietary systems, including Gemini 1.5 Pro and Flash and Claude 3.5 Sonnet. (View Highlight)
  • Our model architecture follows the simple and standard design of combining a language model with an image encoder. It consists of four components: (1) a pre-processor that converts the input image into a set of multiscale, multi-crop images; (2) a ViT image encoder that independently maps each of these images into a set of vision tokens; (3) a connector that projects the vision tokens to the language model’s input dimension with an MLP and then pools the vision tokens to reduce their count; and (4) a decoder-only Transformer LLM. (View Highlight)
  • From this template, we construct a family of models that is parameterized by the choice of vision encoder and LLM. Given these choices, the subsequent training data and recipe are the same for all models (aside from optimizer learning rates). For the vision encoder, all of our released models use OpenAI’s ViT-L/14 336px CLIP model which provides consistently good results (while this model uses closed data, it can be reproduced from scratch as shown by MetaCLIP; we use the model from OpenAI because it was trained for higher resolution images). For the LLM, we have trained models on a variety of choices at different scales and degrees of openness including: the fully open-weight and data OLMo-7B-1024 (using the October, 2024 pre-released weights, which will be public at a later date), the efficient fully open-weight and data OLMoE-1B-7B-0924, open-weight Qwen2 7B, open-weight Qwen2 72B, open-weight Mistral 7B, open-weight Gemma2 9B, and Phi 3 Medium). Today we are releasing 4 samples from this family. (View Highlight)
  • Starting from an independently pre-trained vision encoder and LLM, our training process is simple and consists of two stages: (1) multimodal pre-training for caption generation using our newly collected captioning data and (2) supervised fine-tuning using our dataset mixture described above. All model parameters are updated in both stages. We do not use RLHF. (View Highlight)
  • Release Plan Our first release includes a demo, inference code, a brief technical report on arXiv and the following model weights • MolmoE-1B, a mixture of experts model with 1B (active) 7B (total) • Molmo-7B-O, our most open 7B model • Molmo-7B-D, our demo model • Molmo-72B, our best model In the following two months we will build upon this work by releasing • A more detailed version of the arXiv technical report • Our PixMo family of datasets • Additional model weights and checkpoints • Training and evaluation code (View Highlight)