Today Google releases Gemma 3, a new iteration of their Gemma family of models. The models range from 1B to 27B parameters, have a context window up to 128k tokens, can accept images and text, and support 140+ languages. (View Highlight)
Gemma 3 is Google’s latest iteration of open weight LLMs. It comes in four sizes, 1 billion, 4 billion, 12 billion, and 27 billion parameters with base (pre-trained) and instruction-tuned versions. Gemma 3 goes multimodal ! The 4, 12, and 27 billion parameter models can process both images and text, while the 1B variant is text only. (View Highlight)
The input context window length has been increased from Gemma 2’s 8k to 32k for the 1B variants, and 128k for all others. As is the case with other VLMs (vision-language models), Gemma 3 generates text in response to the user inputs, which may consist of text and, optionally, images. Example uses include question answering, analyzing image content, summarizing documents, etc. (View Highlight)
Scaling context length to 128k tokens could be achieved efficiently without training models from scratch. Instead, models are pretrained with 32k sequences, and only the 4B, 12B, and 27B models are scaled to 128k tokens at the end of pretraining, saving significant compute. Positional embeddings, like RoPE, are adjusted—upgraded from a 10k base frequency in Gemma 2 to 1M in Gemma 3—and scaled by a factor of 8 for longer contexts. (View Highlight)
Gemma 3 models use SigLIP as an image encoder, which encodes images into tokens that are ingested into the language model. The vision encoder takes as input square images resized to 896x896. Fixed input resolution makes it more difficult to process non-square aspect ratios and high-resolution images. To address these limitations during inference, the images can be adaptively cropped, and each crop is then resized to 896x896 and encoded by the image encoder. This algorithm, called pan and scan, effectively enables the model to zoom in on smaller details in the image. (View Highlight)
Similar to PaliGemma, attention in Gemma 3 works differently for text and image inputs. Text is handled with one-way attention, where the model focuses only on previous words in a sequence. Images, on the other hand, get full attention with no masks, allowing the model to look at every part of the image in a bidirectional manner, giving it a complete, unrestricted understanding of the visual input. (View Highlight)
One can see in the figure below that the image tokens <img> are provided with bi-directional attention (the entire square is lit up) while the text tokens have causal attention. It also shows how attention works with the sliding window algorithm.
(View Highlight)
To make a LLM multilingual, the pretraining dataset incorporates more languages. The dataset of Gemma 3 has double the amount of multilingual data to improve language coverage.
To account for the changes, the tokenizer is the same as that of Gemini 2.0. It is a SentencePiece tokenizer with 262K entries. The new tokenizer significantly improves the encoding of Chinese, Japanese and Korean text, to the expense of a slight increase of the token counts for English and Code. (View Highlight)
The LMSys Elo score is a number that ranks language models based on how well they perform in head-to-head competitions, judged by human preferences. On LMSys Chatbot Arena, Gemma 3 27B IT reports an Elo score of 1339, and ranks among the top 10 best models, including leading closed ones. The Elo is comparable to o1-preview and is above other non-thinking open models. This score is achieved with Gemma 3 working on text-only inputs, like the other LLMs in the table.
(View Highlight)
Gemma 3 ships with day zero support in mlx-vlm, an open source library for running vision language models on Apple Silicon devices, including Macs and iPhones (View Highlight)