Transformers, based on the attention mechanism, are the dominant architecture used in all the strongest large language models today. Yet, the attention mechanism is fundamentally limited in processing large sequences due to the increase in compute and memory costs with sequence length. Various alternative architectures, in particular State Space Language Models (SSLMs), tried to address the sequence scaling limitation but fell back in performance compared to SoTA transformers. (View Highlight)
With Falcon Mamba, we demonstrate that sequence scaling limitation can indeed be overcome without loss in performance. Falcon Mamba is based on the original Mamba architecture, proposed in Mamba: Linear-Time Sequence Modeling with Selective State Spaces, with the addition of extra RMS normalization layers to ensure stable training at scale. This choice of architecture ensures that Falcon Mamba:
• can process sequences of arbitrary length without any increase in memory storage, in particular, fitting on a single A10 24GB GPU.
• takes a constant amount of time to generate a new token, regardless of the size of the context (View Highlight)
Falcon Mamba was trained with ~ 5500GT of data, mainly composed of RefinedWeb data with addition of high-quality technical data and code data from public sources. We used constant learning rate for the most of the training, followed by a relatively short learning rate decay stage. In this last stage, we also added a small portion of high-quality curated data to further enhance model performance. (View Highlight)
Following theoretical efficiency SSM models in processing large sequences, we perform a comparison of memory usage and generation throughput between Falcon Mamba and popular transfomer models using the optimum-benchmark library. For a fair comparison, we rescaled the vocabulary size of all transformer models to match Falcon Mamba since it has a big impact on the memory requirements of the model. (View Highlight)
Before going to the results, let’s first discuss the difference between the prompt (prefill) and generated (decode) parts of the sequence. As we will see, the details of prefill are more important for state space models than for transformer models. When a transformer generates the next token, it needs to attend to the keys and values of all previous tokens in the context. This implies linear scaling of both memory requirements and generation time with context length. A state space model attends to and stores only its recurrent state and, therefore, doesn’t need additional memory or time to generate large sequences. While this explains the claimed advantage of SSMs over transformers in the decode stage, the prefill stage requires additional effort to fully utilize SSM architecture.
A standard approach for prefill is to process the whole prompt in parallel to fully utilize GPU. This approach is used in optimum-benchmark library and we will refer to it as parallel prefill. Parallel prefill needs to store in memory the hidden states of each token in the prompt. For transformers, this additional memory is dominated by the memory of stored KV caches. For SSM models, no caching is required, and the memory for storing hidden states becomes the only component proportional to the prompt length. As a result, the memory requirement will scale with prompt length, and SSM models will lose the ability to process arbitrary long sequences, similar to transformers. (View Highlight)
An alternative to parallel prefill is to process the prompt token by token, which we will refer to as sequential prefill. Akin to sequence parallelism, it can also be done on larger chunks of the prompt instead of individual tokens for better GPU usage. While sequential prefill makes little sense for transformers, it brings back the possibility of processing arbitrary long prompts by SSM models.
With these remarks in mind, we first test the largest sequence length that can fit on a single 24 GB A10 GPU, putting the results on the figure below. The batch size is fixed at 1, and we are using float32 precision. Even for parallel prefill, Falcon Mamba can fit larger sequences than a transformer, while in sequential prefill, it unlocks its full potential and can process arbitrary long prompt (View Highlight)
generated tokens, using batch size 1 and H100 GPU. The results are reported in the figure below. We observe that our Falcon Mamba is generating all the tokens at constant throughput and without any increase in CUDA peak memory. For the transformer model, the peak memory grows, and generation speed slows down as the number of generated tokens grows.
(View Highlight)