Self-speculative decoding, proposed in LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding is a novel approach to text generation. It combines the strengths of speculative decoding with early exiting from a large language model (LLM). This method allows for efficient generation by using the same model’s early layers for drafting tokens, and later layers for verification. (View Highlight)
This technique not only speeds up text generation, but it also achieves significant memory savings and reduces computational latency. In order to obtain an end-to-end speedup, the output of the earlier layers need to be close enough to the last layer. This is achieved by a training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering the overall hardware footprint needed for large-scale inference. (View Highlight)
In this blog post, we explore the concept of self-speculative decoding, its implementation, and practical applications using the 🤗 transformers library. You’ll learn about the technical underpinnings, including early exit layers, unembedding, and training modifications. To ground these concepts in practice, we offer code examples, benchmark comparisons with traditional speculative decoding, and insights into performance trade-offs. (View Highlight)
Speculative Decoding and Self-Speculative Decoding
(View Highlight)
Traditional speculative decoding uses two models: a smaller one (draft model) to generate a sequence of draft tokens, and a larger one (verification model) to verify the draft’s accuracy. The smaller model performs a significant portion of the generation, while the larger model refines the results. This increases text generation speed since the larger model verifies full sequences at once, instead of generating one draft at a time. (View Highlight)
In self-speculative decoding, the authors build on this concept but use the early layers of a large model to generate draft tokens that are then verified by the model’s deeper layers. This “self” aspect of speculative decoding, which requires specific training, allows the model to perform both drafting and verification. This, in turn, improves speed and reduces computational costs compared to the traditional speculative decoding. (View Highlight)
Note: While the assistant_early_exit argument can potentially enable early-exit self-speculative decoding for any decoder-only transformer, the logits from the intermediate layers cannot be unembedded (process of decoding through LM Head, described later in the blog post) unless the model is specifically trained for that. You will also only obtain speedups for a checkpoint that was trained in such a way to increase the accuracy of earlier layers. The LayerSkip paper proposes a training recipe to achieve that (namely, applying early exit loss, and progressively increasing layer dropout rates). A collection of Llama2, Llama3, and Code Llama checkpoints that have been continually pretrained with the LayerSkip training recipe are provided here. (View Highlight)
One key technique in self-speculative decoding is early exit, where the generation process can halt at a pre specified layer. To accomplish this, we unembed the logits from these layers by projecting them onto the language model (LM) head to predict the next token. This allows the model to skip subsequent layers and improve inference time. (View Highlight)
Unembedding can be performed at any transformer layer, turning early-exit into an efficient token-prediction mechanism. A natural question arises: how can the LM head be adapted to unembed logits from earlier layers when it was initially trained to work with the final layer only? This is where the training modifications come into play. (View Highlight)
In the training phase, we introduce layer dropout, which allows the model to skip certain layers during training. The dropout rate increases progressively in deeper layers, making the model less reliant on its later layers, as well as enhancing the model’s generalization and speeding up training. (View Highlight)
In addition to layer dropout, early exit loss is applied to ensure the LM head learns to unembed different layers. The total loss function for training the model with early exits is given by a summation of normalized loss from each exit (intermediate layers). This technique enables efficient training by distributing the learning task across all layers. (View Highlight)
Once training is complete, we can apply self-speculative decoding during inference. The process begins with self-drafting, where tokens are generated by exiting early from some intermediate layer. The number of speculative tokens defines how many draft tokens are produced during this stage, and the layer we exit at defines how large and accurate is the draft stage. Both parameters can be specified at inference based on a trade-off between speed and accuracy of the draft stage. (View Highlight)
The next stage is self-verification, where the full model is used to verify the draft tokens. The verification model reuses the portion of cache from the draft model. If the draft tokens align with the verified tokens, they are added to the final output, resulting in a better usage of the memory bandwidth in our system, because it’s much more expensive to generate a sequence of tokens with the full model than verifying a draft, as long as several of the tokens match. (View Highlight)
In the self-verification stage, only the remaining layers are computed for verification, because the results from the early layers are cached during the drafting phase. (View Highlight)
Self-speculative decoding benefits significantly from cache reuse, particularly the KV cache, which stores key-value pairs computed during the drafting stage. This cache allows the model to skip redundant calculations, as both the draft and verification stages use the same early layers. Additionally, the exit query cache stores the query vector from the exit layer, allowing verification to continue seamlessly from the draft stage. (View Highlight)
Compared to traditional two-model speculative decoding, early-exit self-speculative decoding can benefit from the following savings:
• Shared Weights: Reuses the weights from the first EE layers for both drafting and verification.
• Shared KV Cache: Reuses key-value pairs from the first EE layers for both drafting and verification.
• Shared Compute: Reuses the compute of the first EE layers by using a Exit Query Cache that saves only the query vector of the exit layer E−1E−1 so that the verification process won’t need to compute layers 00 to E−1E−1. (View Highlight)
So far, the 🤗 transformers library has implemented the first optimization (Shared Weights) in this pull request. As the number of models that use this method increases, we’ll consider the additional optimizations. Feel free to open a PR if you’re interested! (View Highlight)
The early exit layer of the draft stage is a hyperparameter that we can tune or modify during inference:
• The earlier we exit, the faster the generation of draft tokens are but the less accurate they will be.
• The later we exit, the more accurate the draft tokens generated are but the slower their generation will be. (View Highlight)
For the baseline checkpoints that have not been pretrained or continually pretrained with the LayerSkip training recipe, early exit self-speculative decoding is slower than autoregressive decoding. This is because during training of most LLMs, earlier layers are not motivated to learn to predict the output, and hence generating tokens using earlier layers will have a very low acceptance rate. (View Highlight)
On the other hand, for the Llama checkpoints that were continually pre-trained with the LayerSkip training, early exit self-speculative decoding has higher speedup than autoregressive decoding for at least a subset of the layers.
• For most models, except Llama3.2 1B, we notice a regular pattern when we traverse across layers: speedup starts low for the first few layers, increases gradually to a sweet spot, and then decreases again.
• The early exit layer sweet spot is when we have the optimal tradeoff between high accuracy of predictions and low overhead of generating tokens. This sweet spot depends on each model, and may also depend on the prompt or domain of the prompt. (View Highlight)
LayerSkip leverages the synergy between early exit, layer dropout, and cache reuse to create a fast and efficient text generation pipeline. By training the model to unembed outputs from different layers and optimizing the verification process with caches, this approach strikes a balance between speed and accuracy. As a result, it significantly improves inference times in large language models while maintaining high-quality outputs. It also reduces memory compared to traditional speculative decoding techniques due to a single model used as both the draft and verification model. (View Highlight)
Self-speculation is an exciting field where the same LLM can create draft tokens and fix itself. Other self-speculation approaches include:
• Draft & Verify: where the draft stage involves skipping pre-determined attention and feed forward layers.
• MagicDec: where the draft stage uses a subset of the KV cache, which is useful for long context inputs.
• Jacobi Decoding and Lookahead Decoding: Where the draft stage are a series of “guess tokens” that could be either random or obtained from a n-gram lookup table. (View Highlight)