After the release of Qwen2.5, we heard the community’s demand for processing longer contexts. In recent months, we have made many optimizations for the model capabilities and inference performance of extremely long context. Today, we are proud to introduce the new Qwen2.5-Turbo version, which features: (View Highlight)
Longer Context Support: We have extended the model’s context length from 128k to 1M, which is approximately 1 million English words or 1.5 million Chinese characters, equivalent to 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. The model achieves 100% accuracy in the 1M length Passkey Retrieval task and scores 93.1 on the long text evaluation benchmark RULER, surpassing GPT-4’s 91.6 and GLM4-9B-1M’s 89.9. Additionally, the model maintains very strong competitiveness in short sequence capabilities, on par with GPT-4o-mini. (View Highlight)
Faster Inference Speed: Using sparse attention mechanisms, we successfully reduced the time to first token for processing a context of 1M tokens from 4.9 minutes to 68 seconds, achieving a 4.3x speedup. (View Highlight)
Lower Cost: The price remains ¥0.3 / 1M tokens. At the same cost, Qwen2.5-Turbo can process 3.6 times the number of tokens as GPT-4o-mini. (View Highlight)
In addition to performance improvements in tasks of long context, we are also concerned about the model’s performance in tasks of short context. The existing context length extension methods often lead to significant performance degradation when processing short texts. Therefore, we have paid special attention to this issue when building Qwen2.5-Turbo, ensuring that the extension of context length almost does not affect the ability of short text understanding. (View Highlight)
Results on short text benchmarks show that Qwen2.5-Turbo significantly surpasses previous open-source models with a context length of 1M tokens in most tasks; compared to GPT-4o-mini and Qwen2.5-14B-Instruct models, Qwen2.5-Turbo achieves similar performance in short text tasks while supporting 8 times the context length. (View Highlight)
We tested the TTFT (time to first token) for inputs of different lengths. On the sequences of 1M tokens, we used sparse attention to compress the computation of the attention by about 12.5 times, achieving a speedup of 3.2 to 4.3 times under different hardware configurations. (View Highlight)
While we are pleased to finally extend the context of Qwen2.5-Turbo to 1M tokens, we also recognize that the current model does not always perform satisfactorily when solving long sequence tasks in real applications. There are many unresolved challenges, such as the model’s performance being more unstable in long sequence tasks, and the cost of inference making it difficult to use larger models. However, we will actively explore further alignment of human preferences in long sequences, optimize inference efficiency to reduce computation time, and attempt to launch larger and stronger long-context models. We look forward to sharing new progress in developing long-context models with you soon, so stay tuned! (View Highlight)