Metadata
- Author: Zhuohan Li
- Full Title: π Thrilled to introduce vLLM with @woosuk_k!
- URL: https://twitter.com/zhuohan123/status/1671234707206590464
Highlights
- vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. (View Highlight)
- The core of vLLM is PagedAttention, a novel attention algorithm that brings the classic idea of paging in OSβs virtual memory to LLM serving. Without modifying the model, PagedAttention can batch 5x more sequences together, increasing GPU utilization and thus the throughput. (View Highlight)