vLLM significantly speeds up LLM serving by efficiently managing attention KV memory with PagedAttention and employing continuous batching. It supports a wide range of quantization methods and optimized CUDA kernels for faster model execution.
A high-throughput and memory-efficient inference engine for LLMs.
Developers and researchers needing to deploy and serve Large Language Models with high performance and reduced resource usage.
Not enough data yet. Star history will appear after a few days of tracking.