⚡

vLLM

High-throughput LLM inference engine with PagedAttention

Code & Development

vLLM

High-throughput LLM inference engine with PagedAttention

Code & DevelopmentFree

vLLM is an open-source high-performance LLM inference and serving engine from UC Berkeley that introduced PagedAttention—a memory management technique that significantly improves GPU utilization and throughput for serving large language models. vLLM achieves 24x higher throughput than HuggingFace Transformers for the same model with the same level of output accuracy. It is the de facto standard for self-hosting open-source LLMs in production, supporting OpenAI-compatible APIs, tensor parallelism, and quantization.