vLLM
High-throughput LLM inference engine with PagedAttention
vLLM is an open-source high-performance LLM inference and serving engine from UC Berkeley that introduced PagedAttention—a memory management technique that significantly improves GPU utilization and throughput for serving large language models. vLLM achieves 24x higher throughput than HuggingFace Transformers for the same model with the same level of output accuracy. It is the de facto standard for self-hosting open-source LLMs in production, supporting OpenAI-compatible APIs, tensor parallelism, and quantization.
Key Features
- ✓PagedAttention
- ✓OpenAI-compatible API
- ✓Tensor parallelism
- ✓High throughput
- ✓Quantization support
- ✓Open source
Quick Info
- Category
- Code & Development
- Pricing
- Free
More Code & Development Tools
GitHub Copilot
Code & DevelopmentThe AI pair programmer trusted by millions of developers
Cursor
Code & DevelopmentThe code editor built around AI from the ground up
Tabnine
Code & DevelopmentPrivacy-first AI code completion
Codeium
Code & DevelopmentFree AI coding assistant with no usage limits