Skip to main content

vLLM

High-throughput LLM inference engine with PagedAttention

Code & Development
vLLM logo

vLLM

High-throughput LLM inference engine with PagedAttention

vLLM is an open-source high-performance LLM inference and serving engine from UC Berkeley that introduced PagedAttention—a memory management technique that significantly improves GPU utilization and throughput for serving large language models. vLLM achieves 24x higher throughput than HuggingFace Transformers for the same model with the same level of output accuracy. It is the de facto standard for self-hosting open-source LLMs in production, supporting OpenAI-compatible APIs, tensor parallelism, and quantization.

Key Features

  • PagedAttention
  • OpenAI-compatible API
  • Tensor parallelism
  • High throughput
  • Quantization support
  • Open source
#llm#inference#open-source#self-hosted#gpu-optimization

Get Started

Visit vLLM
🟢
Free
Completely free to use

Quick Info

Category
Code & Development
Pricing
Free

More Code & Development Tools