Skip to main content
🦙

ExLlamaV2

High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format

AI Infrastructure
ExLlamaV2 logo

ExLlamaV2

High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format

ExLlamaV2 is an open-source inference library for running quantized LLMs at high speed on NVIDIA GPUs, supporting its own EXL2 quantization format as well as GPTQ and other formats. It consistently achieves some of the highest tokens-per-second throughput benchmarks for quantized models among open-source solutions. Local AI enthusiasts, researchers with limited GPU resources, and developers building local LLM applications use ExLlamaV2 to run large models on consumer or prosumer hardware at usable speeds.

Key Features

  • EXL2 quantization
  • High throughput
  • GPTQ support
  • NVIDIA optimized
  • Consumer GPU friendly
#llm-inference#quantization#open-source#local-ai#nvidia

Get Started

Visit ExLlamaV2
🟢
Free
Completely free to use

Quick Info

Category
AI Infrastructure
Pricing
Free

More AI Infrastructure Tools