🦙

ExLlamaV2

High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format

AI Infrastructure

ExLlamaV2

High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format

AI InfrastructureFree

ExLlamaV2 is an open-source inference library for running quantized LLMs at high speed on NVIDIA GPUs, supporting its own EXL2 quantization format as well as GPTQ and other formats. It consistently achieves some of the highest tokens-per-second throughput benchmarks for quantized models among open-source solutions. Local AI enthusiasts, researchers with limited GPU resources, and developers building local LLM applications use ExLlamaV2 to run large models on consumer or prosumer hardware at usable speeds.