ExLlamaV2
High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format
ExLlamaV2
High-speed quantized LLM inference library for NVIDIA GPUs with EXL2 format
ExLlamaV2 is an open-source inference library for running quantized LLMs at high speed on NVIDIA GPUs, supporting its own EXL2 quantization format as well as GPTQ and other formats. It consistently achieves some of the highest tokens-per-second throughput benchmarks for quantized models among open-source solutions. Local AI enthusiasts, researchers with limited GPU resources, and developers building local LLM applications use ExLlamaV2 to run large models on consumer or prosumer hardware at usable speeds.
Key Features
- ✓EXL2 quantization
- ✓High throughput
- ✓GPTQ support
- ✓NVIDIA optimized
- ✓Consumer GPU friendly
Quick Info
- Category
- AI Infrastructure
- Pricing
- Free
More AI Infrastructure Tools
Inferless
AI InfrastructureServerless AI model deployment platform with GPU auto-scaling and cold start optimization
Colossal AI
AI InfrastructureOpen-source system for efficient large-scale AI model training and fine-tuning
Neural Magic
AI InfrastructureSoftware-defined AI inference engine that runs LLMs at GPU speed on CPUs
Weaviate Cloud
AI InfrastructureFully managed cloud service for the Weaviate open-source vector database