HuggingFace TGI
HuggingFace's production LLM serving toolkit with continuous batching and streaming
HuggingFace TGI
HuggingFace's production LLM serving toolkit with continuous batching and streaming
Text Generation Inference (TGI) is HuggingFace's production-ready toolkit for deploying and serving large language models with features including continuous batching, token streaming, tensor parallelism, and quantization support. It provides a Docker-based deployment solution with optimized kernels for NVIDIA and AMD GPUs, making it straightforward to serve HuggingFace models in production. TGI is the serving engine behind HuggingFace's hosted inference API. ML engineers deploying open-source models, cloud providers, and enterprises hosting their own LLM infrastructure use TGI for production serving.
Key Features
- ✓Continuous batching
- ✓Token streaming
- ✓Multi-GPU support
- ✓Quantization
- ✓Docker deployment
Quick Info
- Category
- AI Infrastructure & MLOps
- Pricing
- Free
More AI Infrastructure & MLOps Tools
Dstack
AI Infrastructure & MLOpsOpen-source cloud-agnostic platform for AI/ML workload orchestration
Tigris Data
AI Infrastructure & MLOpsAI-native object storage with built-in vector search and S3 compatibility
Superlinked
AI Infrastructure & MLOpsVector compute framework that helps ML engineers build retrieval systems by combining multiple data types a…
Qdrant Cloud
AI Infrastructure & MLOpsManaged vector database cloud service offering high-performance similarity search with filtering, payload i…