LMDeploy
Efficient LLM deployment toolkit with quantization and high-throughput serving for production
LMDeploy
Efficient LLM deployment toolkit with quantization and high-throughput serving for production
LMDeploy is an efficient toolkit for compressing, deploying, and serving large language models in production environments. Developed by Shanghai AI Lab, it provides 4-bit and 8-bit quantization to reduce model memory footprint, a high-performance inference engine (TurboMind) optimized for continuous batching, and an OpenAI-compatible API server. LMDeploy achieves significantly higher throughput than vLLM in many benchmarks for specific model architectures. AI teams at companies deploying open-source models in production use LMDeploy to maximize GPU utilization and minimize inference costs. The toolkit supports Llama, Internlm, and many other popular model architectures and integrates with cloud deployment platforms.
Key Features
- ✓4/8-bit quantization
- ✓TurboMind engine
- ✓OpenAI-compatible API
- ✓High throughput
- ✓Multi-model support
Quick Info
- Category
- AI Infrastructure
- Pricing
- Free
More AI Infrastructure Tools
Inferless
AI InfrastructureServerless AI model deployment platform with GPU auto-scaling and cold start optimization
Colossal AI
AI InfrastructureOpen-source system for efficient large-scale AI model training and fine-tuning
Neural Magic
AI InfrastructureSoftware-defined AI inference engine that runs LLMs at GPU speed on CPUs
Weaviate Cloud
AI InfrastructureFully managed cloud service for the Weaviate open-source vector database