🚀

LMDeploy

Efficient LLM deployment toolkit with quantization and high-throughput serving for production

AI Infrastructure

LMDeploy

Efficient LLM deployment toolkit with quantization and high-throughput serving for production

AI InfrastructureFree

LMDeploy is an efficient toolkit for compressing, deploying, and serving large language models in production environments. Developed by Shanghai AI Lab, it provides 4-bit and 8-bit quantization to reduce model memory footprint, a high-performance inference engine (TurboMind) optimized for continuous batching, and an OpenAI-compatible API server. LMDeploy achieves significantly higher throughput than vLLM in many benchmarks for specific model architectures. AI teams at companies deploying open-source models in production use LMDeploy to maximize GPU utilization and minimize inference costs. The toolkit supports Llama, Internlm, and many other popular model architectures and integrates with cloud deployment platforms.