LMDeploy Release V0.5.2
Highlight
- LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here
What's Changed
🚀 Features
- Support glm4 awq by @AllentDan in #1993
- Support llama3.1 by @lvhan028 in #2122
- Support Llama3.1 tool calling by @AllentDan in #2123
💥 Improvements
- Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
- Remove kv cache offline quantization by @AllentDan in #2097
- Remove
session_len
and deprecated short names of the chat templates by @lvhan028 in #2105 - clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108
🐞 Bug fixes
- fix stop words for glm4 by @RunningLeon in #2044
- Disable peer access code by @lzhangzz in #2082
- set log level ERROR in benchmark scripts by @lvhan028 in #2086
- raise thread exception by @irexyc in #2071
- Fix index error when profiling token generation with
-ct 1
by @lvhan028 in #1898
🌐 Other
- misc: replace slow Jimver/cuda-toolkit by @zhyncs in #2065
- misc: update bug issue template by @zhyncs in #2083
- update daily testcase new by @zhulinJulia24 in #2035
- bump version to v0.5.2 by @lvhan028 in #2143
Full Changelog: v0.5.1...v0.5.2