LMDeploy Release V0.5.2

lvhan028 released this 26 Jul 08:07

· 300 commits to main since this release

7199b4e

Highlight

LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

Support glm4 awq by @AllentDan in #1993
Support llama3.1 by @lvhan028 in #2122
Support Llama3.1 tool calling by @AllentDan in #2123

💥 Improvements

Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
Remove kv cache offline quantization by @AllentDan in #2097
Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

fix stop words for glm4 by @RunningLeon in #2044
Disable peer access code by @lzhangzz in #2082
set log level ERROR in benchmark scripts by @lvhan028 in #2086
raise thread exception by @irexyc in #2071
Fix index error when profiling token generation with -ct 1 by @lvhan028 in #1898

🌐 Other

misc: replace slow Jimver/cuda-toolkit by @zhyncs in #2065
misc: update bug issue template by @zhyncs in #2083
update daily testcase new by @zhulinJulia24 in #2035
bump version to v0.5.2 by @lvhan028 in #2143

Full Changelog: v0.5.1...v0.5.2

Contributors

lvhan028, irexyc, and 5 other contributors

Assets 12