Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release V0.5.2.post1

26 Jul 12:22
fb6f8ea
Compare
Choose a tag to compare

What's Changed

🐞 Bug fixes

  • [Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157

🌐 Other

Full Changelog: v0.5.2...v0.5.2.post1

LMDeploy Release V0.5.2

26 Jul 08:07
7199b4e
Compare
Choose a tag to compare

Highlight

  • LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

💥 Improvements

  • Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
  • Remove kv cache offline quantization by @AllentDan in #2097
  • Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
  • clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

🌐 Other

Full Changelog: v0.5.1...v0.5.2

LMDeploy Release V0.5.1

16 Jul 10:05
9cdce39
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.0...v0.5.1

LMDeploy Release V0.5.0

01 Jul 07:22
4cb3854
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.2...v0.5.0

LMDeploy Release V0.4.2

27 May 08:56
54b7230
Compare
Choose a tag to compare

Highlight

  • Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
  • Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.1...v0.4.2

LMDeploy Release V0.4.1

07 May 08:20
14e9953
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in #1513
  • Fix turbomind import in windows by @irexyc in #1533
  • Fix convert qwen2 to turbomind by @AllentDan in #1546
  • Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in #1478

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.0...v0.4.1

LMDeploy Release V0.4.0

23 Apr 11:18
04ba0ff
Compare
Choose a tag to compare

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

  • We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

  • data-free online quantization
  • Supports all nvidia GPU models with Volta architecture (sm70) and above
  • KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
  • Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

- - - llama2-7b-chat - - internlm2-chat-7b - - qwen1.5-7b-chat - -
dataset version metric kv fp16 kv int8 kv int4 kv fp16 kv int8 kv int4 fp16 kv int8 kv int4
ceval - naive_average 28.42 27.96 27.58 60.45 60.88 60.28 70.56 70.49 68.62
mmlu - naive_average 35.64 35.58 34.79 63.91 64 62.36 61.48 61.56 60.65
triviaqa 2121ce score 56.09 56.13 53.71 58.73 58.7 58.18 44.62 44.77 44.04
gsm8k 1d7fe4 accuracy 28.2 28.05 27.37 70.13 69.75 66.87 54.97 56.41 54.74
race-middle 9a54b6 accuracy 41.57 41.78 41.23 88.93 88.93 88.93 87.33 87.26 86.28
race-high 9a54b6 accuracy 39.65 39.77 40.77 85.33 85.31 84.62 82.53 82.59 82.02

The below table presents LMDeploy's inference performance with quantized KV.

model kv type test settings RPS v.s. kv fp16
llama2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 14.98 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 19.01 1.27
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 20.81 1.39
llama2-chat-13b fp16 tp1 / ratio 0.9 / bs 128 / prompts 10000 8.55 1.0
- int8 tp1 / ratio 0.9 / bs 256 / prompts 10000 10.96 1.28
- int4 tp1 / ratio 0.9 / bs 256 / prompts 10000 11.91 1.39
internlm2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 24.13 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.28 1.05
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.80 1.07

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.3.0...v0.4.0

LMDeploy Release V0.3.0

03 Apr 01:55
4822fba
Compare
Choose a tag to compare

Highlight

  • Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
  • Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.6...v0.3.0

LMDeploy Release V0.2.6

19 Mar 02:43
b69e717
Compare
Choose a tag to compare

Highlight

Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL

  • VLM Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from here

  • VLM serving by openai compatible server
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
  • VLM Serving by gradio
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.5...v0.2.6

LMDeploy Release V0.2.5

05 Mar 08:39
c5f4014
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.4...v0.2.5