Releases · InternLM/lmdeploy

22 Feb 03:44

lvhan028

v0.2.4

24ea5dc

LMDeploy Release V0.2.4

What's Changed

💥 Improvements

use stricter rules to get weight file by @irexyc in #1070
check pytorch engine environment by @grimoire in #1107
Update Dockerfile order to launch the http service by docker run directly by @AllentDan in #1162
Support torch cache_max_entry_count by @grimoire in #1166
Remove the manual model conversion during benchmark by @lvhan028 in #953
update llama triton example by @zhyncs in #1153

🐞 Bug fixes

fix embedding copy size by @irexyc in #1036
fix pytorch engine with peft==0.8.2 by @grimoire in #1122
support triton2.2 by @grimoire in #1137
Add top_k in ChatCompletionRequest by @lvhan028 in #1174
minor fix benchmark generation guide and script by @lvhan028 in #1175

📚 Documentations

docs add debug turbomind guide by @zhyncs in #1121

🌐 Other

Add eval ci by @RunningLeon in #1060
Ete testcase add more models by @zhulinJulia24 in #1077
Fix win ci by @irexyc in #1132
bump version to v0.2.4 by @lvhan028 in #1171

Full Changelog: v0.2.3...v0.2.4

Contributors

grimoire, lvhan028, and 5 other contributors

Assets 10

06 Feb 06:14

RunningLeon

v0.2.3

2831dc2

LMDeploy Release V0.2.3

What's Changed

🚀 Features

Support loading model from modelscope by @irexyc in #1069

💥 Improvements

Remove caching tokenizer.json by @grimoire in #1074
Refactor get_logger to remove the dependency of MMLogger from mmengine by @yinfan98 in #1064
Use TM_LOG_LEVEL environment variable first by @zhyncs in #1071
Speed up the initialization of w8a8 model for torch engine by @yinfan98 in #1088
Make logging.logger's behavior consistent with MMLogger by @irexyc in #1092
Remove owned_session for torch engine by @grimoire in #1097
Unify engine initialization in pipeline by @irexyc in #1085
Add skip_special_tokens in GenerationConfig by @grimoire in #1091
Use default stop words for turbomind backend in pipeline by @irexyc in #1119
Add input_token_len to Response and update Response document by @AllentDan in #1115

🐞 Bug fixes

Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in #992
Fix turbomind CUDA runtime error invalid argument by @zhyncs in #1100
Add safety check for incremental decode by @AllentDan in #1094
Fix device type of get_ppl for turbomind by @RunningLeon in #1093
Fix pipeline init turbomind from workspace by @irexyc in #1126
Add dependency version check and fix ignore_eos logic by @grimoire in #1099
Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in #1129

📚 Documentations

Update contribution guide by @zhyncs in #1120

🌐 Other

Bump version to v0.2.3 by @lvhan028 in #1123

New Contributors

@yinfan98 made their first contribution in #1064

Full Changelog: v0.2.2...v0.2.3

Contributors

grimoire, lvhan028, and 6 other contributors

Assets 10

31 Jan 09:57

lvhan028

v0.2.2

4a28f12

LMDeploy Release V0.2.2

Highlight

English version

The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
The pipeline API supports streaming inference. You may give it a try!

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

Add api key and ssl to api_server

Chinese version

TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例，不再是 GPU 总内存的比例。
Pipeline 支持流式输出接口。可以尝试下如下代码：

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

api_server 在接口中增加了 api_key

What's Changed

🚀 Features

add alignment tools by @grimoire in #1004
support min_length for turbomind backend by @irexyc in #961
Add stream mode function to pipeline by @AllentDan in #974
[Feature] Add api key and ssl to http server by @AllentDan in #1048

💥 Improvements

hide stop-words in output text by @grimoire in #991
optimize sleep by @grimoire in #1034
set example values to /v1/chat/completions in swagger UI by @AllentDan in #984
Update adapters cli argument by @RunningLeon in #1039
Fix turbomind end session bug. Add huggingface demo document by @AllentDan in #1017
Support linking the custom built mpi by @lvhan028 in #1025
sync mem size for tp by @lzhangzz in #1053
Remove model name when loading hf model by @irexyc in #1022
support internlm2-1_8b by @lvhan028 in #1073
Update chat template for internlm2 base model by @lvhan028 in #1079

🐞 Bug fixes

fix TorchEngine stuck when benchmarking with tp>1 by @grimoire in #942
fix module mapping error of baichuan model by @grimoire in #977
fix import error for triton server by @RunningLeon in #985
fix qwen-vl example by @irexyc in #996
fix missing init file in modules by @RunningLeon in #1013
fix tp mem usage by @grimoire in #987
update indexes_containing_token function by @AllentDan in #1050
fix flash kernel on sm 70 by @grimoire in #1027
Fix baichuan2 lora by @grimoire in #1042
Fix modelconfig in pytorch engine, support YI. by @grimoire in #1052
Fix repetition penalty for long context by @irexyc in #1037
[Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in #1072

📚 Documentations

add docs for evaluation with opencompass by @RunningLeon in #995
update docs for kvint8 by @RunningLeon in #1026
[doc] Introduce project OpenAOE by @JiaYingLii in #1049
update pipeline guide and FAQ about OOM by @lvhan028 in #1051
docs update cache_max_entry_count for turbomind config by @zhyncs in #1067

🌐 Other

update ut ci to new server node by @RunningLeon in #1024
Ete testcase update by @zhulinJulia24 in #1023
fix OOM in BlockManager by @zhyncs in #973
fix use engine_config.tp when tp is None by @zhyncs in #1057
Fix serve api by moving logger inside process for turbomind by @AllentDan in #1061
bump version to v0.2.2 by @lvhan028 in #1076

New Contributors

@zhyncs made their first contribution in #973
@JiaYingLii made their first contribution in #1049

Full Changelog: v0.2.1...v0.2.2

Contributors

grimoire, lvhan028, and 8 other contributors

Assets 10

19 Jan 10:38

lvhan028

v0.2.1

e96e2b4

LMDeploy Release V0.2.1

What's Changed

💥 Improvements

[Fix] interlm2 chat format by @Harold-lkk in #1002

🐞 Bug fixes

fix baichuan2 conversion by @AllentDan in #972
[Fix] interlm messages2prompt by @Harold-lkk in #1003

📚 Documentations

add guide about installation on cuda 12+ platform by @lvhan028 in #988

🌐 Other

bump version to v0.2.1 by @lvhan028 in #1005

Full Changelog: v0.2.0...v0.2.1

Contributors

lvhan028, Harold-lkk, and AllentDan

Assets 10

17 Jan 02:00

lvhan028

v0.2.0

b319dce

LMDeploy Release V0.2.0

What's Changed

🚀 Features

Support internlm2 by @lvhan028 in #963
[Feature] Add params config for api server web_ui by @amulil in #735
[Feature]Merge lmdeploy lite calibrate and lmdeploy lite auto_awq by @pppppM in #849
Compute cross entropy loss given a list of input tokens by @lvhan028 in #830
Support QoS in api_server by @sallyjunjun in #877
Refactor torch inference engine by @lvhan028 in #871
add image chat demo by @irexyc in #874
check-in generation config by @lvhan028 in #902
check-in ModelConfig by @AllentDan in #907
pytorch engine config by @grimoire in #908
Check-in turbomind engine config by @irexyc in #909
S-LoRA support by @grimoire in #894
add init in adapters by @grimoire in #923
Refactor LLM inference pipeline API by @AllentDan in #916
Refactor gradio and api_server by @AllentDan in #918
Add request distributor server by @AllentDan in #903
Upgrade lmdeploy cli by @RunningLeon in #922

💥 Improvements

add top_k value for /v1/completions and update the documents by @AllentDan in #870
export "num_tokens_per_iter", "max_prefill_iters" and etc when converting a model by @lvhan028 in #845
Move api_server dependencies from serve.txt to runtime.txt by @lvhan028 in #879
Refactor benchmark bash script by @lvhan028 in #884
Add test case for function regression by @zhulinJulia24 in #844
Update test triton CI by @RunningLeon in #893
Update dockerfile by @RunningLeon in #891
Perform fuzzy matching on chat template according to model path by @AllentDan in #839
support accessing lmdeploy version by lmdeploy.version_info by @lvhan028 in #910
Remove flash-attn dependency of lmdeploy lite module by @lvhan028 in #917
Improve setup by removing pycuda dependency and adding cuda runtime and cublas to RPATH by @irexyc in #912
remove unused settings in turbomind engine config by @irexyc in #921
Cleanup fixed attributes in turbomind engine config by @irexyc in #928
fix get_gpu_mem by @grimoire in #934
remove instance_num argument by @AllentDan in #931
Fix matching results of several chat templates like llama2, solar, yi and so on by @AllentDan in #925
add pytorch random sampling by @grimoire in #930
suppress turbomind chat warning by @irexyc in #937
modify type hint of api to avoid import _turbomind by @AllentDan in #936
accelerate pytorch benchmark by @grimoire in #946
Remove tp from pipline argument list by @lvhan028 in #947
set gradio default value the same as chat.py by @AllentDan in #949
print help for cli in case of failure by @RunningLeon in #955
return dataclass for pipeline by @AllentDan in #952
set random seed when it is None by @AllentDan in #958
avoid run get_logger when import lmdeploy by @RunningLeon in #956
support mlp s-lora by @grimoire in #957
skip resume logic for pytorch backend by @AllentDan in #968
Add ci for ut by @RunningLeon in #966

🐞 Bug fixes

add tritonclient req by @RunningLeon in #872
Fix uninitialized parameter by @lvhan028 in #875
Fix overflow by @irexyc in #897
Fix data offset by @AllentDan in #900
Fix context decoding stuck issue when tp > 1 by @irexyc in #904
[Fix] set scaling_factor 1 forcefully when sequence length is less than max_pos_emb by @lvhan028 in #911
fix pytorch llama2 with new transformers by @grimoire in #914
fix local variable 'output_ids' referenced before assignment by @irexyc in #919
fix pipeline stop_words type error by @AllentDan in #929
pass stop words to openai api by @AllentDan in #887
fix profile generation multiprocessing error by @AllentDan in #933
Miss init.py in modeling folder by @lvhan028 in #951
fix cli with special arg names by @RunningLeon in #959
fix logger in tokenizer by @RunningLeon in #960

📚 Documentations

Improve user guide by @lvhan028 in #899
Add user guide about pytorch engine by @grimoire in #915
Update supported models and add quick start section in README by @lvhan028 in #926
Fix scripts in benchmark doc by @panli889 in #941
Update get_started and w4a16 tutorials by @lvhan028 in #945
Add more docstring to api_server and proxy_server by @AllentDan in #965
stable api_server benchmark result by a non-zero await by @AllentDan in #885
fix pytorch backend can not properly stop by @AllentDan in #962
[Fix] Fix calibrate bug when transformers>4.36 by @pppppM in #967

🌐 Other

bump version to v0.2.0 by @lvhan028 in #969

New Contributors

@amulil made their first contribution in #735
@zhulinJulia24 made their first contribution in #844
@sallyjunjun made their first contribution in #877
@panli889 made their first contribution in #941

Full Changelog: v0.1.0...v0.2.0

Contributors

grimoire, panli889, and 8 other contributors

Assets 10

18 Dec 12:10

lvhan028

v0.1.0

477f2db

LMDeploy Release V0.1.0

What's Changed

🚀 Features

Add extra_requires to reduce dependencies by @RunningLeon in #580
TurboMind 2 by @lzhangzz in #590
Support loading hf model directly by @irexyc in #685
convert model with hf repo_id by @irexyc in #774
Support turbomind bf16 by @grimoire in #803
support image_embs input by @irexyc in #799
Add api.py by @AllentDan in #805

💥 Improvements

Fix Tokenizer encode by @AllentDan in #645
Optimize for throughput by @lzhangzz in #701
Replace mmengine with mmengine-lite by @zhouzaida in #715
Set the default value of max_context_token_num 1 by @lvhan028 in #761
add triton server test and workflow yml by @RunningLeon in #760
improvement(build): enable ninja and gold linker by @tpoisonooo in #767
Report first-token-latency and token-latency percentiles by @lvhan028 in #736
Unify prefill & decode passes by @lzhangzz in #775
add cuda12.1 build check ci by @irexyc in #782
auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
Report the inference benchmark of models with different size by @lvhan028 in #794
Simplify block manager by @lzhangzz in #812
Disable attention mask when it is not needed by @lzhangzz in #813
FIFO pipe strategy for api_server by @AllentDan in #795
simplify the header of the benchmark table by @lvhan028 in #820
add encode for opencompass by @AllentDan in #828
fix: awq should save bin files by @hscspring in #793
Support building docker image manually in CI by @RunningLeon in #825

🐞 Bug fixes

Fix init of batch state by @lzhangzz in #682
fix turbomind stream canceling by @grimoire in #686
[Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
Fix cache/output length calculation by @lzhangzz in #738
[Fix] Skip empty batch by @lzhangzz in #747
[Fix] build docker image failed since packaging is missing by @lvhan028 in #753
[Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto by @lvhan028 in #758
fix turbomind build on sm<80 by @grimoire in #754
Fix early-exit condition in attention kernel by @lzhangzz in #788
Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
fix extra colon in InternLMChat7B template by @C1rN09 in #796
Fix local kv head num by @lvhan028 in #806
Fix out-of-bound access by @lzhangzz in #809
Set smem size for repetition penalty kernel by @lzhangzz in #818
Fix cache verification by @lzhangzz in #821
fix finish_reason by @AllentDan in #816
fix turbomind awq by @grimoire in #847
Fix stop requests by await before turbomind queue.get() by @AllentDan in #850
[Fix] Fix meta tensor error by @pppppM in #848
Fix cuda reinitialization in a multiprocessing setting by @grimoire in #862
launch gradio server directly with hf model by @AllentDan in #856
fix typo by @grimoire in #769
Add chat template for Yi by @AllentDan in #779
fix api_server stop_session and end_session by @AllentDan in #835
Return the iterator after erasing it from a map by @irexyc in #864

📚 Documentations

[Docs] Update Supported Matrix by @pppppM in #679
[Docs] Update KV8 Docs by @pppppM in #681
[Doc] Update restful api doc by @AllentDan in #662
Check-in user guide about turbomind config by @lvhan028 in #680
Update benchmark user guide by @lvhan028 in #763
[Docs] Fix typo in restful_api user guide by @maxchiron in #858
[Docs] Fix typo in restful_api user guide by @maxchiron in #859

🌐 Other

bump version to v0.1.0a0 by @lvhan028 in #709
bump version to 0.1.0a1 by @lvhan028 in #776
bump version to v0.1.0a2 by @lvhan028 in #807
bump version to v0.1.0 by @lvhan028 in #834

New Contributors

@zhouzaida made their first contribution in #715
@C1rN09 made their first contribution in #796
@maxchiron made their first contribution in #858

Full Changelog: v0.0.14...v0.1.0

Contributors

grimoire, lvhan028, and 11 other contributors

Assets 10

06 Dec 06:50

lvhan028

v0.1.0a2

fddad30

LMDeploy Release V0.1.0a2

What's Changed

💥 Improvements

Unify prefill & decode passes by @lzhangzz in #775
add cuda12.1 build check ci by @irexyc in #782
auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
Report the inference benchmark of models with different size by @lvhan028 in #794
Add chat template for Yi by @AllentDan in #779

🐞 Bug fixes

Fix early-exit condition in attention kernel by @lzhangzz in #788
Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
fix extra colon in InternLMChat7B template by @C1rN09 in #796
Fix local kv head num by @lvhan028 in #806

📚 Documentations

Update benchmark user guide by @lvhan028 in #763

🌐 Other

bump version to v0.1.0a2 by @lvhan028 in #807

New Contributors

@C1rN09 made their first contribution in #796

Full Changelog: v0.1.0a1...v0.1.0a2

Contributors

lvhan028, irexyc, and 3 other contributors

Assets 10

29 Nov 13:51

lvhan028

v0.1.0a1

9c46b27

LMDeploy Release V0.1.0a1

What's Changed

💥 Improvements

Set the default value of max_context_token_num 1 by @lvhan028 in #761
add triton server test and workflow yml by @RunningLeon in #760
improvement(build): enable ninja and gold linker by @tpoisonooo in #767
Report first-token-latency and token-latency percentiles by @lvhan028 in #736
convert model with hf repo_id by @irexyc in #774

🐞 Bug fixes

[Fix] build docker image failed since packaging is missing by @lvhan028 in #753
[Fix] Rollback the data type of input_ids to TYPE_UINT32 in preprocessor's proto by @lvhan028 in #758
fix turbomind build on sm<80 by @grimoire in #754
fix typo by @grimoire in #769

🌐 Other

bump version to 0.1.0a1 by @lvhan028 in #776

Full Changelog: v0.1.0a0...v0.1.0a1

Contributors

grimoire, lvhan028, and 3 other contributors

Assets 2

23 Nov 13:05

lvhan028

v0.1.0a0

a7c5007

LMDeploy Release V0.1.0a0

What's Changed

🚀 Features

Add extra_requires to reduce dependencies by @RunningLeon in #580
TurboMind 2 by @lzhangzz in #590
Support loading hf model directly by @irexyc in #685

💥 Improvements

Fix Tokenizer encode by @AllentDan in #645
Optimize for throughput by @lzhangzz in #701
Replace mmengine with mmengine-lite by @zhouzaida in #715

🐞 Bug fixes

Fix init of batch state by @lzhangzz in #682
fix turbomind stream canceling by @grimoire in #686
[Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
Fix cache/output length calculation by @lzhangzz in #738
[Fix] Skip empty batch by @lzhangzz in #747

📚 Documentations

[Docs] Update Supported Matrix by @pppppM in #679
[Docs] Update KV8 Docs by @pppppM in #681
[Doc] Update restful api doc by @AllentDan in #662
Check-in user guide about turbomind config by @lvhan028 in #680

🌐 Other

bump version to v0.1.0a0 by @lvhan028 in #709

New Contributors

@zhouzaida made their first contribution in #715

Full Changelog: v0.0.14...v0.1.0a0

Contributors

grimoire, lvhan028, and 7 other contributors

Assets 2

09 Nov 12:13

lvhan028

v0.0.14

7b20cfd

LMDeploy Release V0.0.14

What's Changed

💥 Improvements

Improve api_server and webui usage by @AllentDan in #544
fix: gradio gr.Button.update deprecated after 4.0.0 by @hscspring in #637
add cli to list the supported model names by @RunningLeon in #639
Refactor model conversion by @irexyc in #296
[Enchance] internlm message to prompt by @Harold-lkk in #499
update turbomind session_len with model.session_len by @AllentDan in #634
Manage session id using random int for gradio local mode by @aisensiy in #553
Add UltraCM and WizardLM chat templates by @AllentDan in #599
Add check env sub command by @RunningLeon in #654

🐞 Bug fixes

[Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized by @pppppM in #605
FIX: fix stop_session func bug by @yunzhongyan0 in #578
fix benchmark serving computation mistake by @AllentDan in #630
fix Tokenizer load error when the path of the being-converted model is not writable by @irexyc in #669
fix tokenizer_info when convert the model by @irexyc in #661

🌐 Other

bump version to v0.0.14 by @lvhan028 in #663

New Contributors

@hscspring made their first contribution in #637
@yunzhongyan0 made their first contribution in #578

Full Changelog: v0.0.13...v0.0.14

Contributors

aisensiy, lvhan028, and 7 other contributors

Assets 2

Releases: InternLM/lmdeploy

LMDeploy Release V0.2.4

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Contributors

LMDeploy Release V0.2.3

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.2.2

Highlight

English version

Chinese version

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.2.1

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Contributors

LMDeploy Release V0.2.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.1.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.1.0a2

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.1.0a1

What's Changed

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors

LMDeploy Release V0.1.0a0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.0.14

What's Changed

💥 Improvements