Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
LMDeploy Release V0.2.4
What's Changed
💥 Improvements
- use stricter rules to get weight file by @irexyc in #1070
- check pytorch engine environment by @grimoire in #1107
- Update Dockerfile order to launch the http service by
docker run
directly by @AllentDan in #1162 - Support torch cache_max_entry_count by @grimoire in #1166
- Remove the manual model conversion during benchmark by @lvhan028 in #953
- update llama triton example by @zhyncs in #1153
🐞 Bug fixes
- fix embedding copy size by @irexyc in #1036
- fix pytorch engine with peft==0.8.2 by @grimoire in #1122
- support triton2.2 by @grimoire in #1137
- Add
top_k
in ChatCompletionRequest by @lvhan028 in #1174 - minor fix benchmark generation guide and script by @lvhan028 in #1175
📚 Documentations
🌐 Other
- Add eval ci by @RunningLeon in #1060
- Ete testcase add more models by @zhulinJulia24 in #1077
- Fix win ci by @irexyc in #1132
- bump version to v0.2.4 by @lvhan028 in #1171
Full Changelog: v0.2.3...v0.2.4
LMDeploy Release V0.2.3
What's Changed
🚀 Features
💥 Improvements
- Remove caching tokenizer.json by @grimoire in #1074
- Refactor
get_logger
to remove the dependency of MMLogger from mmengine by @yinfan98 in #1064 - Use TM_LOG_LEVEL environment variable first by @zhyncs in #1071
- Speed up the initialization of w8a8 model for torch engine by @yinfan98 in #1088
- Make logging.logger's behavior consistent with MMLogger by @irexyc in #1092
- Remove owned_session for torch engine by @grimoire in #1097
- Unify engine initialization in pipeline by @irexyc in #1085
- Add skip_special_tokens in GenerationConfig by @grimoire in #1091
- Use default stop words for turbomind backend in pipeline by @irexyc in #1119
- Add input_token_len to Response and update Response document by @AllentDan in #1115
🐞 Bug fixes
- Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in #992
- Fix turbomind CUDA runtime error invalid argument by @zhyncs in #1100
- Add safety check for incremental decode by @AllentDan in #1094
- Fix device type of get_ppl for turbomind by @RunningLeon in #1093
- Fix pipeline init turbomind from workspace by @irexyc in #1126
- Add dependency version check and fix
ignore_eos
logic by @grimoire in #1099 - Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in #1129
📚 Documentations
🌐 Other
New Contributors
Full Changelog: v0.2.2...v0.2.3
LMDeploy Release V0.2.2
Highlight
English version
- The allocation strategy for k/v cache is changed. The parameter
cache_max_entry_count
defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues. - The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- Add api key and ssl to
api_server
Chinese version
- TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
- Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- api_server 在接口中增加了 api_key
What's Changed
🚀 Features
- add alignment tools by @grimoire in #1004
- support min_length for turbomind backend by @irexyc in #961
- Add stream mode function to pipeline by @AllentDan in #974
- [Feature] Add api key and ssl to http server by @AllentDan in #1048
💥 Improvements
- hide stop-words in output text by @grimoire in #991
- optimize sleep by @grimoire in #1034
- set example values to /v1/chat/completions in swagger UI by @AllentDan in #984
- Update adapters cli argument by @RunningLeon in #1039
- Fix turbomind end session bug. Add huggingface demo document by @AllentDan in #1017
- Support linking the custom built mpi by @lvhan028 in #1025
- sync mem size for tp by @lzhangzz in #1053
- Remove model name when loading hf model by @irexyc in #1022
- support internlm2-1_8b by @lvhan028 in #1073
- Update chat template for internlm2 base model by @lvhan028 in #1079
🐞 Bug fixes
- fix TorchEngine stuck when benchmarking with
tp>1
by @grimoire in #942 - fix module mapping error of baichuan model by @grimoire in #977
- fix import error for triton server by @RunningLeon in #985
- fix qwen-vl example by @irexyc in #996
- fix missing init file in modules by @RunningLeon in #1013
- fix tp mem usage by @grimoire in #987
- update indexes_containing_token function by @AllentDan in #1050
- fix flash kernel on sm 70 by @grimoire in #1027
- Fix baichuan2 lora by @grimoire in #1042
- Fix modelconfig in pytorch engine, support YI. by @grimoire in #1052
- Fix repetition penalty for long context by @irexyc in #1037
- [Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in #1072
📚 Documentations
- add docs for evaluation with opencompass by @RunningLeon in #995
- update docs for kvint8 by @RunningLeon in #1026
- [doc] Introduce project OpenAOE by @JiaYingLii in #1049
- update pipeline guide and FAQ about OOM by @lvhan028 in #1051
- docs update cache_max_entry_count for turbomind config by @zhyncs in #1067
🌐 Other
- update ut ci to new server node by @RunningLeon in #1024
- Ete testcase update by @zhulinJulia24 in #1023
- fix OOM in BlockManager by @zhyncs in #973
- fix use engine_config.tp when tp is None by @zhyncs in #1057
- Fix serve api by moving logger inside process for turbomind by @AllentDan in #1061
- bump version to v0.2.2 by @lvhan028 in #1076
New Contributors
- @zhyncs made their first contribution in #973
- @JiaYingLii made their first contribution in #1049
Full Changelog: v0.2.1...v0.2.2
LMDeploy Release V0.2.1
What's Changed
💥 Improvements
- [Fix] interlm2 chat format by @Harold-lkk in #1002
🐞 Bug fixes
- fix baichuan2 conversion by @AllentDan in #972
- [Fix] interlm messages2prompt by @Harold-lkk in #1003
📚 Documentations
🌐 Other
Full Changelog: v0.2.0...v0.2.1
LMDeploy Release V0.2.0
What's Changed
🚀 Features
- Support internlm2 by @lvhan028 in #963
- [Feature] Add params config for api server web_ui by @amulil in #735
- [Feature]Merge
lmdeploy lite calibrate
andlmdeploy lite auto_awq
by @pppppM in #849 - Compute cross entropy loss given a list of input tokens by @lvhan028 in #830
- Support QoS in api_server by @sallyjunjun in #877
- Refactor torch inference engine by @lvhan028 in #871
- add image chat demo by @irexyc in #874
- check-in generation config by @lvhan028 in #902
- check-in ModelConfig by @AllentDan in #907
- pytorch engine config by @grimoire in #908
- Check-in turbomind engine config by @irexyc in #909
- S-LoRA support by @grimoire in #894
- add init in adapters by @grimoire in #923
- Refactor LLM inference pipeline API by @AllentDan in #916
- Refactor gradio and api_server by @AllentDan in #918
- Add request distributor server by @AllentDan in #903
- Upgrade lmdeploy cli by @RunningLeon in #922
💥 Improvements
- add top_k value for /v1/completions and update the documents by @AllentDan in #870
- export "num_tokens_per_iter", "max_prefill_iters" and etc when converting a model by @lvhan028 in #845
- Move
api_server
dependencies from serve.txt to runtime.txt by @lvhan028 in #879 - Refactor benchmark bash script by @lvhan028 in #884
- Add test case for function regression by @zhulinJulia24 in #844
- Update test triton CI by @RunningLeon in #893
- Update dockerfile by @RunningLeon in #891
- Perform fuzzy matching on chat template according to model path by @AllentDan in #839
- support accessing lmdeploy version by lmdeploy.version_info by @lvhan028 in #910
- Remove
flash-attn
dependency of lmdeploy lite module by @lvhan028 in #917 - Improve setup by removing pycuda dependency and adding cuda runtime and cublas to RPATH by @irexyc in #912
- remove unused settings in turbomind engine config by @irexyc in #921
- Cleanup fixed attributes in turbomind engine config by @irexyc in #928
- fix get_gpu_mem by @grimoire in #934
- remove instance_num argument by @AllentDan in #931
- Fix matching results of several chat templates like llama2, solar, yi and so on by @AllentDan in #925
- add pytorch random sampling by @grimoire in #930
- suppress turbomind chat warning by @irexyc in #937
- modify type hint of api to avoid import _turbomind by @AllentDan in #936
- accelerate pytorch benchmark by @grimoire in #946
- Remove
tp
from pipline argument list by @lvhan028 in #947 - set gradio default value the same as chat.py by @AllentDan in #949
- print help for cli in case of failure by @RunningLeon in #955
- return dataclass for pipeline by @AllentDan in #952
- set random seed when it is None by @AllentDan in #958
- avoid run get_logger when import lmdeploy by @RunningLeon in #956
- support mlp s-lora by @grimoire in #957
- skip resume logic for pytorch backend by @AllentDan in #968
- Add ci for ut by @RunningLeon in #966
🐞 Bug fixes
- add tritonclient req by @RunningLeon in #872
- Fix uninitialized parameter by @lvhan028 in #875
- Fix overflow by @irexyc in #897
- Fix data offset by @AllentDan in #900
- Fix context decoding stuck issue when tp > 1 by @irexyc in #904
- [Fix] set scaling_factor 1 forcefully when sequence length is less than max_pos_emb by @lvhan028 in #911
- fix pytorch llama2 with new transformers by @grimoire in #914
- fix local variable 'output_ids' referenced before assignment by @irexyc in #919
- fix pipeline stop_words type error by @AllentDan in #929
- pass stop words to openai api by @AllentDan in #887
- fix profile generation multiprocessing error by @AllentDan in #933
- Miss init.py in modeling folder by @lvhan028 in #951
- fix cli with special arg names by @RunningLeon in #959
- fix logger in tokenizer by @RunningLeon in #960
📚 Documentations
- Improve user guide by @lvhan028 in #899
- Add user guide about pytorch engine by @grimoire in #915
- Update supported models and add quick start section in README by @lvhan028 in #926
- Fix scripts in benchmark doc by @panli889 in #941
- Update get_started and w4a16 tutorials by @lvhan028 in #945
- Add more docstring to api_server and proxy_server by @AllentDan in #965
- stable api_server benchmark result by a non-zero await by @AllentDan in #885
- fix pytorch backend can not properly stop by @AllentDan in #962
- [Fix] Fix
calibrate
bug whentransformers>4.36
by @pppppM in #967
🌐 Other
New Contributors
- @amulil made their first contribution in #735
- @zhulinJulia24 made their first contribution in #844
- @sallyjunjun made their first contribution in #877
- @panli889 made their first contribution in #941
Full Changelog: v0.1.0...v0.2.0
LMDeploy Release V0.1.0
What's Changed
🚀 Features
- Add extra_requires to reduce dependencies by @RunningLeon in #580
- TurboMind 2 by @lzhangzz in #590
- Support loading hf model directly by @irexyc in #685
- convert model with hf repo_id by @irexyc in #774
- Support turbomind bf16 by @grimoire in #803
- support image_embs input by @irexyc in #799
- Add api.py by @AllentDan in #805
💥 Improvements
- Fix Tokenizer encode by @AllentDan in #645
- Optimize for throughput by @lzhangzz in #701
- Replace mmengine with mmengine-lite by @zhouzaida in #715
- Set the default value of
max_context_token_num
1 by @lvhan028 in #761 - add triton server test and workflow yml by @RunningLeon in #760
- improvement(build): enable ninja and gold linker by @tpoisonooo in #767
- Report first-token-latency and token-latency percentiles by @lvhan028 in #736
- Unify prefill & decode passes by @lzhangzz in #775
- add cuda12.1 build check ci by @irexyc in #782
- auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
- Report the inference benchmark of models with different size by @lvhan028 in #794
- Simplify block manager by @lzhangzz in #812
- Disable attention mask when it is not needed by @lzhangzz in #813
- FIFO pipe strategy for api_server by @AllentDan in #795
- simplify the header of the benchmark table by @lvhan028 in #820
- add encode for opencompass by @AllentDan in #828
- fix: awq should save bin files by @hscspring in #793
- Support building docker image manually in CI by @RunningLeon in #825
🐞 Bug fixes
- Fix init of batch state by @lzhangzz in #682
- fix turbomind stream canceling by @grimoire in #686
- [Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
- Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
- Fix cache/output length calculation by @lzhangzz in #738
- [Fix] Skip empty batch by @lzhangzz in #747
- [Fix] build docker image failed since
packaging
is missing by @lvhan028 in #753 - [Fix] Rollback the data type of
input_ids
toTYPE_UINT32
in preprocessor's proto by @lvhan028 in #758 - fix turbomind build on sm<80 by @grimoire in #754
- Fix early-exit condition in attention kernel by @lzhangzz in #788
- Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
- fix extra colon in InternLMChat7B template by @C1rN09 in #796
- Fix local kv head num by @lvhan028 in #806
- Fix out-of-bound access by @lzhangzz in #809
- Set smem size for repetition penalty kernel by @lzhangzz in #818
- Fix cache verification by @lzhangzz in #821
- fix finish_reason by @AllentDan in #816
- fix turbomind awq by @grimoire in #847
- Fix stop requests by await before turbomind queue.get() by @AllentDan in #850
- [Fix] Fix meta tensor error by @pppppM in #848
- Fix cuda reinitialization in a multiprocessing setting by @grimoire in #862
- launch gradio server directly with hf model by @AllentDan in #856
- fix typo by @grimoire in #769
- Add chat template for Yi by @AllentDan in #779
- fix api_server stop_session and end_session by @AllentDan in #835
- Return the iterator after erasing it from a map by @irexyc in #864
📚 Documentations
- [Docs] Update Supported Matrix by @pppppM in #679
- [Docs] Update KV8 Docs by @pppppM in #681
- [Doc] Update restful api doc by @AllentDan in #662
- Check-in user guide about turbomind config by @lvhan028 in #680
- Update benchmark user guide by @lvhan028 in #763
- [Docs] Fix typo in
restful_api
user guide by @maxchiron in #858 - [Docs] Fix typo in
restful_api
user guide by @maxchiron in #859
🌐 Other
- bump version to v0.1.0a0 by @lvhan028 in #709
- bump version to 0.1.0a1 by @lvhan028 in #776
- bump version to v0.1.0a2 by @lvhan028 in #807
- bump version to v0.1.0 by @lvhan028 in #834
New Contributors
- @zhouzaida made their first contribution in #715
- @C1rN09 made their first contribution in #796
- @maxchiron made their first contribution in #858
Full Changelog: v0.0.14...v0.1.0
LMDeploy Release V0.1.0a2
What's Changed
💥 Improvements
- Unify prefill & decode passes by @lzhangzz in #775
- add cuda12.1 build check ci by @irexyc in #782
- auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
- Report the inference benchmark of models with different size by @lvhan028 in #794
- Add chat template for Yi by @AllentDan in #779
🐞 Bug fixes
- Fix early-exit condition in attention kernel by @lzhangzz in #788
- Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
- fix extra colon in InternLMChat7B template by @C1rN09 in #796
- Fix local kv head num by @lvhan028 in #806
📚 Documentations
🌐 Other
New Contributors
Full Changelog: v0.1.0a1...v0.1.0a2
LMDeploy Release V0.1.0a1
What's Changed
💥 Improvements
- Set the default value of
max_context_token_num
1 by @lvhan028 in #761 - add triton server test and workflow yml by @RunningLeon in #760
- improvement(build): enable ninja and gold linker by @tpoisonooo in #767
- Report first-token-latency and token-latency percentiles by @lvhan028 in #736
- convert model with hf repo_id by @irexyc in #774
🐞 Bug fixes
- [Fix] build docker image failed since
packaging
is missing by @lvhan028 in #753 - [Fix] Rollback the data type of
input_ids
toTYPE_UINT32
in preprocessor's proto by @lvhan028 in #758 - fix turbomind build on sm<80 by @grimoire in #754
- fix typo by @grimoire in #769
🌐 Other
Full Changelog: v0.1.0a0...v0.1.0a1
LMDeploy Release V0.1.0a0
What's Changed
🚀 Features
- Add extra_requires to reduce dependencies by @RunningLeon in #580
- TurboMind 2 by @lzhangzz in #590
- Support loading hf model directly by @irexyc in #685
💥 Improvements
- Fix Tokenizer encode by @AllentDan in #645
- Optimize for throughput by @lzhangzz in #701
- Replace mmengine with mmengine-lite by @zhouzaida in #715
🐞 Bug fixes
- Fix init of batch state by @lzhangzz in #682
- fix turbomind stream canceling by @grimoire in #686
- [Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
- Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
- Fix cache/output length calculation by @lzhangzz in #738
- [Fix] Skip empty batch by @lzhangzz in #747
📚 Documentations
- [Docs] Update Supported Matrix by @pppppM in #679
- [Docs] Update KV8 Docs by @pppppM in #681
- [Doc] Update restful api doc by @AllentDan in #662
- Check-in user guide about turbomind config by @lvhan028 in #680
🌐 Other
New Contributors
- @zhouzaida made their first contribution in #715
Full Changelog: v0.0.14...v0.1.0a0
LMDeploy Release V0.0.14
What's Changed
💥 Improvements
- Improve api_server and webui usage by @AllentDan in #544
- fix: gradio gr.Button.update deprecated after 4.0.0 by @hscspring in #637
- add cli to list the supported model names by @RunningLeon in #639
- Refactor model conversion by @irexyc in #296
- [Enchance] internlm message to prompt by @Harold-lkk in #499
- update turbomind session_len with model.session_len by @AllentDan in #634
- Manage session id using random int for gradio local mode by @aisensiy in #553
- Add UltraCM and WizardLM chat templates by @AllentDan in #599
- Add check env sub command by @RunningLeon in #654
🐞 Bug fixes
- [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized by @pppppM in #605
- FIX: fix stop_session func bug by @yunzhongyan0 in #578
- fix benchmark serving computation mistake by @AllentDan in #630
- fix Tokenizer load error when the path of the being-converted model is not writable by @irexyc in #669
- fix tokenizer_info when convert the model by @irexyc in #661
🌐 Other
New Contributors
- @hscspring made their first contribution in #637
- @yunzhongyan0 made their first contribution in #578
Full Changelog: v0.0.13...v0.0.14