LMDeploy Release V0.1.0
What's Changed
🚀 Features
- Add extra_requires to reduce dependencies by @RunningLeon in #580
- TurboMind 2 by @lzhangzz in #590
- Support loading hf model directly by @irexyc in #685
- convert model with hf repo_id by @irexyc in #774
- Support turbomind bf16 by @grimoire in #803
- support image_embs input by @irexyc in #799
- Add api.py by @AllentDan in #805
💥 Improvements
- Fix Tokenizer encode by @AllentDan in #645
- Optimize for throughput by @lzhangzz in #701
- Replace mmengine with mmengine-lite by @zhouzaida in #715
- Set the default value of
max_context_token_num
1 by @lvhan028 in #761 - add triton server test and workflow yml by @RunningLeon in #760
- improvement(build): enable ninja and gold linker by @tpoisonooo in #767
- Report first-token-latency and token-latency percentiles by @lvhan028 in #736
- Unify prefill & decode passes by @lzhangzz in #775
- add cuda12.1 build check ci by @irexyc in #782
- auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
- Report the inference benchmark of models with different size by @lvhan028 in #794
- Simplify block manager by @lzhangzz in #812
- Disable attention mask when it is not needed by @lzhangzz in #813
- FIFO pipe strategy for api_server by @AllentDan in #795
- simplify the header of the benchmark table by @lvhan028 in #820
- add encode for opencompass by @AllentDan in #828
- fix: awq should save bin files by @hscspring in #793
- Support building docker image manually in CI by @RunningLeon in #825
🐞 Bug fixes
- Fix init of batch state by @lzhangzz in #682
- fix turbomind stream canceling by @grimoire in #686
- [Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
- Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
- Fix cache/output length calculation by @lzhangzz in #738
- [Fix] Skip empty batch by @lzhangzz in #747
- [Fix] build docker image failed since
packaging
is missing by @lvhan028 in #753 - [Fix] Rollback the data type of
input_ids
toTYPE_UINT32
in preprocessor's proto by @lvhan028 in #758 - fix turbomind build on sm<80 by @grimoire in #754
- Fix early-exit condition in attention kernel by @lzhangzz in #788
- Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
- fix extra colon in InternLMChat7B template by @C1rN09 in #796
- Fix local kv head num by @lvhan028 in #806
- Fix out-of-bound access by @lzhangzz in #809
- Set smem size for repetition penalty kernel by @lzhangzz in #818
- Fix cache verification by @lzhangzz in #821
- fix finish_reason by @AllentDan in #816
- fix turbomind awq by @grimoire in #847
- Fix stop requests by await before turbomind queue.get() by @AllentDan in #850
- [Fix] Fix meta tensor error by @pppppM in #848
- Fix cuda reinitialization in a multiprocessing setting by @grimoire in #862
- launch gradio server directly with hf model by @AllentDan in #856
- fix typo by @grimoire in #769
- Add chat template for Yi by @AllentDan in #779
- fix api_server stop_session and end_session by @AllentDan in #835
- Return the iterator after erasing it from a map by @irexyc in #864
📚 Documentations
- [Docs] Update Supported Matrix by @pppppM in #679
- [Docs] Update KV8 Docs by @pppppM in #681
- [Doc] Update restful api doc by @AllentDan in #662
- Check-in user guide about turbomind config by @lvhan028 in #680
- Update benchmark user guide by @lvhan028 in #763
- [Docs] Fix typo in
restful_api
user guide by @maxchiron in #858 - [Docs] Fix typo in
restful_api
user guide by @maxchiron in #859
🌐 Other
- bump version to v0.1.0a0 by @lvhan028 in #709
- bump version to 0.1.0a1 by @lvhan028 in #776
- bump version to v0.1.0a2 by @lvhan028 in #807
- bump version to v0.1.0 by @lvhan028 in #834
New Contributors
- @zhouzaida made their first contribution in #715
- @C1rN09 made their first contribution in #796
- @maxchiron made their first contribution in #858
Full Changelog: v0.0.14...v0.1.0