What's Changed
🚀 Features
- Support moe w8a8 in pytorch engine by @grimoire in #2894
- Support DeepseekV3 fp8 by @grimoire in #2967
- support new backend cambricon by @JackWeiw in #3002
- support-moe-fp8 by @RunningLeon in #3007
- add internlm3-dense(turbomind) & chat template by @irexyc in #3024
- support internlm3 on pt by @RunningLeon in #3026
- Support internlm3 quantization by @AllentDan in #3027
💥 Improvements
- Optimize awq kernel in pytorch engine by @grimoire in #2965
- Support fp8 w8a8 for pt backend by @RunningLeon in #2959
- Optimize lora kernel by @grimoire in #2975
- Remove threadsafe by @grimoire in #2907
- Refactor async engine & turbomind IO by @lzhangzz in #2968
- [dlinfer]rope refine by @JackWeiw in #2984
- Expose spaces_between_special_tokens by @AllentDan in #2991
- [dlinfer]change llm op interface of paged_prefill_attention. by @JackWeiw in #2977
- Update request logger by @lvhan028 in #2981
- remove decoding by @grimoire in #3016
🐞 Bug fixes
- Fix build crash in nvcr.io/nvidia/pytorch:24.06-py3 image by @zgjja in #2964
- add tool role in BaseChatTemplate as tool response in messages by @AllentDan in #2979
- Fix ascend dockerfile by @jinminxi104 in #2989
- fix internvl2 qk norm by @grimoire in #2987
- fix xcomposer2 when transformers is upgraded greater than 4.46 by @irexyc in #3001
- Fix get_ppl & get_logits by @lvhan028 in #3008
- Fix typo in w4a16 guide by @Yan-Xiangjun in #3018
- fix blocked fp8 moe kernel by @grimoire in #3009
- Fix async engine by @lzhangzz in #3029
- [hotfix] Fix get_ppl by @lvhan028 in #3023
- Fix MoE gating for DeepSeek V2 by @lzhangzz in #3030
- Fix empty response for pipeline by @lzhangzz in #3034
- Fix potential hang during TP model initialization by @lzhangzz in #3033
🌐 Other
- [ci] add w8a8 and internvl2.5 models into testcase by @zhulinJulia24 in #2949
- bump version to v0.7.0 by @lvhan028 in #3010
New Contributors
- @zgjja made their first contribution in #2964
- @Yan-Xiangjun made their first contribution in #3018
Full Changelog: 0.6.5...v0.7.0