DJLServing v0.24.0 release
Key Features
- Updates Components
- Updates Neuron to 2.14.1
- Updates DeepSpeed to 0.10.0
- Improved Python logging
- Improved SeqScheduler
- Adds DeepSpeed dynamic int8 quantization with SmoothQuant
- Supports for llama 2
- Supports Safetensors
- Adds Neuron dynamic batching and rolling batch
- Adds Adapter API Preview
- Supports HuggingFace Stopwords
Enhancement
- Allow overriding truncate parameter in request by @maaquib in #953
- Enable multi-gpu inference (device_map='auto') on seq_batch_scheduler by @KexinFeng in #960
- [wlm] Allows set defatul options with environment variable by @frankfliu in #961
- Enable MPI model by environment variable by @frankfliu in #964
- Add built-in json formatter by @frankfliu in #965
- [serving] Update tnx handler for 2.12 supported models by @tosterberg in #896
- [serving] Adds more built-in logging options by @frankfliu in #974
- Bump up DJL version to 0.24.0 by @frankfliu in #979
- [serving] Print out CUDA and Neuron device information by @frankfliu in #978
- [docker] bump transformers-neuronx for small llama-2 support by @tosterberg in #980
- [python] Update lmi-dist by @xyang16 in #975
- Install flash attention using wheel by @xyang16 in #982
- [python] Make paged attention configurable by @xyang16 in #986
- [python] Refactor lmi_dist rolling batch by @xyang16 in #987
- [docker] Upgrade to DJL 0.24.0 by @frankfliu in #989
- Set jsonlines formatter for lmi-dist rolling batch test by @xyang16 in #991
- Install FasterTransformer libs with llama support by @rohithkrn in #993
- Add trust_remote_code to ft handler by @siddvenk in #994
- [serving] Improves PyProcess lifecycle logging by @frankfliu in #996
- [python] Adds pid to python process log by @frankfliu in #997
- [python] Includes individual headers for server side batching by @frankfliu in #1001
- update ft python wheel with llama support by @rohithkrn in #1002
- [serving] Install commong-loggings dependency for XGBoost engine by @frankfliu in #1004
- [python] Finds optimal batch partition by @bryanktliu in #984
- add error handling for rolling batch by @lanking520 in #1005
- [serving] Allows print access log to console by @frankfliu in #1009
- [serving] Adds unregister model log by @frankfliu in #1010
- [python] validate each request in the batch by @frankfliu in #1008
- Update dependencies version by @frankfliu in #1012
- [serving] Return proper HTTP status code for each batch by @frankfliu in #1013
- [HF Streaming] use decode instead batch decode for streaming by @lanking520 in #1016
- [docker] disable TORCH_CUDNN_V8_API_DISABLED for PyTorch 2.0.1 by @frankfliu in #1018
- Allows set TENSOR_PARALLEL_DEGREE=max by @frankfliu in #1019
- Simplify handling of min/max workers by @zachgk in #1021
- [docker] Updates cache directory by @frankfliu in #1027
- [benchmark] Adds safetensors support by @frankfliu in #1031
- [VLLM] use more complex logic to ensure all result are captured by @lanking520 in #1035
- [VLLM] add option to set batched tokens by @lanking520 in #1036
- update inf2 dependencies to 2.13.1 by @lanking520 in #1044
- add data collection and some inf2 bug fixes by @lanking520 in #1047
- [RollingBatch] create request simulator to batch by @lanking520 in #1050
- [DeepSpeed] upgrade dependencies by @lanking520 in #1049
- [docker] Upgrades to inf2 2.13.2 version by @frankfliu in #1052
- add revision to handler by @lanking520 in #1056
- [docker] Change default OMP_NUM_THREADS back to 1 for GPU by @frankfliu in #1073
- Worker type by @zachgk in #1022
- [Handler] add dynamic batching to transformers neuronx by @lanking520 in #1076
- add Neuron RollingBatch implementation by @lanking520 in #1078
- [Neuron] upgrade to Neuron 2.14.0 SDK by @lanking520 in #1089
- [vLLM] add pyarrow dependency by @lanking520 in #1093
- [Handler] formalize all engines with same settings by @lanking520 in #1077
- Removes quick abort of python reader threads by @zachgk in #1095
- Adds adapter support by @zachgk in #1082
- Add unmerged lora support in HF handler by @rohithkrn in #1088
- Cleans some unused pieces of PyProcess by @zachgk in #1100
- Creates adapters by directory by @zachgk in #1094
- Use custom peft wheel by @rohithkrn in #1103
- [feature] Enable model sharding on seq_scheduler tested on gpt_neox_20B by @KexinFeng in #1086
- [vLLM] capture max_rolling_batch settting issues by @lanking520 in #1112
- [RollingBatch] add active requests and pending requests for skip tokens by @lanking520 in #1113
- Upgrade lmi_dist by @xyang16 in #1108
- [INF2][Handler] added optimization level per Neuron instruction by @lanking520 in #1107
- [Handler] add neuron int8 quantization by @lanking520 in #1115
- [Docker] upgrade dependencies version by @lanking520 in #1119
- Upgrade flash attention v2 version to 2.3.0 by @xyang16 in #1123
- [Handler] bump up vllm version and fix some bugs by @lanking520 in #1124
- Integrate with seq_scheduler wheel by @KexinFeng in #1122
- [INF2] remove neuron settings on cache hit for the folder by @lanking520 in #1126
- [python] Make rolling batch output not escape unicode characters by @xyang16 in #1135
- [vLLM][Handler] add quantization option for vLLM by @lanking520 in #1136
- [INF2][Handler] remove type conversion in Neuron by @lanking520 in #1134
- Update vllm_rolling_batch.py by @lanking520 in #1140
- Add support for stopwords in huggingface handler by @ydm-amazon in #1118
- Give a version of seq scheduler by @KexinFeng in #1146
- Support adapters by properties by @zachgk in #1148
- [serving] Allow model_id point to djl model zoo by @frankfliu in #1150
- Assert local lora models in the handler by @rohithkrn in #1153
- Block remote adapter url and handler override by @zachgk in #1147
- Add feature flag for adapters by @zachgk in #1152
- [feat] Modify deepspeed handler to support smoothQuant. by @chen3933 in #1138
- add flash2 support for huggingface accelerate by @lanking520 in #1111
- Clarify error message with unsupported quantization algorithm, since … by @davidthomas426 in #1157
- [Handler] disable circular import by @lanking520 in #1158
- Add error message for quantization when using checkpoint loading. by @chen3933 in #1156
- When doing smoothquant calibration, pass tokenizer through in deepspe… by @davidthomas426 in #1159
- Update vllm wheel name by @xyang16 in #1161
- installing official vLLM into container by @lanking520 in #1162
- Update java dependencies by @zachgk in #1169
- [INF2] add neuron batch size default and support rolling batch configs by @lanking520 in #1168
- Faster in-memory weight transfer for transformers-neuronx by @Lokiiiiii in #1172
- [LMI][Handler] add more model support coverage by @lanking520 in #1176
Bug fixes
- [python] Clean up dangling process in java by @frankfliu in #983
- [fix] Fix the cpu unittests issue due to device_map = 'auto' by @KexinFeng in #970
- [serving] Fixes console log configuration by @frankfliu in #985
- [python] Fixes json output formatter by @frankfliu in #988
- Fix the rolling batch integration test by @xyang16 in #992
- Fix batch offset computation in FT handler by @rohithkrn in #999
- [fix] add no-code rename step to stop runners by @tosterberg in #1000
- [python] Fixes batch header key issue by @frankfliu in #1006
- [fix] Fix Kwargs in AutoConfig by @KexinFeng in #995
- fix the error typing by @lanking520 in #1011
- fix vllm inference error by @lanking520 in #1014
- add some fix to the error messages by @lanking520 in #1020
- fix logging bug in vllm by @lanking520 in #1034
- [python] Fixes tokenizer bug when using hugging pipeline by @frankfliu in #1037
- Fixes OOM checker bug by @frankfliu in #1038
- [python] Avoid holding the lock while running inference by @frankfliu in #1045
- [INF2] disable checker for saved model by @lanking520 in #1058
- [fix] Device and format and implementation optimization by @KexinFeng in #1055
- [serving] Make sure extracting jni from jar file in deps folder by @frankfliu in #1061
- [serving] Fixes wrong device mapping for non-tp mode by @frankfliu in #1067
- [Handler] fix device mapping issues by @lanking520 in #1065
- [fix] Fix device map by @KexinFeng in #1074
- Fixes API responses by @zachgk in #1080
- [serving] Fixes log rotation issue by @frankfliu in #1083
- [INF2] don't install linux headers by @lanking520 in #1087
- [handler] fix a few issues by @lanking520 in #1081
- [INF2] fix some bugs and remove old tests by @lanking520 in #1090
- fix some bugs in handler by @lanking520 in #1098
- Fix setting adapters arg by @rohithkrn in #1099
- Fix typo by @rohithkrn in #1104
- fix deepspeed bugs and have better logging by @lanking520 in #1105
- [INF2][Handler] fix none type check by @lanking520 in #1117
- [fix] Device_and_search_config_issue by @KexinFeng in #1120
- [fix] falcon test model failure in unittest by @KexinFeng in #1129
- [fix] Fix falcon in seq_scheduler by @KexinFeng in #1131
- Revert flash_attn v2 version back to 2.0.1 by @xyang16 in #1133
- [fix] fix hf transformer handler dependency by @KexinFeng in #1132
- [fix] Gptq dependency by @KexinFeng in #1137
- [fix] version_fix by @KexinFeng in #1144
- [Handler] disable flash attention as default as of now by @lanking520 in #1165
- [fix] add fast loading to partition test by @tosterberg in #1164
- Fix flash_attn import issue by @xyang16 in #1174
- [fix] update tp for dynamic llama2 test back to 4 by @tosterberg in #1175
- [bugfix] parsing waiting steps to integer by @lanking520 in #1178
Documentation and Examples
- Update docs to djl 0.23.0 by @sindhuvahinis in #954
- [docs] Update rolling batch document by @frankfliu in #966
- [docs] Updates rolling batch document by @frankfliu in #1003
- Adds streaming docs by @zachgk in #1017
- Adding docs for llm tuning params by @maaquib in #1026
- [docs] Adds log4j configuration document by @frankfliu in #1030
- [docs] Updates TENSOR_PARALLEL_DEGREE description by @frankfliu in #1032
- [docs] Document Python engine alias by @frankfliu in #1039
CI
- Add support for testing candidate release images in sagemaker tests by @siddvenk in #958
- [serving] Update djlbench snapcraft version to 0.23.0 by @xyang16 in #963
- [ci] Update client for llama inf2 and move hf testing off for opt performance testing by @tosterberg in #971
- [ci] Use smaller instance for inf2 bloom test by @tosterberg in #972
- [docker] Update install inf2 script dependencies by @tosterberg in #976
- [ci] move no-code models to s3 to avoid hub download failure by @tosterberg in #998
- Add FT llama integration test by @rohithkrn in #1028
- [ci] Upgrades gradle to 8.3 by @frankfliu in #1029
- add mpt and starcoder tests by @lanking520 in #1023
- [docker] add version labels for sagemaker by @tosterberg in #1040
- [rollingbatch] add standalone script to run by @lanking520 in #1041
- [ci] Fixes PMD warning by @frankfliu in #1062
- [ci] Fixes gradle deprecation warnings by @frankfliu in #1063
- [RollingBatch][CI] use tag for test and not hardcode by @lanking520 in #1070
- [CI] add vllm tests by @lanking520 in #1072
- [INF2] grant write permission by @lanking520 in #1091
- add no code testing for rollingbatch by @lanking520 in #1097
- [CI] fix the inf2 container build failure by @lanking520 in #1102
- Add unmerged lora integration test by @rohithkrn in #1110
- Unmerged lora correctness test by @rohithkrn in #1114
- Add dependency on stop-runners for lora correctness test by @rohithkrn in #1121
- Add rolling batch gptq integration test by @xyang16 in #1125
- [feature] Test llama-7b-gptq on scheduler_rolling_batch by @KexinFeng in #1101
- Adding smoothquant integ tests by @maaquib in #1139
- [CI][Neuron] add extra timeout time for gpt neox by @lanking520 in #1142
- [CI] allow inf2 instance to sleep longer by @lanking520 in #1143
- [INF2][CI] switch the model to pythia by @lanking520 in #1145
- Instant benchmark by @lanking520 in #1149
- Instance Benchmark Rev2 by @lanking520 in #1151
- [IB] remove empty lines by @lanking520 in #1155
- Enable adapters preview in llm_integration test by @zachgk in #1166
- Adding llama2 w/ SmoothQuant ci test by @maaquib in #1171
- [Docker] free disk space for docker build by @lanking520 in #1170
- [CI] change xgen to standard llama model by @lanking520 in #1177
New Contributors
- @ydm-amazon made their first contribution in #1118
- @chen3933 made their first contribution in #1138
- @davidthomas426 made their first contribution in #1157
- @Lokiiiiii made their first contribution in #1172
Full Changelog: v0.23.0...v0.24.0