Release DJLServing v0.24.0 release · deepjavalibrary/djl-serving

Key Features

Updates Components
- Updates Neuron to 2.14.1
- Updates DeepSpeed to 0.10.0
Improved Python logging
Improved SeqScheduler
Adds DeepSpeed dynamic int8 quantization with SmoothQuant
Supports for llama 2
Supports Safetensors
Adds Neuron dynamic batching and rolling batch
Adds Adapter API Preview
Supports HuggingFace Stopwords

Enhancement

Allow overriding truncate parameter in request by @maaquib in #953
Enable multi-gpu inference (device_map='auto') on seq_batch_scheduler by @KexinFeng in #960
[wlm] Allows set defatul options with environment variable by @frankfliu in #961
Enable MPI model by environment variable by @frankfliu in #964
Add built-in json formatter by @frankfliu in #965
[serving] Update tnx handler for 2.12 supported models by @tosterberg in #896
[serving] Adds more built-in logging options by @frankfliu in #974
Bump up DJL version to 0.24.0 by @frankfliu in #979
[serving] Print out CUDA and Neuron device information by @frankfliu in #978
[docker] bump transformers-neuronx for small llama-2 support by @tosterberg in #980
[python] Update lmi-dist by @xyang16 in #975
Install flash attention using wheel by @xyang16 in #982
[python] Make paged attention configurable by @xyang16 in #986
[python] Refactor lmi_dist rolling batch by @xyang16 in #987
[docker] Upgrade to DJL 0.24.0 by @frankfliu in #989
Set jsonlines formatter for lmi-dist rolling batch test by @xyang16 in #991
Install FasterTransformer libs with llama support by @rohithkrn in #993
Add trust_remote_code to ft handler by @siddvenk in #994
[serving] Improves PyProcess lifecycle logging by @frankfliu in #996
[python] Adds pid to python process log by @frankfliu in #997
[python] Includes individual headers for server side batching by @frankfliu in #1001
update ft python wheel with llama support by @rohithkrn in #1002
[serving] Install commong-loggings dependency for XGBoost engine by @frankfliu in #1004
[python] Finds optimal batch partition by @bryanktliu in #984
add error handling for rolling batch by @lanking520 in #1005
[serving] Allows print access log to console by @frankfliu in #1009
[serving] Adds unregister model log by @frankfliu in #1010
[python] validate each request in the batch by @frankfliu in #1008
Update dependencies version by @frankfliu in #1012
[serving] Return proper HTTP status code for each batch by @frankfliu in #1013
[HF Streaming] use decode instead batch decode for streaming by @lanking520 in #1016
[docker] disable TORCH_CUDNN_V8_API_DISABLED for PyTorch 2.0.1 by @frankfliu in #1018
Allows set TENSOR_PARALLEL_DEGREE=max by @frankfliu in #1019
Simplify handling of min/max workers by @zachgk in #1021
[docker] Updates cache directory by @frankfliu in #1027
[benchmark] Adds safetensors support by @frankfliu in #1031
[VLLM] use more complex logic to ensure all result are captured by @lanking520 in #1035
[VLLM] add option to set batched tokens by @lanking520 in #1036
update inf2 dependencies to 2.13.1 by @lanking520 in #1044
add data collection and some inf2 bug fixes by @lanking520 in #1047
[RollingBatch] create request simulator to batch by @lanking520 in #1050
[DeepSpeed] upgrade dependencies by @lanking520 in #1049
[docker] Upgrades to inf2 2.13.2 version by @frankfliu in #1052
add revision to handler by @lanking520 in #1056
[docker] Change default OMP_NUM_THREADS back to 1 for GPU by @frankfliu in #1073
Worker type by @zachgk in #1022
[Handler] add dynamic batching to transformers neuronx by @lanking520 in #1076
add Neuron RollingBatch implementation by @lanking520 in #1078
[Neuron] upgrade to Neuron 2.14.0 SDK by @lanking520 in #1089
[vLLM] add pyarrow dependency by @lanking520 in #1093
[Handler] formalize all engines with same settings by @lanking520 in #1077
Removes quick abort of python reader threads by @zachgk in #1095
Adds adapter support by @zachgk in #1082
Add unmerged lora support in HF handler by @rohithkrn in #1088
Cleans some unused pieces of PyProcess by @zachgk in #1100
Creates adapters by directory by @zachgk in #1094
Use custom peft wheel by @rohithkrn in #1103
[feature] Enable model sharding on seq_scheduler tested on gpt_neox_20B by @KexinFeng in #1086
[vLLM] capture max_rolling_batch settting issues by @lanking520 in #1112
[RollingBatch] add active requests and pending requests for skip tokens by @lanking520 in #1113
Upgrade lmi_dist by @xyang16 in #1108
[INF2][Handler] added optimization level per Neuron instruction by @lanking520 in #1107
[Handler] add neuron int8 quantization by @lanking520 in #1115
[Docker] upgrade dependencies version by @lanking520 in #1119
Upgrade flash attention v2 version to 2.3.0 by @xyang16 in #1123
[Handler] bump up vllm version and fix some bugs by @lanking520 in #1124
Integrate with seq_scheduler wheel by @KexinFeng in #1122
[INF2] remove neuron settings on cache hit for the folder by @lanking520 in #1126
[python] Make rolling batch output not escape unicode characters by @xyang16 in #1135
[vLLM][Handler] add quantization option for vLLM by @lanking520 in #1136
[INF2][Handler] remove type conversion in Neuron by @lanking520 in #1134
Update vllm_rolling_batch.py by @lanking520 in #1140
Add support for stopwords in huggingface handler by @ydm-amazon in #1118
Give a version of seq scheduler by @KexinFeng in #1146
Support adapters by properties by @zachgk in #1148
[serving] Allow model_id point to djl model zoo by @frankfliu in #1150
Assert local lora models in the handler by @rohithkrn in #1153
Block remote adapter url and handler override by @zachgk in #1147
Add feature flag for adapters by @zachgk in #1152
[feat] Modify deepspeed handler to support smoothQuant. by @chen3933 in #1138
add flash2 support for huggingface accelerate by @lanking520 in #1111
Clarify error message with unsupported quantization algorithm, since … by @davidthomas426 in #1157
[Handler] disable circular import by @lanking520 in #1158
Add error message for quantization when using checkpoint loading. by @chen3933 in #1156
When doing smoothquant calibration, pass tokenizer through in deepspe… by @davidthomas426 in #1159
Update vllm wheel name by @xyang16 in #1161
installing official vLLM into container by @lanking520 in #1162
Update java dependencies by @zachgk in #1169
[INF2] add neuron batch size default and support rolling batch configs by @lanking520 in #1168
Faster in-memory weight transfer for transformers-neuronx by @Lokiiiiii in #1172
[LMI][Handler] add more model support coverage by @lanking520 in #1176

Bug fixes

[python] Clean up dangling process in java by @frankfliu in #983
[fix] Fix the cpu unittests issue due to device_map = 'auto' by @KexinFeng in #970
[serving] Fixes console log configuration by @frankfliu in #985
[python] Fixes json output formatter by @frankfliu in #988
Fix the rolling batch integration test by @xyang16 in #992
Fix batch offset computation in FT handler by @rohithkrn in #999
[fix] add no-code rename step to stop runners by @tosterberg in #1000
[python] Fixes batch header key issue by @frankfliu in #1006
[fix] Fix Kwargs in AutoConfig by @KexinFeng in #995
fix the error typing by @lanking520 in #1011
fix vllm inference error by @lanking520 in #1014
add some fix to the error messages by @lanking520 in #1020
fix logging bug in vllm by @lanking520 in #1034
[python] Fixes tokenizer bug when using hugging pipeline by @frankfliu in #1037
Fixes OOM checker bug by @frankfliu in #1038
[python] Avoid holding the lock while running inference by @frankfliu in #1045
[INF2] disable checker for saved model by @lanking520 in #1058
[fix] Device and format and implementation optimization by @KexinFeng in #1055
[serving] Make sure extracting jni from jar file in deps folder by @frankfliu in #1061
[serving] Fixes wrong device mapping for non-tp mode by @frankfliu in #1067
[Handler] fix device mapping issues by @lanking520 in #1065
[fix] Fix device map by @KexinFeng in #1074
Fixes API responses by @zachgk in #1080
[serving] Fixes log rotation issue by @frankfliu in #1083
[INF2] don't install linux headers by @lanking520 in #1087
[handler] fix a few issues by @lanking520 in #1081
[INF2] fix some bugs and remove old tests by @lanking520 in #1090
fix some bugs in handler by @lanking520 in #1098
Fix setting adapters arg by @rohithkrn in #1099
Fix typo by @rohithkrn in #1104
fix deepspeed bugs and have better logging by @lanking520 in #1105
[INF2][Handler] fix none type check by @lanking520 in #1117
[fix] Device_and_search_config_issue by @KexinFeng in #1120
[fix] falcon test model failure in unittest by @KexinFeng in #1129
[fix] Fix falcon in seq_scheduler by @KexinFeng in #1131
Revert flash_attn v2 version back to 2.0.1 by @xyang16 in #1133
[fix] fix hf transformer handler dependency by @KexinFeng in #1132
[fix] Gptq dependency by @KexinFeng in #1137
[fix] version_fix by @KexinFeng in #1144
[Handler] disable flash attention as default as of now by @lanking520 in #1165
[fix] add fast loading to partition test by @tosterberg in #1164
Fix flash_attn import issue by @xyang16 in #1174
[fix] update tp for dynamic llama2 test back to 4 by @tosterberg in #1175
[bugfix] parsing waiting steps to integer by @lanking520 in #1178

Documentation and Examples

Update docs to djl 0.23.0 by @sindhuvahinis in #954
[docs] Update rolling batch document by @frankfliu in #966
[docs] Updates rolling batch document by @frankfliu in #1003
Adds streaming docs by @zachgk in #1017
Adding docs for llm tuning params by @maaquib in #1026
[docs] Adds log4j configuration document by @frankfliu in #1030
[docs] Updates TENSOR_PARALLEL_DEGREE description by @frankfliu in #1032
[docs] Document Python engine alias by @frankfliu in #1039

CI

Add support for testing candidate release images in sagemaker tests by @siddvenk in #958
[serving] Update djlbench snapcraft version to 0.23.0 by @xyang16 in #963
[ci] Update client for llama inf2 and move hf testing off for opt performance testing by @tosterberg in #971
[ci] Use smaller instance for inf2 bloom test by @tosterberg in #972
[docker] Update install inf2 script dependencies by @tosterberg in #976
[ci] move no-code models to s3 to avoid hub download failure by @tosterberg in #998
Add FT llama integration test by @rohithkrn in #1028
[ci] Upgrades gradle to 8.3 by @frankfliu in #1029
add mpt and starcoder tests by @lanking520 in #1023
[docker] add version labels for sagemaker by @tosterberg in #1040
[rollingbatch] add standalone script to run by @lanking520 in #1041
[ci] Fixes PMD warning by @frankfliu in #1062
[ci] Fixes gradle deprecation warnings by @frankfliu in #1063
[RollingBatch][CI] use tag for test and not hardcode by @lanking520 in #1070
[CI] add vllm tests by @lanking520 in #1072
[INF2] grant write permission by @lanking520 in #1091
add no code testing for rollingbatch by @lanking520 in #1097
[CI] fix the inf2 container build failure by @lanking520 in #1102
Add unmerged lora integration test by @rohithkrn in #1110
Unmerged lora correctness test by @rohithkrn in #1114
Add dependency on stop-runners for lora correctness test by @rohithkrn in #1121
Add rolling batch gptq integration test by @xyang16 in #1125
[feature] Test llama-7b-gptq on scheduler_rolling_batch by @KexinFeng in #1101
Adding smoothquant integ tests by @maaquib in #1139
[CI][Neuron] add extra timeout time for gpt neox by @lanking520 in #1142
[CI] allow inf2 instance to sleep longer by @lanking520 in #1143
[INF2][CI] switch the model to pythia by @lanking520 in #1145
Instant benchmark by @lanking520 in #1149
Instance Benchmark Rev2 by @lanking520 in #1151
[IB] remove empty lines by @lanking520 in #1155
Enable adapters preview in llm_integration test by @zachgk in #1166
Adding llama2 w/ SmoothQuant ci test by @maaquib in #1171
[Docker] free disk space for docker build by @lanking520 in #1170
[CI] change xgen to standard llama model by @lanking520 in #1177

New Contributors

@ydm-amazon made their first contribution in #1118
@chen3933 made their first contribution in #1138
@davidthomas426 made their first contribution in #1157
@Lokiiiiii made their first contribution in #1172

Full Changelog: v0.23.0...v0.24.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.24.0 release

Key Features

Enhancement

Bug fixes

Documentation and Examples

CI

New Contributors

Contributors