Release DJLServing v0.25.0 Release · deepjavalibrary/djl-serving

Key Changes

TensorRT LLM Integration. DJLServing now supports using the TensorRT LLM backend to deploy Large Language Models.
- See the documentation here
- Llama2-13b usint TRTLLM example notebook
SmoothQuant support in DeepSpeed
- Llama2-13b using SmoothQuant with DeepSpeed example notebook
Rolling batch support in DeepSpeed to boost throughput
Updated Documentation on using DJLServing to deploy LLMs
- We have added documentation for supported configurations per container, as well as many new examples

Enhancements

Add context length estimate for Neuron handler by @lanking520 in #1184
[INF2] allow neuron to load split model directly by @lanking520 in #1186
Adding INF2 (transformers-neuronx) compilation latencies to SageMaker Health Metrics by @Lokiiiiii in #1185
[serving] Auto detect XGBoost engine with .xgb extension by @frankfliu in #1196
add memory checking in place to identify max by @lanking520 in #1191
[python] Do not set default value for truncate by @xyang16 in #1193
Add aiccl support by @maaquib in #1179
Setting default datatype for deepspeed handlers by @sindhuvahinis in #1203
add trtllm container build by @lanking520 in #1215
Add TRTLLM TRT build from our managed source by @lanking520 in #1199
[python] Remove generation_dict in lmi_dist_rolling_batch by @xyang16 in #1217
install s5cmd to trtllm by @lanking520 in #1219
Update mpirun options by @xyang16 in #1220
[python] Optimize batch serialization by @frankfliu in #1223
upgrade vllm by @lanking520 in #1238
Supports docker build with local .deb by @zachgk in #1231
Do warmup in multiple requests by @xyang16 in #1216
[python] Update PublisherBytesSupplier API by @frankfliu in #1242
remove tensorrt installation by @lanking520 in #1243
Use CUDA runtime image instead of CUDA devel. by @chen3933 in #1201
remove unused components by @lanking520 in #1245
[DeepSpeed DLC] separate container build with multi-layers by @lanking520 in #1246
New PR for tensorrt llm by @ydm-amazon in #1240
[python] Buffer tokens for rolling batch by @frankfliu in #1249
Add trt-llm engine build step during model initialization by @rohithkrn in #1235
[serving] Adds token latency metric by @frankfliu in #1251
install trtllm toolkit by @lanking520 in #1254
[TRTLLM] some clean up on trtllm handler by @lanking520 in #1248
[TRTLLM] use tensorrt wheel by @lanking520 in #1255
Adds versions as labels in dockerfiles by @zachgk in #1160
[TRTLLM] add trtllm with no deps by @lanking520 in #1256
[TRT partition] add realtime stream reader for the conversion script by @lanking520 in #1259
[TRTLLM] always setting request output length by @lanking520 in #1258
Update trtllm toolkit path by @rohithkrn in #1260
allow gpu detection by @lanking520 in #1261
add trtllm cuda-compat by @lanking520 in #1247
[feat] Add serving.properties parameter for compiled graph path inf2 by @tosterberg in #1262
Inf2 properties refactoring using pydantic by @sindhuvahinis in #1252
MME - deviceId while creating workers by @sindhuvahinis in #1257
[serving] Refactor TensorRT-LLM partition code by @frankfliu in #1267
[DS] Deepspeed rolling batch support by @maaquib in #1295
Allow user to pass in max_batch_prefill_tokens by @xyang16 in #1320
add smoothquant as options by @lanking520 in #1285
Deepspeed configurations refactoring by @sindhuvahinis in #1280
update smoothquant arg by @rohithkrn in #1291
[python] Adds do_sample support for trtllm by @frankfliu in #1290
[wlm] Supports model_id point to a local directory by @frankfliu in #1276
[SageMaker Galactus developer experience] model load integration to DJL serving by @haNa-meister in #1230
[feat] Better output format from seq-scheduler by @KexinFeng in #1305
[serving] Upgrades AWSSDK version to 2.21.19 by @frankfliu in #1313
[serving] Uses seconds for ChunkedBytesSupplier timeout by @frankfliu in #1311
install datasets in trtllm container by @rohithkrn in #1270
TensorRrt Configs refactoring by @sindhuvahinis in #1275
[TRTLLM] fix corner case that model_id point to local path by @lanking520 in #1317
Huggingface configurations refactoring by @sindhuvahinis in #1283
Calculate max_seq_length in warmup dynamically by @xyang16 in #1298
Increase memory limit for rolling batch integration octocoder model by @xyang16 in #1319
[TRTLLM] remove default repetition penalty by @lanking520 in #1321
[feat] Expose max sparse params by @KexinFeng in #1273
[NeuronX] add attention mask porting from optimum-neuron by @lanking520 in #1206
[partition] extract properties files by @sindhuvahinis in #1293
add checkpoint to ds properties by @sindhuvahinis in #1296
[vllm] standardize input parameters by @frankfliu in #1301
[TRTLLM] format better for logging by @lanking520 in #1309
Change default top_k and temperature parameters in TRTLLM rolling batch by @ydm-amazon in #1312
Add tokenizer check for triton repo by @rohithkrn in #1274
[SageMaker Galactus developer experience] use python backend when schema is customized by @haNa-meister in #1286

Bug Fixes

[bug fix] add entrypoint camel case recovery by @lanking520 in #1181
Fix max tensor_parallel_degree by @zachgk in #1182
Fix lmi_dist garbage output issue by @xyang16 in #1187
[fix] update context estimate interface by @tosterberg in #1194
Check logs for aiccl usage in integ test by @maaquib in #1202
[serving] Revert management URI matching regex by @frankfliu in #1209
Update datasets version in deepspeed.Dockerfile by @maaquib in #1211
[console] Fixes bug for docker port mapping case by @frankfliu in #1213
[fix] use fast sample as default sample method for inf2 by @tosterberg in #1226
[python] Fixes batch error handling. by @frankfliu in #1232
[python] Make parse_input() backward compatible by @frankfliu in #1233
[python] Fixes build error by @frankfliu in #1253
[python] Fixes scheduler rolling batch device data type error by @frankfliu in #1271
fix pydantic install bug by @sindhuvahinis in #1279
fix the bug in the cuda compat settings by @lanking520 in #1278
[CUDA Compat] fix the verlte settings on the script by @lanking520 in #1281
[fix] quantization properties for lmi dist and hf acc by @sindhuvahinis in #1318
[fix] Refactor device on the scheduler by @KexinFeng in #1302
[pyton] Recover from python process crash by @frankfliu in #1308
[serving] Fixes large concurrent clients performance issue by @frankfliu in #1207
[python] Workaround bug in PublisherBytesSupplier by @frankfliu in #1277
fix batch_size type conversion by @sindhuvahinis in #1299
[TRTLLM] remove max_new_tokens since backend does not recognize it by @lanking520 in #1287
Check whether env variable value is empty by @sindhuvahinis in #1288
[TRTLLM] stop on any exception by @lanking520 in #1289

Docs

[0.25.0] Fix rolling batch properties by @xyang16 in #1327
[doc][0.25.0] lmi configurations readme by @sindhuvahinis in #1337
[0.25.0][cherrypick][doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1346
Adds documentation for adapters by @zachgk in #1197
Bump up DJL version to 0.25.0 by @zachgk in #1221
[docs] Fixes serving document by @frankfliu in #1237
[docs] Reorganize configuration docs by @zachgk in #1316

CI/CD

Add CI performance test for deepspeed smoothquant. by @chen3933 in #1183
[python] Reformat python code by @frankfliu in #1195
[feat] fail continuous test if python is not formatted in CI by @tosterberg in #1198
[telemetry] update nightly containers label to djl 0.25.0 by @tosterberg in #1200
[ci] Fixes missing publish tasks by @frankfliu in #1205
[ci] Clean up temporary changes in init.py by @frankfliu in #1210
Adds a test for unencoded urls during registration by @zachgk in #1225
[fix] gpt2 neuron support handler and ci by @tosterberg in #1229
remove fastertransformer build and release in DJLServing by @lanking520 in #1241
update lora model id in correctness test by @rohithkrn in #1263
[ci] Update tnx_config with new property and unit tests by @tosterberg in #1265
[CI] TnX load split model llama2 test and fix handler by @sindhuvahinis in #1294
[CI] Add awq test case for vllm by @sindhuvahinis in #1300
[CI] split lmi tests to run in parallel by @sindhuvahinis in #1303
[CI] Adding one more machine for inf2 integration test by @sindhuvahinis in #1304
[CI] Adding option to run only test we need by @sindhuvahinis in #1307
[CI] Adds instance benchmark tabular recording by @zachgk in #1315
[UnitTest] adding more properties testing by @sindhuvahinis in #1306
formatPython by @rohithkrn in #1292
[CVE] Patch for Netty by @lanking520 in #1284
[TRTLLM] CVE fixes by @lanking520 in #1314
Downgrade to pydantic version 1.10.13 by @sindhuvahinis in #1269
Creates docker temporary instances by @zachgk in #1264
[TRTLLM] split further for container layers by @lanking520 in #1297

New Contributors

@haNa-meister made their first contribution in #1230

Full Changelog: v0.24.0...v0.25.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.25.0 Release

Key Changes

Enhancements

Bug Fixes

Docs

CI/CD

New Contributors

Contributors