DJLServing v0.25.0 Release
Key Changes
- TensorRT LLM Integration. DJLServing now supports using the TensorRT LLM backend to deploy Large Language Models.
- See the documentation here
- Llama2-13b usint TRTLLM example notebook
- SmoothQuant support in DeepSpeed
- Llama2-13b using SmoothQuant with DeepSpeed example notebook
- Rolling batch support in DeepSpeed to boost throughput
- Updated Documentation on using DJLServing to deploy LLMs
- We have added documentation for supported configurations per container, as well as many new examples
Enhancements
- Add context length estimate for Neuron handler by @lanking520 in #1184
- [INF2] allow neuron to load split model directly by @lanking520 in #1186
- Adding INF2 (transformers-neuronx) compilation latencies to SageMaker Health Metrics by @Lokiiiiii in #1185
- [serving] Auto detect XGBoost engine with .xgb extension by @frankfliu in #1196
- add memory checking in place to identify max by @lanking520 in #1191
- [python] Do not set default value for truncate by @xyang16 in #1193
- Add aiccl support by @maaquib in #1179
- Setting default datatype for deepspeed handlers by @sindhuvahinis in #1203
- add trtllm container build by @lanking520 in #1215
- Add TRTLLM TRT build from our managed source by @lanking520 in #1199
- [python] Remove generation_dict in lmi_dist_rolling_batch by @xyang16 in #1217
- install s5cmd to trtllm by @lanking520 in #1219
- Update mpirun options by @xyang16 in #1220
- [python] Optimize batch serialization by @frankfliu in #1223
- upgrade vllm by @lanking520 in #1238
- Supports docker build with local .deb by @zachgk in #1231
- Do warmup in multiple requests by @xyang16 in #1216
- [python] Update PublisherBytesSupplier API by @frankfliu in #1242
- remove tensorrt installation by @lanking520 in #1243
- Use CUDA runtime image instead of CUDA devel. by @chen3933 in #1201
- remove unused components by @lanking520 in #1245
- [DeepSpeed DLC] separate container build with multi-layers by @lanking520 in #1246
- New PR for tensorrt llm by @ydm-amazon in #1240
- [python] Buffer tokens for rolling batch by @frankfliu in #1249
- Add trt-llm engine build step during model initialization by @rohithkrn in #1235
- [serving] Adds token latency metric by @frankfliu in #1251
- install trtllm toolkit by @lanking520 in #1254
- [TRTLLM] some clean up on trtllm handler by @lanking520 in #1248
- [TRTLLM] use tensorrt wheel by @lanking520 in #1255
- Adds versions as labels in dockerfiles by @zachgk in #1160
- [TRTLLM] add trtllm with no deps by @lanking520 in #1256
- [TRT partition] add realtime stream reader for the conversion script by @lanking520 in #1259
- [TRTLLM] always setting request output length by @lanking520 in #1258
- Update trtllm toolkit path by @rohithkrn in #1260
- allow gpu detection by @lanking520 in #1261
- add trtllm cuda-compat by @lanking520 in #1247
- [feat] Add serving.properties parameter for compiled graph path inf2 by @tosterberg in #1262
- Inf2 properties refactoring using pydantic by @sindhuvahinis in #1252
- MME - deviceId while creating workers by @sindhuvahinis in #1257
- [serving] Refactor TensorRT-LLM partition code by @frankfliu in #1267
- [DS] Deepspeed rolling batch support by @maaquib in #1295
- Allow user to pass in max_batch_prefill_tokens by @xyang16 in #1320
- add smoothquant as options by @lanking520 in #1285
- Deepspeed configurations refactoring by @sindhuvahinis in #1280
- update smoothquant arg by @rohithkrn in #1291
- [python] Adds do_sample support for trtllm by @frankfliu in #1290
- [wlm] Supports model_id point to a local directory by @frankfliu in #1276
- [SageMaker Galactus developer experience] model load integration to DJL serving by @haNa-meister in #1230
- [feat] Better output format from seq-scheduler by @KexinFeng in #1305
- [serving] Upgrades AWSSDK version to 2.21.19 by @frankfliu in #1313
- [serving] Uses seconds for ChunkedBytesSupplier timeout by @frankfliu in #1311
- install datasets in trtllm container by @rohithkrn in #1270
- TensorRrt Configs refactoring by @sindhuvahinis in #1275
- [TRTLLM] fix corner case that model_id point to local path by @lanking520 in #1317
- Huggingface configurations refactoring by @sindhuvahinis in #1283
- Calculate max_seq_length in warmup dynamically by @xyang16 in #1298
- Increase memory limit for rolling batch integration octocoder model by @xyang16 in #1319
- [TRTLLM] remove default repetition penalty by @lanking520 in #1321
- [feat] Expose max sparse params by @KexinFeng in #1273
- [NeuronX] add attention mask porting from optimum-neuron by @lanking520 in #1206
- [partition] extract properties files by @sindhuvahinis in #1293
- add checkpoint to ds properties by @sindhuvahinis in #1296
- [vllm] standardize input parameters by @frankfliu in #1301
- [TRTLLM] format better for logging by @lanking520 in #1309
- Change default top_k and temperature parameters in TRTLLM rolling batch by @ydm-amazon in #1312
- Add tokenizer check for triton repo by @rohithkrn in #1274
- [SageMaker Galactus developer experience] use python backend when schema is customized by @haNa-meister in #1286
Bug Fixes
- [bug fix] add entrypoint camel case recovery by @lanking520 in #1181
- Fix max tensor_parallel_degree by @zachgk in #1182
- Fix lmi_dist garbage output issue by @xyang16 in #1187
- [fix] update context estimate interface by @tosterberg in #1194
- Check logs for aiccl usage in integ test by @maaquib in #1202
- [serving] Revert management URI matching regex by @frankfliu in #1209
- Update datasets version in deepspeed.Dockerfile by @maaquib in #1211
- [console] Fixes bug for docker port mapping case by @frankfliu in #1213
- [fix] use fast sample as default sample method for inf2 by @tosterberg in #1226
- [python] Fixes batch error handling. by @frankfliu in #1232
- [python] Make parse_input() backward compatible by @frankfliu in #1233
- [python] Fixes build error by @frankfliu in #1253
- [python] Fixes scheduler rolling batch device data type error by @frankfliu in #1271
- fix pydantic install bug by @sindhuvahinis in #1279
- fix the bug in the cuda compat settings by @lanking520 in #1278
- [CUDA Compat] fix the verlte settings on the script by @lanking520 in #1281
- [fix] quantization properties for lmi dist and hf acc by @sindhuvahinis in #1318
- [fix] Refactor device on the scheduler by @KexinFeng in #1302
- [pyton] Recover from python process crash by @frankfliu in #1308
- [serving] Fixes large concurrent clients performance issue by @frankfliu in #1207
- [python] Workaround bug in PublisherBytesSupplier by @frankfliu in #1277
- fix batch_size type conversion by @sindhuvahinis in #1299
- [TRTLLM] remove max_new_tokens since backend does not recognize it by @lanking520 in #1287
- Check whether env variable value is empty by @sindhuvahinis in #1288
- [TRTLLM] stop on any exception by @lanking520 in #1289
Docs
- [0.25.0] Fix rolling batch properties by @xyang16 in #1327
- [doc][0.25.0] lmi configurations readme by @sindhuvahinis in #1337
- [0.25.0][cherrypick][doc] Updating new TensorRT-LLM configurations by @sindhuvahinis in #1346
- Adds documentation for adapters by @zachgk in #1197
- Bump up DJL version to 0.25.0 by @zachgk in #1221
- [docs] Fixes serving document by @frankfliu in #1237
- [docs] Reorganize configuration docs by @zachgk in #1316
CI/CD
- Add CI performance test for deepspeed smoothquant. by @chen3933 in #1183
- [python] Reformat python code by @frankfliu in #1195
- [feat] fail continuous test if python is not formatted in CI by @tosterberg in #1198
- [telemetry] update nightly containers label to djl 0.25.0 by @tosterberg in #1200
- [ci] Fixes missing publish tasks by @frankfliu in #1205
- [ci] Clean up temporary changes in init.py by @frankfliu in #1210
- Adds a test for unencoded urls during registration by @zachgk in #1225
- [fix] gpt2 neuron support handler and ci by @tosterberg in #1229
- remove fastertransformer build and release in DJLServing by @lanking520 in #1241
- update lora model id in correctness test by @rohithkrn in #1263
- [ci] Update tnx_config with new property and unit tests by @tosterberg in #1265
- [CI] TnX load split model llama2 test and fix handler by @sindhuvahinis in #1294
- [CI] Add awq test case for vllm by @sindhuvahinis in #1300
- [CI] split lmi tests to run in parallel by @sindhuvahinis in #1303
- [CI] Adding one more machine for inf2 integration test by @sindhuvahinis in #1304
- [CI] Adding option to run only test we need by @sindhuvahinis in #1307
- [CI] Adds instance benchmark tabular recording by @zachgk in #1315
- [UnitTest] adding more properties testing by @sindhuvahinis in #1306
- formatPython by @rohithkrn in #1292
- [CVE] Patch for Netty by @lanking520 in #1284
- [TRTLLM] CVE fixes by @lanking520 in #1314
- Downgrade to pydantic version 1.10.13 by @sindhuvahinis in #1269
- Creates docker temporary instances by @zachgk in #1264
- [TRTLLM] split further for container layers by @lanking520 in #1297
New Contributors
- @haNa-meister made their first contribution in #1230
Full Changelog: v0.24.0...v0.25.0