Skip to content

DJLServing v0.23.0 release

Compare
Choose a tag to compare
@sindhuvahinis sindhuvahinis released this 18 Jul 15:22
· 1337 commits to master since this release
e69b646

Key Features

  • Introduces Roling Batch
    • SeqBatchScheduler with rolling batch #803
    • Sampling SeqBatcher design #842
    • Max Seqbatcher number threshold api #843
    • Adds rolling batch support #828
    • Max new length #845
    • Rolling batch for huggingface handler #857
    • Compute kv cache utility function #863
    • Sampling decoding implementation #878
    • Uses multinomial to choose from topK samples and improve topP sampling #891
    • Falcon support #890
    • Unit test with random seed failure #909
    • KV cache support in default handler #929
  • Introduces LMI Dist library for rolling batch
    • Rolling batch support for flash models #865
    • Assign random seed for lmi dist #912
    • JSON format for rolling batch #899
    • Add quantization parameter for lmi_dist rolling batch backend for HF #888
  • Introduces vLLM library for rolling batch
    • [VLLM] add vllm rolling batch and add hazard handling #877
  • Introduces PEFT and LoRA support in handlers
    • Add peft to fastertransformer container #889
    • Add peft support to default deepspeed and huggingface handlers #884
    • Add lora support to ft default handler #932
  • Introduces streaming support to FasterTransformer
    • Add Streaming support #820
  • Introduces S3 Cache Engine
    • S3 Cache Engine #719
  • Upgrades component versions:
    • Upgrade PyTorch to 2.0.1 #804
    • Update Neuron to 2.10 #681
    • Upgrade deepspeed to 0.9.5 #804

Enhancement

Serving and python engine enhancements

  • Adds workflow model loading for SageMaker #661
  • Allows model being shared between workflows #665
  • Prints out error message if pip install failed #666
  • Install fixed version for transformers and accelerate #672
  • Add numpy fix #674
  • SM Training job changes for AOT #667
  • Creates model dir to prevent issues with no code experience in SageMaker #675
  • Don't mount model dir for no code tests #676
  • AOT upload checkpoints tests #678
  • Add stable diffusion support on INF2 #683
  • Unset omp thread to prevent CLIP model delay #688
  • Update ChunkedBytesSupplier API #692
  • Fixes log file charset issue in management console #693
  • Adds neuronx new feature for generation #694
  • [INF2] adding clip model support #696
  • [plugin] Include djl s3 extension in djl-serving distribution #699
  • [INF2] add bf16 support to SD #700
  • Adds support for streaming Seq2Seq models #698
  • Add SageMaker MCE support #706
  • [INF2] give better room for more tokens #710
  • [INF2] Bump up n positions #713
  • Refactor logic for supporting HF_MODEL_ID to support MME use case #712
  • Support load model from workflow directory #714
  • Add support for se2seq model loading in HF handler #715
  • Load function from workflow directory #718
  • Add vision components for DeepSpeed and inf2 #725
  • Support pip install in offline mode #729
  • Add --no-index to pip install in offline mode #731
  • Adding llama model support #727
  • Change the dependencies so for FasterTransformer #734
  • Adds text/plain content-type support #741
  • Skeleton structure for sequence batch scheduler #745
  • Handles torch.cuda.OutOfMemoryError #749
  • Improves model loading logging #750
  • Asynchronous with PublisherBytesSupplier #730
  • Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
  • Sets default minWorkers to 1 for GPU python model #755
  • Fixes log message #765
  • Adds more logs to LMI engine detection #766
  • Uses predictable model name for HF model #771
  • Adds parallel loading support for Python engine #770
  • Updates management console UI: file input are not required in form data #773
  • Sets default maxWorkers based on OMP_NUM_THREADS #776
  • Support non-gpu models for huggingface #772
  • Use huggingface standard generation for tnx streaming #778
  • Add trust remote code option #781
  • Handles invalid retrun type case #790
  • Add application/jsonlines as content-type for streaming #791
  • Fixes trust_remote_code issue #793
  • Add einops for supporting falcon models #792
  • Adds content-type response for DeepSpeed and FasterTransformer handler #797
  • Sets default maxWorkers the same as earlier version #799
  • Add stream generation for huggingface streamer #801
  • Add server side batching #795
  • Add safetensors #808
  • Improvements in AOT UX #787
  • Add pytorch kernel cache default directory #810
  • Improves partition script error message #826
  • Add -XX:-UseContainerSupport flag only for SageMaker #868
  • Move TP detection logic to PyModel from LmiUtils #840
  • Set tensor_parallel_degree property when not specified #847
  • Add workflow dispatch #870
  • Create model level virtualenv #811
  • Refactor createVirtualEnv() #875
  • Add MPI Engine as generic name for distributed environment #882
  • Raise inference failure exceptions in default handlers #883
  • Increase default max_rolling_batch_size to 32 #893
  • Reformat python code #895
  • Add oom unit tests for load and invoke #898
  • Reformat python code #917
  • Refactor output formatter for rolling batch #916
  • Temporary workaround for rolling batch #922
  • Fixes huggingface logging bug #924
  • Adds batch size metric #925
  • Only override minWorkers when tp > 1 #930
  • Set default maxWorkers to 1 if not configured for TP #934
  • Send error message in json format #939
  • Add null check for prefill batch #938
  • Allow overriding truncate parameter in request #957
  • Add revision as part of the model inputs #947
  • Add revision in test #948
  • Add model revision environment variable #949
  • Disconnect client when streaming timed out #941

Docker enhancements

  • Fixes fastertransformer docker file #671
  • update fastertransformers build instruction #722
  • Uses the same convention as tritonserver #738
  • Pin bitsandbytes version to 0.38.1 #754
  • Avoid auto setting OMP_NUM_THREADS for GPU/INF docker images #774
  • Add llama support and integration tests #844
  • Add missing default argument to gpt2-xl sm endpoint test #846
  • Add protobuf to FT and TNX #850
  • Update netty for cve #859
  • Add 4 bits loading #867
  • Add flash attention installation and a few bug fixing #872
  • Allows mpi model load multiple times on the same GPU #894
  • Upgrade fastertransfomers HF versions #911
  • Update deepspeed docker to nightly wheel #915
  • Bump bitsandbytes versions #936
  • Bump up bitsandbytes on its fixes #944
  • Update release version and wheels #956
  • Adding back S3Url for backward compatibility in pysdk #838
  • Add neuronx 2.11.0 support #848

Bug Fixes

  • Fix the start gpu bug #709
  • tokenizer bug fixes #732
  • Fixes bug in fastertransformer built-in handler #736
  • Fixes typo in fastertransformer handler #740
  • bump versions for new deepspeed wheel #733
  • Fix bitsandbytes pip install #758
  • Fix the stream generation #794
  • Fixes typo in transformers-neuronx.py #796
  • Fixes device id mismatch issue for mutlple GPU case #800
  • Fixes device mismatch issue for streaming token #805
  • Fixes typo in sm workflow inputs #807
  • Fix input_data and device order for streaming #809
  • Fixes retry_threshold bug #812
  • Fixes huggingface device bugs #813
  • Fixes huggingface handler typo #815
  • Fixes invlid device issue #816
  • Fixes WorkerThread name #817
  • partition script: keep 'option' in properties #819
  • Fixes streaming token device mismatch bug #822
  • Extract .py files recursively #821
  • Fixes the device mapping issue if visible devices is set #707
  • Efficiency issue #841
  • Fix the type of max_seq_len #853
  • Remove option prefix when auto setting tensor_parallel_degree in properites #854
  • Fix T5 model not support INT8 issue on handler #856
  • Fix a few pipeline issues #876
  • Fix Kwargs in AutoConfig #885
  • Fix lmi-dist batch handling #887
  • Fix rolling batch type #897
  • Fix PublisherBytesSupplier #905
  • Fix skip_special_tokens flag #907
  • Fix skip special tokens in lmi-dist #908
  • Fix do sample type #920
  • Add current device for tp > 1 scenario on huggingface handler #927
  • Fix for empty tensor input #928
  • Fix boolean kwargs and typo in load_in_4_bit assignment #946
  • Fix some issues with remote code for lora #952
  • Fixes unittest in multi-GPU case #874
  • Fixes rolling batch error handling case #919
  • Fixes MPI engine workers detection #886
  • Fix repeated output for rolling batch #935
  • Fix the default value for rolling batch request parameters #943
  • Fixes logging bug #937
  • Fix remove models in step #834
  • Fix in runtime kv_cache #923

Documentation

  • Adding project diagrams link to architecture.md #742
  • Updates management api document #814
  • OOM management doc #926
  • Updates model configuration document #933
  • Adds s5cmd feature in document #945
  • Adds document about venv per model #951
  • Update docs to djl 0.23.0 #955

CI improvements

  • Fixes unit test for extra data type #673
  • Adds performance testing #558
  • Add small fixes #684
  • Add HuggingFace TGI publish and test pipeline #650
  • Add shared memory arg to docker launch command in README #685
  • Update github-slug-action to v4.4.1 #686
  • Change the bucket for different object #691
  • make performance tests run in parallel #690
  • Add more models to TGI test pipeline #695
  • Upgrade spotbugs to 5.0.14 #704
  • reconfigure performance test time and machines #711
  • Add unit test for empty model store initialization #716
  • Fix no code tests in lmi test suite #717
  • Refactor test code client.py #721
  • Add seq2seq streaming integ test #724
  • [test] Update tranformser-neuxornx gpt-j-b mode options #723
  • Remove TGI build and test pipeline #735
  • Upgrade jacoco to 0.8.8 to support JDK17+ #739
  • Avoid unit-test hang #744
  • update the wheel to have path fixed #747
  • Add SageMaker integration test #705
  • fix permissions for sm pysdk install script #751
  • SM AOT Tests #756
  • Add mme tests to sagemaker tests #763
  • add triton components in the nightly #767
  • fix typos with get default bucket prefix for sm session #768
  • Upload SM benchmark metrics to cloudwatch #769
  • Fixes integration test #779
  • [python] Adjuests mpi workers based CUDA_VISIBLE_DEVICES #782
  • Option to run only the lmi tests needed #786
  • remove inf1 support and upgrade some package versions #785
  • Remove hardcoded version in Assertion error #789
  • Add support for testing nightly images in sagemaker endpoint tests #788
  • Check if input is empty #798
  • update gpu memory consumption and adding GPTNeoX, GPTJ #818
  • Remove flan-t5-xxl #829
  • Migrate sagemaker endpoint tests to us-west-2 #837
  • Rolling batch integration tests #866
  • Add lmi dist tests pipeline #869
  • Add deepspeed cpu build in the pipeline #873
  • Give longer time for building DeepSpeed container #880
  • Add lmi-dist integration tests #892
  • Add integration test for lmi-dist AutoModel #904
  • Add llama to performance testing #921
  • Lmi-dist model tests updates #918
  • Adding gpt-neox-20b-quantized to workflow #931
  • Remove oom tests for hf accelerate performance #940
  • Add lora tests for fastertransformer #942

Contributors

@alexkarezin
@frankfliu
@LanKing
@siddvenk
@xyang16
@tosterberg
@maaquib
@sindhuvahinis
@KexinFeng
@rohithkrn
@zachgk

New Contributors

Full Changelog: v0.22.1...v0.23.0