-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel linear Lora #1092
Parallel linear Lora #1092
Conversation
Great, thanks for the PR. Before going through a full review, I have some points/questions: Could you please provide a bit of more context what users can expect when using this functionality?
It would be great if we could find a way to avoid that, I can also check later if I have any ideas. Finally, it would be great to have unit tests for the new feature, or at the very least an example to see it in action. |
For llama or other large models, the deepspeed framework is not easy to use if the model is too large. We now use our own modified megatron-deepspeed framework to train the cluster. Therefore, the model structure uses the ParallelLinear of megatron, but at this time we want to use Lora for fine-tuning, so we want to extend lora to support the ParallelLinear.
Ok, I use this script to finetune a llama7B with alpaca on our own magetron-deepspeed framework : export CUDA_VISIBLE_DEVICES_=0,1,2,3
export ASCEND_RT_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES_}
ASCEND_RT_VISIBLE_DEVICES_ARRAY=(${CUDA_VISIBLE_DEVICES_//,/ })
echo "${ASCEND_RT_VISIBLE_DEVICES_ARRAY[@]}"
# the number of parameters is not aligned
export LD_LIBRARY_PATH=/usr/local/lib:/home/anaconda3/lib:$LD_LIBRARY_PATH
export HCCL_CONNECT_TIMEOUT=1200
export COMBINED_ENABLE=1
source /home/xxx/Ascend/set_env.sh
GPUS_PER_NODE=${#ASCEND_RT_VISIBLE_DEVICES_ARRAY[@]}
echo $GPUS_PER_NODE
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
TP=1
PP=1
DATA_PATH=xxx
LOAD_CHECKPOINT_PATH=xxx
SAVE_CHECKPOINT_PATH=xxx
TOKENIZER_PATH=xxx
DS_CONFIG=deepspeed_config_13B_1.json
ZERO_STAGE=2
MICRO_BATCH=4
GRADIENT_ACCUMULATION_STEP=8
GLOBAL_BATCH=$(($MICRO_BATCH * $GRADIENT_ACCUMULATION_STEP * $WORLD_SIZE))
EPOCH=2
TRAIN_ITERS=$((52000 / $GLOBAL_BATCH * $EPOCH))
echo $TRAIN_ITERS
SAVE_INTERVAL=$(($TRAIN_ITERS / 2))
echo $SAVE_INTERVAL
export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
cat <<EOT > $DS_CONFIG
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 8,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam"
},
"zero_optimization": {
"stage": $ZERO_STAGE,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": ${GRADIENT_ACCUMULATION_STEP},
"train_batch_size": $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu":$MICRO_BATCH,
"zero_allow_untested_optimizer": true
}
EOT
ds_args=""
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
#deepspeed --master_port ${MASTER_PORT} --include localhost:${CUDA_VISIBLE_DEVICES_} pretrain_llama.py \
deepspeed --master_port ${MASTER_PORT} pretrain_llama.py \
--DDP-impl local \
--no-contiguous-buffers-in-local-ddp \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 11008 \
--num-attention-heads 32 \
--micro-batch-size $MICRO_BATCH \
--global-batch-size $GLOBAL_BATCH \
--seq-length 1024 \
--max-position-embeddings 2048 \
--train-iters ${TRAIN_ITERS} \
--lr-decay-iters ${TRAIN_ITERS} \
--save $SAVE_CHECKPOINT_PATH \
--load $LOAD_CHECKPOINT_PATH \
--data-path $DATA_PATH \
--tokenizer-name-or-path $TOKENIZER_PATH \
--tokenizer-not-use-fast \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 2e-5 \
--lr-decay-style cosine \
--min-lr 0 \
--weight-decay 0. \
--clip-grad 1.0 \
--lr-warmup-iters 100 \
--checkpoint-activations \
--log-interval 1 \
--save-interval ${SAVE_INTERVAL} \
--eval-interval 1000 \
--eval-iters 10 \
--use-cpu-initialization \
--lora-target-modules query_key_value dense gate_proj up_proj down_proj \
--lora-r 16 \
--lora-alpha 32 \
--is-instruction-dataset \
--seed 42 \
$ds_args \
--optimizer fused_adam \
--fp16 | tee logs/train_7B_deepspeed.log We compare it with the model using torch.linear for lora fine-tuning, and the loss error is less than the absolute value 0.001. This is the inference result of our model: In fact, if you have the environment, you can use the megatron or megatron-deepspeed framework to run a small model and lora the model after get_model() func. Of course, currently both repoes need to change the parent class initialization method in ParallelLinear. The PRs of the two warehouses are as follows: |
@BenjaminBossan You can run it in peft dir: pytest tests/test_lora_megatron.py And after modifying these PRs in my local environment, the tests all passed: |
@BenjaminBossan @pacman100 Sorry to bother you, but can you help me review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making some changes to integrate this better with LoRA and adding tests. I still think we need to find a better way to configure how this feature can be used, I made a suggestion in the comments. What do you think?
Furthermore, could you please run make style
?
Hi @zhangsheng377 I wanted to inform you that we merged #1106, which is a substantial refactor of PEFT and created some merge conflicts for your PR. The most notable change from that PR is that we refactored the adapter layers to take a |
Great, I think I need this base_layer. I'll take a closer look when I have time. (After all, we are already at 21 o'clock) |
@zhangsheng377 Thanks a lot for your continued work on this PR. The usability looks much nicer now! There is still a merge conflict left in |
d7d8fac
to
0ae52fe
Compare
@BenjaminBossan Ha, thank you for your concern. In fact, the conflict resolution was completed yesterday, but the changes in the main line were quite large, so I trained the llama model again to compare the loss and make sure there were no problems before submitting the code. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@BenjaminBossan Sorry I forgot make style. Now that it has been reformed, please trigger the workflows again. PS: Happy Thanksgiving. |
@BenjaminBossan Sorry, I neglected to install ruff this morning, so I mistakenly thought that style was no longer a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for updating the PR, it looks very good now. I'm not knowledgeable on megatron, so my review focuses more in the integration with PEFT.
As you can see, I left a couple of comments, please take a look. Mostly, I think we need to make some changes to the config to make saving and loading possible. Also, I think we can simplify the class structure. Finally, I have some questions concerning the forward
method.
I did not find the problem of config.py locally. If it continue to report errors, please help me find out where the problem is
The issue is that the from types
import should come before from typing
.
PS: Happy Thanksgiving.
Thanks, but where I am, there is no Thanksgiving :) In case you celebrate it, happy Thanksgiving to you too.
Edit
I forgot to mention, it would be really great if we could have an example or even better an entry in the docs that shows how to use this feature, maybe highlighting why users should use it. That way, the feature is much easier to discover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhangsheng377 for working on adding support for PEFT in Megatron. Overall, the PR is in great shape to be merged! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much, I think the implementation is much cleaner now and easier to understand. From my point of view, there are only a few details left, otherwise we should be good to merge.
f48fcc2
to
e35d46d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this PR. It looks pretty good now, I have a nit but it's no big deal.
Before merging, I just want to discuss the newly added tests. As is, they would not be run by the Github CI at any point, so in theory bugs could be introduced without us noticing. I have no personal experience, but the requirements for running megatron seem to be quite high, so I'm not sure if we could make it run on our GPU runners. I wonder if there is a way we can still make it work.
Well, I only used one gpu card to run the newly added unit test locally, and the resource requirements are not large. Or you should be able to install Megatron-DeepSpeed, after all, it is actually the backend of our current magic modification. |
Unfortunately, I did not manage to successfully build APEX and it seems I'm not the only one, judging from all the open issues. Therefore, I couldn't test it locally. If you have a recipe to get this all to run, which we could use for a CI job, that would be great. |
Heads up, there is a small merge conflict, should be easy to fix. |
Yes, it's done. |
My apex is version 0.1, and it seems that it should be installed directly via pip. |
Could you tell me how to do that? The What I did is follow the instructions here. Several users reported issues with this. Some suggested solutions included checking out specific tags or commits, but none of those I tried worked for me. |
I think I did not install it from the source code. You can try |
I think this will install the wrong package. When checking on PyPI, the description says:
|
(xx) [root@localhost peft]# pip install apex Maybe you can change the index-url? Or specify version? |
Interesting. The host is unfortunately insecure (http) and I don't know who this is. Therefore, we cannot use this index for the CI, as it could start hosting malicious code at any point in the future. Anyway, even if we cannot find a way right now to make the tests work, we can still proceed. Hopefully, we can find a better way in the future, maybe Nvidia manages to provide a package that can be reliably installed soon. |
@zhangsheng377 Thanks so much for this wonderful PR. |
Adds option to use Megatron's ColumnParallelLinear and RowParallelLinear for LoRA linear layers, leading to improved performance when using LoRA with Megatron.
@zhangsheng377 so that can we use the latest https://github.com/microsoft/Megatron-DeepSpeed with this peft now ? |
@thincal Yes, you can. |
Thanks for your information. And another question, with Megatron-DeepSpeed used, what's the format for input model and result model in this LoRA fine-tuning with PEFT ? @zhangsheng377 |
@thincal In Megatron-DeepSpeed, the base model is megatron's model, so you can see my ut in this pr. |
@zhangsheng377 Thank you for your great work. I'd like to try your code, so could you show me the |
Haha, You will need to adapt the Megatron code to PEFT. For example, modify the By the way, '--lora-target-modules' is a parameter I added to Megatron myself, and you can use your own adaptation process. Or, you can see: https://gitee.com/ascend/MindSpeed |
@zhangsheng377 Thank you. I added the conversion to the LoRA model at the end of |
Things to note when loading the model: 1. The original hf model may not have lora parameters. 2. If you are converting the lora plug-in, you need to write the conversion code yourself. |
I understand that there is no standard way to load weight in LoRA training. I will try converting it now. Thank you. |
We implemented the Lora algorithm for megatron's distributed layer ColumnParallelLinear and RowParallelLinear.
Due to the particularity of megatron creating a distributed layer, the required megatron information needs to be injected before executing Lora:
It has been verified on the megatron and megatron-deepspeed frameworks.