You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am a student who interested in pipeline parallelism for LLM inference. I have successfully run the example for GPT mentioned in Pytorch document, so I just want to modify it to fit for llama2 model on single server with multi GPUs, Here is my scripts:
# Copyright (c) Meta Platforms, Inc. and affiliates
# Minimum effort to run this example:
# $ torchrun --nproc-per-node 4 pipeline_inference.py
import argparse
import os
import torch
import torch.distributed as dist
from torch.distributed.pipelining import pipeline, PipelineStage, ScheduleGPipe, SplitPoint
from transformers import AutoModelForCausalLM, AutoTokenizer
def run(args):
# Grab the model
llama = AutoModelForCausalLM.from_pretrained(
"/zt/model/Llama-2-7b-chat-hf", low_cpu_mem_usage=True, local_files_only= True
)
# print(llama)
tokenizer = AutoTokenizer.from_pretrained("/zt/model/Llama-2-7b-chat-hf", local_files_only= True)
tokenizer.pad_token = tokenizer.eos_token
mb_prompts = (
"How do you", "I like to",
) # microbatch size = 2
llama.to(args.device).eval()
# Cut model by equal number of layers per rank
layers_per_rank = llama.config.num_hidden_layers //args.world_size
print(f"layers_per_rank = {layers_per_rank}")
split_spec = {
f"model.layers.{i * layers_per_rank}": SplitPoint.BEGINNING
for i in range(1, args.world_size)
}
# Create a pipeline representation from the model
mb_inputs = tokenizer(mb_prompts, return_tensors="pt", padding=True).to(args.device)
pipe = pipeline(
module=llama,
mb_args=(mb_inputs["input_ids"],),
split_spec= split_spec
)
# Create pipeline stage for each rank
stage = pipe.build_stage(args.rank, device=args.device)
# Run time inputs
full_batch_prompts = (
"How do you", "I like to", "Can I help", "You need to",
"The weather is", "I found a", "What is your", "You are so",
) # full batch size = 8
inputs = tokenizer(full_batch_prompts, return_tensors="pt", padding=True).to(args.device)
# Attach to a schedule
# number of microbatches = 8 // 2 = 4
num_mbs = 4
schedule = ScheduleGPipe(stage, num_mbs)
# Run
if args.rank == 0:
tmp = inputs["input_ids"]
else:
tmp = None
output = schedule.step(tmp)
# Decode
if output is not None:
next_token_logits = output[0][:, -1, :]
next_token = torch.argmax(next_token_logits, dim=-1)
print(tokenizer.batch_decode(next_token))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--world_size', type=int, default=int(os.getenv("WORLD_SIZE", 4)))
parser.add_argument('--rank', type=int, default=int(os.getenv("RANK", -1)))
parser.add_argument('--master_addr', type=str, default=os.getenv('MASTER_ADDR', 'localhost'))
parser.add_argument('--master_port', type=str, default=os.getenv('MASTER_PORT', '29500'))
parser.add_argument('--schedule', type=str, default="FillDrain") # 这里可能和LLM调度策略有关系
parser.add_argument('--cuda', type=int, default=int(torch.cuda.is_available()))
parser.add_argument("--chunks", type=int, default=4)
parser.add_argument('--batch_size', type=int, default=4)
parser.add_argument('--batches', type=int, default=1)
args = parser.parse_args()
if args.cuda:
# 适应多设备训练的情况,确保线程分配再可用GPU上
dev_id = args.rank % torch.cuda.device_count()
args.device = torch.device(f"cuda:{dev_id}")
else:
args.device = torch.device("cpu")
# Init process group
backend = "nccl" if args.cuda else "gloo"
dist.init_process_group(
backend=backend,
rank=args.rank,
world_size=args.world_size,
)
run(args)
# 销毁进程组
dist.destroy_process_group()
The idea of my script is just to simply mix the example of GPT and llama2 mentioned in Pytorch document. But it turned out a bug below:
Hi, I am a student who interested in pipeline parallelism for LLM inference. I have successfully run the example for GPT mentioned in Pytorch document, so I just want to modify it to fit for llama2 model on single server with multi GPUs, Here is my scripts:
The idea of my script is just to simply mix the example of GPT and llama2 mentioned in Pytorch document. But it turned out a bug below:
(pippy) root@678c7278cb2d:/zt/code/my_dev# torchrun --nproc-per-node 4 pipeline_inference.py
It seems that the NCCL communication is timeout, but I truly successfully run the GPT example. So how can I fix it. Thank u!!!!
The text was updated successfully, but these errors were encountered: