Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Setup is taking up a huge amount of memory #1402

Open
bhuvan777 opened this issue Dec 7, 2024 · 1 comment
Open

Distributed Setup is taking up a huge amount of memory #1402

bhuvan777 opened this issue Dec 7, 2024 · 1 comment
Labels
Distributed Issues related to all things distributed need-user-input The issue needs more information from the reporter before moving forward

Comments

@bhuvan777
Copy link

Hello,

I am running a distributed setup to perform inference with an 8-billion parameter LLaMA model. Despite expecting the workload to fit within two machines (each with 16GB of memory), I had to utilize four machines to avoid memory issues. Even after removing the initialization of the KV cache, for some passes the memory usage still exceeded 9GB per machine.

Could you please help identify potential reasons for this behavior, or let me know if there is something I might be overlooking in the setup?

Thank you!

@Jack-Khuu Jack-Khuu added the Distributed Issues related to all things distributed label Dec 9, 2024
@Jack-Khuu
Copy link
Contributor

Hi @bhuvan777, can you provide a repro?

Are you using torchchat's distributed flag or handling the distributed aspect locally?

@Jack-Khuu Jack-Khuu added the need-user-input The issue needs more information from the reporter before moving forward label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Distributed Issues related to all things distributed need-user-input The issue needs more information from the reporter before moving forward
Projects
None yet
Development

No branches or pull requests

2 participants