-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Consider using different default values for cluster configurations #348
Comments
I think having some of these defaults makes sense . In dask core In [1]: import pynvml
In [2]: pynvml.nvmlInit()
In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)
In [4]: info = pynvml.nvmlDeviceGetMemoryInfo(handle)
In [5]: info.total
Out[6]: 34089730048 maybe |
@mt-jones may remember why we had to set |
While I understand how changing such defaults make sense for TPCx-BB, I'm not so sure this makes sense for everyone, I don't think it will be difficult to find people trying to run dask-cuda together with other applications that also require memory (either device or host), and that could be a bit annoying for users. Think for example of someone running this on a workstation for which one of the GPUs is also used as display. That's not to say I'm totally against the idea, but I'm just a bit concerned about the broader usability of dask-cuda. The As for the communication variables, I think they were not for the workers (or not only, perhaps), but for the scheduler, am I mistaken? Changing the defaults for the scheduler would have to go to distributed upstream. As for workers, if we change those values we should probably check what would be really appropriate numbers, I have a feeling those were just defined arbitrarily high to make things work, was that not the case? And when we scale, would we have to scale the numbers too? |
I was just thinking after writing the comment above, maybe we should have some sort of "default recipes" for different use cases? E.g., the TPCx-BB case could use a "performance recipe", while today's defaults could be something like a "conservative recipe" and so on. I think this would alleviate complexity for multiple uses while still providing defaults that should work mostly anywhere and have an easy switch. |
I forgot to mention another related (potentially duplicate) issue: #334 . |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This issue has been labeled |
In our gpu-bdb benchmarking, we found we needed to configure quite a few bash environment variables when setting up a dask-cuda cluster for optimal performance.
For example, if you're not using UCX, it's likely that GPU worker to GPU worker communication over TCP is at least higher latency, if not slower. To avoid TCP connection failures we needed to reconfigure Dask's "COMM" settings.
I suspect a great many dask-cuda users will be using TCP and not UCX, so would benefit from dask-cuda automatically re-configuring Dask's TCP defaults.
We also need to set
--memory-limit
,--device-memory-limit
, and--rmm-pool-size
values:It would be nice if dask-cuda included logic for detecting GPU memory per card and setting default values for device memory limit and RMM pool size. As a starting point, for a 32GB card, 25GB memory limit with 30GB pool size work well for the 30 tpcx-bb queries. I believe that's a decent representation of typical workflows where those values could be used proportionally for other card memory sizes as well.
The text was updated successfully, but these errors were encountered: