-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Pytorch to share same memory pool as RMM via cli #1392
Enable Pytorch to share same memory pool as RMM via cli #1392
Conversation
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @VibhuJawa , left some minor requests on organization but we should be good to go afterwards.
dask_cuda/local_cuda_cluster.py
Outdated
|
||
if isinstance(rmm_allocator_external_lib_list, str): | ||
rmm_allocator_external_lib_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(rmm_allocator_external_lib_list, str): | |
rmm_allocator_external_lib_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i added a type check here, the reason is that i trust lazy users like me to pass in the same config that they do in cli
, example this is how cli looks right now.
dask-cuda-worker "tcp://10.33.227.161:8786" --set-rmm-allocator-for-libs "torch"
With the updated behavior i complain loudly (see example below):
cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch")
ValueError Traceback (most recent call last)
Cell In[2], line 1
----> 1 cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch")
File ~/dask-cuda/dask_cuda/local_cuda_cluster.py:275, in LocalCUDACluster.__init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, enable_cudf_spill, cudf_spill_stats, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, rmm_pool_size, rmm_maximum_pool_size, rmm_managed_memory, rmm_async, rmm_allocator_external_lib_list, rmm_release_threshold, rmm_log_directory, rmm_track_allocations, jit_unspill, log_spilling, worker_class, pre_import, **kwargs)
272 raise ValueError("Number of workers cannot be less than 1.")
274 if rmm_allocator_external_lib_list is not None and not isinstance(rmm_allocator_external_lib_list, list):
--> 275 raise ValueError(
276 "rmm_allocator_external_lib_list must be a list of strings. "
277 "Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. "
278 f"Received: {type(rmm_allocator_external_lib_list)} "
279 f"with value: {rmm_allocator_external_lib_list}"
280 )
282 # Set nthreads=1 when parsing mem_limit since it only depends on n_workers
283 logger = logging.getLogger(__name__)
ValueError: rmm_allocator_external_lib_list must be a list of strings. Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. Received: <class 'str'> with value: torch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but in that case I think the amount of work/code to support a string is relatively similar, instead of raising the exception should we just support passing a comma-separate string list as well then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it here: 2517874.
Let me know if you want me to change anything, thanks for the suggestion, i agree it made sense.
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @VibhuJawa !
/merge |
This PR closes: #1281
Usage example:
Verify working
Without it its
native
.Context: This helps NeMo-Curator to have a more stable use of Pytorch+dask-cuda
CC: @pentschev .