-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds dynamic shared mem allocation to cuda kernels #413
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
#include <cuda_runtime.h> | ||
#include <cstdio> | ||
|
||
int main() { | ||
int device = 0; | ||
cudaDeviceProp prop; | ||
cudaGetDeviceProperties(&prop, device); | ||
|
||
size_t sharedMemPerBlock = prop.sharedMemPerBlock; | ||
int maxSharedMemPerBlockOptin; | ||
cudaDeviceGetAttribute(&maxSharedMemPerBlockOptin, cudaDevAttrMaxSharedMemoryPerBlockOptin, device); | ||
|
||
size_t maxSharedMem = (sharedMemPerBlock > (size_t)maxSharedMemPerBlockOptin) ? sharedMemPerBlock : (size_t)maxSharedMemPerBlockOptin; | ||
|
||
// Subtract 3KB (3072 bytes) from the max shared memory as is allocated somewhere else | ||
maxSharedMem -= 3072; | ||
|
||
// Calculate the hex value | ||
unsigned int hexValue = (unsigned int)(maxSharedMem / 12); | ||
|
||
printf("0x%X", hexValue); | ||
|
||
return 0; | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -126,9 +126,14 @@ struct RBag { | |
Pair lo_buf[RLEN]; | ||
}; | ||
|
||
// Default value for shared memory (96KB) | ||
#ifndef HVM_SHARED_MEM | ||
#define HVM_SHARED_MEM 0x2000 | ||
#endif | ||
|
||
// Local Net | ||
const u32 L_NODE_LEN = 0x2000; | ||
const u32 L_VARS_LEN = 0x2000; | ||
const u32 L_NODE_LEN = HVM_SHARED_MEM; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't there places in the code that expect L_NODE_LEN to have this exact value? If you change it I think we'll start to get either memory leaks, out of bounds access or non local memory use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, if this number is smaller then default, the programs compiled by bend won't run. If it's smaller the performance will also tank incredibly fast. (see the issue where someone halved this number and got worse performamce than in the cpu) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
i wouldn't say that it expects to have this value, but rather that it expects to have the numbers of a device with 8.9 of compute capability, but then again, this is not a general optimization of the CUDA version, this is just to make the allocation of shared mem on local net dynamic, so that users can install the hvm without having to manually change it on the
can you show me an example of what was ran and what was the device/numbers used when this happened?
i mean, that's of course, if you give it less memory, it will have less memory, besides the fact that the rest of the code is 'optimized' for a we can like, whenever someone installs it using a GPU with <99KB of max shared mem per block, we give a warning saying something along the lines of: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The number of nodes in the local buffer determines the maximum size of hvm definitions. If we decrease |
||
const u32 L_VARS_LEN = HVM_SHARED_MEM; | ||
struct LNet { | ||
Pair node_buf[L_NODE_LEN]; | ||
Port vars_buf[L_VARS_LEN]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you compute this? I think this is from the automatic variables in some of the kernels. I think overall this is a fine approach for now but if we change the 3KB alloc in the runtime this will have to change too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sum of the alloc size of some local shared arrays is roughly ~3KB, i know that this is not the perfect way to do this, but it works for now.