-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified Memory support #99
Comments
FYI: CUDArt is somewhat unmaintained, me / @timholy / @vchuravy occasionally check in small compatibility fixes and tag new releases, but my time at least is spend on CUDAdrv... That said...
In your first example, you pass a literal pointer (to global memory) to the kernel. This pointer itself is a bitstype (primitive type in modern nomenclature), which means it is passed by value and resides in parameter space, a constant memory that doesn't require synchronization for thread accesses. Dereferencing the pointer however does need synchronization, as it points to global memory. Your second example passes a However, this should have been fixed recently JuliaGPU/CUDAnative.jl#78, so I presume you were using an older version of CUDAnative?
No, because up until very recently there was no benefit, except for programmability which already was pretty seamless thanks to automatic conversion at the
No, although I don't think it would be much work. At least for CUDAdrv (are you using CUDArt for a specific reason?). |
Ah, yes, it was because of an old version, both examples run at the same speed now. I was using the runtime API because that's what the NVIDIA beginner's tutorial proposed, but I now converted it to the driver API, which was in fact very easy using I'm not sure I'm the one to implement a new GPU array type at this point, considering I'm still taking baby steps with CUDA here :) |
@barche, can you share the version using CUDAdrv.jl, please? |
Thanks @barche , but I supposed you had converted the example of "Unified Memory example" to CUDAdrv... |
Doesn't it work by just changing the |
I have just updated the gist (file |
Ah, thank you very much @barche, that's exactly what I was looking for. 👍 Unfortunately, it is more or less what I was trying and thus throws the same error (ERROR_INVALID_VALUE) on the I will try to follow @maleadt suggestion and try to understand what's going on. Maybe there is something special about the platform I'm experimenting with, a Jetson TX1. I had already successfully used unified memory but it was on C++/JetsonTX2/RuntimeAPI... I'll put additional questions to disccourse. |
Ok, probably it is because "Maxwell architectures [..] support a more limited form of Unified Memory" 🤦♂️. |
I'll test tonight on my machine at home to confirm it still works, it has been a while since I tried this. |
To confirm, I tried on my GTX 1060 and it still worked. |
The following code reproduces the Unified Memory example from NVIDIA in Julia:
https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-add_cudart-jl
Kernel run time is the same as with the .cu compiled with nvcc, the nvprof output I get is this:
I decided to attempt to make the interface a little nicer, by creating a
UnifiedArray
type modeled afterCuDeviceArray
, represented in this file together with the test:https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-unifiedarray-jl
Unfortunately, this runs significantly slower:
Comparing the
@code_llvm
output for the init kernel after theif
shows for the first version:and for the
UnifiedArray
version:So now for the questions:
cudaMemPrefetchAsync
?p.s. great job on all these CUDA packages, this was a lot easier to set up than I had anticipated :)
The text was updated successfully, but these errors were encountered: