Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Graphs #14

Open
ax3l opened this issue Aug 5, 2019 · 7 comments
Open

CUDA Graphs #14

ax3l opened this issue Aug 5, 2019 · 7 comments

Comments

@ax3l
Copy link
Member

ax3l commented Aug 5, 2019

From latest information on CUDA Graphs, follow the following rules of thumb:

  • always use CUDA Graphs to start kernels, it will always be at least the same speed or faster as not using task graphs, even if you only schedule one or a few kernels
  • when starting an app with a new GPU context, allocate and free a large task graph first (e.g. the largest you expect to occur at runtime), which will reduce initial latencies
  • when tracing the performance of graphs, they are internally batched in groups of ~8 tasks again, but one does not need to optimize for that, it just becomes visible

Official Docs: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

@michaelsippel
Copy link
Member

Currently I see two ways of using CUDA Graphs in combination with redGrapes:

  • for static, gpu-only tasks, create the graph manually with the CUDA API and submit it inside a redGrapes-task that gets handled by the cuda-scheduler, which in turn allocates a cuda-stream at the time it can be submitted.
    Generally, this would be exactly the same as with any async cuda call.

  • Capture all cuda calls with the graph capture, until a host-task appears that creates a dependency. Then the graph capture is ended and submitted.

@sbastrakov
Copy link
Member

sbastrakov commented Jul 2, 2020

@michaelsippel do you prefer one way to the other, or what are your thoughts on this matter generally?

@michaelsippel
Copy link
Member

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

@psychocoderHPC
Copy link
Member

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation.
IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

@michaelsippel
Copy link
Member

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation.

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

Yes, thats what I meant with the second bullet. I will post an image that explains how I imagine this to work. But I don't yet believe that it will gain any performance.

@michaelsippel
Copy link
Member

michaelsippel commented Jul 3, 2020

redGrapesCudaGraph


 1. mgr.finish_task( Host Task 1 )
   1.1. scheduling_graph.task_end( Host Task 1 )
     1.1.1. scheduling_graph.reach_event( Event 2 )


   1.2. mgr.activate_followers( Host Task 1 )
     1.2.1. scheduler.activate_task( Host Task 2 )
         ...

     1.2.2. scheduler.activate_task( CUDA Task 3 )
       1.2.2.1. cuda_scheduler.activate_task( CUDA Task 3 )

         1.2.2.1.1. scheduling_graph.is_task_ready( CUDA Task 3 ) == true
         1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 5 ) == true

         1.2.2.1.2. cuda_scheduler.is_recording == false
         1.2.2.1.2.1. cudaStreamBeginCapture()

         1.2.2.1.2.2. cuda_scheduler.dispatch_task( CUDA Task 3 )
           1.2.2.1.2.2.1. current_stream = select next cuda stream

           1.2.2.1.2.2.2. mgr.run_task( CUDA Task 3 )
           1.2.2.1.2.2.2.1. scheduling_graph.reach_event( 5 )

           1.2.2.1.2.2.3. cuda_event_1 = cudaEventRecord()

           1.2.2.1.2.2.4. Add cuda event as cuda-dependency in following cuda-tasks.
           1.2.2.1.2.2.4.1. [CUDA Task 4].properties.cuda_dependencies[ current_stream ] = cuda_event_1


           1.2.2.1.2.2.5. mgr.activate_followers( CUDA Task 3 )
             1.2.2.1.2.2.5.1. scheduler.activate_task( CUDA Task 4 )
             1.2.2.1.2.2.5.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
               1.2.2.1.2.2.5.1.1.1. scheduling_graph.is_task_ready( CUDA Task 4 ) == true
               1.2.2.1.2.2.5.1.1.1.1. scheduling_graph.is_event_reached( Event 7 ) == true

               1.2.2.1.2.2.5.1.1.2. cuda_scheduler.is_recording == true
               1.2.2.1.2.2.5.1.1.2.1. cuda_scheduler.dispatch_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.1. current_stream = select next cuda stream

                 1.2.2.1.2.2.5.1.1.2.1.2. cudaWaitEvent( cuda_event_1, current_stream )

                 1.2.2.1.2.2.5.1.1.2.1.3. mgr.run_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.3.1. scheduling_graph.reach_event( 5 )

                 1.2.2.1.2.2.5.1.1.2.1.4. cuda_event_2 = cudaEventRecord()

                 1.2.2.1.2.2.5.1.1.2.1.5. Add cuda event as cuda-dependency in following cuda-tasks.
                 1.2.2.1.2.2.5.1.1.2.1.5.1. [CUDA Task 7].properties.cuda_dependencies[ current_stream ] = cuda_event_2

                 1.2.2.1.2.2.5.1.1.2.1.6. mgr.activate_followers( CUDA Task 4 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1. scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1. default_scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1. scheduling_graph.is_task_ready( Host Task 6 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1.1. scheduling_graph.is_event_reached( Event 11 ) == false

                   1.2.2.1.2.2.5.1.1.2.1.6.2. scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1. cuda_scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1. scheduling_graph.is_task_ready( CUDA Task 7 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1.1. scheduling_graph.is_event_reached( Event 13 ) == false

             1.2.2.1.2.2.5.2. scheduler.activate_task( Host Task 5 )
             1.2.2.1.2.2.5.2.1. default_scheduler.activate_task( Host Task 5 )
               1.2.2.1.2.2.5.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == false
               1.2.2.1.2.2.5.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == false

         1.2.2.1.2.3. cudaStreamEndCapture()
         1.2.2.1.2.4. cudaGraphLaunch()

 2. cuda_scheduler.poll()
   2.1. cudaEventQuery() == cudaSuccess
   2.1.1. mgr.task_finish( CUDA Task 3 )
     2.1.1.1. scheduling_graph.task_end( CUDA Task 3 )
     2.1.1.1.1. scheduling_graph.reach_event( Event 6 )

     2.1.1.2. mgr.activate_followers( CUDA Task 3 )
       2.1.1.2.1. scheduler.activate_task( CUDA Task 4 )
       2.1.1.2.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
         2.1.1.2.1.1.1. CUDA Task 4 is already submitted

       2.1.1.2.2. scheduler.activate_task( Host Task 5 )
       2.1.1.2.2.1. default_scheduler.activate_task( Host Task 5 )
         2.1.1.2.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == true
         2.1.1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == true

         2.1.1.2.2.1.2. run Host Task 5 ...

@psychocoderHPC
Copy link
Member

psychocoderHPC commented Jul 3, 2020

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

Yes in rhat case we have per task two cuda api calls.

But I don't yet believe that it will gain any performance

Not sure if you mean with performance latency
We need to reduce the latency. In PIConGPU we have over 2k API calls per second. Each call is around 10us. So we waste a lot of time blocking the cpu with the CUDA API latency. The slowdown is not 2millisec because most API calls will be performed in parallel to the execution of tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants