diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst index 7e34c6d421..c2767b53b8 100644 --- a/docs/programming_guide.rst +++ b/docs/programming_guide.rst @@ -30,17 +30,19 @@ supported by HIP is outlined in this chapter. In general, GPUs are made up of many so called Compute Units that excel at executing parallelizable, computationally intensive workloads without complex control-flow. -Optimize the Thread and Block Sizes +Increase parallelism on multiple level ================================================================================ -Correctly configuring the threads in the kernel launch configuration (e.g., -threads per block, blocks per grid) is crucial for maximizing GPU performance. -Choose an optimal number of threads per block and blocks per grid based on the -specific hardware capabilities (e.g., the number of streaming multiprocessors (SMs) -and cores on the GPU). Ensure that the number of threads per block is a multiple -of the warp size (typically 32 for most GPUs) for efficient execution. Test -different configurations, as the best combination can vary depending on the -specific problem size and hardware. +To maximize performance and keep all system components fully utilized, the +application should expose and efficiently manage as much parallelism as possible. +:ref:`Parallel execution ` can be achieved at the +application, device, and multiprocessor levels. + +The application’s host and device operations can achieve parallel execution +through asynchronous calls, streams, or HIP graphs. On the device level, +multiple kernels can execute concurrently when resources are available, and at +the multiprocessor level, developers can overlap data transfers with +computations to further optimize performance. Data Management and Transfer Between CPU and GPU ================================================================================ @@ -54,8 +56,11 @@ performance critical, so it is important to know how to use them effectively. Memory Management on the GPU ================================================================================ -On-device GPU memory accesses from the threads in a kernel can be a performance bottleneck, depending on the workload. There are also some specifics concerning device memory accesses that have to be considered, compared to CPUs. -GPUs also have different memory spaces, with different access levels and performance characteristics, that have specific use cases. +On-device GPU memory accesses from the threads in a kernel can be a performance +bottleneck, depending on the workload. There are also some specifics concerning +device memory accesses that have to be considered, compared to CPUs. +GPUs also have different memory spaces, with different access levels and +performance characteristics, that have specific use cases. Synchronize CPU and GPU Workloads ================================================================================