Apply suggestions from code review

Co-authored-by: MKKnorr <[email protected]>
ROCm · Nov 6, 2024 · 9717a14 · 9717a14
1 parent 208dfd3
commit 9717a14
Show file tree

Hide file tree

Showing 3 changed files with 54 additions and 62 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -15,18 +15,18 @@ The HIP documentation is organized into the following categories:
   * [Introduction](./programming_guide)
   * {doc}`./understand/programming_model`
   * {doc}`./understand/hardware_implementation`
-  * {doc}`./understand/amd_clr`
   * {doc}`./understand/compilers`
+  * {doc}`./how-to/performance_guidelines`
+  * [Debugging with HIP](./how-to/debugging)
+  * {doc}`./how-to/logging`
   * {doc}`./how-to/hip_runtime_api`
     * {doc}`./how-to/hip_runtime_api/memory_management`
     * {doc}`./how-to/hip_runtime_api/cooperative_groups`
     * {doc}`./how-to/hip_runtime_api/hipgraph`
   * [HIP porting guide](./how-to/hip_porting_guide)
   * [HIP porting: driver API guide](./how-to/hip_porting_driver_api)
   * {doc}`./how-to/hip_rtc`
-  * {doc}`./how-to/performance_guidelines`
-  * [Debugging with HIP](./how-to/debugging)
-  * {doc}`./how-to/logging`
+  * {doc}`./understand/amd_clr`
 
 :::
 :::{grid-item-card} Reference

diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst
@@ -1,89 +1,79 @@
 .. meta::
-    :description: HIP programming guide
-    :keywords: CU, CUs, number of CUs, compute units
+    :description: HIP programming guide introduction
+    :keywords: HIP programming guide introduction, HIP programming guide
 
 .. _hip-programming-guide:
 
 ********************************************************************************
-Programming in HIP
+Programming guide introduction
 ********************************************************************************
 
-When programming a heterogeneous application to run on a host CPU and offload
-kernels to GPUs, the following are key steps and considerations to ensure
-efficient execution and performance:
-
-Understand the Target Architecture (CPU + GPU)
-================================================================================
-
-CPUs are designed to excel at executing a sequence of operations and control
-logic as fast as possible, while GPUs excel at parallel execution of large
-workloads across many threads. You must target specific tasks to the appropriate
-architecture to optimize your application performance. Target computationally
-intensive, parallelizable parts at the GPU, while running control-heavy and
-sequential logic on the CPU. For more information, see :doc:`Hardware Implementation <hip:understand/hardware_implementation>`.
+The following gives a short overview over the chapters of the HIP programming
+guide, in order to guide you through the programming guide and explain how to
+make the best use of HIP.
 
 Write GPU Kernels for Parallel Execution
 ================================================================================
 
-Efficient GPU kernels can greatly speed up computation by leveraging massive
-parallelism. Write kernels that can take advantage of GPU SIMD (Single
-Instruction, Multiple Data) architecture. Ensure that each thread operates on
-independent memory locations to avoid memory contention. Avoid branching (e.g.,
-if-else statements) inside kernels as much as possible, since it can lead to
-divergence, which slows down parallel execution. For more information, see
-:doc:`Programming Model <hip:understand/programming_model>`.
+To make the most of the parallelism inherent to GPUs, a thorough understanding
+of the :ref:`programming model <programming_model>` is helpful. HIPs'
+programming model is designed to make it easy to map data-parallel algorithms to
+architecture of the GPUs. HIP employs the so-called SIMT-model (Single
+Instruction Multiple Threads) with a multi-layered thread hierarchy for
+efficient execution.
 
-Optimize the Thread and Block Sizes
+Understand the Target Architecture (CPU + GPU)
 ================================================================================
 
-Correctly configuring the threads in the kernel launch configuration (e.g.,
-threads per block, blocks per grid) is crucial for maximizing GPU performance.
-Choose an optimal number of threads per block and blocks per grid based on the
-specific hardware capabilities (e.g., the number of streaming multiprocessors (SMs)
-and cores on the GPU). Ensure that the number of threads per block is a multiple
-of the warp size (typically 32 for most GPUs) for efficient execution. Test
-different configurations, as the best combination can vary depending on the
-specific problem size and hardware.
+The general :ref:`hardware implementation <hardware_implementation>` of GPUs
+supported by HIP is outlined in this chapter. In general, GPUs are made up of
+many so called Compute Units that excel at executing parallelizable,
+computationally intensive workloads without complex control-flow.
 
-Data Management and Transfer Between CPU and GPU
+Increase parallelism on multiple level
 ================================================================================
 
-GPUs have their own memory (device memory), separate from CPU memory
-(host memory). Transferring data between the host CPU and the device GPU is one
-of the most expensive operations. Managing data movement is crucial to optimize
-performance. Minimize data transfers between the CPU and GPU by keeping data on
-the GPU for as long as possible. Use asynchronous data transfer functions where
-available, like ``hipMemcpyAsync()``, to overlap data transfer with kernel
-execution. For more information, see :doc:`HIP Programming Manual <hip:how-to/hip_runtime_api/memory_management>`.
+To maximize performance and keep all system components fully utilized, the
+application should expose and efficiently manage as much parallelism as possible.
+:ref:`Parallel execution <parallel execution>` can be achieved at the
+application, device, and multiprocessor levels.
+
+The application’s host and device operations can achieve parallel execution
+through asynchronous calls, streams, or HIP graphs. On the device level,
+multiple kernels can execute concurrently when resources are available, and at
+the multiprocessor level, developers can overlap data transfers with
+computations to further optimize performance.
 
-Memory Management on the GPU
+Memory management
 ================================================================================
 
-GPU memory accesses can be a performance bottleneck if not handled correctly.
-Use the different GPU memory types effectively (e.g., global, shared, constant,
-and local memory). Shared memory is faster than global memory but limited in
-size. Shared memory is ideal for reusing data across threads in a block. Ensure
-memory accesses are coalesced (i.e., threads in a warp access consecutive memory
-locations), as uncoalesced memory access patterns can significantly degrade
-performance.
+GPUs generally have their own destinct memory, also referred to as :ref:`device
+memory <device_memory>`, separate from the :ref:`host memory <device_memory>`.
+This memory needs to be managed separately from the host memory. This includes
+allocating the memory and transfering it between the host and the device. These
+operations can be performance critical, so it is important to know how to use
+them effectively.
+
+For more information, see :ref:`memory management <memory_management>`.
 
 Synchronize CPU and GPU Workloads
 ================================================================================
 
-Host (CPU) and device (GPU) execute tasks run asynchronously, but proper
-synchronization is needed to ensure correct results. Use synchronization
-functions like  ``hipDeviceSynchronize()`` or ``hipStreamSynchronize()`` to
-ensure that kernels have completed execution before using their results. Take
-advantage of asynchronous execution to overlap data transfers, kernel execution,
-and CPU tasks where possible.
+Tasks on the host and devices run asynchronously, so proper synchronization is
+needed when dependencies between those tasks exist. Asynchronous execution of
+tasks is useful to fully utilize the available resources. Even when only a
+single device is available, memory transfers and the execution of tasks can be
+overlapped with asynchrononus execution.
 
 Error Handling
 ================================================================================
 
-Check for errors after memory transfers and kernel launches, for example
-``hipGetLastError()``. Catch and handle errors to allow the application to
-gracefully exit, with appropriate messaging. For more information, see
-`Error Handling <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/error_handling.html>`_.
+All functions in the HIP runtime API return an error value of type
+:cpp:enum:`hipError_t`, that can be used to verify, whether the function was
+successfully executed. It is therefore of utmost importance, to check these
+returned values, in order to catch and handle those errors, if possible.
+An exception to this are kernel launches, which do not return any value. These
+errors can be caught with specific functions like :cpp:func:`hipGetLastError()`.
 
 Multi-GPU and Load Balancing
 ================================================================================

diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst
@@ -4,6 +4,8 @@
                 code, how it maps to the hardware.
   :keywords: AMD, ROCm, HIP, CUDA, API design
 
+.. _programming_model:
+
 *******************************************************************************
 HIP programming model
 *******************************************************************************