Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: MKKnorr <[email protected]>
  • Loading branch information
neon60 and MKKnorr committed Nov 6, 2024
1 parent 208dfd3 commit 9717a14
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 62 deletions.
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@ The HIP documentation is organized into the following categories:
* [Introduction](./programming_guide)
* {doc}`./understand/programming_model`
* {doc}`./understand/hardware_implementation`
* {doc}`./understand/amd_clr`
* {doc}`./understand/compilers`
* {doc}`./how-to/performance_guidelines`
* [Debugging with HIP](./how-to/debugging)
* {doc}`./how-to/logging`
* {doc}`./how-to/hip_runtime_api`
* {doc}`./how-to/hip_runtime_api/memory_management`
* {doc}`./how-to/hip_runtime_api/cooperative_groups`
* {doc}`./how-to/hip_runtime_api/hipgraph`
* [HIP porting guide](./how-to/hip_porting_guide)
* [HIP porting: driver API guide](./how-to/hip_porting_driver_api)
* {doc}`./how-to/hip_rtc`
* {doc}`./how-to/performance_guidelines`
* [Debugging with HIP](./how-to/debugging)
* {doc}`./how-to/logging`
* {doc}`./understand/amd_clr`

:::
:::{grid-item-card} Reference
Expand Down
106 changes: 48 additions & 58 deletions docs/programming_guide.rst
Original file line number Diff line number Diff line change
@@ -1,89 +1,79 @@
.. meta::
:description: HIP programming guide
:keywords: CU, CUs, number of CUs, compute units
:description: HIP programming guide introduction
:keywords: HIP programming guide introduction, HIP programming guide

.. _hip-programming-guide:

********************************************************************************
Programming in HIP
Programming guide introduction
********************************************************************************

When programming a heterogeneous application to run on a host CPU and offload
kernels to GPUs, the following are key steps and considerations to ensure
efficient execution and performance:

Understand the Target Architecture (CPU + GPU)
================================================================================

CPUs are designed to excel at executing a sequence of operations and control
logic as fast as possible, while GPUs excel at parallel execution of large
workloads across many threads. You must target specific tasks to the appropriate
architecture to optimize your application performance. Target computationally
intensive, parallelizable parts at the GPU, while running control-heavy and
sequential logic on the CPU. For more information, see :doc:`Hardware Implementation <hip:understand/hardware_implementation>`.
The following gives a short overview over the chapters of the HIP programming
guide, in order to guide you through the programming guide and explain how to
make the best use of HIP.

Write GPU Kernels for Parallel Execution
================================================================================

Efficient GPU kernels can greatly speed up computation by leveraging massive
parallelism. Write kernels that can take advantage of GPU SIMD (Single
Instruction, Multiple Data) architecture. Ensure that each thread operates on
independent memory locations to avoid memory contention. Avoid branching (e.g.,
if-else statements) inside kernels as much as possible, since it can lead to
divergence, which slows down parallel execution. For more information, see
:doc:`Programming Model <hip:understand/programming_model>`.
To make the most of the parallelism inherent to GPUs, a thorough understanding
of the :ref:`programming model <programming_model>` is helpful. HIPs'
programming model is designed to make it easy to map data-parallel algorithms to
architecture of the GPUs. HIP employs the so-called SIMT-model (Single
Instruction Multiple Threads) with a multi-layered thread hierarchy for
efficient execution.

Optimize the Thread and Block Sizes
Understand the Target Architecture (CPU + GPU)
================================================================================

Correctly configuring the threads in the kernel launch configuration (e.g.,
threads per block, blocks per grid) is crucial for maximizing GPU performance.
Choose an optimal number of threads per block and blocks per grid based on the
specific hardware capabilities (e.g., the number of streaming multiprocessors (SMs)
and cores on the GPU). Ensure that the number of threads per block is a multiple
of the warp size (typically 32 for most GPUs) for efficient execution. Test
different configurations, as the best combination can vary depending on the
specific problem size and hardware.
The general :ref:`hardware implementation <hardware_implementation>` of GPUs
supported by HIP is outlined in this chapter. In general, GPUs are made up of
many so called Compute Units that excel at executing parallelizable,
computationally intensive workloads without complex control-flow.

Data Management and Transfer Between CPU and GPU
Increase parallelism on multiple level
================================================================================

GPUs have their own memory (device memory), separate from CPU memory
(host memory). Transferring data between the host CPU and the device GPU is one
of the most expensive operations. Managing data movement is crucial to optimize
performance. Minimize data transfers between the CPU and GPU by keeping data on
the GPU for as long as possible. Use asynchronous data transfer functions where
available, like ``hipMemcpyAsync()``, to overlap data transfer with kernel
execution. For more information, see :doc:`HIP Programming Manual <hip:how-to/hip_runtime_api/memory_management>`.
To maximize performance and keep all system components fully utilized, the
application should expose and efficiently manage as much parallelism as possible.
:ref:`Parallel execution <parallel execution>` can be achieved at the
application, device, and multiprocessor levels.

The application’s host and device operations can achieve parallel execution
through asynchronous calls, streams, or HIP graphs. On the device level,
multiple kernels can execute concurrently when resources are available, and at
the multiprocessor level, developers can overlap data transfers with
computations to further optimize performance.

Memory Management on the GPU
Memory management
================================================================================

GPU memory accesses can be a performance bottleneck if not handled correctly.
Use the different GPU memory types effectively (e.g., global, shared, constant,
and local memory). Shared memory is faster than global memory but limited in
size. Shared memory is ideal for reusing data across threads in a block. Ensure
memory accesses are coalesced (i.e., threads in a warp access consecutive memory
locations), as uncoalesced memory access patterns can significantly degrade
performance.
GPUs generally have their own destinct memory, also referred to as :ref:`device
memory <device_memory>`, separate from the :ref:`host memory <device_memory>`.
This memory needs to be managed separately from the host memory. This includes
allocating the memory and transfering it between the host and the device. These
operations can be performance critical, so it is important to know how to use
them effectively.

For more information, see :ref:`memory management <memory_management>`.

Synchronize CPU and GPU Workloads
================================================================================

Host (CPU) and device (GPU) execute tasks run asynchronously, but proper
synchronization is needed to ensure correct results. Use synchronization
functions like ``hipDeviceSynchronize()`` or ``hipStreamSynchronize()`` to
ensure that kernels have completed execution before using their results. Take
advantage of asynchronous execution to overlap data transfers, kernel execution,
and CPU tasks where possible.
Tasks on the host and devices run asynchronously, so proper synchronization is
needed when dependencies between those tasks exist. Asynchronous execution of
tasks is useful to fully utilize the available resources. Even when only a
single device is available, memory transfers and the execution of tasks can be
overlapped with asynchrononus execution.

Error Handling
================================================================================

Check for errors after memory transfers and kernel launches, for example
``hipGetLastError()``. Catch and handle errors to allow the application to
gracefully exit, with appropriate messaging. For more information, see
`Error Handling <https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/error_handling.html>`_.
All functions in the HIP runtime API return an error value of type
:cpp:enum:`hipError_t`, that can be used to verify, whether the function was
successfully executed. It is therefore of utmost importance, to check these
returned values, in order to catch and handle those errors, if possible.
An exception to this are kernel launches, which do not return any value. These
errors can be caught with specific functions like :cpp:func:`hipGetLastError()`.

Multi-GPU and Load Balancing
================================================================================
Expand Down
2 changes: 2 additions & 0 deletions docs/understand/programming_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
code, how it maps to the hardware.
:keywords: AMD, ROCm, HIP, CUDA, API design

.. _programming_model:

*******************************************************************************
HIP programming model
*******************************************************************************
Expand Down

0 comments on commit 9717a14

Please sign in to comment.