Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed May 24, 2024
1 parent d426b03 commit 7dd32ef
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 6 deletions.
4 changes: 4 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@ enqueues
embeded
fatbinary
GPGPU
hipcc
Interoperation
latencies
Malloc
malloc
multicore
NDRange
Numa
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/hip_porting_driver_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ HIP-Clang will push primary context to context stack when it is empty. This can

#### `.hip_fatbin`

hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by `clang-offload-bundler` as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.
hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by `clang-offload-bundler` as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the `.hip_fatbin` section of the ELF file of the executable or shared object.

#### Initialization and Termination Functions

Expand Down
2 changes: 1 addition & 1 deletion docs/how-to/hip_rtc.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ HIPRTC provides a few HIPRTC specific flags
* `--gpu-architecture` : This flag can guide the code object generation for a specific gpu arch. Example: `--gpu-architecture=gfx906:sramecc+:xnack-`, its equivalent to `--offload-arch`.
* This option is compulsory if compilation is done on a system without AMD GPUs supported by HIP runtime.
* Otherwise, HIPRTC will load the hip runtime and gather the current device and its architecture info and use it as option.
* `-fgpu-rdc` : This flag when provided during the hiprtcCompileProgram generates the bitcode (HIPRTC doesn't convert this bitcode into ISA and binary). This bitcode can later be fetched using hiprtcGetBitcode and hiprtcGetBitcodeSize APIs.
* `-fgpu-rdc` : This flag when provided during the hiprtcCompileProgram generates the bitcode (HIPRTC doesn't convert this bitcode into ISA and binary). This bitcode can later be fetched using hiprtcGetBitcode and `hiprtcGetBitcodeSize` APIs.

### Bitcode

Expand Down
8 changes: 4 additions & 4 deletions docs/how-to/programming_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Note, Numa policy is so far implemented on Linux, and under development on Windo

ROCm defines two coherency options for host memory:

* Coherent memory : Supports fine-grain synchronization while the kernel is running. For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs. Synchronization instructions include threadfence_system and C++11-style atomic operations.
* Coherent memory : Supports fine-grain synchronization while the kernel is running. For example, a kernel can perform atomic operations that are visible to the host CPU or to other (peer) GPUs. Synchronization instructions include `threadfence_system` and C++11-style atomic operations.
In order to achieve this fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU, or making them read-only.

* Non-coherent memory : Can be cached by GPU, but cannot support synchronization while the kernel is running. Non-coherent memory can be optionally synchronized only at command (end-of-kernel or copy command) boundaries. This memory is appropriate for high-performance access when fine-grain synchronization is not required.
Expand Down Expand Up @@ -116,7 +116,7 @@ HIP supports Stream Memory Operations to enable direct synchronization between N
`hipStreamWriteValue64`
Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access.
For more details, please check the documentation HIP-API.pdf.
For more details, please check the documentation `HIP-API.pdf`.
Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using `hipStreamSynchronize(nullptr)` for synchronization.
Expand Down Expand Up @@ -184,13 +184,13 @@ can be contracted. Tolerance should be used for floating point comparisons.
## Math functions with special rounding modes
Note: Currently, HIP only supports basic math functions with rounding modern (round to nearest). HIP does not support basic math functions with rounding modes ru (round up), rd (round down), and rz (round towards zero).
Note: Currently, HIP only supports basic math functions with rounding modern (round to nearest). HIP does not support basic math functions with rounding modes `ru` (round up), `rd` (round down), and `rz` (round towards zero).
## Creating Static Libraries
HIP-Clang supports generating two types of static libraries. The first type of static library does not export device functions, and only exports and launches host functions within the same library. The advantage of this type is the ability to link with a non-hipcc compiler such as gcc. The second type exports device functions to be linked by other code objects. However, this requires using hipcc as the linker.
In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag --emit-static-lib. The second type of library contains relocatable device objects and is generated using ar.
In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag --emit-static-lib. The second type of library contains relocatable device objects and is generated using `ar`.
Here is an example to create and use static libraries:
Expand Down

0 comments on commit 7dd32ef

Please sign in to comment.