-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpocl-4.0.html
279 lines (251 loc) · 23.4 KB
/
pocl-4.0.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="robots" content="index, follow" />
<meta name="keywords" content="OpenCL portable OpenCL PoCL pocl Portable Computing Langauge" />
<meta name="description" content="PoCL - Portable Computing Language" />
<meta property="og:title" content="PoCL home page"/>
<meta property="og:site_name" content="PoCL"/>
<meta property="og:type" content="website"/>
<meta property="og:description" content="PoCL: a performance portable open source OpenCL implementation"/>
<meta property="og:url" content="http://portablecl.org"/>
<title>PoCL - Portable Computing Language | Portable Computing Language (PoCL) v4.0 released</title>
<link rel="stylesheet" type="text/css" href="pocl-style.css" />
</head>
<body>
<div id="page">
<div id="header">
<h1 id="title"><span style="height: 100%; vertical-align: middle;"></span>
<a href="http://portablecl.org"><img src="img/pocl-80x60.png" border="0" style="vertical-align: middle;"></a>
<span style="height: 100%; vertical-align: middle;"> Portable Computing Language | Portable Computing Language (PoCL) v4.0 released</span></h1>
</div>
<div id="navi">
<ul id="menu_item_list">
<li class="menu_item"><a href="index.html" class="menu_link">Main</a></li>
<li class="menu_item"><a href="download.html" class="menu_link">Download</a></li>
<li class="menu_item"><a href="docs/html" class="menu_link">Documentation</a></li>
<li class="menu_item"><a href="development.html" class="menu_link">Development</a></li>
<li class="menu_item"><a href="discussion.html" class="menu_link">Discussion</a></li>
<li class="menu_item"><a href="https://github.com/pocl/pocl/wiki" class="menu_link">Wiki</a></li>
<li class="menu_item"><a href="publications.html" class="menu_link">Publications</a></li>
</ul>
</div>
<div id="content">
<h1>June 22, 2023: Portable Computing Language (PoCL) v4.0 released</h1>
<title>Release Notes for PoCL 4.0</title>
<div class="document" id="release-notes-for-pocl-4-0">
<h1 class="title">Release Notes for PoCL 4.0</h1>
<div class="section" id="major-new-features">
<h1>Major new features</h1>
<div class="section" id="support-for-clang-llvm-16-0">
<h2>Support for Clang/LLVM 16.0</h2>
<p>PoCL now supports Clang/LLVM from 10.0 to 16.0 inclusive. The most PoCL-relevant
change of the new 16.0 release is support for <a class="reference external" href="https://releases.llvm.org/16.0.0/tools/clang/docs/LanguageExtensions.html#half-precision-floating-point">_Float16 type on x86 and ARM targets.</a></p>
</div>
<div class="section" id="cpu-driver">
<h2>CPU driver</h2>
<div class="section" id="support-for-program-scope-variables">
<h3>Support for program-scope variables</h3>
<p>Global variables in program-scope are now supported, along with static global
variables in function-scope, for both OpenCL C source and SPIR-V compilation. The implementation passes
the <tt class="docutils literal">basic/test_basic</tt> test of the OpenCL-CTS, and has been tested with
client applications through chipStar.</p>
<pre class="code c literal-block">
<span class="name">global</span><span class="whitespace"> </span><span class="keyword type">float</span><span class="whitespace"> </span><span class="name">testGlobalVar</span><span class="punctuation">[</span><span class="literal number integer">128</span><span class="punctuation">];</span><span class="whitespace">
</span><span class="name">__kernel</span><span class="whitespace"> </span><span class="keyword type">void</span><span class="whitespace"> </span><span class="name">test1</span><span class="whitespace"> </span><span class="punctuation">(</span><span class="name">__global</span><span class="whitespace"> </span><span class="keyword">const</span><span class="whitespace"> </span><span class="keyword type">float</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">a</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="punctuation">{</span><span class="whitespace">
</span><span class="keyword type">size_t</span><span class="whitespace"> </span><span class="name">i</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">get_global_id</span><span class="punctuation">(</span><span class="literal number integer">0</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="operator">%</span><span class="whitespace"> </span><span class="literal number integer">128</span><span class="punctuation">;</span><span class="whitespace">
</span><span class="name">testGlobalVar</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">]</span><span class="whitespace"> </span><span class="operator">+=</span><span class="whitespace"> </span><span class="name">a</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">];</span><span class="whitespace">
</span><span class="punctuation">}</span><span class="whitespace">
</span><span class="name">__kernel</span><span class="whitespace"> </span><span class="keyword type">void</span><span class="whitespace"> </span><span class="name">test2</span><span class="whitespace"> </span><span class="punctuation">(</span><span class="name">__global</span><span class="whitespace"> </span><span class="keyword">const</span><span class="whitespace"> </span><span class="keyword type">float</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">a</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="punctuation">{</span><span class="whitespace">
</span><span class="keyword type">size_t</span><span class="whitespace"> </span><span class="name">i</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">get_global_id</span><span class="punctuation">(</span><span class="literal number integer">0</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="operator">%</span><span class="whitespace"> </span><span class="literal number integer">128</span><span class="punctuation">;</span><span class="whitespace">
</span><span class="name">testGlobalVar</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">]</span><span class="whitespace"> </span><span class="operator">*=</span><span class="whitespace"> </span><span class="name">a</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">];</span><span class="whitespace">
</span><span class="punctuation">}</span><span class="whitespace">
</span><span class="name">__kernel</span><span class="whitespace"> </span><span class="keyword type">void</span><span class="whitespace"> </span><span class="name">test3</span><span class="whitespace"> </span><span class="punctuation">(</span><span class="name">__global</span><span class="whitespace"> </span><span class="keyword type">float</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">out</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="punctuation">{</span><span class="whitespace">
</span><span class="keyword type">size_t</span><span class="whitespace"> </span><span class="name">i</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">get_global_id</span><span class="punctuation">(</span><span class="literal number integer">0</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="operator">%</span><span class="whitespace"> </span><span class="literal number integer">128</span><span class="punctuation">;</span><span class="whitespace">
</span><span class="name">out</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">]</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">testGlobalVar</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">];</span><span class="whitespace">
</span><span class="punctuation">}</span>
</pre>
</div>
<div class="section" id="support-for-generic-address-space">
<h3>Support for generic address space</h3>
<p>Generic AS is now supported, for both OpenCL C source and SPIR-V compilation.
PoCL now passes the <tt class="docutils literal">generic_address_space/test_generic_address_space</tt> test
of the OpenCL-CTS, and has been tested with CUDA/HIP applications through chipStar.</p>
<pre class="code c literal-block">
<span class="keyword type">int</span><span class="whitespace"> </span><span class="name function">isOdd</span><span class="punctuation">(</span><span class="keyword type">int</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">val</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="punctuation">{</span><span class="whitespace">
</span><span class="keyword">return</span><span class="whitespace"> </span><span class="name">val</span><span class="punctuation">[</span><span class="literal number integer">0</span><span class="punctuation">]</span><span class="whitespace"> </span><span class="operator">%</span><span class="whitespace"> </span><span class="literal number integer">2</span><span class="punctuation">;</span><span class="whitespace">
</span><span class="punctuation">}</span><span class="whitespace">
</span><span class="name">__kernel</span><span class="whitespace"> </span><span class="keyword type">void</span><span class="whitespace"> </span><span class="name">test3</span><span class="whitespace"> </span><span class="punctuation">(</span><span class="name">__global</span><span class="whitespace"> </span><span class="keyword type">int</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">in1</span><span class="punctuation">,</span><span class="whitespace"> </span><span class="name">__local</span><span class="whitespace"> </span><span class="keyword type">int</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">in2</span><span class="punctuation">,</span><span class="whitespace"> </span><span class="name">__global</span><span class="whitespace"> </span><span class="keyword type">int</span><span class="whitespace"> </span><span class="operator">*</span><span class="name">out</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="punctuation">{</span><span class="whitespace">
</span><span class="keyword type">size_t</span><span class="whitespace"> </span><span class="name">i</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">get_global_id</span><span class="punctuation">(</span><span class="literal number integer">0</span><span class="punctuation">);</span><span class="whitespace">
</span><span class="name">out</span><span class="punctuation">[</span><span class="name">i</span><span class="punctuation">]</span><span class="whitespace"> </span><span class="operator">=</span><span class="whitespace"> </span><span class="name">isOdd</span><span class="punctuation">(</span><span class="name">in1</span><span class="operator">+</span><span class="name">i</span><span class="punctuation">)</span><span class="whitespace"> </span><span class="operator">+</span><span class="whitespace"> </span><span class="name">isOdd</span><span class="punctuation">(</span><span class="name">in2</span><span class="operator">+</span><span class="punctuation">(</span><span class="name">i</span><span class="whitespace"> </span><span class="operator">%</span><span class="whitespace"> </span><span class="literal number integer">128</span><span class="punctuation">)];</span><span class="whitespace">
</span><span class="punctuation">}</span>
</pre>
</div>
<div class="section" id="initial-support-for-cl-khr-subgroups">
<h3>Initial support for cl_khr_subgroups</h3>
<p>The default is a single subgroup that always executes the whole X-dimension's WIs.
Independent forward progress is not yet supported, but it's
not needed for CTS compliance, due to the corner case of only one SG in flight.</p>
<p>Additionally, there is partial implementation for <tt class="docutils literal">cl_khr_subgroup_shuffle</tt>,
<tt class="docutils literal">cl_intel_subgroups</tt> and <tt class="docutils literal">cl_khr_subgroup_ballot with caveats</tt>:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal">cl_khr_subgroup_shuffle</tt>: Passes the CTS, but only because it doesn't test
non-uniform(lock-step) behavior, see:
<a class="reference external" href="https://github.com/KhronosGroup/OpenCL-CTS/issues/1236">https://github.com/KhronosGroup/OpenCL-CTS/issues/1236</a></li>
<li><tt class="docutils literal">cl_khr_subgroup_ballot</tt>: sub_group_ballot() works for uniform calls, the rest
are unimplemented.</li>
<li><tt class="docutils literal">cl_intel_subgroups</tt>: The block reads/writes are unimplemented.</li>
</ul>
</blockquote>
</div>
<div class="section" id="initial-support-for-cl-intel-required-subgroup-size">
<h3>Initial support for cl_intel_required_subgroup_size</h3>
<p>This extension allows the programmer to specify the required subgroup size for
a kernel function. This can be important for algorithm correctness in some cases. It's used by chipStar to implement fixed width warps when needed. The programmer
can specify the size with a new kernel attribute:
<tt class="docutils literal"><span class="pre">__attribute__((intel_reqd_sub_group_size(<int>)))</span></tt></p>
<p>PoCL additionally implements <tt class="docutils literal">CL_DEVICE_SUB_GROUP_SIZES_INTEL</tt> parameter for <tt class="docutils literal">clGetDeviceInfo</tt> API,
however <tt class="docutils literal">CL_KERNEL_SPILL_MEM_SIZE_INTEL</tt> and <tt class="docutils literal">CL_KERNEL_COMPILE_SUB_GROUP_SIZE_INTEL</tt> for
<tt class="docutils literal">clGetKernelWorkGroupInfo</tt> API are not yet implemented.</p>
</div>
<div class="section" id="initial-support-for-cl-khr-fp16">
<h3>Initial support for cl_khr_fp16</h3>
<p>PoCL now has partial support for <tt class="docutils literal">cl_khr_fp16</tt> when compiled with Clang/LLVM 16+.
The implementation relies on Clang, and may result in emulation (promoting to
fp32) if the CPU does not support the required instruction set. In
Clang/LLVM 16+, the following targets have native fp16 support: 32-bit and
64-bit ARM (depending on vendor), x86-64 with AVX512-FP16.
Currently only implemented for a part of builtin library functions,
those that are implemented with either an expression, or a Clang builtin.</p>
</div>
</div>
<div class="section" id="level-zero-driver">
<h2>Level Zero driver</h2>
<p>This is a new experimental driver that supports devices accessible via Level Zero API.</p>
<p>The driver has been tested with multiple devices (iGPU and dGPU),
and passes a large portion of PoCL tests (87% tests passed, 32 tests
fail out of 254), however it has not been finished nor optimized yet,
therefore it cannot be considered production quality.</p>
<p>The driver supports the following OpenCL extensions, in addition to atomics:
cl_khr_il_program, cl_khr_3d_image_writes,
cl_khr_fp16, cl_khr_fp64, cl_khr_subgroups, cl_intel_unified_shared_memory.
In addition, Specialization Constants and SVM are supported.</p>
<p>We also intend to use the driver for prototyping features not found in
the official Intel Compute Runtime OpenCL drivers, and for experimenting
with asynchronous execution with other OpenCL devices in the same PoCL platform.
One such feature currently implemented is the JIT kernel compilation, which is
useful with programs that have thousands of kernels but only launch a few of
them (e.g. when using SPIR-V IL produced from heavily templated C++ code).
For details, see the full driver documentation in <cite>doc/sphinx/source/level0.rst</cite>.</p>
<div class="section" id="support-for-cl-intel-unified-shared-memory">
<h3>Support for cl_intel_unified_shared_memory</h3>
<p>This extension, together with SPIR-V support and other new features, allows
using PoCL as an OpenCL backend for SYCL runtimes. This works with the both
CPU driver (tested on x86-64 & ARM64) and the Level Zero driver. Vincent A. Arcila
has contributed a guide for building PoCL as SYCL runtime backend on ARM.</p>
<p>Additionally, there is a new testsuite integrated into PoCL for testing USM support,
<tt class="docutils literal"><span class="pre">intel-compute-samples</span></tt>. These are tests from <a class="reference external" href="https://github.com/intel/compute-samples">https://github.com/intel/compute-samples</a>
and PoCL currently passes 78% of the tests (12 tests failed out of 54).</p>
</div>
</div>
<div class="section" id="new-testsuites">
<h2>New testsuites</h2>
<p>There are also multiple new CTest testsuites in PoCL. For testing PoCL as a SYCL backend,
there are three new testsuites: <tt class="docutils literal"><span class="pre">dpcpp-book-samples</span></tt>, <tt class="docutils literal"><span class="pre">oneapi-samples</span></tt> and <tt class="docutils literal"><span class="pre">simple-sycl-samples</span></tt>.</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">dpcpp-book-samples</span></tt>: these are samples from <a class="reference external" href="https://github.com/Apress/data-parallel-CPP">https://github.com/Apress/data-parallel-CPP</a>
PoCL currently passes 90 out of 95 tests.</li>
<li><tt class="docutils literal"><span class="pre">oneapi-samples</span></tt>: these are samples from <a class="reference external" href="https://github.com/oneapi-src/oneAPI-samples">https://github.com/oneapi-src/oneAPI-samples</a>
However only a few have been enabled in PoCL for now, because each sample is a separate CMake project</li>
<li><tt class="docutils literal"><span class="pre">simple-sycl-samples</span></tt>: these are from <a class="reference external" href="https://github.com/bashbaug/simple-sycl-samples">https://github.com/bashbaug/simple-sycl-samples</a>
currently contains only 8 samples, PoCL passes all of them.</li>
</ul>
<p>For testing PoCL as chipStar's OpenCL backend: <tt class="docutils literal">chipStar</tt> testsuite. This builds
the runtime and the tests from <a class="reference external" href="https://github.com/CHIP-SPV/chipStar">https://github.com/CHIP-SPV/chipStar</a>, and
runs a subset of tests (approximately 800) with PoCL as the chipStar's backend.</p>
</div>
<div class="section" id="mac-os-x-support">
<h2>Mac OS X support</h2>
<p>Thanks to efforts of Isuru Fernando who stepped up to become the official Mac OSX port maintainer, PoCL's CPU driver has been again fixed to work on Mac OS X.
The current 4.0 release has been tested on these configurations:</p>
<p>MacOS 10.13 (Intel Sandybridge), MacOS 11.7 Intel (Ivybridge) with Clang 15.</p>
<p>Additionally, there are now Github Actions for CI testing of PoCL with Mac OS X,
testing 4 different configurations: LLVM 15 and 16, with and without ICD loader.</p>
</div>
<div class="section" id="github-actions">
<h2>Github Actions</h2>
<p>The original CI used by PoCL authors (Python Buildbot, <a class="reference external" href="https://buildbot.net">https://buildbot.net</a>)
has been converted to publicly accessible Github Actions CI. These are currently
set up to test PoCL with last two LLVM versions rigorously, and basic tests with
older LLVM versions. The most tested driver is the CPU driver, with multiple
configurations enabling or testing different features: sanitizers, external
testsuites, SYCL support, OpenCL conformance, SPIR-V support. There are also
basic tests for other experimental/WiP/research-drivers in PoCL: OpenASIP, Vulkan, CUDA, and LevelZero.</p>
</div>
</div>
<div class="section" id="bugfixes-and-minor-features">
<h1>Bugfixes and minor features</h1>
<ul class="simple">
<li>CMake: it's now possible to disable libhwloc support even when it's present,
using -DENABLE_HWLOC=0 CMake option</li>
<li>AlmaIF's OpenASIP backend now supports a standalone mode.
It generates a standalone C program from a kernel launch, which
can then be compiled and executed with ttasim or RTL simulation.</li>
<li>Added a user env POCL_BITCODE_FINALIZER that can be used to
call a custom script that manipulates the final bitcode before
passing it to the code generation.</li>
<li>New alternative work-group function mode for non-SPMD from Open SYCL:
Continuation-based synchronization is somewhat more general than the default one in PoCL's
current kernel compiler, but allows for fewer hand-rolled optimizations.
CBS is expected to work for kernels that PoCL's current kernel compiler
does not support. Currently, CBS can be manually enabled by setting
the environment variable <cite>POCL_WORK_GROUP_METHOD=cbs</cite>.</li>
<li>Linux/x86-64 only: SIGFPE handler has been changed to skip instructions
causing division-by-zero, only if it occured in one of the CPU driver
threads; so division-by-zero errors are no longer hidden in user threads.</li>
<li>CUDA driver: POCL_CUDA_VERIFY_MODULE env variable has been replaced by POCL_LLVM_VERIFY</li>
<li>CUDA driver: compilation now defaults to <cite>-ffp-contract=fast</cite>, previously it was <cite>-ffp-contract=on</cite>.</li>
<li>CUDA driver: support for Direct Peer-to-Peer buffer migrations
This allows much better performance scaling in multi-GPU scenarios</li>
<li>OpenCL C: <cite>-cl-fast-relaxed-math</cite> now defaults to <cite>-ffp-contract=fast</cite>, previously it was <cite>-ffp-contract=on</cite>.</li>
<li>CPU drivers: renamed 'basic' to 'cpu-minimal' and 'pthread' driver to 'cpu',
to reflect the hardware they're driving instead of implementation details.</li>
<li>CPU drivers: POCL_MAX_PTHREAD_COUNT renamed to POCL_CPU_MAX_CU_COUNT;
the old env. variable is deprecated but still works</li>
<li>CPU drivers: Added a new POCL_CPU_LOCAL_MEM_SIZE environment for overriding the
local memory size.</li>
<li>CPU drivers: OpenCL C printf() flushes output after each call instead of waiting
for the end of the kernel command. This makes it more useful for debugging
kernel segfaults.</li>
</ul>
</div>
</div>
</div>
<div id="footer">
<span style="height: 100%; vertical-align: middle;"></span>
<a href="http://portablecl.org"><img src="img/pocl-80x60.png" border="0" style="vertical-align: middle;"></a>
<span style="height: 100%; vertical-align: middle;">Portable Computing Language © 2010-2023 PoCL developers
</span>
</div>
<div class="g-plusone" data-annotation="inline" data-width="300"></div>
</div>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-36911879-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>
</body>
</html>