High latencies #18

harald-lang · 2015-12-14T09:38:29Z

I noticed very high latencies for kernel dispatches using AQL. Synchronous dispatches take up to 21 µs. Asynchronous (batch) dispatches help to hide latencies. However, kernel dispatching still takes 6 µs (in average), which is still far to slow for fine-grained offloading.

In my experiments I set HSA_ENABLE_INTERRUPT to 0, which greatly improves robustness of the kernel offload times. With interrupts enabled, latencies vary from 6 to 15 microseconds.

System setup:

Kaveri APU (no dGPU)
Kernel 4.0.0-100002-generic #201511031149 SMP
kfd-v1.6.1 (7fb04c4 from git repo HSA-Drivers-Linux-AMD)
HSA-Runtime 1.0.3 (fa0ef7e from git repo HSA-Runtime-AMD)
CL offline compiler (CLOC) v0.9.8

The text was updated successfully, but these errors were encountered:

harald-lang · 2015-12-17T17:41:27Z

just FYI... the same experiment with a discrete NVidia card connected via PCIe takes 2 microseconds.

harald-lang · 2015-12-18T16:07:47Z

Update:
Setting the iGPU's frequency from "auto" to 720 MHz reduced the dispatch time to ~ 3µs.
Installing a dGPU to connect the display doesn't have an impact at all.

aditya4d · 2016-02-04T21:52:24Z

Hi,
Interesting observation. Are you still facing this?

harald-lang · 2016-02-05T01:59:08Z

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also facing this.

aditya4d · 2016-02-05T02:56:56Z

Hi,
Can you try spawning a new thread (run an empty function) and time it? It
may be the CPU bottleneck.
You can try ROC drivers with AMD discrete card with disabling integrated
graphics in bios.
Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang [email protected] wrote:

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also
facing this.

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

gstoner · 2016-02-05T03:04:06Z

ROC Driver only runs with Haswell CPU and FIJI based GPU. I am having the team look into this to see if it regression, but it would be help to understand which NVIDIA GPU ( need model number) are you comparing the APU too. Which APU is it and model number.

greg
On Feb 4, 2016, at 8:56 PM, Aditya Avinash Atluri <[email protected]mailto:[email protected]> wrote:

Hi,
Can you try spawning a new thread (run an empty function) and time it? It
may be the CPU bottleneck.
You can try ROC drivers with AMD discrete card with disabling integrated
graphics in bios.
Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang <[email protected]mailto:[email protected]> wrote:

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also
facing this.

—
Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180169638.

aditya4d · 2016-02-05T03:14:22Z

Hi Greg,

How about disabling integrated graphics on APU?

—

Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

gstoner · 2016-02-05T03:17:52Z

What are trying to run FIJI card on APU? We only are testing FIJI card with Xeon E5 v3, Xeon E3, I7, I5 Haswell or newer since we need PCIe Gen 3 Platform atomics with the ROC driver and runtime.

greg
On Feb 4, 2016, at 9:14 PM, Aditya Avinash Atluri <[email protected]mailto:[email protected]> wrote:

Hi Greg,

How about disabling integrated graphics on APU?

—

Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180172813.

aditya4d · 2016-02-05T03:27:17Z

Hi Harald,
Can you try rebuilding the driver? My guess for performance hit can be new
thread spawn or ISA loader. Can you try profiling executable.cpp in loader
directory?
Thanks!

Regards,

Aditya Atluri,

USA.

aditya4d · 2016-02-05T04:17:19Z

Hi Greg,
How about the new A10-7890K?

harald-lang · 2016-02-05T11:32:46Z

Hi Gregory,
for the NVidia experiment we used a GTX 650 connected via PCIe 2.0.
It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

aditya4d · 2016-02-05T12:07:47Z

Hi Harald,
Quick question. What is your target setup? For which configuration do you
want to solve this issue? APU? Or AMD GPU? Or NVIDIA GPU?

gstoner · 2016-02-05T12:35:48Z

Can I get the test your running. I can do A/B test on same hardware with Fiji vs Titan x

Sent from Outlook Mobilehttps://aka.ms/qtex0l

On Fri, Feb 5, 2016 at 3:32 AM -0800, "Harald Lang" <[email protected]mailto:[email protected]> wrote:

Hi Gregory,
for the NVidia experiment we used a GTX 650 connected via PCIe 2.0.
It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatlurihttps://github.com/adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180308482.

harald-lang · 2016-02-05T12:51:01Z

Hi Aditya,
my target setup is an APU system.

aditya4d · 2016-02-05T13:17:08Z

Hi Harald,
Can you please provide the code which you are trying to profile? Or code
that can simulate the same behavior? Usual AQL dispatch is writing to queue
memory and ringing the doorbell. Also, just making sure there are no other
kernel dispatches going on in the system right?
Thanks!

harald-lang · 2016-02-05T14:34:43Z

Hi Aditya and Gregory,

I pushed the code to https://github.com/harald-lang/hsa-lab.

Quickstart instructions:

adjust paths to the HSA runtime in env.sh
source the env file: . ./env.sh
make sure that cloc is installed
build the project: make bin/tester
run performance tests: bin/tester --gtest_filter=*Perf*

The output is a little verbose. The important lines start with milliseconds/dispatch = ???.

The dispatch functions can be found in src/rts/hsa/HsaContext.hpp.

harald-lang · 2016-02-05T18:38:50Z

Update:
I ran the tests on a different machine (Godavari APU 7870K on a MSI A88XM mainboard) and it seems that the trick, setting the GPU frequency manually, does not work here.
Dispatch times vary from 6 to 12µs (sync) and 3 - 6µs (batch).

aditya4d · 2016-02-05T18:58:55Z

Hi Harald,
Thank you for the update. The code looks good. Can you try profiling the vector copy sample? Just to make sure that the system is working fine.
Thank you!

harald-lang · 2016-02-05T19:14:17Z

Hi Aditya,

the vector_copy sample runs without errors.

kaveri: ~/git/HSA-Runtime-AMD/sample$ ./vector_copy 
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is Spectre.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Shutting down the runtime succeeded.

On the Godavari APU, the output looks exactly the same.

The output of the profiler can be found here: https://gist.github.com/harald-lang/b132a4df7863ad4523f2

... by the way... thank you very much for your help! :)

harald-lang · 2016-02-11T12:10:25Z

Hi Aditya,

I profiled the vector_copy as you suggested. Please refer to https://gist.github.com/harald-lang/b132a4df7863ad4523f2

aditya4d · 2016-02-11T13:00:01Z

Hi,
Can you remove check call between start_kernel and end_kernel = clock(). And re-run it? We don't want to profile stdio
https://gist.github.com/harald-lang/b132a4df7863ad4523f2

harald-lang · 2016-02-11T16:48:26Z

Hi Aditya,
please, don't tell anyone ;)

I updated the profile at https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Unfortunately, the results are approx. the same.

aditya4d · 2016-02-11T16:53:39Z

Hi Harald,
Sure. I am sorry to put you through this. I wanted to make sure that different applications are showing the same behavior. Now its confirmed that the issue with either the drivers or GPU command processor speed. I'll get back to you. Thank you for time and effort you put into this.

aditya4d · 2016-03-01T16:42:09Z

Hi Harald,
Check this comment: https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Also, here are the numbers we ran on Titan and APU.
For Titan, single dispatch it is 40us and for batch it is 11.7uS
For APU, single dispatch it is 8us and for batch it is 3 uS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latencies #18

High latencies #18

harald-lang commented Dec 14, 2015

harald-lang commented Dec 17, 2015

harald-lang commented Dec 18, 2015

aditya4d commented Feb 4, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

aditya4d commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

harald-lang commented Feb 11, 2016

aditya4d commented Feb 11, 2016

harald-lang commented Feb 11, 2016

aditya4d commented Feb 11, 2016

aditya4d commented Mar 1, 2016

High latencies #18

High latencies #18

Comments

harald-lang commented Dec 14, 2015

harald-lang commented Dec 17, 2015

harald-lang commented Dec 18, 2015

aditya4d commented Feb 4, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

aditya4d commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

gstoner commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

harald-lang commented Feb 5, 2016

aditya4d commented Feb 5, 2016

harald-lang commented Feb 5, 2016

harald-lang commented Feb 11, 2016

aditya4d commented Feb 11, 2016

harald-lang commented Feb 11, 2016

aditya4d commented Feb 11, 2016

aditya4d commented Mar 1, 2016