Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latencies #18

Open
harald-lang opened this issue Dec 14, 2015 · 24 comments
Open

High latencies #18

harald-lang opened this issue Dec 14, 2015 · 24 comments

Comments

@harald-lang
Copy link

I noticed very high latencies for kernel dispatches using AQL. Synchronous dispatches take up to 21 µs. Asynchronous (batch) dispatches help to hide latencies. However, kernel dispatching still takes 6 µs (in average), which is still far to slow for fine-grained offloading.

In my experiments I set HSA_ENABLE_INTERRUPT to 0, which greatly improves robustness of the kernel offload times. With interrupts enabled, latencies vary from 6 to 15 microseconds.

System setup:

  • Kaveri APU (no dGPU)
  • Kernel 4.0.0-100002-generic #201511031149 SMP
  • kfd-v1.6.1 (7fb04c4 from git repo HSA-Drivers-Linux-AMD)
  • HSA-Runtime 1.0.3 (fa0ef7e from git repo HSA-Runtime-AMD)
  • CL offline compiler (CLOC) v0.9.8
@harald-lang
Copy link
Author

just FYI... the same experiment with a discrete NVidia card connected via PCIe takes 2 microseconds.

@harald-lang
Copy link
Author

Update:
Setting the iGPU's frequency from "auto" to 720 MHz reduced the dispatch time to ~ 3µs.
Installing a dGPU to connect the display doesn't have an impact at all.

@aditya4d
Copy link

aditya4d commented Feb 4, 2016

Hi,
Interesting observation. Are you still facing this?

@harald-lang
Copy link
Author

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also facing this.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi,
Can you try spawning a new thread (run an empty function) and time it? It
may be the CPU bottleneck.
You can try ROC drivers with AMD discrete card with disabling integrated
graphics in bios.
Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang [email protected] wrote:

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also
facing this.


Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

@gstoner
Copy link
Member

gstoner commented Feb 5, 2016

ROC Driver only runs with Haswell CPU and FIJI based GPU. I am having the team look into this to see if it regression, but it would be help to understand which NVIDIA GPU ( need model number) are you comparing the APU too. Which APU is it and model number.

greg
On Feb 4, 2016, at 8:56 PM, Aditya Avinash Atluri <[email protected]mailto:[email protected]> wrote:

Hi,
Can you try spawning a new thread (run an empty function) and time it? It
may be the CPU bottleneck.
You can try ROC drivers with AMD discrete card with disabling integrated
graphics in bios.
Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang <[email protected]mailto:[email protected]> wrote:

Hi,
yes, nothing changed so far.
I wonder if something is wrong with my system configuration.
It would be very interesting to know, if other HSA developers are also
facing this.


Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.


Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180169638.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Greg,

How about disabling integrated graphics on APU?

Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.

@gstoner
Copy link
Member

gstoner commented Feb 5, 2016

What are trying to run FIJI card on APU? We only are testing FIJI card with Xeon E5 v3, Xeon E3, I7, I5 Haswell or newer since we need PCIe Gen 3 Platform atomics with the ROC driver and runtime.

greg
On Feb 4, 2016, at 9:14 PM, Aditya Avinash Atluri <[email protected]mailto:[email protected]> wrote:

Hi Greg,

How about disabling integrated graphics on APU?

Reply to this email directly or view it on GitHub
#18 (comment)
.

Regards,

Aditya Atluri,

USA.


Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180172813.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Harald,
Can you try rebuilding the driver? My guess for performance hit can be new
thread spawn or ISA loader. Can you try profiling executable.cpp in loader
directory?
Thanks!

Regards,

Aditya Atluri,

USA.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Greg,
How about the new A10-7890K?

@harald-lang
Copy link
Author

Hi Gregory,
for the NVidia experiment we used a GTX 650 connected via PCIe 2.0.
It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Harald,
Quick question. What is your target setup? For which configuration do you
want to solve this issue? APU? Or AMD GPU? Or NVIDIA GPU?

@gstoner
Copy link
Member

gstoner commented Feb 5, 2016

Can I get the test your running. I can do A/B test on same hardware with Fiji vs Titan x

Sent from Outlook Mobilehttps://aka.ms/qtex0l

On Fri, Feb 5, 2016 at 3:32 AM -0800, "Harald Lang" <[email protected]mailto:[email protected]> wrote:

Hi Gregory,
for the NVidia experiment we used a GTX 650 connected via PCIe 2.0.
It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatlurihttps://github.com/adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-180308482.

@harald-lang
Copy link
Author

Hi Aditya,
my target setup is an APU system.

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Harald,
Can you please provide the code which you are trying to profile? Or code
that can simulate the same behavior? Usual AQL dispatch is writing to queue
memory and ringing the doorbell. Also, just making sure there are no other
kernel dispatches going on in the system right?
Thanks!

@harald-lang
Copy link
Author

Hi Aditya and Gregory,

I pushed the code to https://github.com/harald-lang/hsa-lab.

Quickstart instructions:

  • adjust paths to the HSA runtime in env.sh
  • source the env file: . ./env.sh
  • make sure that cloc is installed
  • build the project: make bin/tester
  • run performance tests: bin/tester --gtest_filter=*Perf*

The output is a little verbose. The important lines start with milliseconds/dispatch = ???.

The dispatch functions can be found in src/rts/hsa/HsaContext.hpp.

@harald-lang
Copy link
Author

Update:
I ran the tests on a different machine (Godavari APU 7870K on a MSI A88XM mainboard) and it seems that the trick, setting the GPU frequency manually, does not work here.
Dispatch times vary from 6 to 12µs (sync) and 3 - 6µs (batch).

@aditya4d
Copy link

aditya4d commented Feb 5, 2016

Hi Harald,
Thank you for the update. The code looks good. Can you try profiling the vector copy sample? Just to make sure that the system is working fine.
Thank you!

@harald-lang
Copy link
Author

Hi Aditya,

the vector_copy sample runs without errors.

kaveri: ~/git/HSA-Runtime-AMD/sample$ ./vector_copy 
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is Spectre.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Shutting down the runtime succeeded.

On the Godavari APU, the output looks exactly the same.

The output of the profiler can be found here: https://gist.github.com/harald-lang/b132a4df7863ad4523f2

... by the way... thank you very much for your help! :)

@harald-lang
Copy link
Author

Hi Aditya,

I profiled the vector_copy as you suggested. Please refer to https://gist.github.com/harald-lang/b132a4df7863ad4523f2

@aditya4d
Copy link

Hi,
Can you remove check call between start_kernel and end_kernel = clock(). And re-run it? We don't want to profile stdio
https://gist.github.com/harald-lang/b132a4df7863ad4523f2

@harald-lang
Copy link
Author

Hi Aditya,
please, don't tell anyone ;)

I updated the profile at https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Unfortunately, the results are approx. the same.

@aditya4d
Copy link

Hi Harald,
Sure. I am sorry to put you through this. I wanted to make sure that different applications are showing the same behavior. Now its confirmed that the issue with either the drivers or GPU command processor speed. I'll get back to you. Thank you for time and effort you put into this.

@aditya4d
Copy link

aditya4d commented Mar 1, 2016

Hi Harald,
Check this comment: https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Also, here are the numbers we ran on Titan and APU.
For Titan, single dispatch it is 40us and for batch it is 11.7uS
For APU, single dispatch it is 8us and for batch it is 3 uS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants