-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance benchmarks #70
Comments
@vince- We have not made any benchmark so far on aarch64. Our focus has been Cortex-M and we are slowly adding Neon support and first starting from aarch32. So, some functions may unfortunately don't have the expected performance improvement because we haven't tested yet. Your compilation options look right. Is your biquad long (several stages) ? What blockSize value are you using ? I suspect that the filter implementation may add too much overhead (compared to a C version) for small values of blockSize or low number of stages and it may become efficient only for bigger values. |
Hi Vince, please also note the DF1 is more "vector friendly" than DF2. Regards, Laurent. |
@christophe0606 Thanks for the reply. My current test biquad is 4 stages and I'm running a 32 sample block size. It makes sense that the overhead might counter any potential gains for a small filter like this. @llefaucheur I did try the DF1, it appears worse than the DF2 on A57 as well. |
@vince- I tagged this issue as enhancement so that we look at it when we will have the bandwidth to start aarch64 benchmarking (not soon unfortunately). |
@vince- we have an other repo for experiments with DF1 for NEON with approximately Cycles(size, casc) = 4.125.casc.size + 75.casc (CA55) . At a time this code will come in the mainline |
Merci @llefaucheur! The DF1 for NEON indeed shows good gains. Appreciate both of you guys help and responsiveness on this issue. |
Hello,
I'm wondering if any bench marking has been done for elements in the library? I'm trying to run the NEON enabled version of arm_biquad_cascade_df2T_f32 on a Cortex-A57 and I'm seeing performance that is roughly 80% slower than my standard optimized C code.
I've gone over the documentation and setup articles a bunch of time, I'm using GCC 7.3 and my Makefile sets the following options specifically for CMSIS-DSP:
Looking at the generated assembly it appears that it is indeed generating vectorized code, but not yielding faster execution. Is it possible that the neon routines are just comparatively slower on aarch64?
I have also tried the same test code compiled for A57 using GCC 8.3. And again natively on an Apple M1 machine using GCC 11 and Clang-14, the CMSIS-DSP cascaded biquad is always measurably slower.
Thanks!
The text was updated successfully, but these errors were encountered: