-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafile_sgemm API - INT8 implementation #10912
base: master
Are you sure you want to change the base?
Conversation
85c5280
to
d70f5fc
Compare
Hi @ggerganov, |
We will need to merge #10714 first, since there may be some conflicts. |
Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR. |
I'll try to submit it today. 🤞 |
This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>
d70f5fc
to
a8d3700
Compare
I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.
#define COMPUTE(ACC, c_idx, s_idx) \ | ||
__builtin_mma_disassemble_acc(vec_C, ACC); \ | ||
for (int i = 0; i < 4; i++) { \ | ||
CA[i] = vec_splats((float)(((double)comparray[c_idx+i]) * -128.0)); \ | ||
res[i] = vec_add(vec_ctf(vec_C[i], 0), CA[i]); \ | ||
fin_res[s_idx+i] = vec_madd(res[i], vs[s_idx+i], fin_res[s_idx+i]); \ | ||
} \ | ||
|
||
#define SAVE_RES(ii, jj, RM, RN, idx) \ | ||
for (int I = 0; I < RM; I++) { \ | ||
for (int J = 0; J < RN; J++) { \ | ||
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \ | ||
} \ | ||
} \ | ||
|
||
#define PERMUTE_VEC(IR1, IR2, IR3, IR4, vecOffset) \ | ||
t1 = vec_perm(IR1, IR2, swiz1); \ | ||
t2 = vec_perm(IR1, IR2, swiz2); \ | ||
t3 = vec_perm(IR3, IR4, swiz1); \ | ||
t4 = vec_perm(IR3, IR4, swiz2); \ | ||
t5 = vec_perm(t1, t3, swiz3); \ | ||
t6 = vec_perm(t1, t3, swiz4); \ | ||
t7 = vec_perm(t2, t4, swiz3); \ | ||
t8 = vec_perm(t2, t4, swiz4); \ | ||
if (flip == true) { \ | ||
t5 = vec_xor(t5, xor_vector); \ | ||
t6 = vec_xor(t6, xor_vector); \ | ||
t7 = vec_xor(t7, xor_vector); \ | ||
t8 = vec_xor(t8, xor_vector); \ | ||
} \ | ||
vec_xst(t5, 0, vecOffset); \ | ||
vec_xst(t6, 0, vecOffset+16); \ | ||
vec_xst(t7, 0, vecOffset+32); \ | ||
vec_xst(t8, 0, vecOffset+48); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++, I prefer not to use C-macros, but rather simple templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Djip007
I used macros just to avoid some repitition. I can replace them by function or templates.
From the performance viewpoint and considering the nature of the above macros, would function be better than templates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example:
#define SAVE_RES(ii, jj, RM, RN, idx) \
for (int I = 0; I < RM; I++) { \
for (int J = 0; J < RN; J++) { \
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&fin_res[idx+I]+J); \
} \
} \
I will use template for RM/RN it make more likely let the compiler unrool the 2 loops for small values.
It may be sometime needed for various TYPE.
But yes, no need for template for most case.
free(comparray); | ||
} | ||
|
||
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n, int RM, int RN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for gemm I would have made a template:
template<int RM, int RN>
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)
That way you can remove the malloc/free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Djip007
I intend to use
std::array<int, 4> comparray;
inside gemm_small as well.
So is it necessary to make gemm_small template?. I am just thinking if making gemm_small a template function would have a minor performance impact.
Please suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With RM/RN as template param the for is more simple for compiler to unroll or vectorise.
The only drawback is that the compiler will have to create as many functions as template parameters, so the code will be a bit bigger.
Sometime I use https://godbolt.org/ to see what compilers can do.
} | ||
} | ||
__builtin_mma_disassemble_acc(vec_C, &acc_0); | ||
TA *aoffset = const_cast<TA*>(A+(ii*lda)+l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this const_cast needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Djip007
When I try to cast it normally like this:
TA aoffset = (TA)(A+(iilda)+l);
I get this warning:
warning: cast from type ‘const block_q8_0’ to type ‘block_q8_0*’ casts away qualifiers [-Wcast-qual].
Is there any other way to avoid this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about
const TA* aoffset = A+(ii*lda)+l;
ie: no need to remove the const. (and not a good idea for compiler optimisation)
auto aoffset = A + ii*lda + l;
mc = 8; | ||
nc = 8; | ||
gemm<8,8>(m0, m, n0, n); | ||
} else if (m_rem >= 8 && n_rem >= 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that the 2 first if is needed.
Or you miss the gemm<16,8> / gemmw8,16> ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Djip007
Yes, I had gemm<16,8> and gemm<8,16> earlier. But those had some performance problems and I switched to gemm<8,8>.
I plan to work on solving the performance issues and use the bigger kernels.
Is it ok if I retain this code as place holder for future use of gemm<16,8>/gemm<8,16> ?
If not, I can remove that first 2 ifs now, and add later it later if I can solve the performance issues.
Please suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to keep it for now, you can also leave it commented and add a note to say why.
Like that, when someone reads the code, there is always a doubt if there is an error ;)
int64_t ii = m0 + job / xtiles * RM; | ||
int64_t jj = n0 + job % xtiles * RN; | ||
//int *comparray = (int *)malloc(RM * sizeof(int), 4096); | ||
int *comparray = (int *)malloc(RM * sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , if
template<int RM, int RN>
void gemm_small(int64_t m0, int64_t m, int64_t n0, int64_t n)
then you can use
std::array<int, RM> comparray;
if (RM == 4 && RN == 8) { | ||
kernel = &tinyBLAS_Q0_PPC::KERNEL_4x8; | ||
} else if (RM == 8 && RN == 4) { | ||
kernel = &tinyBLAS_Q0_PPC::KERNEL_8x4; | ||
} else if (RM == 8 && RN == 8) { | ||
kernel = &tinyBLAS_Q0_PPC::KERNEL_8x8; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I would have tried to make KERNEL_N_M a template, if not possible/simple I would have done
something like that now we can use c++17:
tempate<int RN, int RM>
inline void kernel(int64_t ii, int64_t jj) {
if constexpr(RM == 4 && RN == 8) {
KERNEL_4x8(ii,jj);
} else if constexpr(RM == 8 && RN == 4) {
KERNEL_4x8(ii,jj);
} else if constexpr(RM == 8 && RN == 8) {
KERNEL_8x8(ii,jj);
} else {
static_assert(false, "RN/RM values not supported");
}
}
and use
kernel<RM,RN>(ii, jj);
This remove the need of "methode-pointer", and do not compile if you use it with not supported RM/RN.
} | ||
|
||
private: | ||
void (tinyBLAS_Q0_PPC::*kernel)(int64_t, int64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not see it last time, dangerous, it is not thread safe.
@amritahs-ibm |
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.