-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add FP8 support to gguf/llama: #10055
base: master
Are you sure you want to change the base?
Conversation
Well... I have many things to correct! |
0723ac5
to
08b6344
Compare
60ece70
to
3faf670
Compare
OK. now build is in good shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native FP8
types are interesting since AFAIK there is hardware support for these.
What are the advantages of the FP8_Q
types, for example compared to the existing Q8_0
type?
Makefile
Outdated
ggml/src/ggml-fp8.cpp \ | ||
ggml/src/ggml-fp8.h \ | ||
ggml/src/ggml-common.h | ||
$(CXX) $(CXXFLAGS) -std=c++17 -c $< -o $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should target -std=c++11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use "static constexpr" in this file that is not allowed with c++11 (and may be more).
And because it is the default with gcc, I develop/test the FP8 template with it.
I see you have it use for GGML_SYCL build to.
So I can try to rewrite it with c++11 before merge request, but is it really needed? Do you have target that do not support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not a good reason to change the project standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have it for GGML_SYCL build, is there a reason to keep that old standart that was the 1er draft for modern C++?
I'll have a try to rewrite it with only c++11 and see how it look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends are for the most part self-contained and don't affect the rest of the project. What you are trying to merge here would be part of the core ggml and would require changing the standard for the entire project. I would also prefer to use C++17, but that is a discussion for another time.
ggml/src/ggml-common.h
Outdated
// - fp8 simple type | ||
typedef struct { uint8_t bits; } ggml_e5m2_t; | ||
typedef struct { uint8_t bits; } ggml_e4m3_t; | ||
typedef struct { uint8_t bits; } ggml_e3m4_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes my bad. will remove it for next commit.
ggml/src/ggml-fp8.cpp
Outdated
template<int N> constexpr float EXP2() { | ||
if constexpr (N==0) return 1; | ||
if constexpr (N>0) return EXP2<N-1>()*2; | ||
if constexpr (N<0) return EXP2<N+1>()/2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++11:
template<int N> constexpr float EXP2() { | |
if constexpr (N==0) return 1; | |
if constexpr (N>0) return EXP2<N-1>()*2; | |
if constexpr (N<0) return EXP2<N+1>()/2; | |
} | |
constexpr float exp2(int n) { | |
return n < 0 ? 1.0f / (1 << -n) : 1 << n; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not work for n > 32 or 64... we have it up to 127 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr float exp2(int n) {
return n == 0 ? 1 : n > 0 ? 2 * exp2(n - 1) : 0.5 * exp2(n + 1);
}
ggml/src/ggml-fp8.cpp
Outdated
using type = FP8<_E>; | ||
static constexpr int E=_E; | ||
static constexpr int M=7-_E; | ||
static constexpr int E_BIAS=EXP2<_E-1>()-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The not c++11 support is this 'static constexpr"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static const int
would do the same here.
Some possible advantages:
some advantage over FP8 is:
Note: In fact I have some good accuracy with bloc of 512 or 1024, but I prefer to keep the same bloc size that is use for other Qn_K quant. |
58863d0
to
4e81ab0
Compare
OK have hard time with macOS compiler... thanks @slaren @ggerganov |
Got it. So the native |
rebase on last master, with little change on fp32->fp8 rounding (add round to nearest even) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the conversion code should be in ggml base, the vec dot code needs to be built in the CPU backend since that's the only code optimized for the CPU arch.
I think this could be useful to support base models released in FP8, and possibly for faster inference on ADA GPUs that support FP8, but I am not convinced that it is worth to add the new quant types based on FP8. From what I can tell, they perform worse than Q8_0, so there is no clear use case for them.
I am also not sure that we should add the OpenMP SIMD dependency. I don't really understand how it is supposed to work. Does the compiler really produce better code when it is enabled?
LOG("\n"); | ||
LOG("chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for FP8, but you write header on all step in you case. So it add many line in the output.
But may be best in an other PR.
ggml/src/ggml-cpu.c
Outdated
@@ -426,6 +427,27 @@ static const struct ggml_type_traits_cpu type_traits_cpu[GGML_TYPE_COUNT] = { | |||
.vec_dot_type = GGML_TYPE_Q8_K, | |||
.nrows = 1, | |||
}, | |||
// les FP8... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// les FP8... |
@@ -26,7 +26,7 @@ function has_cmd { | |||
} | |||
|
|||
if has_cmd wget; then | |||
cmd="wget -q --show-progress -c -O %s/%s %s" | |||
cmd="wget -q -c -O %s/%s %s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to change this?
ggml/src/CMakeLists.txt
Outdated
check_cxx_compiler_flag("-fopenmp-simd" SUPPORTS_OPENMP_SIMD) | ||
if (SUPPORTS_OPENMP_SIMD) | ||
# OpenMP_RUNTIME_MSVC=experimental / if (MSVC) | ||
message(STATUS "Using openmp_simd.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message(STATUS "Using openmp_simd.") | |
message(STATUS "Using OPENMP_SIMD.") |
So yes it help. But it is true we can have much better than that with more optimised intrinsic code. So as you like. |
We can keep it until it is replaced with intrinsics then. It should still be limited to only the CPU backend, though. |
Good catch, i did it before the move of cpu in backend.
And try to move this simd in cpu backend to. |
- correct local CI. - correct perplexity log
E5M2 & E4M3: for use with FP8 distributed model E4M3_Q & E3M4_Q: for gguf quantized model. E5M2 and A4M3 type are use like FP16 / BF16 native. E4M3_Q and E3M4_Q are define like Q8_0 with bloc size of 256 (like QK_K)
sorry, i misunderstood what you mean before. now i get it: apologize for my nonsense |
e6c4791
to
1488d7d
Compare
OK rebase with later master and now it is possible made use of c++17. I refactor the fp8 so dot is now in cpu-backend. There is a few copie for now but may be remove later with optimized code.
at least the tg is in line with the Q8 on my CPU. I hope I managed to take into account all/most comments / reviews. |
As far as I can tell, this conversion is not implemented, so it is not actually possible to use FP8 models. |
I have taken a look at the available FP8 models, and they use a single F32 scale So as it is, this implementation cannot be used with FP8 models, and the FP8 Q types perform worse than Q8_0. And I don't think that having a different group size than Q8_0 is a good reason to add them either. I am not sure that there are any practical applications for this code as it is at the moment. I still think that it would be good to have support for FP8 models, but it needs more work to achieve that. |
For this case we don't need to add it in quantization schemes, but can add it in model load ("just" add mul op with it after the matmul, like with bias.)?
FP8_Q is perfectly usable and with good result.
(Yes as you see NVIDIA/INTEL/AMD was wrong not add E3M4 in hardware.) High speed FP8-CPU kernel is WIP but need this PR #10446 for clean/simple Do you want we wait for this kernel before add FP8 ? |
It would be ok for a proof of concept, but it would effectively require adding a special case to every matrix multiplication to support FP8. The best way would be be to add support to ggml for quantization types that use one group/scale per row.
That's not good enough, it's still worse quality than Q8_0, and much slower. As it is, I don't see any motivation to use these types. It needs to offer a meaningful improvement in some aspect to justify the maintenance cost of adding all this code, especially since it would be adding new file types. I suppose one use for the FP8_Q types would be to use them with FP8 models by using the same scale for every group in the row, until we are able to support quantization types with one group per row in ggml. That would still require adding support to perform this conversion in |
I agree with @slaren's evaluation. We have to see more gains from these data types before we merge them. |
I see your point. I have some kernel that use BF16 for compute with that speed:
But I need some time to have it clean for PR and port it to the "cpu-extra-buffer". Do you think this is enough if I can get that? Note: I know how to make it faster but need more test/time for that. |
Any update on this. deep seekv3 use Fp8 as training format so this might be useful |
With the advent of DS R1 also being FP8 this is looking even more important |
After make many test with FP8 this is the first part for add FP8 on llama.cpp.
I made some choose:
I implement 4 FP8 support:
E5M2 & E4M3: for use with FP8 distributed model
it is not for create quantized model from FP16/BF16 model, but for FP8 distributed model like Llama-3.1-405B-FP8 so I only add element for run.
E4M3_Q & E3M4_Q: for gguf quantized model.
With this on we can quantize model I made test with Mistral-Nemo / Mistral-7B and Llama-3.1-8B .
the weight have simple quantization process like with Q8_0, but with size bloc of 256 (Q8_0 use bloc of 32)
For now all is in place for CPU backend as reference for any arch, so only use pure C++. I use openmp-simd for a little speed-up.
I add 2 files gguf-fp8.h gguf-fp8.cpp to a lower change on existing sources, and simple merge.
For now I update Makefile build (need to figure what to do with CMake)
I prefer to use C++ for template on FP8 so I can have the 4 formats with little code size.
So for now any comment is welcome.
Next step is to make a true merge request, before add faster compute.
For those curious I have a high speed version I use for simple test here for zen4. And we have more discussion here