You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be interesting to see the effect of each optimisation separately.
Some of the ideas might be very easy to use in other inference engines and others much harder to implement, so it would be very helpful to get an idea of the gains for each feature separately.
Yes, we will present more results later. Actually, our vision for KTransformers is to serve as an experimental platform used to develop prototypes faster. The more mature ones can then be adopted into more popular inference engines like llama.cpp and vLLM.
It would be interesting to see the effect of each optimisation separately.
Some of the ideas might be very easy to use in other inference engines and others much harder to implement, so it would be very helpful to get an idea of the gains for each feature separately.
I've linked this to the
llama.cpp
discussions:ggerganov/llama.cpp#8721
As the "Arithmetic Intensity Guided Offloading" is clearly something they could do quite easily and likely give a significant boost to MoE models.
The text was updated successfully, but these errors were encountered: