You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in my local environment. My point of view is that, simply wrapping the Transformer library for inference is a little bit inefficient, since there are many optimization mechanisms for LLM serving in recent years' papers/projects, for example, Flash Attention, Paged Attention, Continuous Batching, etc. It would sound more if Petals could integrate any or a few of these optimizations. I wonder if authors have any future plan on this. I'm personally trying to integrate vLLM with Petals, or in another word, enabling vLLM to run on different nodes over the internet.
The text was updated successfully, but these errors were encountered:
I'm interested to know the answer on this as well. I realize that with the 1b and 3b llama 3.2 models coming out recently and there already being quantized llama 3.2 11b models available on huggingface, i can't help but think there's still room to use this to help local network users and the whole network with seeing if this project has any future development plans. I'm imagining use cases for re-using old tech devices instead of sending them to landfills. One such use case is installing petals servers on them and giving them network access. Some users will contribute to the larger pool, some will prefer to use it on a private swarm, but i'd imagine opening support for these other models and ways to use them as OP was mentioning, we could really unlock a lot of use from moving in that direction.
Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in my local environment. My point of view is that, simply wrapping the Transformer library for inference is a little bit inefficient, since there are many optimization mechanisms for LLM serving in recent years' papers/projects, for example, Flash Attention, Paged Attention, Continuous Batching, etc. It would sound more if Petals could integrate any or a few of these optimizations. I wonder if authors have any future plan on this. I'm personally trying to integrate vLLM with Petals, or in another word, enabling vLLM to run on different nodes over the internet.
The text was updated successfully, but these errors were encountered: