-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama-run #452
Add llama-run #452
Conversation
Reviewer's Guide by SourceryThis pull request adds a llama-cpp-python server to the container images. It updates the build scripts to install the server and modifies the model execution command to use the new server. It also updates the llama.cpp and whisper.cpp dependencies. Sequence diagram for model execution flow changessequenceDiagram
participant User
participant CLI
participant Server
participant Model
User->>CLI: Execute command
CLI->>CLI: Parse arguments
CLI->>Server: Initialize llama-cpp-python server
Note over Server: New server component
Server->>Model: Load model
Server->>Model: Process prompt
Model-->>Server: Generate response
Server-->>CLI: Return results
CLI-->>User: Display output
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ericcurtin - I've reviewed your changes - here's some feedback:
Overall Comments:
- Please provide justification for changing the default runtime from llama.cpp to llama-cpp-python. What are the benefits that led to this decision?
- Consider updating any additional documentation to reflect the new runtime option and explain the differences between the available runtimes
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have skipped reviewing this pull request. It looks like we've already reviewed the commit 47fdf9a in this pull request.
c4f91bc
to
fe9c0ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the code.
But, I think I'm missing something. Is llama-cpp-python
already a dependency?
@ygalblum we probably need to push some container images before merging this, but when we do that, we should be all good. |
1d79a3c
to
4c35af3
Compare
ramalama/cli.py
Outdated
@@ -103,7 +103,7 @@ def load_and_merge_config(): | |||
) | |||
|
|||
ramalama_config['carimage'] = ramalama_config.get('carimage', "registry.access.redhat.com/ubi9-micro:latest") | |||
ramalama_config['runtime'] = ramalama_config.get('runtime', 'llama.cpp') | |||
ramalama_config['runtime'] = ramalama_config.get('runtime', 'llama-cpp-python') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have a different runtime for ramalama serve versus ramalama run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ramalama run
, the only commands that are available to do that in a single process are the llama.cpp ones, like llama-cli (and llama-run which I'll be adding in this PR now that the last few bits were merged a couple of hours ago). There are other possibilities for chatbots, but the other require us to set up a server process and a client process. I think llama-run is cleaner than setting up a client and server.
As regards server there are 3 clear options, vLLM, llama.cpp and llama-cpp-python (which is just a python wrapper around llama.cpp). Although vLLM is the best enterprise option today for a server in my opinion, it lacks the diverse hardware support for local development, which leaves us the other two. llama.cpp server is a very minimal implementation of an OpenAI-compatible server as in it is barely OpenAI-compatible, last I tried it, it didn't work with OpenAI-compatible clients like: https://github.com/open-webui/open-webui
llama.cpp server might catch up as it is actively being contributed to, but it's not there today. One key feature it is missing is multi-model support, so the client normally specifies the model it wants to use.
llama-cpp-python is further ahead in terms of it's OpenAI-compatible server implementation, that's why I'm proposing we switch in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion here is choice is bad. We should only offer llama.cpp and if the server tool llama-cpp-python is better, then default to using that, if a better one shows up in the future, then we could move to it, as long as it implements the Same API as llama-cpp-python. We are doing the same thing on the run side moving from llama-client to llama-run.
Bottom line is keep the selection to llama.cpp and vLLM (And whisper.cpp) and we can modify the tools used to work with the libraries under the covers. Eventually we could allow specifying the commands to run within the containers for run and serve, but that is a ways a way.
acb465f
to
678ca99
Compare
Adding temperature option to llama-run here: |
ramalama/cli.py
Outdated
@@ -205,8 +205,8 @@ def configure_arguments(parser): | |||
parser.add_argument( | |||
"--runtime", | |||
default=config.get("runtime"), | |||
choices=["llama.cpp", "vllm"], | |||
help="specify the runtime to use; valid options are 'llama.cpp' and 'vllm'", | |||
choices=["llama-cpp-python", "llama.cpp", "vllm"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, do not give this choice, just stick with llama.cpp and use llama-cpp-python server for serve
07202c5
to
ebf4d34
Compare
ebf4d34
to
4a5e4d8
Compare
I reduced the scope of this. One of the main drivers for llama-cpp-python server is to attempt to find a server that's compatible with: https://github.com/open-webui/open-webui we probably need to explore a few options like llama-server, llama-cpp-python and https://github.com/mostlygeek/llama-swap |
19f64aa
to
9ba28af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ericcurtin - I've reviewed your changes - here's some feedback:
Overall Comments:
- The removal of stderr redirection capability in exec_cmd() could make debugging more difficult. Consider keeping this functionality unless there's a specific reason for removing it.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
13ed742
to
cba1655
Compare
3c26215
to
6624389
Compare
And update llama.cpp and whisper.cpp. Signed-off-by: Eric Curtin <[email protected]>
6624389
to
fb45f23
Compare
This is ready to go @rhatdan |
LGTM |
And update llama.cpp and whisper.cpp.