-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common : support tag-based --hf-repo like on ollama #11195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot wasn't able to review any files in this pull request.
Files not reviewed (3)
- common/arg.cpp: Language not supported
- common/common.cpp: Language not supported
- common/common.h: Language not supported
@ggerganov I'm wondering, if we want to go a step further, we can also make the But doing this way, we should also take a bit of time to maintain GGUF on What do you think about this? |
common/common.h
Outdated
#if defined(LLAMA_USE_CURL) | ||
#include <curl/curl.h> | ||
#include <curl/easy.h> | ||
#include <future> | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep curl
contained in a single source: common/common.cpp
. Move the common_get_hf_file
to common/common.cpp
and keep the includes in the .cpp
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup that makes sense, I did in 22927b1
Seems like a lot of work to manage the models in the organization and I am not sure we will have the resources to do it. |
@ericcurtin Btw, I didn't notice that you added the support for registry in
That being said, I appreciate the effort to bring good UX into |
I remember you wanted to look into this for ramalama. If we port this to python3 in ramalama it should work. |
Haven't tested this yet, but how does it react to split models and my subfolder structure? Awesome QoL change though either way! |
@ericcurtin Yes it should be trivial to port to python, just need to set And btw without @bartowski1182 I've just tried with a split model and it works. It should work normally with subfolder too, should be exactly the same with |
@ngxson How does the API work with gated models? I'm working on adding support for this in I've tried fetching |
Edit: sorry I misunderstood the question. I'll discuss with the team to see if it's safe to let users to read manifest of gated model (without giving access to the blobs) |
Btw, what is the use case for that @giladgd ? AFAIK you can technically get the full model info via |
@ngxson Currently,
I wanted to add support for shorter URIs for a while now, but I wasn't sure what's the most stable and efficient approach to resolve the correct file for a given quant without having to read the metadata of all the repo's There are various flows where model URIs are resolved in BTW, it would be nice if you could also return the quant type/name returned by the API when getting the |
I noticed today that the Windows binaries from the releases do not have CURL support. Should we start bundling |
I have no idea if newer versions of windows (10/11) already had |
Yup, I'm not sure what is the best way. I just noticed that the Think we should somehow enable it for the Windows builds only, because it is rather difficult to provide it compared to linux and macos. Either bundling a static build (if it is not too heavy on the binary sizes) or somehow using a system-wide library if it is available. |
Distributing a libcurl.dll ourselves is slightly better than static linking, less duplication, better for updates. But static linking also works. |
According this this every copy of Windows 10/11 should have it built-in though: |
The I'll make a PR a bit later. |
This change is related to these upstream PR: - ggerganov/llama.cpp#11195 allows using tag-based repo name like on ollama - ggerganov/llama.cpp#11214 automatically turn on `--conversation` mode for models having chat template Example: ```sh # for "instruct" model, conversation mode is enabled automatically llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF # for non-instruct model, it runs as completion llama-cli -hf TheBloke/Llama-2-7B-GGUF -p "Once upon a time," ```
This PR takes advantage of the Ollama support on Hugging Face. The backend can suggest the GGUF file to be used automatically, based on tag name.
For example (without tag):
By default, it will take
Q4_K_M
if it is available. Otherwise, it checksQ4*
, then falls back to the first file in repo if it is not found.User can also specify the quantization via the tag, for example:
How it works
HF already have registry-compatible API, which powers the ollama <> HF integration:
However, the API only returns the SHA256 of the GGUF file. To return the file name (which is used by llama.cpp), we had to add a logic in HF backend code base that returns
ggufFile
object when the headerUser-Agent
is set tollama-cpp
Note
You may ask, why don't we just use the
blobId
inlayers
? Indeed, we definitely can, but this will break backward compatibility with existing cache files, so I decided to properly reuse the existinghf_file
by getting the correct value for it.