Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common : support tag-based --hf-repo like on ollama #11195

Merged
merged 9 commits into from
Jan 13, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 11, 2025

This PR takes advantage of the Ollama support on Hugging Face. The backend can suggest the GGUF file to be used automatically, based on tag name.

-hf <user>/<model>[:quant]

For example (without tag):

llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF
llama-cli -hf mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
llama-cli -hf arcee-ai/SuperNova-Medius-GGUF
llama-cli -hf bartowski/Humanish-LLama3-8B-Instruct-GGUF

By default, it will take Q4_K_M if it is available. Otherwise, it checks Q4*, then falls back to the first file in repo if it is not found.

User can also specify the quantization via the tag, for example:

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

# the quantization name is case-insensitive, this will also work
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m

How it works

HF already have registry-compatible API, which powers the ollama <> HF integration:

https://huggingface.co/v2/<user>/<model>/manifests/<tag>

However, the API only returns the SHA256 of the GGUF file. To return the file name (which is used by llama.cpp), we had to add a logic in HF backend code base that returns ggufFile object when the header User-Agent is set to llama-cpp

{
    ....
    "ggufFile": {
        "rfilename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf",
        "blobId": "sha256:6c1a2b41161032677be168d354123594c0e6e67d2b9227c84f296ad037c728ff",
        "size": 2019377696,
        "lfs": {
            "sha256": "6c1a2b41161032677be168d354123594c0e6e67d2b9227c84f296ad037c728ff",
            "size": 2019377696,
            "pointerSize": 135
        }
    }
}

Note

You may ask, why don't we just use the blobId in layers? Indeed, we definitely can, but this will break backward compatibility with existing cache files, so I decided to properly reuse the existing hf_file by getting the correct value for it.

@ngxson ngxson requested a review from Copilot January 12, 2025 12:41

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (3)
  • common/arg.cpp: Language not supported
  • common/common.cpp: Language not supported
  • common/common.h: Language not supported
@ngxson ngxson requested a review from ggerganov January 13, 2025 10:45
@ngxson ngxson marked this pull request as ready for review January 13, 2025 10:45
@ngxson
Copy link
Collaborator Author

ngxson commented Jan 13, 2025

@ggerganov I'm wondering, if we want to go a step further, we can also make the <user> part in <user>/<model> become optional. It can be defaulted to ggml-org, which will takes models from https://huggingface.co/ggml-org

But doing this way, we should also take a bit of time to maintain GGUF on ggml-org (we only have some models for now)

What do you think about this?

common/common.h Outdated
Comment on lines 11 to 16
#if defined(LLAMA_USE_CURL)
#include <curl/curl.h>
#include <curl/easy.h>
#include <future>
#endif

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep curl contained in a single source: common/common.cpp. Move the common_get_hf_file to common/common.cpp and keep the includes in the .cpp file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that makes sense, I did in 22927b1

@ggerganov
Copy link
Owner

@ggerganov I'm wondering, if we want to go a step further, we can also make the <user> part in <user>/<model> become optional. It can be defaulted to ggml-org, which will takes models from huggingface.co/ggml-org

But doing this way, we should also take a bit of time to maintain GGUF on ggml-org (we only have some models for now)

What do you think about this?

Seems like a lot of work to manage the models in the organization and I am not sure we will have the resources to do it.

@ngxson ngxson merged commit 00b4c3d into ggerganov:master Jan 13, 2025
48 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Jan 13, 2025

@ericcurtin Btw, I didn't notice that you added the support for registry in llama-run. Some small things to note:

  • Not all models on ollama hub is compatible with llama.cpp, for example llama3.2-vision and phi-4 (ollama have a patch on their side to handle sliding window differently from llama.cpp). Also, it's tricky to add support mmproj file. That's the main reason why I don't plan to support ollama registry.
  • For HF registry, in this PR, I added the same support in common.cpp which returns value of --hf-file instead of SHA256. This is just to make sure the existing cache files continue to work as before (as they are depend on --model, which is generated from --hf-file).

That being said, I appreciate the effort to bring good UX into llama-run. Just want to raise awareness about the fact that as the code of llama-run grows, there will be a lot of duplication, which eventually better to make llama-run into a full produce instead of an example.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 13, 2025

@swarajpande5

containers/ramalama#377

I remember you wanted to look into this for ramalama. If we port this to python3 in ramalama it should work.

@bartowski1182
Copy link
Contributor

Haven't tested this yet, but how does it react to split models and my subfolder structure?

Awesome QoL change though either way!

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 13, 2025

@ericcurtin Yes it should be trivial to port to python, just need to set User-Agent to llama-cpp when sending the request, HF API will return the manifest (as normal) + the ggufFile details. Feel free to ping if you need help on HF API (Btw, I'm the maintainer of hf.co registry)

And btw without User-Agent: llama-cpp, the API will throw error on multi shards (splits) file, because it's not supported by ollama (but is supported by llama.cpp)

@bartowski1182 I've just tried with a split model and it works. It should work normally with subfolder too, should be exactly the same with --hf-file sub/folder/model.gguf

@giladgd
Copy link
Contributor

giladgd commented Jan 13, 2025

@ngxson How does the API work with gated models?
Since the list of files is always accessible on the website even on gated models (while the files themselves are still gated), I think it should be OK to be able to call the API on gated models without an access token.

I'm working on adding support for this in node-llama-cpp, and it would be nice to be able to resolve a model URI without an access token so it could be done separately from the download process itself.

I've tried fetching https://hf.co/v2/google/gemma-2b-it-GGUF/manifests/latest with User-Agent: llama-cpp without an access token and got 401 Unauthorized.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 13, 2025

Edit: sorry I misunderstood the question. I'll discuss with the team to see if it's safe to let users to read manifest of gated model (without giving access to the blobs)

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 13, 2025

Btw, what is the use case for that @giladgd ? AFAIK you can technically get the full model info via hf.co/api/models/google/gemma-2b-it-GGUF endpoint, but without being able to download files

@giladgd
Copy link
Contributor

giladgd commented Jan 13, 2025

@ngxson Currently, node-llama-cpp supports this kind of model URIs:

hf:mradermacher/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf

I wanted to add support for shorter URIs for a while now, but I wasn't sure what's the most stable and efficient approach to resolve the correct file for a given quant without having to read the metadata of all the repo's .gguf files to find the right one.
Now that you've added an API that does it on HuggingFace's side, it makes it possible to do this efficiently using a single request, so now is the right time to improve the URI-resolving implementation.

There are various flows where model URIs are resolved in node-llama-cpp, and in some, it validates whether the remote file is the same as the existing local one, and if there's a difference, the user will be prompted whether to download the updated file or use the existing one (there's also an explicit option for this that can be specified).
It would be nice if that validation could be done even without an access token.

BTW, it would be nice if you could also return the quant type/name returned by the API when getting the latest tag.

@ggerganov
Copy link
Owner

I noticed today that the Windows binaries from the releases do not have CURL support. Should we start bundling libcurl statically into the executables?

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 14, 2025

I have no idea if newer versions of windows (10/11) already had libcurl.dll built-in, but yes we can bundle it just in case.

@ggerganov
Copy link
Owner

Yup, I'm not sure what is the best way. I just noticed that the llama-server.exe says it is built without libcurl support.

Think we should somehow enable it for the Windows builds only, because it is rather difficult to provide it compared to linux and macos. Either bundling a static build (if it is not too heavy on the binary sizes) or somehow using a system-wide library if it is available.

@ericcurtin
Copy link
Collaborator

Distributing a libcurl.dll ourselves is slightly better than static linking, less duplication, better for updates. But static linking also works.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Jan 14, 2025

According this this every copy of Windows 10/11 should have it built-in though:

https://curl.se/windows/microsoft.html

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 14, 2025

The llama-server workflow already have a windows build with -DLLAMA_CURL, we can just copy the setup step from that, should be simple. (the libcurl.dll is also downloaded from the setup step)

I'll make a PR a bit later.

ngxson added a commit to huggingface/huggingface.js that referenced this pull request Jan 17, 2025
This change is related to these upstream PR:
- ggerganov/llama.cpp#11195 allows using
tag-based repo name like on ollama
- ggerganov/llama.cpp#11214 automatically turn
on `--conversation` mode for models having chat template

Example:

```sh
# for "instruct" model, conversation mode is enabled automatically
llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF

# for non-instruct model, it runs as completion
llama-cli -hf TheBloke/Llama-2-7B-GGUF -p "Once upon a time,"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants