common : support tag-based --hf-repo like on ollama #11195

ngxson · 2025-01-11T18:51:10Z

This PR takes advantage of the Ollama support on Hugging Face. The backend can suggest the GGUF file to be used automatically, based on tag name.

-hf <user>/<model>[:quant]

For example (without tag):

llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF
llama-cli -hf mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
llama-cli -hf arcee-ai/SuperNova-Medius-GGUF
llama-cli -hf bartowski/Humanish-LLama3-8B-Instruct-GGUF

By default, it will take Q4_K_M if it is available. Otherwise, it checks Q4*, then falls back to the first file in repo if it is not found.

User can also specify the quantization via the tag, for example:

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

# the quantization name is case-insensitive, this will also work
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m

How it works

HF already have registry-compatible API, which powers the ollama <> HF integration:

https://huggingface.co/v2/<user>/<model>/manifests/<tag>

However, the API only returns the SHA256 of the GGUF file. To return the file name (which is used by llama.cpp), we had to add a logic in HF backend code base that returns ggufFile object when the header User-Agent is set to llama-cpp

{
    ....
    "ggufFile": {
        "rfilename": "Llama-3.2-3B-Instruct-Q4_K_M.gguf",
        "blobId": "sha256:6c1a2b41161032677be168d354123594c0e6e67d2b9227c84f296ad037c728ff",
        "size": 2019377696,
        "lfs": {
            "sha256": "6c1a2b41161032677be168d354123594c0e6e67d2b9227c84f296ad037c728ff",
            "size": 2019377696,
            "pointerSize": 135
        }
    }
}

Note

You may ask, why don't we just use the blobId in layers? Indeed, we definitely can, but this will break backward compatibility with existing cache files, so I decided to properly reuse the existing hf_file by getting the correct value for it.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (3)

common/arg.cpp: Language not supported
common/common.cpp: Language not supported
common/common.h: Language not supported

ngxson · 2025-01-13T10:48:33Z

@ggerganov I'm wondering, if we want to go a step further, we can also make the <user> part in <user>/<model> become optional. It can be defaulted to ggml-org, which will takes models from https://huggingface.co/ggml-org

But doing this way, we should also take a bit of time to maintain GGUF on ggml-org (we only have some models for now)

What do you think about this?

ggerganov · 2025-01-13T11:01:40Z

common/common.h

+#if defined(LLAMA_USE_CURL)
+#include <curl/curl.h>
+#include <curl/easy.h>
+#include <future>
+#endif
+


We should keep curl contained in a single source: common/common.cpp. Move the common_get_hf_file to common/common.cpp and keep the includes in the .cpp file.

Yup that makes sense, I did in 22927b1

ggerganov · 2025-01-13T11:04:46Z

@ggerganov I'm wondering, if we want to go a step further, we can also make the <user> part in <user>/<model> become optional. It can be defaulted to ggml-org, which will takes models from huggingface.co/ggml-org

But doing this way, we should also take a bit of time to maintain GGUF on ggml-org (we only have some models for now)

What do you think about this?

Seems like a lot of work to manage the models in the organization and I am not sure we will have the resources to do it.

ngxson · 2025-01-13T13:41:00Z

@ericcurtin Btw, I didn't notice that you added the support for registry in llama-run. Some small things to note:

Not all models on ollama hub is compatible with llama.cpp, for example llama3.2-vision and phi-4 (ollama have a patch on their side to handle sliding window differently from llama.cpp). Also, it's tricky to add support mmproj file. That's the main reason why I don't plan to support ollama registry.
For HF registry, in this PR, I added the same support in common.cpp which returns value of --hf-file instead of SHA256. This is just to make sure the existing cache files continue to work as before (as they are depend on --model, which is generated from --hf-file).

That being said, I appreciate the effort to bring good UX into llama-run. Just want to raise awareness about the fact that as the code of llama-run grows, there will be a lot of duplication, which eventually better to make llama-run into a full produce instead of an example.

ericcurtin · 2025-01-13T17:34:24Z

@swarajpande5

containers/ramalama#377

I remember you wanted to look into this for ramalama. If we port this to python3 in ramalama it should work.

bartowski1182 · 2025-01-13T17:58:14Z

Haven't tested this yet, but how does it react to split models and my subfolder structure?

Awesome QoL change though either way!

ngxson · 2025-01-13T18:44:27Z

@ericcurtin Yes it should be trivial to port to python, just need to set User-Agent to llama-cpp when sending the request, HF API will return the manifest (as normal) + the ggufFile details. Feel free to ping if you need help on HF API (Btw, I'm the maintainer of hf.co registry)

And btw without User-Agent: llama-cpp, the API will throw error on multi shards (splits) file, because it's not supported by ollama (but is supported by llama.cpp)

@bartowski1182 I've just tried with a split model and it works. It should work normally with subfolder too, should be exactly the same with --hf-file sub/folder/model.gguf

giladgd · 2025-01-13T22:08:25Z

@ngxson How does the API work with gated models?
Since the list of files is always accessible on the website even on gated models (while the files themselves are still gated), I think it should be OK to be able to call the API on gated models without an access token.

I'm working on adding support for this in node-llama-cpp, and it would be nice to be able to resolve a model URI without an access token so it could be done separately from the download process itself.

I've tried fetching https://hf.co/v2/google/gemma-2b-it-GGUF/manifests/latest with User-Agent: llama-cpp without an access token and got 401 Unauthorized.

ngxson · 2025-01-13T22:59:57Z

Edit: sorry I misunderstood the question. I'll discuss with the team to see if it's safe to let users to read manifest of gated model (without giving access to the blobs)

ngxson · 2025-01-13T23:03:46Z

Btw, what is the use case for that @giladgd ? AFAIK you can technically get the full model info via hf.co/api/models/google/gemma-2b-it-GGUF endpoint, but without being able to download files

giladgd · 2025-01-13T23:57:11Z

@ngxson Currently, node-llama-cpp supports this kind of model URIs:

hf:mradermacher/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf

I wanted to add support for shorter URIs for a while now, but I wasn't sure what's the most stable and efficient approach to resolve the correct file for a given quant without having to read the metadata of all the repo's .gguf files to find the right one.
Now that you've added an API that does it on HuggingFace's side, it makes it possible to do this efficiently using a single request, so now is the right time to improve the URI-resolving implementation.

There are various flows where model URIs are resolved in node-llama-cpp, and in some, it validates whether the remote file is the same as the existing local one, and if there's a difference, the user will be prompted whether to download the updated file or use the existing one (there's also an explicit option for this that can be specified).
It would be nice if that validation could be done even without an access token.

BTW, it would be nice if you could also return the quant type/name returned by the API when getting the latest tag.

ggerganov · 2025-01-14T13:33:14Z

I noticed today that the Windows binaries from the releases do not have CURL support. Should we start bundling libcurl statically into the executables?

ngxson · 2025-01-14T14:14:52Z

I have no idea if newer versions of windows (10/11) already had libcurl.dll built-in, but yes we can bundle it just in case.

ggerganov · 2025-01-14T14:45:05Z

Yup, I'm not sure what is the best way. I just noticed that the llama-server.exe says it is built without libcurl support.

Think we should somehow enable it for the Windows builds only, because it is rather difficult to provide it compared to linux and macos. Either bundling a static build (if it is not too heavy on the binary sizes) or somehow using a system-wide library if it is available.

ericcurtin · 2025-01-14T15:21:47Z

Distributing a libcurl.dll ourselves is slightly better than static linking, less duplication, better for updates. But static linking also works.

ericcurtin · 2025-01-14T15:22:43Z

According this this every copy of Windows 10/11 should have it built-in though:

https://curl.se/windows/microsoft.html

ngxson · 2025-01-14T16:27:20Z

The llama-server workflow already have a windows build with -DLLAMA_CURL, we can just copy the setup step from that, should be simple. (the libcurl.dll is also downloaded from the setup step)

I'll make a PR a bit later.

This change is related to these upstream PR: - ggerganov/llama.cpp#11195 allows using tag-based repo name like on ollama - ggerganov/llama.cpp#11214 automatically turn on `--conversation` mode for models having chat template Example: ```sh # for "instruct" model, conversation mode is enabled automatically llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF # for non-instruct model, it runs as completion llama-cli -hf TheBloke/Llama-2-7B-GGUF -p "Once upon a time," ```

ngxson added 3 commits January 11, 2025 19:44

common : support tag-based hf_repo like on ollama

8030316

fix build

ef089ca

various fixes

242135e

ngxson requested a review from Copilot January 12, 2025 12:41

Copilot AI reviewed Jan 12, 2025

View reviewed changes

small fixes

d7b5bf8

ngxson requested a review from ggerganov January 13, 2025 10:45

ngxson marked this pull request as ready for review January 13, 2025 10:45

fix style

ff484f7

Merge branch 'master' into xsn/tag_based_hf_repo

c03d5cc

ggerganov reviewed Jan 13, 2025

View reviewed changes

fix windows build?

6ffb590

move common_get_hf_file to common.cpp

22927b1

ggerganov approved these changes Jan 13, 2025

View reviewed changes

fix complain with noreturn

8bd5b18

ngxson merged commit 00b4c3d into ggerganov:master Jan 13, 2025
48 checks passed

ngxson mentioned this pull request Jan 14, 2025

local-apps: update llama.cpp snippet huggingface/huggingface.js#1103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common : support tag-based --hf-repo like on ollama #11195

common : support tag-based --hf-repo like on ollama #11195

ngxson commented Jan 11, 2025 •

edited

Loading

ngxson commented Jan 13, 2025 •

edited

Loading

ggerganov Jan 13, 2025

ngxson Jan 13, 2025

ggerganov commented Jan 13, 2025

ngxson commented Jan 13, 2025

ericcurtin commented Jan 13, 2025 •

edited

Loading

bartowski1182 commented Jan 13, 2025

ngxson commented Jan 13, 2025 •

edited

Loading

giladgd commented Jan 13, 2025

ngxson commented Jan 13, 2025

ngxson commented Jan 13, 2025

giladgd commented Jan 13, 2025

ggerganov commented Jan 14, 2025

ngxson commented Jan 14, 2025

ggerganov commented Jan 14, 2025

ericcurtin commented Jan 14, 2025

ericcurtin commented Jan 14, 2025 •

edited

Loading

ngxson commented Jan 14, 2025 •

edited

Loading

common : support tag-based --hf-repo like on ollama #11195

common : support tag-based --hf-repo like on ollama #11195

Conversation

ngxson commented Jan 11, 2025 • edited Loading

How it works

Choose a reason for hiding this comment

ngxson commented Jan 13, 2025 • edited Loading

ggerganov Jan 13, 2025

Choose a reason for hiding this comment

ngxson Jan 13, 2025

Choose a reason for hiding this comment

ggerganov commented Jan 13, 2025

ngxson commented Jan 13, 2025

ericcurtin commented Jan 13, 2025 • edited Loading

bartowski1182 commented Jan 13, 2025

ngxson commented Jan 13, 2025 • edited Loading

giladgd commented Jan 13, 2025

ngxson commented Jan 13, 2025

ngxson commented Jan 13, 2025

giladgd commented Jan 13, 2025

ggerganov commented Jan 14, 2025

ngxson commented Jan 14, 2025

ggerganov commented Jan 14, 2025

ericcurtin commented Jan 14, 2025

ericcurtin commented Jan 14, 2025 • edited Loading

ngxson commented Jan 14, 2025 • edited Loading

ngxson commented Jan 11, 2025 •

edited

Loading

ngxson commented Jan 13, 2025 •

edited

Loading

ericcurtin commented Jan 13, 2025 •

edited

Loading

ngxson commented Jan 13, 2025 •

edited

Loading

ericcurtin commented Jan 14, 2025 •

edited

Loading

ngxson commented Jan 14, 2025 •

edited

Loading