Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Astlaan · 2024-10-25T22:41:28Z

System Info

OS: Windows 11
Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20)
Hardware: CPU AMD 6800HS

(text-generation-launcher --env didn't work)

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Hi,
I am trying to run a model locally using CPU, since I only have an AMD GPU, which apparently is not yet supported.

I followed the instructions here: https://huggingface.co/docs/text-embeddings-inference/local_cpu
I tried to run this:

text-embeddings-router --model-id dunzhang/stella_en_400M_v5 --port 8080

I get this error:

2024-10-25T21:52:54.872449Z  INFO text_embeddings_router: router\src/main.rs:175: Args { model_id: "dun*****/******_**_***M_v5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-10-25T21:52:54.875192Z  INFO hf_hub: C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hf-hub-0.3.2\src\lib.rs:55: Token file not found "C:\\Users\\user\\.cache\\huggingface\\token"
2024-10-25T21:52:54.875404Z  INFO download_pool_config: text_embeddings_core::download: core\src\download.rs:38: Downloading `1_Pooling/config.json`
2024-10-25T21:52:54.875746Z  INFO download_new_st_config: text_embeddings_core::download: core\src\download.rs:62: Downloading `config_sentence_transformers.json`
2024-10-25T21:52:54.875919Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:21: Starting download
2024-10-25T21:52:54.876003Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:23: Downloading `config.json`
2024-10-25T21:52:54.876215Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:26: Downloading `tokenizer.json`
2024-10-25T21:52:54.876393Z  INFO download_artifacts: text_embeddings_backend: backends\src\lib.rs:328: Downloading `model.safetensors`
2024-10-25T21:52:54.876567Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:32: Model artifacts downloaded in 647.4µs
2024-10-25T21:52:54.886413Z  INFO text_embeddings_router: router\src/lib.rs:206: Maximum number of tokens per request: 512
2024-10-25T21:52:54.886730Z  INFO text_embeddings_core::tokenization: core\src\tokenization.rs:28: Starting 16 tokenization workers
2024-10-25T21:52:54.930092Z  INFO text_embeddings_router: router\src/lib.rs:248: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

It's asking for very specific GPU resources, even though I'm trying to run on the CPU.

Expected behavior

Would expect the model to work :)

The text was updated successfully, but these errors were encountered:

kozistr · 2024-10-26T05:49:21Z

@Astlaan hi. this may be related to #375. for now, GTE is only supported on Cuda devices in fp16 and needs to be implemented for the CPU version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Astlaan commented Oct 25, 2024

kozistr commented Oct 26, 2024

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Comments

Astlaan commented Oct 25, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

kozistr commented Oct 26, 2024