Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run TEI model on CPU fails (says Cuda f16 and flash attention is required) #431

Open
2 of 4 tasks
Astlaan opened this issue Oct 25, 2024 · 1 comment
Open
2 of 4 tasks

Comments

@Astlaan
Copy link

Astlaan commented Oct 25, 2024

System Info

OS: Windows 11
Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20)
Hardware: CPU AMD 6800HS

(text-generation-launcher --env didn't work)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Hi,
I am trying to run a model locally using CPU, since I only have an AMD GPU, which apparently is not yet supported.

  1. I followed the instructions here: https://huggingface.co/docs/text-embeddings-inference/local_cpu
  2. I tried to run this:
text-embeddings-router --model-id dunzhang/stella_en_400M_v5 --port 8080
  1. I get this error:
2024-10-25T21:52:54.872449Z  INFO text_embeddings_router: router\src/main.rs:175: Args { model_id: "dun*****/******_**_***M_v5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-10-25T21:52:54.875192Z  INFO hf_hub: C:\Users\user\.cargo\registry\src\index.crates.io-6f17d22bba15001f\hf-hub-0.3.2\src\lib.rs:55: Token file not found "C:\\Users\\user\\.cache\\huggingface\\token"
2024-10-25T21:52:54.875404Z  INFO download_pool_config: text_embeddings_core::download: core\src\download.rs:38: Downloading `1_Pooling/config.json`
2024-10-25T21:52:54.875746Z  INFO download_new_st_config: text_embeddings_core::download: core\src\download.rs:62: Downloading `config_sentence_transformers.json`
2024-10-25T21:52:54.875919Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:21: Starting download
2024-10-25T21:52:54.876003Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:23: Downloading `config.json`
2024-10-25T21:52:54.876215Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:26: Downloading `tokenizer.json`
2024-10-25T21:52:54.876393Z  INFO download_artifacts: text_embeddings_backend: backends\src\lib.rs:328: Downloading `model.safetensors`
2024-10-25T21:52:54.876567Z  INFO download_artifacts: text_embeddings_core::download: core\src\download.rs:32: Model artifacts downloaded in 647.4µs
2024-10-25T21:52:54.886413Z  INFO text_embeddings_router: router\src/lib.rs:206: Maximum number of tokens per request: 512
2024-10-25T21:52:54.886730Z  INFO text_embeddings_core::tokenization: core\src\tokenization.rs:28: Starting 16 tokenization workers
2024-10-25T21:52:54.930092Z  INFO text_embeddings_router: router\src/lib.rs:248: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

It's asking for very specific GPU resources, even though I'm trying to run on the CPU.

Expected behavior

Would expect the model to work :)

@kozistr
Copy link
Contributor

kozistr commented Oct 26, 2024

@Astlaan hi. this may be related to #375. for now, GTE is only supported on Cuda devices in fp16 and needs to be implemented for the CPU version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants