Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Efficient Layer Distribution for DeepSeek Coder v2 on Multiple GPUs and CPUs #49

Open
BGFGB opened this issue Aug 21, 2024 · 4 comments

Comments

@BGFGB
Copy link

BGFGB commented Aug 21, 2024

Hi, I'm currently trying to run DeepSeek Coder v2 on a single node with the following setup:

Node 1: Two A6000 GPUs (48GB each) and 192GB of RAM
Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM

At present, with the default configuration, the model only fully utilizes a single GPU with 24GB of VRAM. Alternatively, it can be split across two GPUs, but only using around 12GB on each, which seems suboptimal given the available resources. Wouldn’t it be more efficient if I could fully utilize more GPUs?

I can modify the configuration to allocate more layers to the GPUs, but this has been a trial-and-error process. Is there a more systematic approach or calculation method that could help guide me in allocating layers more efficiently across the available GPUs? Are there any recommended strategies for balancing the model layers on GPUs with different VRAM capacities?

Any guidance on how to better utilize GPU resources for faster inference would be greatly appreciated.

Thanks!

@cts2021
Copy link

cts2021 commented Aug 22, 2024

I have also encountered the same situation and will pay close attention.

@ELigoP
Copy link

ELigoP commented Aug 22, 2024

This is covered in #46 , tutorial is https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md

For now you need to write special .yaml optimization rule for your case.

@Azure-Tang
Copy link
Contributor

Azure-Tang commented Aug 22, 2024

Are you looking for detailed guidance on how to write a YAML configuration that maximizes GPU utilization? We will consider it.

Before our detailed tutorial, a practical starting point is to assess the tensor sizes in your gguf files. For instance, by examining the DeepseekV2 configuration, you can determine the shapes and data types of the tensors and estimate the VRAM they require.

Note, there are two things you have to pay attention to when calculating:

1.	If you are using KExpertsTorch or KLinearTorch as your backend, the weights will be dequantized model's default dtype, which is bf16 for deepseekV2.
2.	If your backend is Marlin, the weights will be dequantized to Q4 (you can also use Q8 by setting kwargs).

@sammcj
Copy link
Contributor

sammcj commented Aug 29, 2024

This may not be quite right, but I'm thinking your config could look something like this:

  • Layer Distribution:
    • GPU 0 (cuda:0): Layers 0-29 (30 layers)
    • GPU 1 (cuda:1): Layers 30-59 (30 layers) + model.norm and lm_head
  • The model is evenly split between the two GPUs, each handling half of the layers. This should provide good balance and parallelism.
  • The transfer_map in the ^model$ match is set to transfer at layer 30, which is the midpoint of the model.
  • Both GPUs are utilized for generation and prefill operations, maximizing the use of available hardware.
  • The embedding tokens are kept on the CPU to save GPU memory.
  • Expert parallelism is maintained, with experts being computed on the CPU and then transferred to the respective GPU.

2x A6000 GPUs (48GB each) and 192GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:1"
      generate_op:  "KExpertsTorch"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

Node 2: Two 4090 GPUs (24GB each) and 64GB of RAM:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cpu"
        prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.(?!self_attn).*$"
    class: torch.nn.Linear
  replace:
    class: ktransformers.operators.linear.KTransformersLinear
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      prefill_device: "cuda:1"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:1"
  recursive: False

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([3-5][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0
      transfer_map: 
        30: "cuda:1"

- match:
    name: "^model\\.layers\\.([0-2][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "(^model\\.layers\\.([3-5][0-9])\\.)|(^model.norm)|(^lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:1"
      prefill_device: "cuda:1"

To keep the embedding tokens on the GPUs, I think you'd modify the configuration for the model.embed_tokens match. Instead of assigning it to the CPU, assign it to one of the GPUs, typically the first one (cuda:0). Here's how we can modify that part:

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cuda:0"
        prefill_device: "cuda:0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants