Add marlin int4 kernel #315

dacorvo · 2024-09-20T15:34:52Z

What does this PR do?

This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:

MarlinInt4PackedTensor,
MarlinInt4WeightQBitsTensor.

The AWQ kernel is still used by default because from the first tests it seems the modified Marlin kernel either has some accuracy issues or is not properly integrated (perplexity increases).

Note: during the integration, I tried to register the Marlin fp16int4 gemm as a torch.tensor.library.custom_op, but it added an extra latency (up to 50 %), so I used the same legacy declaration (with define/impl).

source: https://github.com/shcho1118/marlin-scaled-zero-point

Original fix in vLLM project: The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

This is to guarantee Marlin kernels output is similar to the output obtained using dequantized weights.

dacorvo · 2024-09-27T15:57:16Z

Closing this as the modified kernel is definitely flawed: as soon as it processes more than 32 inputs (i.e. two blocks of 16), there are errors in the outputs starting from the 128th output feature. There is likely some kind of flaw in the weight/scales/zero-point readback as soon as parallelization increases.

dacorvo requested a review from SunMarc September 20, 2024 15:35

dacorvo force-pushed the add_marlin_int4_kernel branch 4 times, most recently from ffd984a to 5e5adbe Compare September 25, 2024 16:28

dacorvo marked this pull request as draft September 25, 2024 16:33

dacorvo force-pushed the add_marlin_int4_kernel branch from 5e5adbe to 19ee33e Compare September 25, 2024 19:55

dacorvo and others added 9 commits September 26, 2024 15:16

feat(library): add original marlin fp16i4 kernel

a756fef

feat: modify marlin fp16 int4 kernel to use scaled zeropoint

6c087ca

source: https://github.com/shcho1118/marlin-scaled-zero-point

feat(library): add Marlin gemm_f16_i4 op

4b5b17e

feat: add MarlinInt4PackedTensor

77a4df4

feat(marlin): add scales/shifts permutations

409d26e

test: add test_gemm_marlin_fp16_int4

0366a03

This is to guarantee Marlin kernels output is similar to the output obtained using dequantized weights.

perf: add Marlin to w4a16 benchmark

ed58970

feat(qtensor): add MarlinQBitsTensor

5564b8f

dacorvo force-pushed the add_marlin_int4_kernel branch from 19ee33e to 5564b8f Compare September 26, 2024 15:23

dacorvo closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add marlin int4 kernel #315

Add marlin int4 kernel #315

dacorvo commented Sep 20, 2024 •

edited

Loading

dacorvo commented Sep 27, 2024

Add marlin int4 kernel #315

Add marlin int4 kernel #315

Conversation

dacorvo commented Sep 20, 2024 • edited Loading

What does this PR do?

dacorvo commented Sep 27, 2024

dacorvo commented Sep 20, 2024 •

edited

Loading