Add: Support for Sparse24Bitmask Compressed Models #12097

rahul-tuli · 2025-01-15T20:42:21Z

This PR adds support for models compressed using Sparse24BitMaskCompressor to use cutlass 2:4 Kernels

Adds support for compressed cases
Introduces a new BitmaskShapeParameter

This diff was manually tested on:

nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_int8-BitM

Also added unit tests for the compressed cases!!

Needs the following compressed-tensors PR to land:

BugFix: Shape should be a flat list neuralmagic/compressed-tensors#241

Notion Doc: https://www.notion.so/SparseBitMask-24-work-15e863ebf65c80dcbc70e6317d552987

github-actions · 2025-01-15T20:42:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli · 2025-01-15T22:04:06Z

Add a test file with an 8B 2of4 compressed model for lm_eval_harness in buildkite
Add test cases for:

-> Sparse only
-> fp8 + sparse dynamic per token
-> fp8 scheme
-> int8 dynamic
-> int8 scheme

dsikka · 2025-01-16T15:05:39Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

@@ -481,6 +495,19 @@ def supports_cutlass_24(

        return weight_quant.num_bits == input_quant.num_bits == 8

+    def _get_model_compression_config(


seems like an unnecessary function break out

dsikka · 2025-01-16T15:06:30Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

+            assert all(
+                partition_size % 8 == 0
+                for partition_size in output_partition_sizes
+            ), "All partitions must be divisible by 8 for 2:4 compressed models"


maybe "for a 2:4 sparse compressed model"?

dsikka · 2025-01-16T15:06:44Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

+            shape = BitMaskShapeParameter(data=torch.empty(
+                2 * len(output_partition_sizes), 1, dtype=torch.uint64),
+                                          weight_loader=weight_loader)
+            compressed = ModelWeightParameter(data=torch.empty(


nit: parameter name

dsikka · 2025-01-16T15:08:25Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

-    new_tensor = tensor.view(-1, 4)
-    zero_counts = (new_tensor == 0).sum(dim=1)
-    return (zero_counts >= 2).all().item()
+    def _decompress_bitmask_compressed_weight(


dsikka · 2025-01-16T15:09:47Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

+                for partition_size in output_partition_sizes
+            ), "All partitions must be divisible by 8 for 2:4 compressed models"
+
+            shape = BitMaskShapeParameter(data=torch.empty(


We dont need to shard the shape

rahul-tuli added 2 commits January 15, 2025 20:59

Add: Support for Sparse24Bitmask Compressed Models

8237d41

Signed-off-by: Rahul Tuli <[email protected]>

Fix: mypy errors

02ff821

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the rahul-bitmask-additions branch from ab892d2 to 02ff821 Compare January 15, 2025 20:59

dsikka reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: Support for Sparse24Bitmask Compressed Models #12097

Add: Support for Sparse24Bitmask Compressed Models #12097

rahul-tuli commented Jan 15, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 15, 2025

rahul-tuli commented Jan 15, 2025

dsikka Jan 16, 2025

dsikka Jan 16, 2025

dsikka Jan 16, 2025

dsikka Jan 16, 2025

dsikka Jan 16, 2025

		@@ -481,6 +495,19 @@ def supports_cutlass_24(

		return weight_quant.num_bits == input_quant.num_bits == 8

		def _get_model_compression_config(

Add: Support for Sparse24Bitmask Compressed Models #12097

Are you sure you want to change the base?

Add: Support for Sparse24Bitmask Compressed Models #12097

Conversation

rahul-tuli commented Jan 15, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 15, 2025

rahul-tuli commented Jan 15, 2025

dsikka Jan 16, 2025

Choose a reason for hiding this comment

dsikka Jan 16, 2025

Choose a reason for hiding this comment

dsikka Jan 16, 2025

Choose a reason for hiding this comment

dsikka Jan 16, 2025

Choose a reason for hiding this comment

dsikka Jan 16, 2025

Choose a reason for hiding this comment

rahul-tuli commented Jan 15, 2025 •

edited by github-actions bot

Loading