The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually.
-
Preferable compression ratio
-
No post-hoc retraining
-
Affordable inference time
-
Short compression time
- Abstract
- How to Use
- How to compress models
- How to test compressed models
- How to use compressed results for model inference
- More Details of Experimental Results
- Experimental Details
- Supplementary Experimental Results
- Parameter Sensitivity
- Citation
The rapid growth of large models' size has far outpaced that of GPU memory. To bridge this gap, inspired by the succinct relationship between genotype and phenotype, we turn the model compression problem into the issue of parameter representation to propose the so-called hyper-compression. The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually. Empirically, the proposed hyper-compression enjoys the following merits: 1) Preferable compression ratio; 2) No post-hoc retraining; 3) Affordable inference time; and 4) Short compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. Our work has the potential to invigorate the field of model compression, towards a harmony between the scaling law and the stagnation of hardware upgradation.
We provide examples of compressing three models: UNet, MobileNetV3, and Sheared-LlaMA-1.3B. Of course, our method can be applied to any model, as it essentially compresses the model's parameters and is independent of the model structure. The models for UNet and MobileNetV3, along with the project files, are available and can be run directly, with the compression results automatically saved in the newly created compressed_results
folder. However, due to the large size of the Sheared-LlaMA-1.3B model, we recommend downloading it directly from Huggingface and placing all model files in Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LlaMA-1.3B/
.
In our paper, we conducted multiple tests on the three models—UNet, MobileNetV3, and Sheared-LlaMA-1.3B—to demonstrate that our compression method has minimal impact on model performance.
-
UNet is tested on the Carvana dataset using the Dice metric. You can run
Hyper-Compression/UNet-hyper-compression/decode_eval.py
to evaluate it. -
MobileNetV3 is tested on the CIFAR-10 dataset using Accuracy. You can run
Hyper-Compression/MobileNetV3-hyper-compression/decode_eval.py
to evaluate it. -
Sheared-LlaMA-1.3B is tested on the wikitext-2-raw-v1 dataset for Perplexity (PPL) and evaluated on eight downstream tasks: 0-shot accuracy on SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot HellaSwag, 32-shot BoolQ, NQ, and 5-shot MMLU, using the
lm-evaluation-harness
GitHub library. However, to perform the 8 downstream tasks, you need to downloadlm-evaluation-harness
:git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness pip install -e .
For example, assuming I want to test 10-shot HellaSwag on Sheared-LlaMA-1.3B, we can use the following command:
lm_eval --model hf --model_args pretrained=/root/autodl-tmp/LLM-models/Sheared-LLaMA-1.3B --tasks hellaswag --device cuda:0 --batch_size auto:4 --num_fewshot 10
When we want to test the lossy model compressed using our method, we only need to replace the
pytorch_model.bin
in theHyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B/
folder withpytorch_model_back.bin
fromHyper-Compression/Sheared-Llamma-hyper-compression/compressed_result/
. (When replacing, remember to also rename the file to "pytorch_model.bin"). This way, you can perform the test using the same method.At the same time, we can use the following code to test the model's Perplexity (PPL). The wikitext-2-raw-v1 dataset can be downloaded from Hugging Face:
import torch import torch.nn as nn from transformers import AutoTokenizer, AutoModelForCausalLM import tqdm from datasets import load_dataset import argparse class Evaluator: def __init__(self, dataset, tokenizer, device, n_samples=40): self.dataset = dataset self.tokenizer = tokenizer self.device = device self.dataset = tokenizer( "\n\n".join(dataset['train']["text"]), return_tensors="pt" ).input_ids.to(device) self.n_samples = n_samples @torch.no_grad() def evaluate(self, model): model.eval() nlls = [] n_samples = self.n_samples if self.n_samples else self.dataset.size(1) // 2048 for i in tqdm.tqdm(range(n_samples), desc="Evaluating..."): batch = self.dataset[:, (i * 2048) : ((i + 1) * 2048)].to(model.device) with torch.no_grad(): lm_logits = model(batch).logits shift_logits = lm_logits[:, :-1, :].contiguous().float() shift_labels = self.dataset[:, (i * 2048) : ((i + 1) * 2048)][:, 1:] loss_fct = nn.CrossEntropyLoss() loss = loss_fct( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1) ) neg_log_likelihood = loss.float() * 2048 nlls.append(neg_log_likelihood) return torch.exp(torch.stack(nlls).sum() / (n_samples * 2048)) # dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") dataset = load_dataset('parquet', data_files='wiki2-test-00000-of-00001.parquet') tokenizer = AutoTokenizer.from_pretrained("`Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B") model = AutoModelForCausalLM.from_pretrained("`Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B").to('cuda') evaluator = Evaluator(dataset, tokenizer, "cuda", n_samples=n_samples) ppl = evaluator.evaluate(model) print(f"Perplexity: {ppl}")
We only provide the example for UNet, and you can run Hyper-Compression/UNet-hyper-compression/inference.py
.
In this section, we introduce details of experimental configurations. All experiments are conducted on a single NVIDIA RTX4090 (24GB) GPU using the PyTorch framework.
-
Details of experiments related to UNet: We perform parameter compression on both UNet and the pruned UNet, and test the performance loss of the pruned models on the Carvana dataset.
-
UNet: U-Net is a deep convolutional neural network architecture that has proven highly effective for image segmentation tasks, especially in biomedical imaging. The network adopts an encoder-decoder structure. The encoder progressively downsamples feature maps to capture context, while the decoder upsamples these feature maps and concatenates them with corresponding feature maps from the encoder via skip connections to precisely localize objects.
-
Pruned UNet: Specifically, the pruned UNet refers to the UNet model whose parameters are removed using the Taylor-Rank Pruning with the FLOPs regularization (strength=0.001). This is essentially a channel pruning method. The bar chart below illustrates the percentage of channels pruned in each layer of the pruned UNet relative to the original UNet.
-
Dataset: The Carvana Image Masking Challenge dataset is a large-scale dataset specifically designed for image segmentation tasks, particularly in the automotive industry. It consists of a collection of high-resolution images of cars, along with the corresponding pixel-level masks that delineate the precise boundaries of the vehicles. The Carvana dataset contains 5,088 images with label masks in the training set and images without masks in the test set. To evaluate model performance using the Dice metric, we randomly split 80% of the dataset as the training set and 20% as the validation set.
-
Evaluation: The Dice coefficient, also known as the Dice similarity index, is a statistical measure used to gauge the similarity between two sets. It is particularly popular in the field of image segmentation to assess the accuracy of models by comparing the overlap between the predicted segmentation and the ground truth. Mathematically, the Dice coefficient is defined as $$ \begin{equation} \mathtt{Dice} = \frac{2|A \cap B|}{|A| + |B|}, \end{equation} $$ , where A represents the predicted segmentation, and B is the ground truth. The value of the Dice coefficient ranges from 0 to 1, where 1 indicates the perfect agreement between two sets, and 0 indicates no overlap at all.
-
-
Details of experiments related to MobileNetV3: We apply our compression method to both MobileNetV3 and the pruned MobileNetV3 optimized by automated gradual pruning algorithm, and evaluate the top-1 accuracy on the CIFAR-10 dataset.
-
MobileNetV3: MobileNetV3 is a state-of-the-art convolutional neural network architecture designed specifically for mobile devices. Building upon the success of its predecessors, MobileNetV3 introduces novel design principles and optimization techniques to achieve a more compelling balance between accuracy and efficiency.
-
Pruned MobileNetV3: We employ the Automated Gradual Pruner (AGP) technique to prune MobileNetV3, ultimately reducing the model size by approximately 2.68 times. The central idea of Automated Gradual Pruner (AGP) is to prune the model's weights gradually over time, allowing the network to adapt to the loss of parameters through continuous retraining. This technique is particularly effective in minimizing the performance degradation associated with aggressive pruning strategies.
-
Dataset: The CIFAR-10 dataset is a well-known benchmark in the field of machine learning and computer vision, consisting of 60,000 colored images belonging to 10 classes. The dataset is split into 50,000 training images and 10,000 testing images.
-
-
Details of experiments related to LlaMA2-7B: We apply our compression method to the LLaMA2-7B and three other performant pruned variants of the LLaMA2-7B: Sheared-LLaMA-1.3B, TinyLLaMA, and LiteLLaMA. We then examine the performance drop on these models using various metrics, including SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot Hellaswag, 32-shot BoolQ , 32-shot NQ, 5-shot MMLU, and Perplexity to evaluate the effects of compression by using the package
$\texttt{lm-evaluation-harness}$ .-
LlaMA2-7B: It is a state-of-the-art language model that has 7-billion parameters and was made publicly available by Meta AI. Employing the Transformer architecture, the model has been pre-trained on a massive text corpus, enabling it to exhibit superior performance on a multitude of natural language processing tasks, including text generation, question answering, machine translation, and so on.
-
Other comparative models:
- Sheared-LlaMA-1.3B: It is a variant of LlaMA2-7B that uses a novel pruning algorithm: targeted structured pruning which can reduce a source model to a specified target architecture defined by the configurations of the existing pre-trained models. Additionally, this algorithm also introduces a dynamic batch loading strategy that proportionally loads training data from each domain based on its rate of loss reduction, thereby optimizing data utilization and facilitating overall performance improvements.
- TinyLlaMA: TinyLLaMA is a compact and efficient variant of the LLaMA2-7B model. It retains the same architecture and tokenizer as LLaMA2-7B but scales down to a Transformer decoder-only model with 1.1 billion parameters. This model is trained on up to 3 trillion tokens, with a specific focus on the behavior of smaller models when exposed to a substantially larger quantity of training tokens than those predicted by scaling laws.
- LiteLlaMA: LiteLlaMA is another variant of LlaMA2-7B with 460M parameters which is trained over 1T tokens on data from the RedPajama dataset.
-
Evaluation: In this study, most evaluation metrics for large language models (LLMs) are tested using the open-source package
$\texttt{lm-evaluation-harness}$ which is a versatile and open-source Python package for benchmarking the ability of large language models (LLMs) across various tasks and datasets. The "($i$ )" after the name of an evaluation metric means "$i$-shot".- SciQ: SciQ, a dataset of 13.7k multiple choice science exam questions, is a benchmark dataset specifically designed to evaluate a language model's ability to understand and reason about scientific questions. By presenting the model with carefully crafted questions that require a deep understanding of scientific concepts and principles, SciQ assesses a model's capacity to perform complex reasoning tasks.
- WinoGrande: Winogrande, a collection of 44k WSC-inspired problems, is a benchmark dataset specifically designed to evaluate a language model's ability to understand and reason about common sense knowledge. By presenting the model with carefully constructed sentences requiring the selection of the correct noun to complete a given context, Winogrande assesses a model's capacity to grasp nuanced semantic and pragmatic relationships.
- ARC-E & ARC-C (25): ARC (AI2 Reasoning Challenge), including totally 7,787 natural, grade-school science questions, is a question set, text corpus, and baselines assembled to encourage AI research in advanced question answering, which is partitioned into a challenge set (ARC-C) and an easy set (ARC-E).
- HellaSwag (10): HellaSwag, which stands for "Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations," is designed to test commonsense natural language inference (NLI) in the context of physical situations.
- BoolQ (32): BoolQ is a reading comprehension dataset of naturally occurring yes/no questions which are challenging and require a wide range of inference abilities to solve.
- NQ (32): Natural Questions (NQ), including 307,372 training examples and 7,830 examples for development, contains real user questions issued to Google search, and answers found from Wikipedia by annotators, designed for the training and evaluation of automatic question answering systems.
- MMLU (5): Massive Multitask Language Understanding (MMLU) benchmark is designed to evaluate a language model's ability to generalize across a wide range of tasks and domains. By assessing performance on a diverse set of 57 tasks, including elementary mathematics, US history, computer science, law and more, MMLU provides a comprehensive evaluation of a model's extensive world knowledge and problem solving ability.
- Perplexity: Perplexity (PPL) is used to measure the quality of a probability distribution or probabilistic model in predicting samples. A model with lower perplexity indicates better performance in predicting samples.
-
More detailed information on all the models used for comparative testing:
Model Parameter Precision Number of Parameters File Size UNet FP32 13,395,329 52.4MB Pruned-UNet FP32 5,326,021 20.3MB MobileNetV3 FP32 1,528,106 5.95MB Pruned-MobileNetV3 FP32 556,442 2.22MB LlaMA2-7B FP16 6,738,415,616 12.50GB Sheared-LlaMA-1.3B FP32 1,345,423,360 5.01GB TinyLlaMA FP32 1,100,048,384 4.09GB LiteLlaMA FP16 461,685,760 0.86GB
-
In this section, we present supplementary experimental results that are not put into the main body due to the limit of pages.
This table is the comparison of our methods with other compression variant models of LlaMA2-7B on eight downstream tasks: 0-shot accuracy on SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot HellaSwag, 32-shot BoolQ, NQ, and 5-shot MMLU. "Average" means the average of these downstream tasks . As a reference, the compression ratio of INT4 quantization is 4×.
Model | FileSize (GB) | SciQ | WinoGrande | ARC_E | ARC_C | HellaSwag | BoolQ | NQ | MMLU | Average (%) |
---|---|---|---|---|---|---|---|---|---|---|
LlaMA2-7B | 12.50 | 94.00 | 68.98 | 76.30 | 52.39 | 78.94 | 81.90 | 28.67 | 45.86 | 65.88 |
Sheared-LlaMA-1.3B | 5.01 (2.50×) | 87.30 | 58.09 | 60.98 | 34.04 | 61.02 | 65.54 | 9.89 | 25.59 | 50.31 |
TinyLlaMA | 4.09 (3.06×) | 89.30 | 59.43 | 61.66 | 37.12 | 62.48 | 62.91 | 12.52 | 26.79 | 51.53 |
LiteLlaMA | 0.86 (14.53×) | 75.10 | 52.64 | 47.85 | 24.32 | 38.41 | 57.09 | 1.77 | 26.11 | 40.41 |
LlaMA2-7B + HF | 4.80 (2.60×) | 93.70 | 69.77 | 75.59 | 53.24 | 77.17 | 80.92 | 25.65 | 43.04 | 64.89 |
Sheared-LlaMA-1.3B + HF | 0.98 (12.76×) | 87.90 | 58.88 | 60.48 | 32.85 | 60.56 | 63.76 | 8.98 | 24.68 | 49.76 |
TinyLlaMA + HF | 0.78 (16.03×) | 89.50 | 58.96 | 61.11 | 36.43 | 61.92 | 58.41 | 11.58 | 27.28 | 50.65 |
LiteLlaMA + HF | 0.39 (32.05×) | 73.00 | 54.22 | 44.78 | 23.89 | 37.56 | 56.57 | 1.16 | 26.61 | 39.72 |
The table below is the comparison of Perplexity on the dataset wikitext-2-raw-v1.
Model | Compression Ratio | Perplexity |
---|---|---|
LlaMA2-7B | - | 5.47 |
LlaMA2-7B + W8A8 SmoothQuant (int8, |
2.00× | 5.52 |
LlaMA2-7B + Asymmetric Quant (int8) | 2.00× | 5.65 |
LlaMA2-7B + GPTQ | 3.46× | 5.69 |
LlaMA2-7B + Asymmetric Quant (int4) | 4.00× | 26160.34 |
LlaMA2-7B + HF | 2.60× | 5.82 |
Sheared-LlaMA-1.3B | 2.50× | 8.13 |
Sheared-LlaMA-1.3B + HF | 12.76× | 8.37 |
TinyLlaMA | 3.06× | 7.71 |
TinyLlaMA + HF | 16.03× | 7.95 |
LiteLlaMA | 14.53× | 31.82 |
LiteLlaMA + HF | 32.05× | 37.85 |
Here, we show the compression performance of LLaMA2-7B and seven different variants based on LlaMA2-7B. First, we test eight important downstream tasks. By directly compressing LLaMA2-7B using our method (LLaMA2-7B + HF), a compression ratio of 2.6$\times$ is achieved with only a 0.99% decrease in the average score. When our method is combined with other LLaMA2-7B variants (Sheared-LlaMA-1.3B, TinyLlaMA, and LiteLlaMA) respectively, higher compression ratios can be achieved, with the decrease in the average score remaining within 1%, compared to the pruned ones. Moreover, Perplexity (PPL) is an important metric for evaluating the overall performance of large language models (LLMs). We conduct tests on the publicly available dataset wikitext-2-raw-v1. The PPL of LLaMA2-7B+HF increases by only 0.35 compared to the original LLaMA2-7B model, whereas the PPL increases caused by the other three LLaMA2-7B variants (Sheared-LlaMA-1.3B, TinyLlaMA, LiteLlaMA) are 2.66, 2.24, and 26.35, respectively. This demonstrates that our method not only enhances the compression ratio but also effectively maintains the model's overall performance. We also compare the perplexity with four quantization methods: W8A8 SmoothQuant (int8,
As shown in Table below, we record how much using our method to compress eight models. The time to compress smaller models like UNet and MobileNetV3 is only 35 seconds, with Pruned-MobileNetV3 requiring only 8.83 seconds. The compression time for large language models is longer. For example, compressing LLaMA2-7B takes 53 minutes, while LiteLLaMA is compressed in just 6 minutes. The smaller the model's size, the faster the compression is.
Model | Compression Time (s) |
---|---|
UNet + HF | 30.00 |
Pruned-UNet + HF | 34.60 |
MobileNetV3 + HF | 11.63 |
Pruned-MobileNetV3 + HF | 8.83 |
Model | Compression Time (min) |
LlaMA2-7B + HF | 53 |
Sheared-LlaMA-1.3B + HF | 7 |
TinyLlaMA + HF | 11 |
LiteLlaMA + HF | 6 |
As shown in Figure below, we conduct ablation experiments based on UNet and MobileNetV3 on the aforementioned two acceleration techniques to demonstrate their effectiveness in enhancing computational speed.
Model | K-D Tree | Matrix Operations | Compression Time (min) |
---|---|---|---|
UNet | × | × | 231.6 |
UNet | √ | × | 176.9 |
UNet | × | √ | 54.8 |
UNet | √ | √ | 0.5 |
MbileNetV3 | × | × | 52.29 |
MbileNetV3 | √ | × | 3.13 |
MbileNetV3 | × | √ | 45.50 |
MbileNetV3 | √ | √ | 0.19 |
Furthermore, we show that the concurrent application of k-d trees and matrix operations leads to a substantial enhancement in computational efficiency.
We test the inference time and the execution time of our method under different batch sizes on NVIDIA GeForce RTX 4060 and NVIDIA A40 graphics cards, respectively.
Batch Size | 4060 | 4060 | A40 | A40 |
---|---|---|---|---|
- | Inference (s) | Our Method (s) | Inference (s) | Our Method (s) |
1 | 4.35 | 4.43 | 1.50 | 2.06 |
2 | 3.95 | 4.49 | 1.44 | 2.05 |
4 | 3.63 | 4.77 | 1.45 | 2.22 |
8 | 3.67 | 7.94 | 1.44 | 2.46 |
When performing the time test, the model is first initialized and loaded onto the GPU. Next, the start time is recorded, and the model evaluation function is executed 10 times in a loop. After each loop, the CUDA cache is manually cleared and Python's garbage collection mechanism is triggered to ensure that the memory resource is fully released. After the loop is finished, the end time is recorded, and the average time for each evaluation is calculated.
The Carvana dataset is used in the model evaluation function. This dataset is used for car image segmentation tasks and is widely used in image segmentation benchmarks. Each time the model is evaluated, only 1 batch of data is used for a test. We notice that when batch size increases, the inference time of our compressed model gets closer to that of the original model, which aligns with our expectations. This is because when batch size goes up, the inference time across layers is slow, which allows weights in later layers to be decompressed in time.
In this section, we use UNet as an example to conduct a parameter sensitivity test on three aforementioned hyperparameters:
If you want to cite this paper, please cite it in your publications.
@article{fan2024hyper,
title={Hyper-Compression: Model Compression via Hyperfunction},
author={Fan, Fenglei and Fan, Juntong and Wang, Dayang and Zhang, Jingbo and Dong, Zelin and Zhang, Shijun and Wang, Ge and Zeng, Tieyong},
journal={arXiv preprint arXiv:2409.00592},
year={2024}
}