Skip to content

Juntongkuki/Hyper-Compression

Repository files navigation

《Hyper-Compression: Model Compression via Hyperfunction》

Static Badge Static Badge GitHub Repo stars

[📄paper] [📍Github]

The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually.

💡The proposed hyper-compression enjoys the following merits:

  • Preferable compression ratio

  • No post-hoc retraining

  • Affordable inference time

  • Short compression time

Contents

  • Abstract
  • How to Use
    • How to compress models
    • How to test compressed models
    • How to use compressed results for model inference
  • More Details of Experimental Results
    • Experimental Details
    • Supplementary Experimental Results
    • Parameter Sensitivity
  • Citation

Abstract

The rapid growth of large models' size has far outpaced that of GPU memory. To bridge this gap, inspired by the succinct relationship between genotype and phenotype, we turn the model compression problem into the issue of parameter representation to propose the so-called hyper-compression. The hyper-compression uses a hyperfunction to represent the parameters of the target network, and notably, here the hyperfunction is designed per ergodic theory that relates to a problem: if a low-dimensional dynamic system can fill the high-dimensional space eventually. Empirically, the proposed hyper-compression enjoys the following merits: 1) Preferable compression ratio; 2) No post-hoc retraining; 3) Affordable inference time; and 4) Short compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. Our work has the potential to invigorate the field of model compression, towards a harmony between the scaling law and the stagnation of hardware upgradation.

How to Use

1. How to compress models:

We provide examples of compressing three models: UNet, MobileNetV3, and Sheared-LlaMA-1.3B. Of course, our method can be applied to any model, as it essentially compresses the model's parameters and is independent of the model structure. The models for UNet and MobileNetV3, along with the project files, are available and can be run directly, with the compression results automatically saved in the newly created compressed_results folder. However, due to the large size of the Sheared-LlaMA-1.3B model, we recommend downloading it directly from Huggingface and placing all model files in Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LlaMA-1.3B/.

2. How to test compressed models:

In our paper, we conducted multiple tests on the three models—UNet, MobileNetV3, and Sheared-LlaMA-1.3B—to demonstrate that our compression method has minimal impact on model performance.

  • UNet is tested on the Carvana dataset using the Dice metric. You can run Hyper-Compression/UNet-hyper-compression/decode_eval.py to evaluate it.

  • MobileNetV3 is tested on the CIFAR-10 dataset using Accuracy. You can run Hyper-Compression/MobileNetV3-hyper-compression/decode_eval.py to evaluate it.

  • Sheared-LlaMA-1.3B is tested on the wikitext-2-raw-v1 dataset for Perplexity (PPL) and evaluated on eight downstream tasks: 0-shot accuracy on SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot HellaSwag, 32-shot BoolQ, NQ, and 5-shot MMLU, using the lm-evaluation-harness GitHub library. However, to perform the 8 downstream tasks, you need to download lm-evaluation-harness :

    git clone https://github.com/EleutherAI/lm-evaluation-harness.git
    cd lm-evaluation-harness
    pip install -e .

    For example, assuming I want to test 10-shot HellaSwag on Sheared-LlaMA-1.3B, we can use the following command:

    lm_eval --model hf --model_args pretrained=/root/autodl-tmp/LLM-models/Sheared-LLaMA-1.3B --tasks hellaswag --device cuda:0 --batch_size auto:4 --num_fewshot 10

    When we want to test the lossy model compressed using our method, we only need to replace the pytorch_model.bin in the Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B/ folder with pytorch_model_back.bin from Hyper-Compression/Sheared-Llamma-hyper-compression/compressed_result/. (When replacing, remember to also rename the file to "pytorch_model.bin"). This way, you can perform the test using the same method.

    At the same time, we can use the following code to test the model's Perplexity (PPL). The wikitext-2-raw-v1 dataset can be downloaded from Hugging Face:

    import torch
    import torch.nn as nn
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import tqdm
    from datasets import load_dataset
    import argparse
    
    class Evaluator:
        def __init__(self, dataset, tokenizer, device, n_samples=40):
            self.dataset = dataset
            self.tokenizer = tokenizer
            self.device = device
    
            self.dataset = tokenizer(
                "\n\n".join(dataset['train']["text"]), return_tensors="pt"
            ).input_ids.to(device)
    
            self.n_samples = n_samples
    
        @torch.no_grad()
        def evaluate(self, model):
            model.eval()
            nlls = []
            n_samples = self.n_samples if self.n_samples else self.dataset.size(1) // 2048
            for i in tqdm.tqdm(range(n_samples), desc="Evaluating..."):
                batch = self.dataset[:, (i * 2048) : ((i + 1) * 2048)].to(model.device)
                with torch.no_grad():
                    lm_logits = model(batch).logits
                shift_logits = lm_logits[:, :-1, :].contiguous().float()
                shift_labels = self.dataset[:, (i * 2048) : ((i + 1) * 2048)][:, 1:]
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(
                    shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
                )
                neg_log_likelihood = loss.float() * 2048
                nlls.append(neg_log_likelihood)
    
            return torch.exp(torch.stack(nlls).sum() / (n_samples * 2048))
    
    
    
    # dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    dataset = load_dataset('parquet', data_files='wiki2-test-00000-of-00001.parquet')
    
    tokenizer = AutoTokenizer.from_pretrained("`Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B")
    model = AutoModelForCausalLM.from_pretrained("`Hyper-Compression/Sheared-Llamma-hyper-compression/Sheared-LLaMA-1.3B").to('cuda')
    
    evaluator = Evaluator(dataset, tokenizer, "cuda", n_samples=n_samples)
    ppl = evaluator.evaluate(model)
    print(f"Perplexity: {ppl}")

3. How to use compressed results for model inference:

We only provide the example for UNet, and you can run Hyper-Compression/UNet-hyper-compression/inference.py.

More Details of Experimental Results

1. Experimental Details

In this section, we introduce details of experimental configurations. All experiments are conducted on a single NVIDIA RTX4090 (24GB) GPU using the PyTorch framework.

  • Details of experiments related to UNet: We perform parameter compression on both UNet and the pruned UNet, and test the performance loss of the pruned models on the Carvana dataset.

    1. UNet: U-Net is a deep convolutional neural network architecture that has proven highly effective for image segmentation tasks, especially in biomedical imaging. The network adopts an encoder-decoder structure. The encoder progressively downsamples feature maps to capture context, while the decoder upsamples these feature maps and concatenates them with corresponding feature maps from the encoder via skip connections to precisely localize objects.

    2. Pruned UNet: Specifically, the pruned UNet refers to the UNet model whose parameters are removed using the Taylor-Rank Pruning with the FLOPs regularization (strength=0.001). This is essentially a channel pruning method. The bar chart below illustrates the percentage of channels pruned in each layer of the pruned UNet relative to the original UNet.

    3. Dataset: The Carvana Image Masking Challenge dataset is a large-scale dataset specifically designed for image segmentation tasks, particularly in the automotive industry. It consists of a collection of high-resolution images of cars, along with the corresponding pixel-level masks that delineate the precise boundaries of the vehicles. The Carvana dataset contains 5,088 images with label masks in the training set and images without masks in the test set. To evaluate model performance using the Dice metric, we randomly split 80% of the dataset as the training set and 20% as the validation set.

    4. Evaluation: The Dice coefficient, also known as the Dice similarity index, is a statistical measure used to gauge the similarity between two sets. It is particularly popular in the field of image segmentation to assess the accuracy of models by comparing the overlap between the predicted segmentation and the ground truth. Mathematically, the Dice coefficient is defined as $$ \begin{equation} \mathtt{Dice} = \frac{2|A \cap B|}{|A| + |B|}, \end{equation} $$ , where A represents the predicted segmentation, and B is the ground truth. The value of the Dice coefficient ranges from 0 to 1, where 1 indicates the perfect agreement between two sets, and 0 indicates no overlap at all.

  • Details of experiments related to MobileNetV3: We apply our compression method to both MobileNetV3 and the pruned MobileNetV3 optimized by automated gradual pruning algorithm, and evaluate the top-1 accuracy on the CIFAR-10 dataset.

    1. MobileNetV3: MobileNetV3 is a state-of-the-art convolutional neural network architecture designed specifically for mobile devices. Building upon the success of its predecessors, MobileNetV3 introduces novel design principles and optimization techniques to achieve a more compelling balance between accuracy and efficiency.

    2. Pruned MobileNetV3: We employ the Automated Gradual Pruner (AGP) technique to prune MobileNetV3, ultimately reducing the model size by approximately 2.68 times. The central idea of Automated Gradual Pruner (AGP) is to prune the model's weights gradually over time, allowing the network to adapt to the loss of parameters through continuous retraining. This technique is particularly effective in minimizing the performance degradation associated with aggressive pruning strategies.

    3. Dataset: The CIFAR-10 dataset is a well-known benchmark in the field of machine learning and computer vision, consisting of 60,000 colored images belonging to 10 classes. The dataset is split into 50,000 training images and 10,000 testing images.

  • Details of experiments related to LlaMA2-7B: We apply our compression method to the LLaMA2-7B and three other performant pruned variants of the LLaMA2-7B: Sheared-LLaMA-1.3B, TinyLLaMA, and LiteLLaMA. We then examine the performance drop on these models using various metrics, including SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot Hellaswag, 32-shot BoolQ , 32-shot NQ, 5-shot MMLU, and Perplexity to evaluate the effects of compression by using the package $\texttt{lm-evaluation-harness}$.

    1. LlaMA2-7B: It is a state-of-the-art language model that has 7-billion parameters and was made publicly available by Meta AI. Employing the Transformer architecture, the model has been pre-trained on a massive text corpus, enabling it to exhibit superior performance on a multitude of natural language processing tasks, including text generation, question answering, machine translation, and so on.

    2. Other comparative models:

      • Sheared-LlaMA-1.3B: It is a variant of LlaMA2-7B that uses a novel pruning algorithm: targeted structured pruning which can reduce a source model to a specified target architecture defined by the configurations of the existing pre-trained models. Additionally, this algorithm also introduces a dynamic batch loading strategy that proportionally loads training data from each domain based on its rate of loss reduction, thereby optimizing data utilization and facilitating overall performance improvements.
      • TinyLlaMA: TinyLLaMA is a compact and efficient variant of the LLaMA2-7B model. It retains the same architecture and tokenizer as LLaMA2-7B but scales down to a Transformer decoder-only model with 1.1 billion parameters. This model is trained on up to 3 trillion tokens, with a specific focus on the behavior of smaller models when exposed to a substantially larger quantity of training tokens than those predicted by scaling laws.
      • LiteLlaMA: LiteLlaMA is another variant of LlaMA2-7B with 460M parameters which is trained over 1T tokens on data from the RedPajama dataset.
    3. Evaluation: In this study, most evaluation metrics for large language models (LLMs) are tested using the open-source package $\texttt{lm-evaluation-harness}$ which is a versatile and open-source Python package for benchmarking the ability of large language models (LLMs) across various tasks and datasets. The "($i$)" after the name of an evaluation metric means "$i$-shot".

      • SciQ: SciQ, a dataset of 13.7k multiple choice science exam questions, is a benchmark dataset specifically designed to evaluate a language model's ability to understand and reason about scientific questions. By presenting the model with carefully crafted questions that require a deep understanding of scientific concepts and principles, SciQ assesses a model's capacity to perform complex reasoning tasks.
      • WinoGrande: Winogrande, a collection of 44k WSC-inspired problems, is a benchmark dataset specifically designed to evaluate a language model's ability to understand and reason about common sense knowledge. By presenting the model with carefully constructed sentences requiring the selection of the correct noun to complete a given context, Winogrande assesses a model's capacity to grasp nuanced semantic and pragmatic relationships.
      • ARC-E & ARC-C (25): ARC (AI2 Reasoning Challenge), including totally 7,787 natural, grade-school science questions, is a question set, text corpus, and baselines assembled to encourage AI research in advanced question answering, which is partitioned into a challenge set (ARC-C) and an easy set (ARC-E).
      • HellaSwag (10): HellaSwag, which stands for "Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations," is designed to test commonsense natural language inference (NLI) in the context of physical situations.
      • BoolQ (32): BoolQ is a reading comprehension dataset of naturally occurring yes/no questions which are challenging and require a wide range of inference abilities to solve.
      • NQ (32): Natural Questions (NQ), including 307,372 training examples and 7,830 examples for development, contains real user questions issued to Google search, and answers found from Wikipedia by annotators, designed for the training and evaluation of automatic question answering systems.
      • MMLU (5): Massive Multitask Language Understanding (MMLU) benchmark is designed to evaluate a language model's ability to generalize across a wide range of tasks and domains. By assessing performance on a diverse set of 57 tasks, including elementary mathematics, US history, computer science, law and more, MMLU provides a comprehensive evaluation of a model's extensive world knowledge and problem solving ability.
      • Perplexity: Perplexity (PPL) is used to measure the quality of a probability distribution or probabilistic model in predicting samples. A model with lower perplexity indicates better performance in predicting samples.
    4. More detailed information on all the models used for comparative testing:

      Model Parameter Precision Number of Parameters File Size
      UNet FP32 13,395,329 52.4MB
      Pruned-UNet FP32 5,326,021 20.3MB
      MobileNetV3 FP32 1,528,106 5.95MB
      Pruned-MobileNetV3 FP32 556,442 2.22MB
      LlaMA2-7B FP16 6,738,415,616 12.50GB
      Sheared-LlaMA-1.3B FP32 1,345,423,360 5.01GB
      TinyLlaMA FP32 1,100,048,384 4.09GB
      LiteLlaMA FP16 461,685,760 0.86GB

2. Supplementary Experimental Results

In this section, we present supplementary experimental results that are not put into the main body due to the limit of pages.

This table is the comparison of our methods with other compression variant models of LlaMA2-7B on eight downstream tasks: 0-shot accuracy on SciQ, WinoGrande, ARC-E, 25-shot ARC-C, 10-shot HellaSwag, 32-shot BoolQ, NQ, and 5-shot MMLU. "Average" means the average of these downstream tasks . As a reference, the compression ratio of INT4 quantization is 4×.

Model FileSize (GB) SciQ WinoGrande ARC_E ARC_C HellaSwag BoolQ NQ MMLU Average (%)
LlaMA2-7B 12.50 94.00 68.98 76.30 52.39 78.94 81.90 28.67 45.86 65.88
Sheared-LlaMA-1.3B 5.01 (2.50×) 87.30 58.09 60.98 34.04 61.02 65.54 9.89 25.59 50.31
TinyLlaMA 4.09 (3.06×) 89.30 59.43 61.66 37.12 62.48 62.91 12.52 26.79 51.53
LiteLlaMA 0.86 (14.53×) 75.10 52.64 47.85 24.32 38.41 57.09 1.77 26.11 40.41
LlaMA2-7B + HF 4.80 (2.60×) 93.70 69.77 75.59 53.24 77.17 80.92 25.65 43.04 64.89
Sheared-LlaMA-1.3B + HF 0.98 (12.76×) 87.90 58.88 60.48 32.85 60.56 63.76 8.98 24.68 49.76
TinyLlaMA + HF 0.78 (16.03×) 89.50 58.96 61.11 36.43 61.92 58.41 11.58 27.28 50.65
LiteLlaMA + HF 0.39 (32.05×) 73.00 54.22 44.78 23.89 37.56 56.57 1.16 26.61 39.72

The table below is the comparison of Perplexity on the dataset wikitext-2-raw-v1.

Model Compression Ratio Perplexity
LlaMA2-7B - 5.47
LlaMA2-7B + W8A8 SmoothQuant (int8, $\alpha$=0.85) 2.00× 5.52
LlaMA2-7B + Asymmetric Quant (int8) 2.00× 5.65
LlaMA2-7B + GPTQ 3.46× 5.69
LlaMA2-7B + Asymmetric Quant (int4) 4.00× 26160.34
LlaMA2-7B + HF 2.60× 5.82
Sheared-LlaMA-1.3B 2.50× 8.13
Sheared-LlaMA-1.3B + HF 12.76× 8.37
TinyLlaMA 3.06× 7.71
TinyLlaMA + HF 16.03× 7.95
LiteLlaMA 14.53× 31.82
LiteLlaMA + HF 32.05× 37.85

A. Compression Performance:

Here, we show the compression performance of LLaMA2-7B and seven different variants based on LlaMA2-7B. First, we test eight important downstream tasks. By directly compressing LLaMA2-7B using our method (LLaMA2-7B + HF), a compression ratio of 2.6$\times$ is achieved with only a 0.99% decrease in the average score. When our method is combined with other LLaMA2-7B variants (Sheared-LlaMA-1.3B, TinyLlaMA, and LiteLlaMA) respectively, higher compression ratios can be achieved, with the decrease in the average score remaining within 1%, compared to the pruned ones. Moreover, Perplexity (PPL) is an important metric for evaluating the overall performance of large language models (LLMs). We conduct tests on the publicly available dataset wikitext-2-raw-v1. The PPL of LLaMA2-7B+HF increases by only 0.35 compared to the original LLaMA2-7B model, whereas the PPL increases caused by the other three LLaMA2-7B variants (Sheared-LlaMA-1.3B, TinyLlaMA, LiteLlaMA) are 2.66, 2.24, and 26.35, respectively. This demonstrates that our method not only enhances the compression ratio but also effectively maintains the model's overall performance. We also compare the perplexity with four quantization methods: W8A8 SmoothQuant (int8, $\alpha$=0.85), Asymmetric Quant (int8), GPTQ (int4), and Asymmetric Quant (int4). Among them, both W8A8 SmoothQuant (int8, $\alpha$=0.85) and GPTQ (int4) require calibration sets for calibration during the quantization process, whereas our method does not involve calibration. Figures below present two examples comparing the text outputs generated by eight different LLM models for a given prompt. The experimental data demonstrate that our method achieves a preferable compression ratio while having minimal impact on the model's performance across various downstream tasks. Moreover, it can also integrate effectively with other compression methods such as pruning.

B. Compression Time:

As shown in Table below, we record how much using our method to compress eight models. The time to compress smaller models like UNet and MobileNetV3 is only 35 seconds, with Pruned-MobileNetV3 requiring only 8.83 seconds. The compression time for large language models is longer. For example, compressing LLaMA2-7B takes 53 minutes, while LiteLLaMA is compressed in just 6 minutes. The smaller the model's size, the faster the compression is.

Model Compression Time (s)
UNet + HF 30.00
Pruned-UNet + HF 34.60
MobileNetV3 + HF 11.63
Pruned-MobileNetV3 + HF 8.83
Model Compression Time (min)
LlaMA2-7B + HF 53
Sheared-LlaMA-1.3B + HF 7
TinyLlaMA + HF 11
LiteLlaMA + HF 6

As shown in Figure below, we conduct ablation experiments based on UNet and MobileNetV3 on the aforementioned two acceleration techniques to demonstrate their effectiveness in enhancing computational speed.

Model K-D Tree Matrix Operations Compression Time (min)
UNet × × 231.6
UNet × 176.9
UNet × 54.8
UNet 0.5
MbileNetV3 × × 52.29
MbileNetV3 × 3.13
MbileNetV3 × 45.50
MbileNetV3 0.19

Furthermore, we show that the concurrent application of k-d trees and matrix operations leads to a substantial enhancement in computational efficiency.

C. Inference Time:

We test the inference time and the execution time of our method under different batch sizes on NVIDIA GeForce RTX 4060 and NVIDIA A40 graphics cards, respectively.

Batch Size 4060 4060 A40 A40
- Inference (s) Our Method (s) Inference (s) Our Method (s)
1 4.35 4.43 1.50 2.06
2 3.95 4.49 1.44 2.05
4 3.63 4.77 1.45 2.22
8 3.67 7.94 1.44 2.46

When performing the time test, the model is first initialized and loaded onto the GPU. Next, the start time is recorded, and the model evaluation function is executed 10 times in a loop. After each loop, the CUDA cache is manually cleared and Python's garbage collection mechanism is triggered to ensure that the memory resource is fully released. After the loop is finished, the end time is recorded, and the average time for each evaluation is calculated.

The Carvana dataset is used in the model evaluation function. This dataset is used for car image segmentation tasks and is widely used in image segmentation benchmarks. Each time the model is evaluated, only 1 batch of data is used for a test. We notice that when batch size increases, the inference time of our compressed model gets closer to that of the original model, which aligns with our expectations. This is because when batch size goes up, the inference time across layers is slow, which allows weights in later layers to be decompressed in time.

3. Parameter Sensitivity

In this section, we use UNet as an example to conduct a parameter sensitivity test on three aforementioned hyperparameters: $M$, $U$, and $l$ in our compression algorithm, where $M$ means the maximum value of categories that can be set for all layers of the model, $U$ means the list of the number of sample nodes in the center square, and $l$ means the length of the side of the center square. As shown in Table below, it can be observed that different values of $M$, $U$, and $l$ only moderately impact the model's performance and compression ratio, which means our algorithm is robust to hyperparameters.

Citation

If you want to cite this paper, please cite it in your publications.

@article{fan2024hyper,
  title={Hyper-Compression: Model Compression via Hyperfunction},
  author={Fan, Fenglei and Fan, Juntong and Wang, Dayang and Zhang, Jingbo and Dong, Zelin and Zhang, Shijun and Wang, Ge and Zeng, Tieyong},
  journal={arXiv preprint arXiv:2409.00592},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages