This repository is the official implementation of the ACL 2024 paper Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations.
Pruning architecture | Latency on Xiaomi 14 mobile phone |
- Prepare environment following
transact.dockerfile
. - Create symlink to your data, models, outputs, and HuggingFace cache (mainly for large datasets).
ln -s /path/to/data data ln -s /path/to/models models ln -s /path/to/outputs outputs ln -s /path/to/hf-cache hf-cache
- Tweak training configs
train_config.yaml
anddeepspeed.json
. - Run the train script
run_trainer.sh
, for exampleRunbash run_trainer.sh -m all \ -a 768 -f 1536 \ -n 128 -k 8 -p acts \ -l 4096 -t 50 \ -g 64 -b 4 \ -d togethercomputer/RedPajama-Data-1T \ -x llama -y llama2 -z 7B
bash run_trainer.sh -h
for help. - Run the evaluation script
eval.sh
, for exampleRunbash eval.sh -m all \ -a 768 -f 1536 \ -n 128 -k 8 -p acts \ -l 4096 -t 50 \ -d togethercomputer/RedPajama-Data-1T \ -x llama -y llama2 -z 7B
bash eval.sh -h
for help.
Please cite the paper if this repository is useful for you.
@inproceedings{shen-etal-2024-pruning,
title = "Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations",
author = "Shen, Bowen and
Lin, Zheng and
Zha, Daren and
Liu, Wei and
Luan, Jian and
Wang, Bin and
Wang, Weiping",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
year = "2024",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.582",
doi = "10.18653/v1/2024.findings-acl.582",
pages = "9781--9793",
}