In order to facilitate everyone to reproduce our experimental results, we will release the evaluation code. We used mainstream open source evaluation tasks(MMLU, CMMLU, CEVAL...) to measure the performance of our model. At the same time, we adopted OpenCompass as the main framework of the evaluation, and made adaptive modifications on this basis.
- Environment Setup
conda create --name benchmark_env python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate benchmark_env
git clone llm_benchmark_repo_url llm_benchmark
cd llm_benchmark
pip install -e .
- Data Download
Please download from these URLs manually and place them in the correct directories as follows.
needlebench: https://github.com/open-compass/opencompass/releases/download/0.2.4.rc1/OpenCompassData-complete-20240325.zip
LongBench: https://huggingface.co/datasets/THUDM/LongBench/tree/main
LEval: https://huggingface.co/datasets/L4NLP/LEval/tree/main
The placement directories and locations of the respective datasets are as follows:
data/
├── LongBench/
│ ├── LongBench.py
│ ├── README.md
│ └── data/
├── LEval/
│ ├── LEval.py
│ ├── README.md
│ ├── test_data.ipynb
│ └── LEval/
│ ├── Exam/
│ └── Generation/
└── needlebench/
├── PaulGrahamEssays.jsonl
├── multi_needle_reasoning_en.json
├── multi_needle_reasoning_zh.json
├── names.json
├── needles.jsonl
├── zh_finance.jsonl
├── zh_game.jsonl
├── zh_general.jsonl
├── zh_government.jsonl
├── zh_movie.jsonl
└── zh_tech.jsonl
-
Evaluation
Run with python command:
# mmlu_gen ceval_gen cmmlu_gen hellaswag_gen gsm8k_gen humaneval_gen LLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python run.py \ --datasets mmlu_gen ceval_gen cmmlu_gen hellaswag_gen gsm8k_gen humaneval_gen \ --hf-path your_model_path/model_name \ --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.bfloat16 \ --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ --max-seq-len 4096 \ --batch-size 32 \ --hf-num-gpus 1 \ --mode all # longbench LLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python run.py \ --datasets longbench \ --summarizer longbench/summarizer \ --hf-path your_model_path/model_name \ --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.bfloat16 \ --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ --max-seq-len 32768 \ --batch-size 1 \ --hf-num-gpus 1 \ --mode all # leval LLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python run.py \ --datasets leval \ --summarizer leval/summarizer \ --hf-path your_model_path/model_name \ --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.bfloat16 \ --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ --max-seq-len 32768 \ --batch-size 1 \ --hf-num-gpus 1 \ --mode all # needlebench LLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0 python run.py \ --datasets needlebench_single_32k \ --summarizer needlebench/needlebench_32k_summarizer \ --hf-path your_model_path/model_name \ --model-kwargs device_map='auto' trust_remote_code=True torch_dtype=torch.bfloat16 \ --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ --max-seq-len 32768 \ --batch-size 1 \ --hf-num-gpus 1 \ --mode all
-
Run with config file:
Define the task_file in run_local_test.py, then run the following command:
./run_local_test.sh
-
-
Get dataset config file
Use following python command to get dataset config
# dataset name like mmlu or arc python ./tools/list_configs.py mmlu arc
Thanks to the release of the following projects, which have provided great help in quickly building a comparable benchamrk: