[2023.12.22] We released our technical report🔥🔥🔥YAYI 2: Multilingual Open-Source Large Language Models.
YAYI 2 is the new generation of open-source large language models launched by Wenge Technology. It has been pretrained for 2.65 trillion tokens of multilingual data with high quality. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback (RLHF).
We opensource the pre-trained language model in this release, namely YAYI2-30B. By open-sourcing the YAYI 2 model, we aim to contribute to the development of the Chinese pre-trained large language model open-source community. Through open-source, we aspire to collaborate with every partner in building the YAYI large language model ecosystem.
For more technical details, please read our technical report 🔥YAYI 2: Multilingual Open-Source Large Language Models.
Model | Context Length | 🤗 HF Model Name | Download Links |
---|---|---|---|
YAYI2-30B | 4096 | wenge-research/yayi2-30b | download |
YAYI2-Chat-30B | 4096 | wenge-research/yayi2-chat-30b | Comming soon... |
We evaluated our model on standard benchmarks, including C-Eval, MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K, MATH, BBH, HumanEval, and MBPP. Our goal is to assess the model's performance in language comprehension, knowledge comprehension, mathematical reasoning, logical reasoning, and code generation. YAYI 2 has demonstrated exceptional performance across models with similar size.
Knowledge | Math | Logic reasonning | Code | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | C-Eval(val) | MMLU | AGIEval | CMMLU | GAOKAO-Bench | GSM8K | MATH | BBH | HumanEval | MBPP |
5-shot | 5-shot | 3/0-shot | 5-shot | 0-shot | 8/4-shot | 4-shot | 3-shot | 0-shot | 3-shot | |
MPT-30B | - | 46.9 | 33.8 | - | - | 15.2 | 3.1 | 38.0 | 25.0 | 32.8 |
Falcon-40B | - | 55.4 | 37.0 | - | - | 19.6 | 5.5 | 37.1 | 0.6 | 29.8 |
LLaMA2-34B | - | 62.6 | 43.4 | - | - | 42.2 | 6.2 | 44.1 | 22.6 | 33.0 |
Baichuan2-13B | 59.0 | 59.5 | 37.4 | 61.3 | 45.6 | 52.6 | 10.1 | 49.0 | 17.1 | 30.8 |
Qwen-14B | 71.7 | 67.9 | 51.9 | 70.2 | 62.5 | 61.6 | 25.2 | 53.7 | 32.3 | 39.8 |
InternLM-20B | 58.8 | 62.1 | 44.6 | 59.0 | 45.5 | 52.6 | 7.9 | 52.5 | 25.6 | 35.6 |
Aquila2-34B | 98.5 | 76.0 | 43.8 | 78.5 | 37.8 | 50.0 | 17.8 | 42.5 | 0.0 | 41.0 |
Yi-34B | 81.8 | 76.3 | 56.5 | 82.6 | 68.3 | 67.6 | 15.9 | 66.4 | 26.2 | 38.2 |
YAYI2-30B | 80.9 | 80.5 | 62.0 | 84.0 | 64.4 | 71.2 | 14.8 | 54.5 | 53.1 | 45.8 |
We evaluate our model using the source code from the OpenCompass Github repository. If available, we report results for comparative models assessed by OpenCompass with the evaluation reference date set to Dec. 15th, 2013. For MPT, Falcon, and Llama, which have not been evaluated by OpenCompass, we use the results reported in the LLaMA 2 paper.
You can follow the steps below to run YAYI 2 model with transformers.
- Clone this repository to the local environment:
git clone https://github.com/wenge-research/YAYI2.git
cd YAYI2
- Create a conda virtual environment:
conda create --name yayi_inference_env python=3.10
conda activate yayi_inference_env
Please note that this project requires Python 3.8 or higher.
- Install dependencies:
pip install -r requirements.txt
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi2-30b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi2-30b", device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer('The winter in Beijing is', return_tensors='pt')
>>> inputs = inputs.to('cuda')
>>> pred = model.generate(
**inputs,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
repetition_penalty=1.2,
temperature=0.4,
top_k=100,
top_p=0.8
)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
The first-time downloading and loading the model could take some time.
This project utilizes the deepspeed
framework for model training. After setting up the environment, you can execute the corresponding scripts to train the model. It supports full-parameter fine-tuning and LoRA fine-tuning.
- Create a conda virtual environment:
conda create --name yayi_train_env python=3.10
conda activate yayi_train_env
- Install dependencies:
pip install -r requirements.txt
- Install accelerate:
pip install --upgrade accelerate
- Install flashattention
pip install flash-attn==2.0.3 --no-build-isolation
pip install triton==2.0.0.dev20221202 --no-deps
- Data format: Refer to
data/yayi_train_example.json
, which is a standard JSON file. Each data entry consists of"system"
and"conversations"
."system"
contains global role-setting information and can be an empty string."conversations"
contains multi-turn dialogue content conducted alternately between human and YAYI roles. - Instructions: Running the following command will initiate full-parameter fine-tuning of the YAYI model. It is recommended to use hardware configurations with 16 or more A100 GPUs (80GB each).
deepspeed --hostfile config/hostfile \
--module training.trainer_yayi2 \
--report_to "tensorboard" \
--data_path "./data/yayi_train_example.json" \
--model_name_or_path "your_model_path" \
--output_dir "./output" \
--model_max_length 2048 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 10 \
--learning_rate 5e-6 \
--warmup_steps 2000 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--gradient_checkpointing True \
--deepspeed "./config/deepspeed.json" \
--bf16 True
Start the training using shell scripts:
bash scripts/start.sh
- Data format: Same as above, refer to
data/yayi_train_example.json
. - Running the following command will initiate LoRA fine-tuning of the YAYI model.
bash scripts/start_lora.sh
- During the pre-training phase, we not only utilized internet data to train the model's language abilities but also incorporated curated general data and domain-specific information to enhance the model's expertise. Details of the data distribution are as follows:
- We establish a comprehensive data processing pipeline to enhance data quality in all aspects. This pipeline comprises four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. 240 terabytes of raw data are collected for pre-training, and only 10.6 terabytes of high-quality data remain after preprocessing. Details of the data processing pipeline are as follows:
- YAYI 2 tokenizer adopts Byte-Pair Encoding (BPE) algorithm from the Sentence-Piece library. The tokenizer is trained high-quality multilingual corpus of on a 500GB, which covers over ten commonly used languages such as Chinese, English, French and Russian.
- We decomposed the numbers digit by digit for mathematical reasoning. Simultaneously, we manually added numerous HTML identifiers and common punctuation marks to the vocabulary to enhance tokenization accuracy. Additionally, we reserved 200 slots for potential future applications, such as incorporating identifiers during fine-tuning stages for specific directives.
- In our comprehensive evaluation of YAYI 2 tokenizer’s multilingual performance, we sample data with a uniform length of 10,000 tokens, covering Chinese, English, and various minor languages. The compression ratio is presented in the following table.
- A lower compression ratio indicates superior training and inference efficiency.
The following figure shows the final pre-training loss of YAYI2-30B.
The code in this project is open-sourced under the Apache-2.0 license. The use of YaYi series model weights and data must adhere to the YAYI 2 Community License. If you intend to use the YAYI 2 series models or their derivatives for commercial purposes, please complete the YAYI 2 Model Commercial Registration Information and send it to [email protected]. After receiving the email, we will conduct an audit within 3 working days. Once the audit is passed, you will receive a commercial license. Please strictly comply with the relevant content of the YAYI 2 Model Commercial License Agreement during the use process. Thank you for your cooperation!
If you are using the resource for your work, please cite our paper.
@article{YAYI 2,
author = {Yin Luo, Qingchao Kong, Nan Xu, et.al.},
title = {YAYI 2: Multilingual Open Source Large Language Models},
journal = {arXiv preprint arXiv:2312.14862},
url = {https://arxiv.org/abs/2312.14862},
year = {2023}
}