English | 中文

Update

[2023.12.22] We released our technical report🔥🔥🔥YAYI 2: Multilingual Open-Source Large Language Models.

Introduction

YAYI 2 is the new generation of open-source large language models launched by Wenge Technology. It has been pretrained for 2.65 trillion tokens of multilingual data with high quality. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback (RLHF).

We opensource the pre-trained language model in this release, namely YAYI2-30B. By open-sourcing the YAYI 2 model, we aim to contribute to the development of the Chinese pre-trained large language model open-source community. Through open-source, we aspire to collaborate with every partner in building the YAYI large language model ecosystem.

For more technical details, please read our technical report 🔥YAYI 2: Multilingual Open-Source Large Language Models.

Model download

Model	Context Length	🤗 HF Model Name	Download Links
YAYI2-30B	4096	wenge-research/yayi2-30b	download
YAYI2-Chat-30B	4096	wenge-research/yayi2-chat-30b	Comming soon...

Evaluation

We evaluated our model on standard benchmarks, including C-Eval, MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K, MATH, BBH, HumanEval, and MBPP. Our goal is to assess the model's performance in language comprehension, knowledge comprehension, mathematical reasoning, logical reasoning, and code generation. YAYI 2 has demonstrated exceptional performance across models with similar size.

	Knowledge					Math		Logic reasonning	Code
Model	C-Eval(val)	MMLU	AGIEval	CMMLU	GAOKAO-Bench	GSM8K	MATH	BBH	HumanEval	MBPP
	5-shot	5-shot	3/0-shot	5-shot	0-shot	8/4-shot	4-shot	3-shot	0-shot	3-shot
MPT-30B	-	46.9	33.8	-	-	15.2	3.1	38.0	25.0	32.8
Falcon-40B	-	55.4	37.0	-	-	19.6	5.5	37.1	0.6	29.8
LLaMA2-34B	-	62.6	43.4	-	-	42.2	6.2	44.1	22.6	33.0
Baichuan2-13B	59.0	59.5	37.4	61.3	45.6	52.6	10.1	49.0	17.1	30.8
Qwen-14B	71.7	67.9	51.9	70.2	62.5	61.6	25.2	53.7	32.3	39.8
InternLM-20B	58.8	62.1	44.6	59.0	45.5	52.6	7.9	52.5	25.6	35.6
Aquila2-34B	98.5	76.0	43.8	78.5	37.8	50.0	17.8	42.5	0.0	41.0
Yi-34B	81.8	76.3	56.5	82.6	68.3	67.6	15.9	66.4	26.2	38.2
YAYI2-30B	80.9	80.5	62.0	84.0	64.4	71.2	14.8	54.5	53.1	45.8

We evaluate our model using the source code from the OpenCompass Github repository. If available, we report results for comparative models assessed by OpenCompass with the evaluation reference date set to Dec. 15th, 2013. For MPT, Falcon, and Llama, which have not been evaluated by OpenCompass, we use the results reported in the LLaMA 2 paper.

Quick Start

You can follow the steps below to run YAYI 2 model with transformers.

Clone this repository to the local environment:

git clone https://github.com/wenge-research/YAYI2.git
cd YAYI2

Create a conda virtual environment:

conda create --name yayi_inference_env python=3.10
conda activate yayi_inference_env

Please note that this project requires Python 3.8 or higher.

Install dependencies:

pip install -r requirements.txt

pre-trained model inference

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi2-30b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi2-30b", device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer('The winter in Beijing is', return_tensors='pt')
>>> inputs = inputs.to('cuda')
>>> pred = model.generate(
        **inputs, 
        max_new_tokens=256, 
        eos_token_id=tokenizer.eos_token_id, 
        do_sample=True,
        repetition_penalty=1.2,
        temperature=0.4, 
        top_k=100, 
        top_p=0.8
        )
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

The first-time downloading and loading the model could take some time.

Model fine-tuning

This project utilizes the deepspeed framework for model training. After setting up the environment, you can execute the corresponding scripts to train the model. It supports full-parameter fine-tuning and LoRA fine-tuning.

Set up environment

Create a conda virtual environment:

conda create --name yayi_train_env python=3.10
conda activate yayi_train_env

Install dependencies:

pip install -r requirements.txt

Install accelerate:

pip install --upgrade accelerate

Install flashattention

pip install flash-attn==2.0.3 --no-build-isolation
pip install triton==2.0.0.dev20221202  --no-deps

Full-parameter fine-tuning

Data format: Refer to data/yayi_train_example.json, which is a standard JSON file. Each data entry consists of "system" and "conversations". "system" contains global role-setting information and can be an empty string. "conversations" contains multi-turn dialogue content conducted alternately between human and YAYI roles.
Instructions: Running the following command will initiate full-parameter fine-tuning of the YAYI model. It is recommended to use hardware configurations with 16 or more A100 GPUs (80GB each).

deepspeed --hostfile config/hostfile \
    --module training.trainer_yayi2 \
    --report_to "tensorboard" \
    --data_path "./data/yayi_train_example.json" \
    --model_name_or_path "your_model_path" \
    --output_dir "./output" \
    --model_max_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 10 \
    --learning_rate 5e-6 \
    --warmup_steps 2000 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed "./config/deepspeed.json" \
    --bf16 True

Start the training using shell scripts:

bash scripts/start.sh

LoRA fine-tuning

Data format: Same as above, refer to data/yayi_train_example.json.
Running the following command will initiate LoRA fine-tuning of the YAYI model.

bash scripts/start_lora.sh

Pre-Training Data

During the pre-training phase, we not only utilized internet data to train the model's language abilities but also incorporated curated general data and domain-specific information to enhance the model's expertise. Details of the data distribution are as follows:

We establish a comprehensive data processing pipeline to enhance data quality in all aspects. This pipeline comprises four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. 240 terabytes of raw data are collected for pre-training, and only 10.6 terabytes of high-quality data remain after preprocessing. Details of the data processing pipeline are as follows:

Tokenizer

YAYI 2 tokenizer adopts Byte-Pair Encoding (BPE) algorithm from the Sentence-Piece library. The tokenizer is trained high-quality multilingual corpus of on a 500GB, which covers over ten commonly used languages such as Chinese, English, French and Russian.
We decomposed the numbers digit by digit for mathematical reasoning. Simultaneously, we manually added numerous HTML identifiers and common punctuation marks to the vocabulary to enhance tokenization accuracy. Additionally, we reserved 200 slots for potential future applications, such as incorporating identifiers during fine-tuning stages for specific directives.
In our comprehensive evaluation of YAYI 2 tokenizer’s multilingual performance, we sample data with a uniform length of 10,000 tokens, covering Chinese, English, and various minor languages. The compression ratio is presented in the following table.

A lower compression ratio indicates superior training and inference efficiency.

Loss

The following figure shows the final pre-training loss of YAYI2-30B.

Related agreements

Open Source License

The code in this project is open-sourced under the Apache-2.0 license. The use of YaYi series model weights and data must adhere to the YAYI 2 Community License. If you intend to use the YAYI 2 series models or their derivatives for commercial purposes, please complete the YAYI 2 Model Commercial Registration Information and send it to yayi@wenge.com. After receiving the email, we will conduct an audit within 3 working days. Once the audit is passed, you will receive a commercial license. Please strictly comply with the relevant content of the YAYI 2 Model Commercial License Agreement during the use process. Thank you for your cooperation!

Citation

If you are using the resource for your work, please cite our paper.

@article{YAYI 2,
  author    = {Yin Luo, Qingchao Kong, Nan Xu, et.al.},
  title     = {YAYI 2: Multilingual Open Source Large Language Models},
  journal   = {arXiv preprint arXiv:2312.14862},
  url       = {https://arxiv.org/abs/2312.14862},
  year      = {2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Update

Introduction

Model download

Evaluation

Quick Start

pre-trained model inference

Model fine-tuning

Set up environment

Full-parameter fine-tuning

LoRA fine-tuning

Pre-Training Data

Tokenizer

Loss

Related agreements

Open Source License

Citation

Star History

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Update

Introduction

Model download

Evaluation

Quick Start

pre-trained model inference

Model fine-tuning

Set up environment

Full-parameter fine-tuning

LoRA fine-tuning

Pre-Training Data

Tokenizer

Loss

Related agreements

Open Source License

Citation

Star History