Skip to content

Latest commit

 

History

History
374 lines (302 loc) · 15.5 KB

README_EN.md

File metadata and controls

374 lines (302 loc) · 15.5 KB

Update

[2023.12.22] We released our technical report🔥🔥🔥YAYI 2: Multilingual Open-Source Large Language Models.

Introduction

YAYI 2 is the new generation of open-source large language models launched by Wenge Technology. It has been pretrained for 2.65 trillion tokens of multilingual data with high quality. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback (RLHF).

We opensource the pre-trained language model in this release, namely YAYI2-30B. By open-sourcing the YAYI 2 model, we aim to contribute to the development of the Chinese pre-trained large language model open-source community. Through open-source, we aspire to collaborate with every partner in building the YAYI large language model ecosystem.

For more technical details, please read our technical report 🔥YAYI 2: Multilingual Open-Source Large Language Models.

Model download

Model Context Length 🤗 HF Model Name Download Links
YAYI2-30B 4096 wenge-research/yayi2-30b download
YAYI2-Chat-30B 4096 wenge-research/yayi2-chat-30b Comming soon...

Evaluation

We evaluated our model on standard benchmarks, including C-Eval, MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K, MATH, BBH, HumanEval, and MBPP. Our goal is to assess the model's performance in language comprehension, knowledge comprehension, mathematical reasoning, logical reasoning, and code generation. YAYI 2 has demonstrated exceptional performance across models with similar size.

Knowledge Math Logic reasonning Code
Model C-Eval(val) MMLU AGIEval CMMLU GAOKAO-Bench GSM8K MATH BBH HumanEval MBPP
5-shot 5-shot 3/0-shot 5-shot 0-shot 8/4-shot 4-shot 3-shot 0-shot 3-shot
MPT-30B - 46.9 33.8 - - 15.2 3.1 38.0 25.0 32.8
Falcon-40B - 55.4 37.0 - - 19.6 5.5 37.1 0.6 29.8
LLaMA2-34B - 62.6 43.4 - - 42.2 6.2 44.1 22.6 33.0
Baichuan2-13B 59.0 59.5 37.4 61.3 45.6 52.6 10.1 49.0 17.1 30.8
Qwen-14B 71.7 67.9 51.9 70.2 62.5 61.6 25.2 53.7 32.3 39.8
InternLM-20B 58.8 62.1 44.6 59.0 45.5 52.6 7.9 52.5 25.6 35.6
Aquila2-34B 98.5 76.0 43.8 78.5 37.8 50.0 17.8 42.5 0.0 41.0
Yi-34B 81.8 76.3 56.5 82.6 68.3 67.6 15.9 66.4 26.2 38.2
YAYI2-30B 80.9 80.5 62.0 84.0 64.4 71.2 14.8 54.5 53.1 45.8

We evaluate our model using the source code from the OpenCompass Github repository. If available, we report results for comparative models assessed by OpenCompass with the evaluation reference date set to Dec. 15th, 2013. For MPT, Falcon, and Llama, which have not been evaluated by OpenCompass, we use the results reported in the LLaMA 2 paper.

Quick Start

You can follow the steps below to run YAYI 2 model with transformers.

  1. Clone this repository to the local environment:
git clone https://github.com/wenge-research/YAYI2.git
cd YAYI2
  1. Create a conda virtual environment:
conda create --name yayi_inference_env python=3.10
conda activate yayi_inference_env

Please note that this project requires Python 3.8 or higher.

  1. Install dependencies:
pip install -r requirements.txt

pre-trained model inference

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi2-30b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi2-30b", device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer('The winter in Beijing is', return_tensors='pt')
>>> inputs = inputs.to('cuda')
>>> pred = model.generate(
        **inputs, 
        max_new_tokens=256, 
        eos_token_id=tokenizer.eos_token_id, 
        do_sample=True,
        repetition_penalty=1.2,
        temperature=0.4, 
        top_k=100, 
        top_p=0.8
        )
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

The first-time downloading and loading the model could take some time.

Model fine-tuning

This project utilizes the deepspeed framework for model training. After setting up the environment, you can execute the corresponding scripts to train the model. It supports full-parameter fine-tuning and LoRA fine-tuning.

Set up environment

  1. Create a conda virtual environment:
conda create --name yayi_train_env python=3.10
conda activate yayi_train_env
  1. Install dependencies:
pip install -r requirements.txt
  1. Install accelerate:
pip install --upgrade accelerate
  1. Install flashattention
pip install flash-attn==2.0.3 --no-build-isolation
pip install triton==2.0.0.dev20221202  --no-deps 

Full-parameter fine-tuning

  • Data format: Refer to data/yayi_train_example.json, which is a standard JSON file. Each data entry consists of "system" and "conversations". "system" contains global role-setting information and can be an empty string. "conversations" contains multi-turn dialogue content conducted alternately between human and YAYI roles.
  • Instructions: Running the following command will initiate full-parameter fine-tuning of the YAYI model. It is recommended to use hardware configurations with 16 or more A100 GPUs (80GB each).
deepspeed --hostfile config/hostfile \
    --module training.trainer_yayi2 \
    --report_to "tensorboard" \
    --data_path "./data/yayi_train_example.json" \
    --model_name_or_path "your_model_path" \
    --output_dir "./output" \
    --model_max_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 10 \
    --learning_rate 5e-6 \
    --warmup_steps 2000 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed "./config/deepspeed.json" \
    --bf16 True 

Start the training using shell scripts:

bash scripts/start.sh

LoRA fine-tuning

  • Data format: Same as above, refer to data/yayi_train_example.json.
  • Running the following command will initiate LoRA fine-tuning of the YAYI model.
bash scripts/start_lora.sh

Pre-Training Data

  • During the pre-training phase, we not only utilized internet data to train the model's language abilities but also incorporated curated general data and domain-specific information to enhance the model's expertise. Details of the data distribution are as follows:

data distribution

  • We establish a comprehensive data processing pipeline to enhance data quality in all aspects. This pipeline comprises four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. 240 terabytes of raw data are collected for pre-training, and only 10.6 terabytes of high-quality data remain after preprocessing. Details of the data processing pipeline are as follows: data process

Tokenizer

  • YAYI 2 tokenizer adopts Byte-Pair Encoding (BPE) algorithm from the Sentence-Piece library. The tokenizer is trained high-quality multilingual corpus of on a 500GB, which covers over ten commonly used languages such as Chinese, English, French and Russian.
  • We decomposed the numbers digit by digit for mathematical reasoning. Simultaneously, we manually added numerous HTML identifiers and common punctuation marks to the vocabulary to enhance tokenization accuracy. Additionally, we reserved 200 slots for potential future applications, such as incorporating identifiers during fine-tuning stages for specific directives.
  • In our comprehensive evaluation of YAYI 2 tokenizer’s multilingual performance, we sample data with a uniform length of 10,000 tokens, covering Chinese, English, and various minor languages. The compression ratio is presented in the following table.

Alt text

  • A lower compression ratio indicates superior training and inference efficiency.

Loss

The following figure shows the final pre-training loss of YAYI2-30B. loss

Related agreements

Open Source License

The code in this project is open-sourced under the Apache-2.0 license. The use of YaYi series model weights and data must adhere to the YAYI 2 Community License. If you intend to use the YAYI 2 series models or their derivatives for commercial purposes, please complete the YAYI 2 Model Commercial Registration Information and send it to [email protected]. After receiving the email, we will conduct an audit within 3 working days. Once the audit is passed, you will receive a commercial license. Please strictly comply with the relevant content of the YAYI 2 Model Commercial License Agreement during the use process. Thank you for your cooperation!

Citation

If you are using the resource for your work, please cite our paper.

@article{YAYI 2,
  author    = {Yin Luo, Qingchao Kong, Nan Xu, et.al.},
  title     = {YAYI 2: Multilingual Open Source Large Language Models},
  journal   = {arXiv preprint arXiv:2312.14862},
  url       = {https://arxiv.org/abs/2312.14862},
  year      = {2023}
}

Star History

Star History Chart