Concern on the language #2

zhhongzhi · 2023-03-31T03:00:25Z

Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

yuys0602 · 2023-03-31T03:09:46Z

yes, we need chinese, more chinese corpus

victorsungo · 2023-03-31T03:17:06Z

Thanks for the reaching out. This is a good question.
As shown in our proposed ten main research aeras, the multilingual is an important challenge for the 1st generation of LLaMA.
According to our analysis, a potential thorough solution is to add more high-quality Chinese corpus for additional pre-training, but we should always pay attention to the risk of forgetting the existing capabilities of the model.
We need systematic research and hope that more people will participate in discussing and solving this problem thoroughly together on the Llama-X community.

Let's keep this issue item open and look forward to more solid methods.

zhangbo2008 · 2023-03-31T05:29:30Z

i debug the code of alpaca , its vocabulary is 30k. very small . and some chinese character is tokenized by 2token. its unefficiency. if you wana use new encodign for chinese you perhaps need delete chinese token and add new ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concern on the language #2

Concern on the language #2

zhhongzhi commented Mar 31, 2023

yuys0602 commented Mar 31, 2023

victorsungo commented Mar 31, 2023

zhangbo2008 commented Mar 31, 2023

Concern on the language #2

Concern on the language #2

Comments

zhhongzhi commented Mar 31, 2023

yuys0602 commented Mar 31, 2023

victorsungo commented Mar 31, 2023

zhangbo2008 commented Mar 31, 2023