You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?
The text was updated successfully, but these errors were encountered:
Thanks for the reaching out. This is a good question.
As shown in our proposed ten main research aeras, the multilingual is an important challenge for the 1st generation of LLaMA.
According to our analysis, a potential thorough solution is to add more high-quality Chinese corpus for additional pre-training, but we should always pay attention to the risk of forgetting the existing capabilities of the model.
We need systematic research and hope that more people will participate in discussing and solving this problem thoroughly together on the Llama-X community.
Let's keep this issue item open and look forward to more solid methods.
i debug the code of alpaca , its vocabulary is 30k. very small . and some chinese character is tokenized by 2token. its unefficiency. if you wana use new encodign for chinese you perhaps need delete chinese token and add new ones.
Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?
The text was updated successfully, but these errors were encountered: