Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concern on the language #2

Open
zhhongzhi opened this issue Mar 31, 2023 · 3 comments
Open

Concern on the language #2

zhhongzhi opened this issue Mar 31, 2023 · 3 comments

Comments

@zhhongzhi
Copy link

Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

@yuys0602
Copy link

yes, we need chinese, more chinese corpus

@victorsungo
Copy link
Collaborator

Thanks for the reaching out. This is a good question.
As shown in our proposed ten main research aeras, the multilingual is an important challenge for the 1st generation of LLaMA.
According to our analysis, a potential thorough solution is to add more high-quality Chinese corpus for additional pre-training, but we should always pay attention to the risk of forgetting the existing capabilities of the model.
We need systematic research and hope that more people will participate in discussing and solving this problem thoroughly together on the Llama-X community.

Let's keep this issue item open and look forward to more solid methods.

@zhangbo2008
Copy link

i debug the code of alpaca , its vocabulary is 30k. very small . and some chinese character is tokenized by 2token. its unefficiency. if you wana use new encodign for chinese you perhaps need delete chinese token and add new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants