Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 100 translation languages with m2m-100 #5

Open
Bachstelze opened this issue Jun 18, 2023 · 8 comments
Open

Support 100 translation languages with m2m-100 #5

Bachstelze opened this issue Jun 18, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@Bachstelze
Copy link

We could support more translation directions with m2m-100 in cTranslate or use easy translate.

@jncraton jncraton added the enhancement New feature or request label Jun 18, 2023
@Rohith04MVK
Copy link

Is this something currently being worked on? If not, I would love to contribute.

@Bachstelze
Copy link
Author

In the long-term, I am looking into better translation support by LLMs like the tower of unbabel. Though it takes additional steps till we have general models with this enhancement.

@jncraton
Copy link
Owner

@Rohith04MVK This is not actively being worked on, but if folks want this I'm happy for it to be added. I haven't thought about this deeply, but I would imagine this could be implemented as something like:

def translate(text, src_lang, dst_lang):
    """Translate `text` from `src_lang` to `dst_lang`"""
    ...

It should be a lot like the code function.

@Rohith04MVK
Copy link

I'd love to help! While I think M2M-100 418M model with CTranslate2 (>512 MB) has potential, are there any other models or approaches we should consider before moving forward?

@jncraton
Copy link
Owner

My approach has been to try to define the simplest possible interface without worrying too much about specific models. New and improved models are created regularly, and one of my goals for this project is to provide easy access to the current state-of-the-art model for its size without users of the package needing to keep track of the latest and greatest models.

There's a priority list of available models that is used to determine which model to use. The package searches through this list in order until a model is found the matches the current inference requirement (max RAM, license, tuning, etc). I would hope that we would be able to do the same for translation models.

m2m100 looks like a reasonable place to start from my point of view. I just uploaded the ct2 int8 quantized models.

@Bachstelze
Copy link
Author

NLLB models are also supported by cTranslate. They support up to 200 languages but are a magnitude bigger.

@Rohith04MVK
Copy link

Was the sentencepiece.bpe.model intentionally omitted from the repo?

@jncraton
Copy link
Owner

That's an oversight on my part. I have a notebook that I use to quickly convert these models. I didn't see that this file needed to be added to the files copied by ct2-transformers-converter. I've added those files now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants