Support 100 translation languages with m2m-100 #5

Bachstelze · 2023-06-18T13:26:26Z

We could support more translation directions with m2m-100 in cTranslate or use easy translate.

Rohith04MVK · 2024-01-19T05:31:32Z

Is this something currently being worked on? If not, I would love to contribute.

Bachstelze · 2024-01-19T09:02:42Z

In the long-term, I am looking into better translation support by LLMs like the tower of unbabel. Though it takes additional steps till we have general models with this enhancement.

jncraton · 2024-01-23T12:39:01Z

@Rohith04MVK This is not actively being worked on, but if folks want this I'm happy for it to be added. I haven't thought about this deeply, but I would imagine this could be implemented as something like:

def translate(text, src_lang, dst_lang):
    """Translate `text` from `src_lang` to `dst_lang`"""
    ...

It should be a lot like the code function.

Rohith04MVK · 2024-01-23T14:17:46Z

I'd love to help! While I think M2M-100 418M model with CTranslate2 (>512 MB) has potential, are there any other models or approaches we should consider before moving forward?

jncraton · 2024-01-23T15:22:03Z

My approach has been to try to define the simplest possible interface without worrying too much about specific models. New and improved models are created regularly, and one of my goals for this project is to provide easy access to the current state-of-the-art model for its size without users of the package needing to keep track of the latest and greatest models.

There's a priority list of available models that is used to determine which model to use. The package searches through this list in order until a model is found the matches the current inference requirement (max RAM, license, tuning, etc). I would hope that we would be able to do the same for translation models.

m2m100 looks like a reasonable place to start from my point of view. I just uploaded the ct2 int8 quantized models.

Bachstelze · 2024-01-23T15:48:29Z

NLLB models are also supported by cTranslate. They support up to 200 languages but are a magnitude bigger.

Rohith04MVK · 2024-01-24T14:53:16Z

Was the sentencepiece.bpe.model intentionally omitted from the repo?

jncraton · 2024-01-24T15:06:14Z

That's an oversight on my part. I have a notebook that I use to quickly convert these models. I didn't see that this file needed to be added to the files copied by ct2-transformers-converter. I've added those files now.

jncraton added the enhancement New feature or request label Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 100 translation languages with m2m-100 #5

Support 100 translation languages with m2m-100 #5

Bachstelze commented Jun 18, 2023

Rohith04MVK commented Jan 19, 2024

Bachstelze commented Jan 19, 2024

jncraton commented Jan 23, 2024

Rohith04MVK commented Jan 23, 2024

jncraton commented Jan 23, 2024

Bachstelze commented Jan 23, 2024

Rohith04MVK commented Jan 24, 2024

jncraton commented Jan 24, 2024

Support 100 translation languages with m2m-100 #5

Support 100 translation languages with m2m-100 #5

Comments

Bachstelze commented Jun 18, 2023

Rohith04MVK commented Jan 19, 2024

Bachstelze commented Jan 19, 2024

jncraton commented Jan 23, 2024

Rohith04MVK commented Jan 23, 2024

jncraton commented Jan 23, 2024

Bachstelze commented Jan 23, 2024

Rohith04MVK commented Jan 24, 2024

jncraton commented Jan 24, 2024