About language normalization #17

Animenosekai · 2021-06-18T12:55:34Z

Animenosekai
Jun 18, 2021
Maintainer

@ZhymabekRoman Do you really think we should have Google, Bing, Reverso, etc. specified in the dataset we have? (calling by_google for example).

We maybe should juste have the language normalization and denormalization functions which holds the changes needed to make to the language name (with maybe a dictionary) and have the service specific language code accessible through those functions only.

ZhymabekRoman · 2021-06-19T03:17:00Z

ZhymabekRoman
Jun 19, 2021

We maybe should juste have the language normalization and denormalization functions which holds the changes needed to make to the language name (with maybe a dictionary) and have the service specific language code accessible through those functions only.

Yes, I think that's a good idea. We can create a dictionary with {alpha2: service_specific_code} value and use this single dictionary in both _language_normalize and _language_denormalize. And I think we also need to make a list of languages supported by the service and check source/destinaion language by this list at BaseTranslator level.

But what will we do with the languages which are not in official ISO639? For example Kazakh-Latin, Emoji, Uzbek-Cyrilic. Just leave these languages in the ISO639 list?

2 replies

Animenosekai Jun 19, 2021
Maintainer Author

We might need to use ISO6393 codes and make an "ID" field being a reference to the language in translatepy (allowing us to implement language such as emj)

Look at the language-management branch which has a draft of the work in the playground/language folder (still very messy though).

You could look at how the dataset was created in save.py and how the Language class is working in language.py.

_language_cache.py under the translatepy/utils folder is where the data is combined into a Python file (to be loaded in memory when launching the module)

Animenosekai Jun 19, 2021
Maintainer Author

@ZhymabekRoman The iso.json file comes from wooorm/iso-639-3

Animenosekai · 2021-07-05T22:23:10Z

Animenosekai
Jul 5, 2021
Maintainer Author

@ZhymabekRoman Just merged the language-management branch to main, all is left to do is to fix maybe some translators that might not work (heard that bing didn't work anymore), and fix the tests (don't know why they don't work...)

5 replies

ZhymabekRoman Jul 6, 2021

Great, until September 1 unfortunately I will be busy, but nevertheless I will try to fix something. By the way, not long ago I looked Bing Translate - until now did not understand why he does not want to work, but I can assume that it needs cookies as additional authorization data. But here I do not know how to be - through Request session class site opens in some strange encoding, but if I do requests directly through requests.post, everything is fine. The same problem if memory is not changed is with Deepl.

Animenosekai Jul 6, 2021
Maintainer Author

The same problem if memory is not changed is with Deepl.

What do you mean by memory?

through Request session class site opens in some strange encoding, but if I do requests directly through requests.post

I will look into this, but isn't it a GET request to retrieve the site?

Animenosekai Jul 6, 2021
Maintainer Author

opens in some strange encoding

Yup figured it out.

translatepy's Request class accepts the "br" encoding, which doesn't seem to work with requests (The response header from bing returned: "Content-Encoding":"br"). requests sets by default the encoding to gzip, deflate, which works.

I've fixed the issue and will soon commit the patch

Also it seems to be a GET request to retrieve the site

ZhymabekRoman Jul 7, 2021

What do you mean by memory?

I meant that the JSON resopnse from DeepL as well as Bing came in a crooked encoding, if make requests through Request class. I think now will work fine after fixing

but isn't it a GET request to retrieve the site?

Yes, that's right, the GET request, I just made a mistake

@Animenosekai, And yes by the way, look at my patch and if everything is ok apply it, which fixes the issue described here

Animenosekai Jul 7, 2021
Maintainer Author

@Animenosekai, And yes by the way, look at my patch and if everything is ok apply it, which fixes the issue described here

@ZhymabekRoman My bad I thought that I already added it to translate.py lol

I think now will work fine after fixing

I think we could use self.session now that I fixed the headers

ZhymabekRoman · 2021-07-06T03:06:01Z

ZhymabekRoman
Jul 6, 2021

Ahh, yes, look at GitHub Action, for some reason 2 instances of Pylint have been idle for more than 4 hours, it will exhaust your entire free GitHub Action plan.

1 reply

Animenosekai Jul 6, 2021
Maintainer Author

I think that GitHub Actions are unlimited for public repositories

But still, that's weird that they ran for almost 6 hours...

Animenosekai · 2021-07-06T15:45:32Z

Animenosekai
Jul 6, 2021
Maintainer Author

@ZhymabekRoman I finally repaired Bing by passing the cookies to the requests 🎉

I'm just going to fix the issue with MyMemory and it should be good for the tests

1 reply

Animenosekai Jul 6, 2021
Maintainer Author

@ZhymabekRoman

🎉🎉🎉🎉🎉

Animenosekai · 2021-07-19T19:04:39Z

Animenosekai
Jul 19, 2021
Maintainer Author

I do think that the current state is ready for v2 (unless I forgot something?)

What do you think about releasing it @ZhymabekRoman ?

2 replies

ZhymabekRoman Jul 21, 2021

I do think that the current state is ready for v2

Lol, I was thinking of asking that question too .....

I think yes, at the moment the library works fine, and can do a release of a new version. I will make a post about this library in Russian Telegram channels for open source projects after the release to attract more developers.

And I think issue #4 #9 #14 can be closed

Animenosekai Jul 26, 2021
Maintainer Author

@ZhymabekRoman I just released v2.0! 🎐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About language normalization #17

{{title}}

Replies: 5 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

About language normalization #17

Animenosekai Jun 18, 2021 Maintainer

Replies: 5 comments · 11 replies

ZhymabekRoman Jun 19, 2021

Animenosekai Jun 19, 2021 Maintainer Author

Animenosekai Jun 19, 2021 Maintainer Author

Animenosekai Jul 5, 2021 Maintainer Author

ZhymabekRoman Jul 6, 2021

Animenosekai Jul 6, 2021 Maintainer Author

Animenosekai Jul 6, 2021 Maintainer Author

ZhymabekRoman Jul 7, 2021

Animenosekai Jul 7, 2021 Maintainer Author

ZhymabekRoman Jul 6, 2021

Animenosekai Jul 6, 2021 Maintainer Author

Animenosekai Jul 6, 2021 Maintainer Author

Animenosekai Jul 6, 2021 Maintainer Author

Animenosekai Jul 19, 2021 Maintainer Author

ZhymabekRoman Jul 21, 2021

Animenosekai Jul 26, 2021 Maintainer Author

Animenosekai
Jun 18, 2021
Maintainer

Replies: 5 comments 11 replies

ZhymabekRoman
Jun 19, 2021

Animenosekai Jun 19, 2021
Maintainer Author

Animenosekai Jun 19, 2021
Maintainer Author

Animenosekai
Jul 5, 2021
Maintainer Author

Animenosekai Jul 6, 2021
Maintainer Author

Animenosekai Jul 6, 2021
Maintainer Author

Animenosekai Jul 7, 2021
Maintainer Author

ZhymabekRoman
Jul 6, 2021

Animenosekai Jul 6, 2021
Maintainer Author

Animenosekai
Jul 6, 2021
Maintainer Author

Animenosekai Jul 6, 2021
Maintainer Author

Animenosekai
Jul 19, 2021
Maintainer Author

Animenosekai Jul 26, 2021
Maintainer Author