A rule-based checker for Khan Academy translations.
KATC is a set of scripts that valide Khan Academy translations according to a predefined ruleset. Subsequently, a webpage is generated.
KATC is written as a pure Python3 script. It is assumed that you have already installed a somewhat recent Python3 interpreter. If not available for your platform, I recommend Anaconda. In order to install the library dependencies use pip3 install -r requirements.txt
.
First, fill in your Crowdin credentials into crowdin-credentials.json
(use crowdin-credentials.json
as a template9. I recommend not to use your main Crowdin account. No changes or translations are performed using your account. However, an account is required in order to be able to download the POT files (see Architecture section below).
Then you can simply run the PO file update and the checker.
./UpdateAllFiles.py -j 64 # Creates 'de' directory
./check.py # Creates 'output' directory
You can also (optionally) update the youtube subtitle mapping:
./YTSubtitles.py # Creates 'videos.json' directory
which will cause subtitles.html
to be rendered by the next call to ./check.py
.
Initially, KATC downloaded the ZIP file from Crowdin. However, it turned out that this file had not been updated for more than 6 months and any update needs manual triggering which will happen infrequently at best and never at worst.
Therefore, KATC uses a request-based algorithm to automate the export/download of every single file. This step is performed with selectable concurrency. A very high concurrency of at least 32 is recommended.
KATC uses polib for parsing the downloaded PO file. As the parsing is quite slow, this step is performed in parallel on all CPUs.
After that, a static set of rules is computed in memory, whereby all regular expressions are pre-compiled.
Rules are defined as python functors that express a specific call behaviour. Currently most rules are simple regex searches. For example, a SimpleRegexRule
emits a hit if the regex matches anywhere in the translated text. This hit is recorded in a list with the hit string itself, the matching rule and the exact PO entry that caused the hit.
Currently, only rules for the german translation exist. These rules are defined in de.py.
The PO files exported by Crowdin do not contain information on whether any given string is already translated or approved. Therefore, KATC makes the assumption that strings that are equal to the msgid
(i.e. the untranslated text) are not translated and must therefore be ignored.
Although this is rarely used at the moment, there are rule wrappers that allow exceptions for a rule and rules are fully combineable using boolean logic.
Rules are ordered by severity levels. These levels are not only used for coloring and
The hitlists are subsequently re-ordered into multiple distinct hit hierarchies (see check.py) that contain e.g. hits per rule per filename, hits per severity per file, hits per file per rule etc.
KATC is a pure static page generator. This architecture has been chosen for its simplicity, high integratability and ease of use. All cached data is stored locally in the directory.
Note that a usual run with the german ruleset generates up to 400 MiB of static HTML data. Make sure that enough free space is available on the system. Also, do not run the updater on a mobile internet connection as multiple hundreds of Mibibytes of traffic might be generated.
The rendering is performed using Jinja2 using the templates stored in templates/. index.html
is used to generated a total overview page and one overview page per file. For each rule (and each rule for each file individually), template.html
is used to show all individual hits. jQuery Highlight is used to highlight the hits (only works for simple rules). subtitles.html
is used if and only if videos.json
(generated by YTSubtitles.py
) is present. Currently, it only shows some basic information about each video and whether there are german subtitles present.
Additionally, the filestats.json
statistics API file is generated. This file is used by KALanguageReport.
KATC also contains an automatic lint report generation. This approach resolves the issue that the Khan Academy Lint CSV format contains newline and is therefore hard to import in off-the-shelf tools like Excel.
The current lint report can be fetched automatically from i18n-reports at Google Groups. Due to the complexity of the GWT-based Google Groups interface, the automation is done using Selenium. Due to unknown limitations, the process of fetching the group index can be performed using PhantomJS while the second step of fetching an individual post only works in actual browsers like Chrome or Firefox
This means that, in order to update the lint file, you need to install Firefox. On servers, use xvfb-run
to run the script. On Desktops, you can try running it without any prefix:
xvfb-run ./LintReport.py
Note that running the lint report script only updates the lint CSV file and does not (re-)generate the lint HTML.
This process is entirely optional. The main HTML generator will automatically recognize if the lint file exist and only try to generate the lint page if it is present.
The report button uses the utils/report.php
script which sends me an e-mail if a user reports an entry as wrong. Remember to use your own email address if you setup a customized instance of KATC.
Any contribution is highly appreciated. Filing issues and submitting pull requests on GitHub is preferred. Please only submit code that is compatible to the Apache License 2.0.
When trying to add or modify rules, have a look at de.py first. When trying to modify the rules code itself, look in Rules.py instead.
The main checker interface is coded in check.py. This file only needs to be modified for behavioural changes of the software.
Known members of KA translation teams and frequent contributors may gain direct push access to this repository upon request.
Copyright (C) 2015 Uli Köhler
KATC is distributed under the Apache 2.0 license.
Special thanks to alani1 (also the KALR author) and w0lfg who contributed in making this a useful software.