Skip to content

Commit

Permalink
Mention the cleaner
Browse files Browse the repository at this point in the history
  • Loading branch information
jelmervdl committed Sep 21, 2023
1 parent 0311261 commit 174738f
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ OpusCleaner is a machine translation/language model data cleaner and training sc
## Cleaner
The cleaner bit takes care of downloading and cleaning multiple different datasets and preparing them for translation.

```sh
opuscleaner-clean --parallel 4 data/train-parts/dataset.filter.json | gzip -c > clean.gz
```

### Installation for cleaning
If you just want to use OpusCleaner for cleaning, you can install it from PyPI, and then run it

Expand Down Expand Up @@ -42,6 +46,7 @@ Compare the dataset at different stages of filtering to see what the impact is o
- `data/train-parts` is scanned for datasets. You can change this by setting the `DATA_PATH` environment variable, the default is `data/train-parts/*.*.gz`.
- `filters` should contain filter json files. You can change the `FILTER_PATH` environment variable, the default is `<PYTHON_PACKAGE>/filters/*.json`.


### Installation for development
```sh
cd frontend
Expand Down

0 comments on commit 174738f

Please sign in to comment.