Skip to content
This repository has been archived by the owner on Sep 19, 2023. It is now read-only.

Latest commit

 

History

History
332 lines (241 loc) · 10.9 KB

README.md

File metadata and controls

332 lines (241 loc) · 10.9 KB

nmt-wizard-docker

The goal of this project is to encapsulate MT frameworks in Docker containers and expose a single interface for preprocessing, training, translating, and serving models. The project targets advanced users who are looking to automate the production and validation of translation models.

The available Docker images extend the original frameworks with the following features:

  • Data weighting, sampling, and preprocessing from raw training files.
  • Files synchronization from and to remote storages such as Amazon S3, Swift, or any server via SSH.
  • Metadata on model history such as parent model, training data that was used, training time, etc.
  • Regular HTTP request to an URL to declare running status.

It supports the OpenNMT-py and OpenNMT-tf training frameworks, and provides translate-only frameworks using online translation API from DeepL, Google, Baidu, and others.

Usage

We recommend using the Docker images that are available on Docker Hub.

The Docker image entrypoint is a Python script that exposes the same command line interface for all frameworks. For example, run the commands below to download the latest OpenNMT-tf image and list the available options:

docker pull nmtwizard/opennmt-tf
docker run nmtwizard/opennmt-tf -h

JSON configuration files are used to define training data, data preprocess, and framework specific options (model, hyperparameters, etc).

As an example, let's train an English-German Transformer model with OpenNMT-tf.

1. Prepare the data.

For simplicity, we assume that the training data is already tokenized and the current directory contains a data/ subdirectory with the following structure:

$ tree data/
.
├── corpus
│   ├── train.de
│   └── train.en
└── vocab
    └── shared-vocab.txt

where:

  • train.de and train.en are tokenized training files
  • shared-vocab.txt contains one token per line and no special tokens

2. Define the configuration.

The JSON configuration file is used to describe where to read the data (data) and how to transform it (preprocess). The options block is specific to each framework and defines the model and training hyperparameters. See Configuration for more details.

{
    "source": "en",
    "target": "de",
    "data": {
        "sample_dist": [
            {
                "path": "/data/corpus",
                "distribution": [
                    ["train", "*"]
                ]
            }
        ]
    },
    "preprocess": [
        {
            "op":"tokenization",
            "source": {
                "mode": "space"
            },
            "target": {
                "mode": "space"
            }
        }
    ],
    "vocabulary": {
        "source": {
            "path": "/data/vocab/shared-vocab.txt"
        },
        "target": {
            "path": "/data/vocab/shared-vocab.txt"
        }
    },
    "options": {
        "model_type": "Transformer",
        "auto_config": true
    }
}

3. Train the model.

The train command is used to start the training:

cat config.json | docker run -i --gpus all \
    -v $PWD/data:/data -v $PWD/models:/models nmtwizard/opennmt-tf \
    --model_storage /models --task_id my_model_1 --config - train

This command runs the training for 1 epoch and produces the model models/my-model-1. The model contains the latest checkpoint, the JSON configuration, and all required resources (such as the vocabulary files).

You can run the next epoch by passing the model name as argument:

docker run --gpus all -v $PWD/data:/data -v $PWD/models:/models nmtwizard/opennmt-tf \
    --model_storage /models --task_id my_model_2 --model my_model_1 train

Alternatively, you can run the full training in one command by setting the training step option available in the training framework.

4. Translate test files.

Once a model is saved, you can start translating files with the trans command:

docker run --gpus all -v $PWD/data:/data -v $PWD/models:/models nmtwizard/opennmt-tf \
    --model_storage /models --model my_model_2 trans -i /data/test.en -o /data/output.de

This command translates data/test.en and saves the result to data/output.de.

5. Serve the model.

At some point, you may want to turn your trained model into a translation service with the serve command:

docker run -p 4000:4000 --gpus all -v $PWD/models:/models nmtwizard/opennmt-tf \
    --model_storage /models --model my_model_2 serve --host 0.0.0.0 --port 4000

This command starts a translation server that accepts HTTP requests:

curl -X POST http://localhost:4000/translate -d '{"src":[{"text": "Hello world !"}]}'

See the REST translation API for more details.

To optimize the model size and loading latency, you can also release the model before serving. It will remove training-only information and possibly run additional optimizations:

docker run -v $PWD/models:/models nmtwizard/opennmt-tf \
    --model_storage /models --model my_model_2 release

This command produces the serving-only model models/my_model_2_release.

Remote storages

Files and directories can be automatically downloaded from remote storages such as Amazon S3, Swift, or any server via SSH. This includes training data, models, and resources used in the configuration.

Remote storages should be configured in a JSON file and passed to the --storage_config command line option:

{
    "storage_id_1": {
        "type": "s3",
        "bucket": "model-catalog",
        "aws_credentials": {
            "access_key_id": "...",
            "secret_access_key": "...",
            "region_name": "..."
        }
    },
    "storage_id_2": {
        "type": "ssh",
        "server": "my-server.com",
        "basedir": "myrepo",
        "user": "root",
        "password": "root"
    }
}

See the supported services and their parameters in SYSTRAN/storages.

Paths on the command line and in the configuration can reference remote paths with the syntax <storage_id>:<path>, where:

  • <storage_id> is the storage identifier in the JSON file above (e.g. storage_id_1)
  • <path> is a file path on the remote storage

Files will be downloaded in the /root/workspace/shared directory within the Docker image. To minimize download cost, it is possible to mount this directory when running the Docker image. Future runs will reuse the local file if the remote file has not changed.

Configuration

The JSON configuration file contains all parameters necessary to train and run models. It has the following general structure:

{
    "source": "string",
    "target": "string",
    "data": {},
    "preprocess": [
        {
            "op":"tokenization",
            "source": {
                "mode": "string"
            },
            "target": {
                "mode": "string"
            }
        }
    ],
    "vocabulary": {
        "source": {
            "path": "string"
        },
        "target": {
            "path": "string"
        }
    },
    "options": {},
    "serving": {}
}

Description

source and target

They define the source and target languages (e.g. "en", "de", etc.).

data

The data section of the JSON configuration can be used to select data based on file patterns. The distribution is a list where each element contains:

  • path: path to a directory where the distribution applies
  • distribution: a list of filename patterns and weights

For example, the configuration below will randomly select 10,000 training examples in the directory data/en_nl/train from files that have News, IT, or Dialog in their name:

{
    "data": {
        "sample": 10000,
        "sample_dist": [
            {
                "path": "/data/en_de/train",
                "distribution": [
                    ["News", 0.7],
                    ["IT", 0.3],
                    ["Dialog", 0.5]
                ]
            }
        ]
    }
}

Weights define the relative proportion that each pattern should take in the final sampling. They do not need to sum to 1. The special weight "*" can be used to force using all the examples associated with the pattern.

Source and target files should have the same name and be suffixed by the language code.

preprocess

This block applies preprocess operations, such as tokenization, to the data.

Tokenization operator accepts any tokenization options from OpenNMT/Tokenizer.

vocabulary

This block contains the vocabularies used by translation framework.

The vocabulary file must have the following format:

  • one token par line
  • no special tokens (such as <s>, </s>, <blank>, <unk>, etc.)

We plan to add a buildvocab command to automatically generate it from the data.

options

This block contains the parameters that are specific to the selected framework: the model architecture, the training parameters, etc.

See the file frameworks/./README.md of the selected framework for more information.

serving

Serving reads additional values from the JSON configuration:

{
    "serving": {
        "max_batch_size": 64
    }
}

where:

  • max_batch_size is the maximum batch size to execute at once

These values can be overriden for each request.

Overriding the model configuration

When a model is set on the command line with --model, its configuration will be used. You can pass a partial configuration to --config in order to override some fields from the model configuration.

Environment variables

Values in the configuration can use environment variables with the syntax ${VARIABLE_NAME}.

This is especially useful to avoid hardcoding a remote storage identifier in the configuration. For example, one can define a data path as ${DATA_DIR}/en_de/corpus in the configuration and then configure the storage identifer when running the Docker image with -e DATA_DIR=storage_id_1:.

Add or extend frameworks

This repository consists of a Python module nmtwizard that implements the shared interface and extended features mentionned above. Each framework should then:

  • extend the nmtwizard.Framework class and implement the logic that is specific to the framework
  • define a Python entrypoint script that invokes Framework.run()
  • define a Dockerfile that includes the framework, the nmtwizard module, and all dependencies.

Advanced users could extend existing frameworks to implement customized behavior.

Run tests

# Install test requirements:
pip install -e .[tests]

# Run unit tests:
pytest test/

# Automatically reformat code:
black .

# Check code for errors:
flake8 .