In this exercise we will look at how to use Data Version Control (DVC) with Annif. DVC is a command-line tool designed to ease the version control of machine learning projects. It works on top of git.
DVC is included in the Virtualbox and Docker images of this tutorial; to your computer it
can be installed using pip
or pipx
from PyPI.
See details on dvc.org.
Note that when using the Docker image the Annif-tutorial-dvc/
is not persisted
if it is not mounted explicitly to host with docker run
command and git commits won't work as given,
but for testing the basic DVC functionality the commits can be omitted.
DVC offers many functionalities for working with machine-learning models.
In case of Annif, maybe the most valuable feature is the possibility to build
data pipelines,
which consist of separate stages. The whole pipeline
can be conviniently operated on with dvc
commands.
For example, the stages of a pipeline can be
- load the vocabulary,
- train the model,
- evaluate the model.
This organization of steps could be easily achived also with a plain bash script. The benefit of using DVC for this is that DVC can track the inputs and outputs of each stage and cache the stages. This eases working on the pipeline: on reruns of the pipeline, if there are no changes in the inputs of a stage, that stage is skipped and its output is taken from the cache.
The DVC tracking works together with git, and this allows efficient version control of large data-sets. DVC also eases tuning models with experiment management.
In this exercise a DVC pipeline is set up for some of the projects seen in the previous exercises.
It is best to create a new git repository Annif-tutorial-dvc
for this:
pwd # Check that you start in the Annif-tutorial directory
cd .. # And move one level up from there
git init Annif-tutorial-dvc
cd Annif-tutorial-dvc/
Run these commands to create a DVC data directory (.dvc
) and ignore file
(.dvcignore
) and commit them:
dvc init
git commit -m "initialize DVC"
Enable DVC autostage, which avoids the need to run git add
commands:
dvc config core.autostage true
git commit .dvc/config -m "enable DVC autostage"
Create directories for the metric reports and training data to be used in the pipeline:
mkdir corpora reports
Import the vocabulary, short-text and full-text documents.
If you use the yso-nlf
data set, run these commands:
cd corpora
dvc import-url ../../Annif-tutorial/data-sets/yso-nlf/yso-skos.ttl skos.ttl
dvc import-url ../../Annif-tutorial/data-sets/yso-nlf/yso-finna.tsv.gz shorttext-docs.tsv.gz
dvc import-url ../../Annif-tutorial/data-sets/yso-nlf/docs/ docs
If you use the stw-zbw
data set, run these commands:
cd corpora
dvc import-url ../../Annif-tutorial/data-sets/stw-zbw/stw-skos.ttl skos.ttl
dvc import-url ../../Annif-tutorial/data-sets/stw-zbw/stw-econbiz.tsv.gz shorttext-docs.tsv.gz
dvc import-url ../../Annif-tutorial/data-sets/stw-zbw/docs/ docs
See what's happened:
dvc status
Unfreeze the
"frozen" .dvc
files:
dvc unfreeze docs.dvc shorttext-docs.tsv.gz.dvc skos.ttl.dvc
Commit changes:
cd ..
git commit -am "import corpora"
A DVC pipeline is defined in an YAML file, by default of the name dvc.yaml
.
DVC can intelligently track (hyper)parameter changes in a file, which in case of
Annif is the projects configuration file. However, the INI/CFG format does not
work in this, but the format has to be TOML.
Create a projects configuration file projects.toml
containing the configurations
for tfidf, MLLM, and ensemble projects.
Show configurations to use with the nlf-yso
data set
[yso-tfidf-en]
name = "YSO TFIDF project"
language = "en"
backend = "tfidf"
vocab = "yso"
analyzer = "snowball(english)"
[yso-mllm-en]
name = "YSO MLLM project"
language = "en"
backend = "mllm"
vocab = "yso"
analyzer = "snowball(english)"
[yso-ensemble-en]
name="YSO ensemble project"
language="en"
backend="ensemble"
vocab="yso"
sources="yso-tfidf-en,yso-mllm-en:2"
Show configurations to use with the stw-zbw
data set
[stw-tfidf-en]
name = "STW TFIDF project"
language = "en"
backend = "tfidf"
vocab = "stw"
analyzer = "snowball(english)"
[stw-mllm-en]
name = "STW MLLM project"
language = "en"
backend = "mllm"
vocab = "stw"
analyzer = "snowball(english)"
[stw-ensemble-en]
name = "STW ensemble project"
language = "en"
backend = "ensemble"
vocab = "stw"
sources = "stw-tfidf-en,stw-mllm-en:2"
Create dvc.yaml
with the following contents, and uncomment the line defining the vocab variable depending on the data-set you use:
vars:
#- vocab: 'stw' # Uncomment this if you are using the STW dataset
#- vocab: 'yso' # Uncomment this if you are using the YSO dataset
- docs: 100 # Max. number of documents to use
stages:
# Load vocabulary
load-vocab:
cmd: annif load-vocab ${vocab} corpora/skos.ttl --force
deps:
- corpora/skos.ttl
outs:
- data/vocabs/${vocab}
# Train tfidf project
train-tfidf:
cmd: annif train ${vocab}-tfidf-en corpora/shorttext-docs.tsv.gz -d ${docs}
deps:
- corpora/shorttext-docs.tsv.gz
- data/vocabs/${vocab}
params:
- projects.toml:
- ${vocab}-tfidf-en
outs:
- data/projects/${vocab}-tfidf-en
# Train MLLM project
train-mllm:
cmd: annif train ${vocab}-mllm-en corpora/docs/train -d ${docs}
deps:
- corpora/docs/train
- data/vocabs/${vocab}
params:
- projects.toml:
- ${vocab}-mllm-en
outs:
- data/projects/${vocab}-mllm-en
# Dummy "train" ensemble, i.e. create the datadir needed as dependency in eval
dummy-train-ensemble:
cmd: mkdir -p data/projects/${vocab}-ensemble-en
deps:
- corpora/docs/train
- data/vocabs/${vocab}
- data/projects/${vocab}-tfidf-en
- data/projects/${vocab}-mllm-en
outs:
- data/projects/${vocab}-ensemble-en
# Evaluate projects in a loop
eval-en:
foreach:
- ${vocab}-mllm-en
- ${vocab}-tfidf-en
- ${vocab}-ensemble-en
do:
cmd:
- annif eval ${item} -m F1@5 -m NDCG --metrics-file reports/${item}.json corpora/docs/test/ -d ${docs}
deps:
- corpora/docs/test
- data/projects/${item}
params:
- projects.toml:
- ${item}
metrics:
- reports/${item}.json:
cache: false
Now the pipeline is set up, and can be executed with one command:
dvc repro
Running the pipeline as such takes only about 10 min, because the docs
variable limits the number of documents used in stages to 100.
When the docs
is increased e.g. to 10000000 to cover all available (short-text) documents, reproducing the pipeline takes 40-50 min.
When finished, the evalution reports can be shown using DVC:
dvc metrics show
Finally commit the changes with git
:
git add reports/
git commit -am "set up, train & evaluate projects"
Congratulations, you've completed Exercise 15, you have a working DVC pipeline allowing easy and reproducible way to train and evaluate Annif projects!