This page is an overview of Annif tutorial contents. There are video-only lectures that are prefixed with 🎞️. Exercises marked with 💻 require some coding, and those with 📖 are for reading only.
The exercises drawn with thick borders and a blue background are core, the others are optional extras.
%%{
init: {
"flowchart": {
"nodeSpacing": 20,
"rankSpacing": 45,
"useMaxWidth": true,
"curve": "linear"
}
}
}%%
flowchart TD
classDef core fill:#ADD8E6,stroke:#000000,stroke-width:2px;
classDef optional fill:#ffffff,stroke:#000000,stroke-width:1px;
install([install]) --> tfidf([TFIDF])
tfidf --> webui([Web UI])
webui --> eval([evaluate])
eval --> mllm([MLLM])
mllm --> ensemble([ensemble])
ensemble --> nn_ensemble([NN ensemble])
ensemble --> custom([Custom corpus])
ensemble --> dvc([DVC])
mllm --> ft([Hogwarts/fastText])
mllm --> lang_filter([Languages & filtering])
webui --> rest([REST API])
rest --> production([Production use])
eval --> omikuji([Omikuji])
omikuji --> classification([Classification])
class install core
class tfidf core
class webui core
class eval core
class mllm core
class lang_filter optional
class ensemble core
class dvc optional
class rest optional
class production optional
class omikuji optional
class classification optional
class ft optional
class custom optional
class nn_ensemble optional
Select your installation type. If you don’t know what to choose, we suggest using VirtualBox.
This tutorial provides two example data sets; one of them should be chosen to be used in the exercises.
The basic functionality of Annif is introduced by setting up and training a project using a TFIDF model.
The principles of the algorithm types used by Annif models are presented.
Slides on associative algorithms for XMTC (by CSC's @jmakoske & @mvsjober):
The web user interface of Annif allows quick testing of projects.
The REST API of Annif can be used for integrating Annif with other systems.
Here is described aspects to consider when going from testing and development phase to a production-ready deployment of Annif.
Quantitative testing and comparison of projects against standard metrics can be done using the eval
command.
Omikuji is a tree-based associative machine learning model that often produces very good results, but requires more resources than the TFIDF model. This exercise is optional, because training an Omikuji model on the full datasets can take around 40 minutes.
MLLM is a lexical algorithm for matching terms in document text to terms in a controlled vocabulary.
Yet another algorithm you can try is fastText, which can also work on the level of individual characters.
The ability of Annif to process text in a given language depends on the choice of the analyzer, which performs text preprocessing. Sometimes it might be useful to filter out parts of the document that are not in the main language of the document.
An ensemble project combines results from the projects set up in previous exercises.
A neural network ensemble can be trained to intelligently combine the results from the base projects.
A big challenge in applying Annif to own data is gathering documents and converting them to form a corpus in suitable format. In this exercise metadata from arXiv articles are used to form a corpus, which can be used to train Annif models.
Data Version Control (DVC) eases maintaining machine learning projects. In this exercise a DVC pipeline is used to set up, train and evaluate Annif projects.
Summary of the material in the tutorial and some pointers to further information.