We are excited to share that we presented and published our PETGUI poster this year at the 69th "Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie" (GMDS)" annual conference in Dresden, Germany:
Fig.1 - Our PETGUI poster at this year's 69th annual GMDS conference.We present PETGUI (Pattern-Exploiting Training GUI), a user-friendly graphical user interface for training, testing and labeling with pre-trained masked language models using Pattern-Exploiting Training, a state-of-the-art machine learning framework for text classification tasks using few-shot learning and prompting. Concretely, PETGUI facilitates a multistep pipeline of training and testing on labeled data, followed by annotating on unlabeled data in a comprehensible and intuitive way. PETGUI also provides valuable insights into various aspects of the training, with statistics for label distribution and model performance. We envision our app as a pivotal use-case of a simple machine learning application, that is accessible and manageable by users without domain specific knowledge, in our case by physicians from clinical routine.
- Pattern Exploiting Training
- 🧰 PETGUI Requirements
- ⚙️ PETGUI Setup
- 🛫 Start PETGUI
- 👟 Run PETGUI
- ➕ Features
- ➖ Limitations
- 🗃️ References
PET (Pattern-Exploiting Training) is a semi-supervised training strategy for language models. By reformulating input examples as cloze-style phrases, it has been shown to significantly outperform standard supervised training (Schick et al., 2021), especially valuable for low-resource settings, such as the German clinical domain (Richter-Pechanski et al., 2023).
Fig.2 - Illustration of the PET workflow, see Schick et al., 2021In this illustration, the pattern "It was ___ ." is a cloze-style phrase, textually explaining to the model what the task is about, in this case: sentiment classification.
For this, PET works in the following way: A pretrained language model is first trained on each of such patterns (1).
Secondly, an ensemble of these models annotates unlabeled training data (2).
Finally, a classifier is trained on the resulting soft-labeled dataset (3).
- A Linux host system
- A connection to a remote Slurm cluster with GPUs, accessible via LDAP
- Docker=1.5-2
- Python=3.11
- Torch=2.1.1 (on the remote Slurm cluster)
To run PETGUI on your machine, you need:
- A working connection to a remote Slurm cluster.
- Ldap credentials for accessing the remote Slurm cluster.
- The ca-certificate file for the remote Slurm cluster.
- In a terminal,
git clone
the repo and change directory to it. - Adapt the Slurm configuration SBATCH lines of train.sh and predict.sh for your remote Slurm cluster:
#SBATCH --partition=gpu
#SBATCH --gres=gpu:pascal:1
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 2
#SBATCH --mem=16G
#SBATCH --job-name=petgui
- Adapt conf.yaml to the LDAP server specifications for your remote Slurm cluster:
"CLUSTER_NAME" : "cluster.ORGANISATION-NAME.org"
"LDAP_SERVER" : 'ldap://ldap2.ORGANISATION-NAME.org'
"CA_FILE" : 'ORGANISATION-NAME_CA.pem'
"USER_BASE" : 'dc=ORGANISATION-NAME,dc=org'
"LDAP_SEARCH_FILTER" : '({name_attribute}={name})'
- Move your certificate file of the server to
/conf
directory (example reference file). - Build docker image:
docker build . -t petgui
.
- Change directory to repository:
cd /PETGUI
- Run the docker container:
docker run --name petgui -p 89:89 --mount type=bind,source=./conf,target=/home/appuser/conf petgui
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:89 (Press CTRL+C to quit)
- Open localhost
http://localhost:89
in a browser.
You successfully started PETGUI! To run PETGUI, please see the below steps.
Steps | What you will see |
---|---|
1. Login with ldap credentials for your remote Slurm cluster: | |
2. Input training parameters, for the German few-shot sample data: SAMPLE: 1, LABEL: 0 TEMPLATE: Es war _ . (include underscore character: "_" as a separator in the template, click "+" to add more) VERBALIZER: 1 schlecht, 2 gut Chose one of pre-defined language model: gbert-base or medbert .Click View Data to get statistics on your data as label distribution plots. |
|
3. Click Submit to proceed. Start Training to start the model training. You may Abort the process, which will terminate trainingand navigate you to step 2. |
|
4. Show Results to view model results, displaying accuracy per pattern, as well as precision, recall, f1-measure, and support per label. |
|
5. Choose to either re-train with new parameters (Run with new configuration ) or continue wit trained model for labeling unseen data (Annotate unseen data ). |
|
6. Test the model on evaluation data (sample data): Upload unlabeled data as a csv file and make sure, that the first column in your dataset contains nothing throughout your data lines. Predict Labels Using PET Model starts prediction process. When complete, Download Predicted Data . |
In the terminal: Ctrl + C
to stop the running "uvicorn" process:
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
- To restart PETGUI:
In the terminal:
docker stop petgui
docker rm petgui
docker run...
from 2.
PETGUI provides an intuitive GUI for the PET workflow. Concretely, with PETGUI you can:
- Display statistics on label distribution of the training data
- Train either bert-base-cased or medbert-512 on a labeled dataset
- Display statistics on the model performance
- Test the trained model to generate predictions on unseen data
- Download the labeled file
In its current form, PETGUI is bound by the following requirements, which we may further simplify in future work:
- Connection to remote Slurm cluster: You must have a working connection to a remote Slurm cluster.
- File format and naming convention: The provided training data must be a tar.gz file containing train.csv, test.csv and unlabeled.csv, like our sample training data. The evaluation data must be a comma separated .txt file with the first column empty throughout, like our sample test data.
- Verbalizer mapping: The tokenizer splits words into sub-words, e.g.: "Langeswort" becomes "Langes" and "#wort".
The provided verbalizer has to map to a single input-id, hence the user must provide a sub-word from the model vocabulary.
We plan on adding user feedback to ensure correct input.
- Timo Schick and Hinrich Schütze. (2021). Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference. arXiv preprint arXiv:2001.07676.
- Timo Schick. (2023). Pattern-Exploiting Training (PET) GitHub repository
- Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Geis NA, Frank A, Dieterich C. Few-Shot and Prompt Training for Text Classification in German Doctor's Letters. Stud Health Technol Inform. 2023 May 18;302:819-820. doi: 10.3233/SHTI230275. PMID: 37203504.
- Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH) 08.09. - 13.09.2024, Dresden