GlossLM

A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

Overview

GlossLM consists of a multilingual corpus of IGT and pretrained models for interlinear glossing.

Background

Interlinear Glossed Text (IGT) is a common format in language documentation projects. It looks something like this:

o sey x  -tok    r  -ixoqiil
o sea COM-buscar E3S-esposa
“O sea busca esposa.”

The three lines are as follows:

The transcription line is a sentence in the target language (possibly segmented into morphemes)
The gloss line gives an linguistic gloss for each morpheme in the transcription
The translation line is a translation into a higher-resource language

Creating consistent and large IGT datasets is time-consuming and error-prone. The goal of automated interlinear glossing is to aid in this task by predicting the gloss line given the transcription and translation line.

Using the dataset

import datasets

glosslm_corpus = datasets.load_dataset("lecslab/glosslm-corpus")

Using the model

import transformers

# Your inputs
transcription = "o sey xtok rixoqiil"
translation = "O sea busca esposa."
lang = "Uspanteco"
metalang = "Spanish"
is_segmented = False

prompt = f"""Provide the glosses for the following transcription in {lang}.

Transcription in {lang}: {transcription}
Transcription segmented: {is_segmented}
Translation in {metalang}: {translation}\n
Glosses: 
"""

model = transformers.T5ForConditionalGeneration.from_pretrained("lecslab/glosslm")
tokenizer = transformers.ByT5Tokenizer.from_pretrained("google/byt5-base", use_fast=False)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = tokenizer.batch_decode(model.generate(**inputs, max_length=1024), skip_special_tokens=True)
print(outputs[0]) # o sea COM-buscar E3S-esposa

License

Citation

@inproceedings{ginn-etal-2024-glosslm,
    title = "{G}loss{LM}: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text",
    author = "Ginn, Michael  and
      Tjuatja, Lindia  and
      He, Taiqi  and
      Rice, Enora  and
      Neubig, Graham  and
      Palmer, Alexis  and
      Levin, Lori",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.683",
    pages = "12267--12286",
}

Running Experiments

Interested in replicating or modifying our experiments? Pretraining, finetuning, and inference are handled with run.py, as follows:

python run.py -c some_config_file.cfg -o key1=val1 key2=val2

The run is defined by an INI-style config file. We have some examples for pretraining, finetuning, and inference. A config file looks something like:

[config]

mode = pretrain
exp_name = pretrain_base
pretrained_model = google/byt5-base

# Dataset
exclude_st_seg = false
use_translation = true
use_unimorph = false

# Training
max_epochs = 13
early_stopping_patience = 3
learning_rate = 5e-5
batch_size = 2

# Files
output_model_path = /projects/migi8081/glosslm/models/glosslm-pretrained-base

The full list of possible options is in experiment_config.py. In addition to the config file, you can specify any parameter overrides with -o key1=val1 key2=val2.

Note

If you'd like to run finetuning on a new language, you'll probably need to write your own training script. You can use run.py as a reference for how we implemented finetuning.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
configs		configs
data-processing		data-processing
data		data
error_analysis		error_analysis
preds		preds
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlossLM

Overview

Background

Using the dataset

Using the model

License

Citation

Running Experiments

About

Releases

Packages

Contributors 2

Languages

lecs-lab/glosslm

Folders and files

Latest commit

History

Repository files navigation

GlossLM

Overview

Background

Using the dataset

Using the model

License

Citation

Running Experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages