Multilingual CORE

This repository contains a collection of web-crawled documents annotated with a hierarchical register scheme based on the CORE taxonomy, originally developed for English by Biber and Egbert (2018) and Laippala et al. (2022). This scheme has been applied to 16 languages, with five large subcorpora (English, Finnish, French, Swedish, Turkish) and eleven smaller ones (Arabic, Catalan, Spanish, Persian, Hindi, Indonesian, Japanese, Norwegian, Portuguese, Urdu, Chinese).

The data in this repository was used to train our multilingual register classifier, available on Hugging Face.

Usage

If you use this dataset in your research or projects, please cite the following paper (currently under review for publication):

@misc{henriksson2024untanglingunrestrictedwebautomatic,
      title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers},
      author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
      year={2024},
      eprint={2406.19892},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19892},
}

Register Label Scheme

The register annotation scheme used in this corpus is hierarchical, with the following categories:

MT: Machine translated or generated
LY: Lyrical
SP: Spoken
- it: Interview
ID: Interactive discussion
NA: Narrative
- ne: News report
- sr: Sports report
- nb: Narrative blog
HI: How-to or instructions
- re: Recipe
IN: Informational description
- en: Encyclopedia article
- ra: Research article
- dtp: Description of a thing or person
- fi: Frequently asked questions
- lt: Legal terms and conditions
OP: Opinion
- rv: Review
- ob: Opinion blog
- rs: Denominational religious blog or sermon
- av: Advice
IP: Informational persuasion
- ds: Description with intent to sell
- ed: News & opinion blog or editorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual CORE

Usage

Register Label Scheme

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ar		ar
ca		ca
en		en
es		es
fa		fa
fi		fi
fr		fr
hi		hi
id		id
jp		jp
no		no
pt		pt
sv		sv
tr		tr
ur		ur
zh		zh
README.md		README.md

TurkuNLP/multilingual-CORE

Folders and files

Latest commit

History

Repository files navigation

Multilingual CORE

Usage

Register Label Scheme

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages