Skip to content

TurkuNLP/multilingual-CORE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual CORE

This repository contains a collection of web-crawled documents annotated with a hierarchical register scheme based on the CORE taxonomy, originally developed for English by Biber and Egbert (2018) and Laippala et al. (2022). This scheme has been applied to 16 languages, with five large subcorpora (English, Finnish, French, Swedish, Turkish) and eleven smaller ones (Arabic, Catalan, Spanish, Persian, Hindi, Indonesian, Japanese, Norwegian, Portuguese, Urdu, Chinese).

The data in this repository was used to train our multilingual register classifier, available on Hugging Face.

Usage

If you use this dataset in your research or projects, please cite the following paper (currently under review for publication):

@misc{henriksson2024untanglingunrestrictedwebautomatic,
      title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers},
      author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
      year={2024},
      eprint={2406.19892},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19892},
}

Register Label Scheme

The register annotation scheme used in this corpus is hierarchical, with the following categories:

  • MT: Machine translated or generated
  • LY: Lyrical
  • SP: Spoken
    • it: Interview
  • ID: Interactive discussion
  • NA: Narrative
    • ne: News report
    • sr: Sports report
    • nb: Narrative blog
  • HI: How-to or instructions
    • re: Recipe
  • IN: Informational description
    • en: Encyclopedia article
    • ra: Research article
    • dtp: Description of a thing or person
    • fi: Frequently asked questions
    • lt: Legal terms and conditions
  • OP: Opinion
    • rv: Review
    • ob: Opinion blog
    • rs: Denominational religious blog or sermon
    • av: Advice
  • IP: Informational persuasion
    • ds: Description with intent to sell
    • ed: News & opinion blog or editorial

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published