This corpus was created during my master's degree at ICMC-USP, and made possible thanks to the Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interinstitucional de Linguística Computacional), represented by my advisor Dra. Sandra Maria Aluísio and the linguistics specialist Dra. Magali Sanches Duran.
http://www.nilc.icmc.usp.br/nilc/index.php
@inproceedings{leal2018pss,
author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluísio},
title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
year = {2018},
pages = {401–413},
month = {August},
date = {20-26},
address = {Santa Fe, New Mexico, USA},
}
All files are in Tab Separated Values (TSV) format, it means that fields are separated by tab (Also knows as char(9)
or \t
), and newline (char(10)
or \n
) for the rows.
In this folder you'll find the source corpus used to extract the sentence pairs, already exportaded in TSV format:
- production_id: Each triplet of texts (original, natural, strong) has an unique id, called production_id.
- level: ORI (1 - Original), level NAT (2 - Natural) or STR (3 - Strong)
- text_id: Unique id for each text.
- sentence_id: Unique id for each sentence.
- paragraph: Sequential id for the paragraph in text.
- sentence_text: The raw text from the sentence.
- production_id: See porsimples_sentences.tsv.
- level: Simplification level ORI->NAT or NAT->STR.
- text_id_from: Text id from source side of simplification.
- sentence_id_from: sentence id from source side of simplification.
- text_id_to: Text id for target side of simplification.
- sentence_id_to: Sentence id for target side of simplification.
In this folder are the files with aligned pairs from pss0 to pss3, it all have the same layout:
- production_id: See porsimples_sentences.tsv.
- level: Simplification level ORI->NAT, NAT->STR or ORI->STR.
- changed: If the sentence has changes in this simplification level.
- split: If the sentence suffers split in this simplification level.
- sentence_text_from: The raw text of the source sentence.
- sentence_text_to: The raw text of the target sentence.
Concatenate all resulting split sentences on the right side, may be usefull to study the simplification process.
- pss0_align_concat_ori_nat.tsv
- pss0_align_concat_nat_str.tsv
Repeats left side sentence to each one resulting split
- pss1_align_all_splits_ori_nat.tsv
- pss1_align_all_splits_nat_str.tsv
- pss1_align_all_splits_ori_str.tsv
Only the sentence with bigger length and most overlap of tokens. Repeats left side sentence when two resulting split sentences has the same size and overlap.
- pss2_align_length_ori_nat.tsv
- pss2_align_length_nat_str.tsv
- pss2_align_length_ori_str.tsv
Only the sentences that not suffered split.
- pss3_align_no_splits_ori_nat.tsv
- pss3_align_no_splits_nat_str.tsv
- pss3_align_no_splits_ori_str.tsv
In the file triplets_length.tsv, are sentences from the 3 levels, generated from the pss2_length pairs, in the following layout:
- production_id: See porsimples_sentences.tsv.
- level: Fixed - ORI->NAT->STR.
- changed_ori_nat: If the sentence has changes from the original to the natural level.
- changed_nat_str: If the sentence has changes from the natural to the strong level.
- original_text: The raw text of the original sentence.
- natural_text: The raw text of the natural sentence.
- strong_text: The raw text of the strong sentence.
Total sentences Original: 2907
Zero Hora: 2067
Caderno Ciencia FSP: 840
Total sentences Natural: 4066
Total sentences Strong: 4971
Total sentences ALL: 11944
Total sentences NO SIMPLIFICATION Original->Natural: 565
Total sentences NO SIMPLIFICATION Natural->Strong: 2619
Total sentences SPLIT Original->Natural: 826
Total sentences SPLIT Natural->Strong: 721
Total sentences Natural from split: 1990
Total sentences Strong from split: 1625
Total sentences SIMPLIFIED (no split) Original->Natural: 1515
Total sentences SIMPLIFIED (no split) Natural->Strong: 729
Total pairs simplified Original->Natural: 2340
Total pairs simplified Natural->Strong: 1450
Total pairs simplified Original->Strong: 1101
Total all pairs simplified: 4891
Total triplets NO SIMPLIFICATION 3 Levels: 393
Total triplets Simplified Only Original->Natural: 1297
Total triplets Simplified Only Natural->Strong: 181
Total triplets Simplified 3 Levels: 1099
Total triplets: 2970
Mean token size of sentences - simplified (no split) - Ori->Nat: 20
Min token size of sentences - simplified (no split) - Ori->Nat: 3
Max token size tokens of sentences - simplified (no split) - Ori->Nat: 69
Mean token size of sentences - simplified (with split) - Ori->Nat: 33
Min token size of sentences - simplified (with split) - Ori->Nat: 6
Max token size tokens of sentences - simplified (with split) - Ori->Nat: 54
Mean token size of sentences - simplified (no split) - Nat->Str: 22
Min token size of sentences - simplified (no split) - Nat->Str: 4
Max token size tokens of sentences - simplified (no split) - Nat->Str: 57
Mean token size of sentences - simplified (with split) - Nat->Str: 24
Min token size of sentences - simplified (with split) - Nat->Str: 5
Max token size tokens of sentences - simplified (with split) - Nat->Str: 49
Mean tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 6
Min tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 26
Mean tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 9
Min tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 64
Total PSS1 Original->Natural: 3504
Total PSS1 Natural->Strong: 2353
Total PSS1 Original->Strong: 2052
Total geral PSS1: 7909
Total PSS2 Original->Natural: 2370
Total PSS2 Natural->Strong: 1491
Total PSS2 Original->Strong: 1101
Total geral PSS2: 4962
Total PSS3 Original->Natural: 1515
Total PSS3 Natural->Strong: 729
Total PSS3 Original->Strong: 264
Total geral PSS3: 2508