This repository contains a lexical resource for English, sentence-level relation extraction (RE) based on Wikidata. The resource is distributed as an SQLite database for easy quering and storing. It includes more than 47 million examples collected from different corpora for RE, and it covers more than 1,000 unique Wikidata properties.
You can use, for example, this DB browser to read the database.
The database can be downloaded from osf.io (4.5 G).
This database accompanies the paper "Knowledge Extraction From Texts Based on Wikidata" (Shimorina et al., NAACL 2022).
The following six datasets were preprocessed for sentence-based relation extraction: FewRel, T-REx, DocRED, WikiFact, Wiki20m, WebRED.
- # of R types: how many unique Wikidata properties are in the dataset
- Negative examples are sentences with no relation detected or with an unknown relation between entities.
dataset | # of instances | # of R types | % of neg. examples | human checks | license |
---|---|---|---|---|---|
FewRel (Han et al., 2018) [data] | 56,000 | 80 | 0% | yes | MIT |
T-REx (Elsahar et al., 2018) [data] | 12,081,023 | 652 | 0% | no | CC BY-SA 4.0 |
DocRED (Yao et al., 2019) [data] | 778,914 | 96 | 0% | no (yes*) | MIT |
WikiFact (Goodrich et al., 2019) [data] | 33,628,338 | 934 | 92% | no | CC BY 4.0 |
Wiki20m (Han et al., 2020) [data] | 738,463 | 81 | 60% | no | MIT |
WebRED (Ormandi et al., 2021) [data] | 107,819 | 385 | 54% | yes | CC BY 4.0 |
our database (DB) | 47,390,557 | 1,022 | 66% | yes/no | CC BY-SA 4.0 |
* Some part of DocRED was verified by humans.
The DB has one table corpora
where all dataset instances are stored.
column name | explanation | example | comment |
---|---|---|---|
id | ID in db | INTEGER | |
relation_id | property ID from Wikidata (*) | P131 | |
relation_name | property label from Wikidata (*) | located in the administrative territorial entity | |
sentence | a sentence where subject and object are marked | SUBJ{Magnificent Mile}, a SUBJ{neighborhood} in OBJ{Chicago}, Illinois, U.S. | |
dataset | name of the dataset ( webred/docred/wikifact) | trex | |
human_checks | Was this sentence manually verified by humans? | True/False | |
dataset_part | The dataset split the instance belongs to (train or dev). Test data is not included. | train/dev | |
source_name | subject text | Magnificent Mile | |
target_name | object text | Chicago | |
source_wikidata_id | subject ID from Wikidata | Q3056359 | only in FewRel, Wiki20m |
target_wikidata_id | object ID from Wikidata | Q1331049 | only in FewRel, Wiki20m |
sentence_tokenised | tokenised sentence | The Lakes is a locality in OBJ{Western Australia} within the SUBJ{Shire of Mundaring} . | only in FewRel, Wiki20m, DocRED |
title | title of the document where the sentence comes from | The Lakes, Western Australia | only in T-REx, DocRED |
sent_id | sentence ID from the document | INTEGER | only in T-REx, DocRED |
source_entity_type | subject entity type (LOC/PER/TIME/MISC) | ORG | only in DocRED |
target_entity_type | object entity type (LOC/PER/TIME/MISC) | LOC | only in DocRED |
triple_annotator | aligner that was used for aligning the triple with text (see T-REx docs) | Simple-Aligner | only in T-REx |
source_annotator | aligner that was used for aligning the subject with text (see T-REx docs) | Wikidata_Spotlight_Entity_Linker | only in T-REx |
target_annotator | aligner that was used for aligning the object with text (see T-REx docs) | Date_Linker | only in T-REx |
num_pos_raters | number of unique human raters who thought that the sentence expresses the given relation. | INTEGER | only in WebRED |
num_raters | number of unique human raters whouannotated the sentence-fact pair | INTEGER | only in WebRED |
url | document url where the sentence comes from | https://en.wikipedia.org/wiki/Miracle_Mile | only in WebRED |
(*) Relation IDs and names include P0
(absence of relation; name: "no_relation") and NA
(name: "unknown").
- Calculate rater confidence and assign P0 to the cases where it was low.
- Separate into train and dev (90/10).
- Delete instances where SUBJ and OBJ cover the same token(s).
- Delete instances with properties which does not exist in Wikidata anymore (e.g., P5130, P134, P1432, P1962, P1773).
- Create detokenised sentences with MosesDetokenizer from
sacremoses
.
- Create detokenised sentences with MosesDetokenizer from
sacremoses
. - Don't include sentences that are present in FewRel.
- Add sentences that are present in FewRel but have a different relation (66 instances).
- Delete duplicate entries.
- Extract relations that have subject and object present in the same sentence.
- Create detokenised sentences with MosesDetokenizer from
sacremoses
. - For DocRED-human: discard some examples, e.g., with missing subjects/objects, no evidence support (see this issue).
- Take all the alignments where word boundaries for subject/object are present.
- Delete duplicates among all the alignments for a text.
We do train/dev division only for WebRED, since it is human-annotated. We don't do it for T-REx. All other corpora provide a validation set.
There exist other datasets that use Wikidata for RE. They were not included in the DB.
If you use the database, please cite every used dataset.
@inproceedings{shimorina-etal-2022-knowledge,
title = "Knowledge Extraction From Texts Based on {W}ikidata",
author = "Shimorina, Anastasia and
Heinecke, Johannes and
Herledan, Fr{\'e}d{\'e}ric",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track",
month = jul,
year = "2022",
address = "Hybrid: Seattle, Washington + Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-industry.33",
pages = "297--304",
abstract = "This paper presents an effort within our company of developing knowledge extraction pipeline for English, which can be further used for constructing an entreprise-specific knowledge base. We present a system consisting of entity detection and linking, coreference resolution, and relation extraction based on the Wikidata schema. We highlight existing challenges of knowledge extraction by evaluating the deployed pipeline on real-world data. We also make available a database, which can serve as a new resource for sentential relation extraction, and we underline the importance of having balanced data for training classification models.",
}
@inproceedings{yao-etal-2019-docred,
title = "{D}oc{RED}: A Large-Scale Document-Level Relation Extraction Dataset",
author = "Yao, Yuan and
Ye, Deming and
Li, Peng and
Han, Xu and
Lin, Yankai and
Liu, Zhenghao and
Liu, Zhiyuan and
Huang, Lixin and
Zhou, Jie and
Sun, Maosong",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P19-1074",
doi = "10.18653/v1/P19-1074",
pages = "764--777",
abstract = "Multiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: (1) DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text; (2) DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document; (3) along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios. In order to verify the challenges of document-level RE, we implement recent state-of-the-art methods for RE and conduct a thorough evaluation of these methods on DocRED. Empirical results show that DocRED is challenging for existing RE methods, which indicates that document-level RE remains an open problem and requires further efforts. Based on the detailed analysis on the experiments, we discuss multiple promising directions for future research. We make DocRED and the code for our baselines publicly available at https://github.com/thunlp/DocRED.",
}
@inproceedings{han-etal-2018-fewrel,
title = "{F}ew{R}el: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation",
author = "Han, Xu and
Zhu, Hao and
Yu, Pengfei and
Wang, Ziyun and
Yao, Yuan and
Liu, Zhiyuan and
Sun, Maosong",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
month = oct # "-" # nov,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D18-1514",
doi = "10.18653/v1/D18-1514",
pages = "4803--4809",
abstract = "We present a Few-Shot Relation Classification Dataset (dataset), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. The relation of each sentence is first recognized by distant supervision methods, and then filtered by crowdworkers. We adapt the most recent state-of-the-art few-shot learning methods for relation classification and conduct thorough evaluation of these methods. Empirical results show that even the most competitive few-shot learning models struggle on this task, especially as compared with humans. We also show that a range of different reasoning skills are needed to solve our task. These results indicate that few-shot relation classification remains an open problem and still requires further research. Our detailed analysis points multiple directions for future research.",
}
@inproceedings{elsahar-etal-2018-rex,
title = "{T}-{RE}x: A Large Scale Alignment of Natural Language with Knowledge Base Triples",
author = "Elsahar, Hady and
Vougiouklis, Pavlos and
Remaci, Arslen and
Gravier, Christophe and
Hare, Jonathon and
Laforest, Frederique and
Simperl, Elena",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L18-1544",
}
@misc{ormandi2021webred,
title={WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web},
author={Robert Ormandi and Mohammad Saleh and Erin Winter and Vinay Rao},
year={2021},
eprint={2102.09681},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{han-etal-2020-data,
title = "More Data, More Relations, More Context and More Openness: A Review and Outlook for Relation Extraction",
author = "Han, Xu and
Gao, Tianyu and
Lin, Yankai and
Peng, Hao and
Yang, Yaoliang and
Xiao, Chaojun and
Liu, Zhiyuan and
Li, Peng and
Zhou, Jie and
Sun, Maosong",
booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.aacl-main.75",
pages = "745--758",
abstract = "Relational facts are an important component of human knowledge, which are hidden in vast amounts of text. In order to extract these facts from text, people have been working on relation extraction (RE) for years. From early pattern matching to current neural networks, existing RE methods have achieved significant progress. Yet with explosion of Web text and emergence of new relations, human knowledge is increasing drastically, and we thus require {``}more{''} from RE: a more powerful RE system that can robustly utilize more data, efficiently learn more relations, easily handle more complicated context, and flexibly generalize to more open domains. In this paper, we look back at existing RE methods, analyze key challenges we are facing nowadays, and show promising directions towards more powerful RE. We hope our view can advance this field and inspire more efforts in the community.",
}
@inproceedings{goodrich2019wikifact,
author = {Goodrich, Ben and Rao, Vinay and Liu, Peter J. and Saleh, Mohammad},
title = {Assessing The Factual Accuracy of Generated Text},
year = {2019},
isbn = {9781450362016},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3292500.3330955},
doi = {10.1145/3292500.3330955},
abstract = {We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scori
ng schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce an
d release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction mode
ls. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analy
se multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE a
nd other model-free variants by conducting a human evaluation study.},
booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages = {166–175},
numpages = {10},
keywords = {transformers, factual correctness, metric, deep learning, generative models},
location = {Anchorage, AK, USA},
series = {KDD '19}
}