Skip to content

A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data

Notifications You must be signed in to change notification settings

AIPHES/HierarchicalSummarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Summarization

The following repository contains the corpus that was created for the publication 'Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data' as well as the annotation tool that was developed for that purpose and an example Amazon Mechanical Turk HIT .

The Corpus

Included in the Corpus folder is the following: Inlcuded in the SourceDocuments folder are the .xml files of all source topics and a .txt file with the topic names. Included in the AMTAllNuggets folder is a tab-delimited csv file with all annotations from Amazon Mechanical Turk in the format worker [tab] annotation. The turker IDs have been hashed in order to anonymize them. Included in the Trees folder are the inout documents for the tree annotation, the trees from three annotators as well as the gold standard trees created out of these trees.

The Annotation tool

Included in the AnnotationTool folder is the Annotation tool as a Java archive as well as the source code and documentation of the tool.

The HIT-Template

Included in the HIT-Template folder is an example HIT along with the javascript and stylesheet.

Citation

If you find the corpus and/or annotation tool useful, please cite the following paper: Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data

@inproceedings{Tauchmann.et.al.2018.LREC,
	author = {Tauchmann, Christopher and Arnold, Thomas and Hanselowski, Andreas and Meyer, Christian M. and Mieskes, Margot},
	title = {Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data},
	booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)},
	month = {May},
	year = {2018},
	pages = {3184--3191},
	location = {Miyazaki, Japan},
	url = {http://www.lrec-conf.org/proceedings/lrec2018/pdf/252.pdf}
}

Abstract: Automatic summarization has so far focused on datasets of ten to twenty rather short documents, typically news articles. But automatic systems could in theory analyze hundreds of documents from a wide range of sources and provide an overview to the interested reader. Such a summary would ideally present the most general issues of a given topic and allow for more in-depth information on specific aspects within said topic. In this paper, we present a new approach for creating hierarchical summarization corpora from large, heterogeneous document collections. We first extract relevant content using crowdsourcing and then ask trained annotators to order the relevant information hierarchically. This yields tree structures covering the specific facets discussed in a document collection. Our resulting corpus is freely available. It can be used to develop and evaluate hierarchical summarization systems.

Contact person: Christopher Tauchmann, [email protected]

https://www.aiphes.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

License

About

A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published