Title: Scientific Research Papers have 2-3 tables defining various quantitative and non-quantitative aspects of the paper. In this project you are going to be finding out summaries defining these tables to create a dataset of table summarization.
Design a pipeline using Python which does the following:
Section 1: Download and Preprocessing
Download 1000 unique scientific papers in pdf format. Write a program in python which converts the papers from pdf to text and XML formats. The papers should be divided into folders by types for eg. Summarization, IOT, Sentiment analysis, translation etc. Rename the text and the xml files in a uniform format and keep them in well defined folders. E.g. document100.txt, document100.xml
Section 2: System Design
Summary can be of two types, abstractive and extractive. For each table in each scientific paper you are going to be executing these steps:
For each table in each paper, the abstractive summary will be the caption of the table and the extractive summary will be the text in the paper mentioning the table by its number. For eg. Let us say that in a particular paper, this table is given as Table 1.
And its mentioned in the paper text like this:
So the abstractive summary of table 1 will be the caption of the table: Performance of the multi-label classifiers used throughout the BioASQ challenge semantic indexing task 4a, in terms of Micro-F and MacroF. Training set size was 1,000,000 documents and test set size 50,000 respectively
The extractive summary will be the text where the table is mentioned in the paper: In Table 1 we present the performance of the multi-label classifiers used in our ensembles, in terms of the Micro-F and Macro-F measures, for the training set (one million documents) and the validation set (fifty thousand documents) used throughout the challenge
Note: If the table is mentioned multiple times then there will be multiple extractive summary for one table.
Section 3: Output File preparation
You have to prepare a single text file for 1000 papers. The format of the text file will be like this:
=Table 1: Correlations between input features and average system performance for multi-document inputs of DUC 2001-2003, 2004G (generic task), 2004B (biographical task), All data (2002-2004) - UNnormalized and Normalized coverage scores</Abstractive Summary> = NULL</Extractive Summary> </Paper ID =1>
Section 4: Submission checklist for all groups: The well defined folders containing the 1000 scientific papers in PDF, TXT and XML formats . All the python codes used in all sections with clear cut naming conventions. The output .txt file in the above mentioned format.