Apply Non-Negative Matrix Factoriazation(NMF) topic modelling to COVID research articles.
-
Download "search_results" zip file from COVID-19 Graph upzip it in the coronavirus_twenty_years_of_research directory.
-
Install required packages by using requirements.txt.
Step1. Data preprocessing.ipynb Load JSON file and preprocess text data.
- INPUT:
- coronavirus_twenty_years_of_research/search_results/covid_19.json
- coronavirus_twenty_years_of_research/search_results/covid19.json coronavirus_twenty_years_of_research/search_results/sars_cov_2.json
- OUTPUT: coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl
Step2. Extract Features.ipynb Convert the preprocessed data into TF-IDF matrix and train a word2vec model.
- INPUT: coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl
- OUTPUT:
- coronavirus_twenty_years_of_research/technical_validation/covid_100d.model
- coronavirus_twenty_years_of_research/technical_validation/vocid_100d.txt
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_d.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_v.pkl
Step3. Apply topic modelling (NMF).ipynb Apply NMF topic modelling and produce outputs in JSON format.
- INPUT:
- coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_100d.model
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_d.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_v.pkl
- OUTPUT:
- coronavirus_twenty_years_of_research/technical_validation/NMF_topic_modelling_results(8clusters).csv
Step4. Produce outputs of topic modelling.ipynb Read topic moding results and produce outputs in JSON format.
- INPUT: coronavirus_twenty_years_of_research/technical_validation/NMF_topic_modelling_results(8clusters).csv
- OUTPUT: coronavirus_twenty_years_of_research/clusters/cluster*/*.json
- The 'search_results' folder contains the extracted articles metadata.
- The 'clusters' folder is the core of our dataset that contains the classified articles into the eight clusters.
- The 'technical_validation' folder includes all csv files which are used to create figures in this paper.
- The 'time_lapse_visualization' folder contains a video that animates the cluster trends.