covid19-graph

Apply Non-Negative Matrix Factoriazation(NMF) topic modelling to COVID research articles.

prerequisite

Download "search_results" zip file from COVID-19 Graph upzip it in the coronavirus_twenty_years_of_research directory.
Install required packages by using requirements.txt.

About program process

Step1. Data preprocessing.ipynb _{Load JSON file and preprocess text data.}

INPUT:
- coronavirus_twenty_years_of_research/search_results/covid_19.json
- coronavirus_twenty_years_of_research/search_results/covid19.json coronavirus_twenty_years_of_research/search_results/sars_cov_2.json
OUTPUT: coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl

Step2. Extract Features.ipynb _{Convert the preprocessed data into TF-IDF matrix and train a word2vec model.}

INPUT: coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl
OUTPUT:
- coronavirus_twenty_years_of_research/technical_validation/covid_100d.model
- coronavirus_twenty_years_of_research/technical_validation/vocid_100d.txt
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_d.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_v.pkl

Step3. Apply topic modelling (NMF).ipynb _{Apply NMF topic modelling and produce outputs in JSON format.}

INPUT:
- coronavirus_twenty_years_of_research/technical_validation/merged_covid_articles.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_100d.model
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_d.pkl
- coronavirus_twenty_years_of_research/technical_validation/covid_tfidf_v.pkl
OUTPUT:
- coronavirus_twenty_years_of_research/technical_validation/NMF_topic_modelling_results(8clusters).csv

Step4. Produce outputs of topic modelling.ipynb _{Read topic moding results and produce outputs in JSON format.}

INPUT: coronavirus_twenty_years_of_research/technical_validation/NMF_topic_modelling_results(8clusters).csv
OUTPUT: coronavirus_twenty_years_of_research/clusters/cluster*/*.json

About data directories

The 'search_results' folder contains the extracted articles metadata.
The 'clusters' folder is the core of our dataset that contains the classified articles into the eight clusters.
The 'technical_validation' folder includes all csv files which are used to create figures in this paper.
The 'time_lapse_visualization' folder contains a video that animates the cluster trends.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
coronavirus_twenty_years_of_research		coronavirus_twenty_years_of_research
source		source
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

covid19-graph

prerequisite

About program process

About data directories

About

Releases

Packages

Contributors 2

Languages

License

researchgraph/covid19-graph

Folders and files

Latest commit

History

Repository files navigation

covid19-graph

prerequisite

About program process

About data directories

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages