Skip to content

Latest commit

 

History

History
55 lines (37 loc) · 3.52 KB

README.md

File metadata and controls

55 lines (37 loc) · 3.52 KB

Duck News Reporters

This repository contains the source code for the programs written in this project. This README should guide the user through our file structure and a general hint on how to run the code.

Also see ATTRIBUTIONS.md for a list of attributions. A plaintext version is included at ATTRIBUTIONS.txt fo those without a markdown renderer.

Directory structure

All code is contained within the src directory.

Dataset

We included the dataset in src/data/Horne2017_FakeNewsData/Buzzfeed. Along with the raw dataset, it also contains intermediary cached files containing our contributions. These will be automatically loaded by our pipeline.

  • context.csv - These are all the urls manually scraped for each input article. It also includes the summary we used.
  • context/ - This is a folder containing all the downloaded article content. This is not really intended for user browsing but it is cached here so we don't have to re-fetch them. They will be automatically joined when the dataset is loaded.
  • features_*.csv - This contains all our features after running the pipeline. It acts as a cache so we don't have to run the whole pipeline multiple times.
  • tf_model/ - The saved model for our neural network classification model. This will be loaded by main.py when running our pipeline.

Software

We include both helper code in *.py files and notebooks in *.ipynb files. Notebooks have been pre-run so they can be viewed by just opening them. The structure of python files is as follows:

  • classification.py - Contains the helpers for our classification models and a pipeline that will load all features.
  • dataset.py - Contains helpers to load the dataset from the `data/`` folder
  • eda.py - Contains helper functions to perform data analysis.
  • non_latent_features.py - Contains helpers used to extract non latent features.
  • preprocess.py - Contains the helper to preprocess and tokenize input.
  • sentiment.py - Contains a legacy helper used to extract sentiment. This has been replaced by code within non_latent_features.py.
  • similarity.py - Contains the summary extractor and similarity model.

  • main.py - Takes the user through a run of our pipeline including getting any article, and prompting the user to enter 3 context articles based on a summary given to the user. Will build the features and perform an inference. Note: This is currently broken. It will run to the end but give funny results.

Running code

You will need to have python>=3.10 installed and no guarantees are made about GPU support. The ducknewsreporters team uses a combination of Google Colab and local WSL2 systems with CUDA installed to run software. We believe that our code should automatically fal back to CPU.

To run the code:

# Make sure you have python>=3.10 installed
$ cd src
# This will install all the dependencies. (May take a while)
$ ./install
$ python3 main.py