Skip to content

A classifier that distinguishes political from non-political news articles.

License

Notifications You must be signed in to change notification settings

lukasgebhard/Political-News-Filter

Repository files navigation

Political News Filter

Political News Filter classifies English news articles regarding whether they cover policy topics.

It uses a broad characterization of politics: Politics is about "who gets what, when, and how" (Lasswell, 1936). As a result, Political News Filter may consider business news or tech news as political, depending on actual contents.

Requirements

  • Python 3.6
  • Pandas 0.24.1
  • NumPy 1.18.1
  • Keras 2.3.1
  • TensorFlow 2.1.0

Political News Filter supports both CPU and GPU processing. The latter is faster but requires a CUDA-capable graphics card and the CUDA toolkit.

Setup

  1. Clone this repository:

    $ git clone https://github.com/lukasgebhard/Political-News-Filter.git
    $ cd Political-News-Filter
  2. Download and extract pon_classifier.zip into the repository folder. Its inflated size is 1.1 GB.

  3. Install Python dependencies. For example, create a virtual environment:

    $ virtualenv --python=python3.6 venv
    $ source venv/bin/activate
    $ pip install -r requirements.txt
  4. Verify the installation was successful:

    $ ./check_installation.sh
    Hooray! Political News Filter is properly installed and ready to use.

Usage Demo

Start a Python session:

$ python3

Create exemplary articles:

>>> political_article = '''White House declares war against terror. The US government officially announced a ''' \
                        '''large-scale military offensive against terrorism. Today, the Senate agreed to spend an ''' \
                        '''additional 300 billion dollars on the advancement of combat drones to be used against ''' \
                        '''global terrorism. Opposition members sharply criticize the government. ''' \
                        '''"War leads to fear and suffering. ''' \
                        '''Fear and suffering is the ideal breeding ground for terrorism. So talking about a ''' \
                        '''war against terror is cynical. It's actually a war supporting terror."'''
>>> nonpolitical_article = '''Table tennis world cup 2025 takes place in South Korea. ''' \
                           '''The 2025 world cup in table tennis will be hosted by South Korea, ''' \
                           '''the Table Tennis World Commitee announced yesterday. ''' \
                           '''Three-time world champion, Hu Ho Han, did not pass the qualification round, ''' \
                           '''to the advantage of underdog Bob Bobby who has been playing outstanding matches ''' \
                           '''in the National Table Tennis League this year.'''

To filter a list of news articles, call filter_news:

>>> from political_news_filter import filter_news
>>> political_article == filter_news([political_article, nonpolitical_article])[0]
True

If you need more flexibility, you can directly call the underlying classifier:

>>> from political_news_filter import Classifier
>>> classifier = Classifier()
>>> probabilities = classifier.estimate([political_article, nonpolitical_article])
>>> probabilities[0] > 0.99
True
>>> probabilities[1] < 0.01
True

Please read the docstrings for further information.

Runtime Performance

Below are some benchmarks on a notebook with 6 CPU cores @ 2.6 GHz, a GPU with 4 GB GRAM and CUDA capability 7.5, 32 GB RAM, and a PCIe SSD drive:

Task On CPU On GPU
One-time Initialization 30 sec 15 sec
Classification of 1,000 articles 1.8 sec 1.3 sec

Architecture

The classifier is based on a model by Heng Zheng submitted to Kaggle under the Apache 2.0 license. It is a convolutional neural network with a 100-dimensional GloVe embedding layer, three convolutional layers, each one followed by a ReLu layer and a pooling layer, and finally a softmax output layer. During training, a cross-entropy loss function is minimized using dropout regularization.

Training & Evaluation

I created a labeled set of 0.57M news articles, selected from:

After fitting the classifier on 87.5 % of the articles, testing it on the remaining 12.5 % yields:

  • F1 = 94.4
  • Precision = 95.6
  • Recall = 93.2

How to Cite

If you use Political News Filter, please cite our poster:

@InProceedings{POLUSA,
  author     = {Gebhard, Lukas and Hamborg, Felix},
  title      = {The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity},
  year       = {2020},
  month      = {August},
  booktitle  = {Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20)},
  venue      = {Virtual event, China},
  publisher  = {Association for Computing Machinery},
  doi        = {10.1145/3383583.3398567}
}