The 22 tasks in this repository were carefully selected out of bigger, more extenstive collection under several criteria.
Below, you can see the process diagram of creating of the MBIB. For more details about the selection process, please see our paper.
Each Task is represented by a set of bias-related datasets. The distribution and sizing can be seen in the following table.
Bias Type | Dataset | Data Points |
---|---|---|
Linguistic Bias | Wikipedia NPOV | 11,945 |
BABE | 3,673 | |
Wiki Neutrality Corpus | 362,991 | |
UsVsThem | 6,863 | |
RedditBias | 10,583 | |
Media Frames Corpus | 37,622 | |
BASIL | 1,726 | |
Biased Sentences | 842 | |
Σ 433,677 | ||
Cognitive Bias | BIGNEWS | 2,331,552 |
Liar Dataset | 12,835 | |
Σ 2,344,387 | ||
Text-Level Context | Contextual Abuse Dataset | 26,235 |
Multidimensional Dataset | 2,094 | |
Σ 28,329 | ||
Hate Speech | Kaggle Jigsaw | 1,999,516 |
HateXplain | 20,148 | |
RedditBias | 10,583 | |
Online Harassment Corpus | 20,427 | |
Σ 2,050,674 | ||
Gender Bias | RedditBias | 3,000 |
RtGender | 15,351 | |
WorkPlace sexism | 1,136 | |
CMSB | 13,634 | |
Σ 33,121 | ||
Racial Bias | RedditBias | 2,620 |
RacialBias | 751 | |
Σ 2,371 | ||
Fake News | Liar Dataset | 12,835 |
PHEME | 5,222 | |
FakeNewsNet | 6,337 | |
Σ 24,394 | ||
Political Bias | UsVsThem | 6,863 |
BIGNEWS | 2,331,552 | |
SemEval | 9,783 | |
Σ 2,348,198 |
data_utils.py
- Contains
TweetLoader
class that fetches batches or single tweets from TwitterAPI. - Contains
MBIBDataLoader
class, that aggregates processed datasets into 8 Tasks.
This class is used bycreate_corpus.py
script to create final corpus.
- Contains