To complete the course, successful completion of the project described here is required. The project affects grading as follows:
- Completed project is required to pass the course
- Project completed with bonus task (see below) gives +1 to your grade (of course you need to pass the exam, so the project bonus cannot lift you from fail to 1)
Use this template for your project.
The project can be completed either individually or as a group of max two people. Group projects have some additional requirements; see below.
The project involves creating a machine learning-based method for a text classification task of your choice. Specifically, you should
- Select a text classification corpus to work on. (see below)
- Read the paper presenting the corpus as well as any other relevant published materials about the corpus to assure you understand the data
- Identify what is the state-of-the-art performance for this corpus (i.e. the best published results). Some corpora (but not all) will have public leaderboards that can help here.
- Write code to
- download the corpus
- perform any necessary preprocessing
- train a machine learning method on the training set, evaluating performance on the validation set
- perform hyperparameter optimization
- evaluate your final model on the test set
- Write a summary of
- what you learned about the corpus (referring also to the paper which you read)
- your results
- how your results relate to the state of the art
- if completed as a group project, a section detailing who did what
Choose one of the following corpora to work on. (If you'd prefer to suggest a corpus not listed here, get in touch with us on Discord!)
Corpus name | Labels | Subset sizes | Description |
---|---|---|---|
imdb | neg pos | train:25000 unsupervised:50000 test:25000 | Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. |
amazon_polarity | negative positive | train:3600000 test:400000 | The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. |
rotten_tomatoes | neg pos | train:8530 validation:1066 test:1066 | Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. |
sst2 | negative positive | train:67349 validation:872 test:1821 | The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels. |
emo | angry happy others sad | train:30160 test:5509 | In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes - Happy, Sad, Angry and Others. |
emotion | anger fear joy love sadness surprise | train:16000 validation:2000 test:2000 | Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. |
Note: if you choose a corpus that does not have a validation set, you should split off a random portion of the training data to use for validation.
As an optional bonus task, you may also do the following. Completing this bonus task in addition to the basic project gives +1 to your grade.
- Choose 50 documents (or 100 documents if group project) that do not represent the text domain of the corpus (e.g. online discussion vs. news)
- Annotate these documents similarly as the original corpus has been annotated
- Convert your annotations into a dataset
- Evaluate your method on this additional out-of-domain test set
- Include your annotated data and the results of this evaluation in your report
Return your project as a Python notebook (following this template) that includes both execution results and the descriptions detailed above.