Course project

To complete the course, successful completion of the project described here is required. The project affects grading as follows:

Completed project is required to pass the course
Project completed with bonus task (see below) gives +1 to your grade (of course you need to pass the exam, so the project bonus cannot lift you from fail to 1)

Use this template for your project.

Group work

The project can be completed either individually or as a group of max two people. Group projects have some additional requirements; see below.

Basic project outline

The project involves creating a machine learning-based method for a text classification task of your choice. Specifically, you should

Select a text classification corpus to work on. (see below)
Read the paper presenting the corpus as well as any other relevant published materials about the corpus to assure you understand the data
Identify what is the state-of-the-art performance for this corpus (i.e. the best published results). Some corpora (but not all) will have public leaderboards that can help here.
Write code to
1. download the corpus
2. perform any necessary preprocessing
3. train a machine learning method on the training set, evaluating performance on the validation set
4. perform hyperparameter optimization
5. evaluate your final model on the test set
Write a summary of
1. what you learned about the corpus (referring also to the paper which you read)
2. your results
3. how your results relate to the state of the art
4. if completed as a group project, a section detailing who did what

Text classification corpora

Choose one of the following corpora to work on. (If you'd prefer to suggest a corpus not listed here, get in touch with us on Discord!)

Corpus name	Labels	Subset sizes	Description
imdb	neg pos	train:25000 unsupervised:50000 test:25000	Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
amazon_polarity	negative positive	train:3600000 test:400000	The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.
rotten_tomatoes	neg pos	train:8530 validation:1066 test:1066	Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
sst2	negative positive	train:67349 validation:872 test:1821	The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.
emo	angry happy others sad	train:30160 test:5509	In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes - Happy, Sad, Angry and Others.
emotion	anger fear joy love sadness surprise	train:16000 validation:2000 test:2000	Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.

Note: if you choose a corpus that does not have a validation set, you should split off a random portion of the training data to use for validation.

Bonus task

As an optional bonus task, you may also do the following. Completing this bonus task in addition to the basic project gives +1 to your grade.

Choose 50 documents (or 100 documents if group project) that do not represent the text domain of the corpus (e.g. online discussion vs. news)
Annotate these documents similarly as the original corpus has been annotated
Convert your annotations into a dataset
Evaluate your method on this additional out-of-domain test set
Include your annotated data and the results of this evaluation in your report

Returning your project

Return your project as a Python notebook (following this template) that includes both execution results and the descriptions detailed above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

course-project-2023.md

course-project-2023.md

Course project

Group work

Basic project outline

Text classification corpora

Bonus task

Returning your project

Files

course-project-2023.md

Latest commit

History

course-project-2023.md

File metadata and controls

Course project

Group work

Basic project outline

Text classification corpora

Bonus task

Returning your project