Project involves Web Scraping custom IMDB data between 2020 and 2021 of 10000 movies and shows sorted by number of votes ,fine tuning a pre trained DistilBERT Transformer using Transfer Learning and then saving and reusing the saved model for further use.
- DistilBERT Transformer
- Tensorflow
- Numpy and Pandas
- Selenium, BeautifulSoup4 and requests
- Accuracy achieved: 81.3492%
- ROC_AUC_Score achieved: 0.7217
1) Ensure Python and Jupyter Notebook are installed. Optionally Conda environment can also be used.
- Install the required modules using
pip install -r requirements.txt
orconda install -r requirements.txt
or!pip install -r requirements.txt
for Google Colab.
- Selenium requires browser specific drivers. Guides for Chrome and Firefox are mentioned below. Alternatively,this step is optional if the notebook is run on Google Colab.
Chrome: https://chromedriver.chromium.org/getting-started
Firefox: https://www.lambdatest.com/blog/selenium-firefox-driver-tutorial/
1)(Optional) Run the IMDB Web scraper
. This generates the already provided csv file and imdb_movies pickle file.
- Run the
IMDB Web scraper
on an environment which has GPU acceleration. Here it is used with Google Colab where Nvidia Tesla T4 or Nvidia Tesla K80 are allocated.Training Time: Roughly 20-25 mins Epochs: 10 Training Batch Size: 8 Max length of each Sentence: 512
AMovie_prediction_model
directory is created withconfig.json
file(provided) and atf_model.h5
(not provided due to space constraints).
1) Ensure the model has been created inside Movie_prediction_model
directory.
-
Run the python file using
python DistilBERT_Movie_Classifier.py
-
Enter the description of the movie or TV show you want to predict for. An output will be generated with the binary prediction of success based of IMDB Ratings.