This project demonstrates natural language processing can be used to automatically identify specific messages customers send to product developers about their products using unsolicted data.
Specifically, this project employs topic modeling of a large set of prose text product reviews.
The data set used in the project is the Amazon Fine Foods Reviews. This data set is composed of 568,454 reviews from 256,059 reviewers on nearly 75,000 different products, primarily foods.
The project takes a large corpus of prose text reviews and performs unsupervised learning analysis of the reviews. By doing so it creates a structural model which serves as an abstraction of the latent properties of said corpus. This structure is then used to identify specific subjects mentioned in the corpus. The topic modeling quantifies the prevalence of the concepts in the underlying corpus. So a user can not only know what is being said in the corpus but also be able to numerically understand the significance of the subject matter. It also enables relating one subject matter to another such as knowing if two subjects are similar and to what degree they are similar.
Also based on the methodology employed in this project, a user of the information can, for a specific subject, compare and contrast both highly positive and negative points of view on the same said subject. This enables a person to quickly determine what users either like or dislike about any of the products covered by the underlying corpus.
Below are diagrams of the NLP pipeline used to perform the topic modeling:
TFIDF Vectorization
_LDA and PCA to identify topics and visualize document relationships overlaid with per document topic assignment.
- Kaggle: Initial data set
- python
re
: Used regular expressions to clean HTML and other strings that would have hampered downstream analysis of the text corpus - pandas: For cleaning and analysis of topic-modeling results
- numpy: Analysis of topic modeling results
- NLTK: Perform tokenization (splitting into words) and lemmatization of the original text corpus and cleaning (removal of non-English words).
- sklearn: To perform TFIDF vectorization of the documents, LDA of the vectorized documents, and PCA that enabled visualization of the documents, topics, and their relationships to the underlying text.
- matplotlib: Used to visualize the underlying text corpus, the calculated topics, and their relationships
- seaborn: Visualization
- pyLDAvis: To perform several visualizations of the resultant topic model in relationships to the text corpus.
Source/Project4.ipynb
: The jupyter notebook for the projectDocuments/Link_to_Presentation.md
: Markdown file that has a link to the presentation.Doucments/Images
: Images used to in this file or other files for documentation purposes
- Clone this repo
- Run this command:
conda env create -f Source/project4_venv.yml
- Once that's done and you have activated your virtual environment you should
be able to start jupyter lab with this command:
jupyter lab
- With Jupyter Lab, open the notebook
Source/Project4.ipynb
- You should be able to run the cells. Please note that the default settings of the notebook require pkl files of the cleaned data. Please contact me if you want that data. Depending on the option settings in the notebook, you can run the cells reproduce this information from scratch. Note you will have to download the CSV file associated with the data set.