data607_project3_research.Rmd

---
title: "Data 607 - Project 3 - Research"
author: "Glen Dale Davis, Coco Donovan, Alex Khaykin, Mohamed Hassan-El Serafi, Eddie Xu"
date: "2023-02-21"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Load the Required Packages:

Below, the packages required for data analysis and visualization are loaded.

```{r packages, warning = FALSE, message = FALSE}
library(tidyverse)
library(magrittr)
library(DBI)
library(dbplyr)
library(RMariaDB)
library(data.table)
library(stopwords)
library(tidytext)
library(RColorBrewer)
library(DT)
library(MCDA)
library(ggwordcloud)
library(hrbrthemes)
library(tidylo)
library(arsenal)
library(cowplot)

```

## Research Question: 

W. Edwards Deming said, “In God we trust, all others must bring data.” Below, we have used data to answer the question, **“Which are the most valued data science skills?”**

## Motivation and Approach:

To determine the most valued data science skills, we have analyzed the frequency of single words, two-word phrases (bigrams), and three-word phrases (trigrams) in online job listings from a variety of data sources. For present-day analysis, the sites we collected recent job listings from were:

* [ai-jobs.net](https://ai-jobs.net/)
* [CareerCast IT & Engineering](https://it.careercast.com/)
* [Open Data Science Job Portal](https://jobs.opendatascience.com/)
* [Jobs for R-Users](https://www.r-users.com/)
* [MLconf Job Board](https://mlconf.com/jobs/)
* [Python Job Board](https://www.python.org/jobs/)
* [Indeed](https://www.indeed.com/)
* [LinkedIn](https://www.linkedin.com/jobs/)

We also wanted to do a historical analysis for the sake of comparison, so we also analyzed an archival data set of job listings from the site below:

* [Glassdoor](https://www.glassdoor.com/member/home/index.htm)

We stored the job listings we reviewed in a MySQL database hosted on Google Cloud Platform. We connect to it below and display a small subset of the data we've collected.

## Connect to the SQL Data Base:

```{r db_con1}
con <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "dat_sci_jobs",
  username = "root",
  password = as.character(read.table("sql_db.txt", header = FALSE)),
  host = "35.227.102.234")

```

## Read from the SQL Data Base and Disconnect:

```{r db_write1}
jobs_df <- dbReadTable(con, "_Jobs")
datatable(head(jobs_df[, -3], options = list(pageLength = 25), n = 250))
dbDisconnect(con)

```

## Create a Data Frame of Jobs Found via RSS Feed:

Many of the sources of our present-day job listings were accessible via RSS feed, and we set up a [Feedbin](https://feedbin.com/) RSS reader account to collect data science job listings from them:

* [ai-jobs.net](https://ai-jobs.net/)
* [CareerCast IT & Engineering](https://it.careercast.com/)
* [Open Data Science Job Portal](https://jobs.opendatascience.com/)
* [Jobs for R-Users](https://www.r-users.com/)
* [MLconf Job Board](https://mlconf.com/jobs/)
* [Python Job Board](https://www.python.org/jobs/)
* [Indeed](https://www.indeed.com/)

We then sent API calls to Feedbin to retrieve results every few days and saved the results as data frames in CSV format.

```{r jobs_df}
completed_files <- readLines("completed_files.txt")

url_base <- "https://raw.githubusercontent.com/geedoubledee/data607_project3/main/"

new_jobs_df <- as.data.frame(matrix(nrow = 0, ncol = 11))
files <- list.files(pattern = "_feeds_.*csv$")

if (length(files) > 0){
    for (i in 1:length(files)){
        if (!files[i] %in% completed_files){
            file <- paste(url_base, files[i], sep = "")
            csv <- read.csv(file = file, header = TRUE)
            new_jobs_df <- rbind(new_jobs_df, csv)
            completed_files <- append(completed_files, files[i])
        }
    }
    file.copy(from = paste0(getwd(), "/", files),
              to = paste0(getwd(), "/feeds-csv/", files))
    file.remove(from = paste0(getwd(), "/", files))
}

new_jobs_df2 <- as.data.frame(matrix(nrow = 0, ncol = 10))
files <- list.files(pattern = "_linkedin_.*csv$")

if (length(files) > 0){
    for (i in 1:length(files)){
        if (!files[i] %in% completed_files){
            file <- paste(url_base, files[i], sep = "")
            csv <- read.csv(file = file, header = TRUE)
            new_jobs_df2 <- rbind(new_jobs_df2, csv)
            completed_files <- append(completed_files, files[i])
        }
    }
    file.copy(from = paste0(getwd(), "/", files),
              to = paste0(getwd(), "/feeds-csv/", files))
    file.remove(from = paste0(getwd(), "/", files))
}

writeLines(completed_files, "completed_files.txt")

new_jobs_df <- new_jobs_df[!duplicated(new_jobs_df), ]
new_jobs_df2 <- new_jobs_df2[!duplicated(new_jobs_df2), ]

```

## Cleaning Up the Jobs Data Frame:

We removed duplicates, removed unnecessary columns, and rearranged the remaining columns to combine the CSV files into one big Jobs data frame. We gave unique **Job_id** values to all listings based on their row numbers.

```{r jobs_df_minimize}
if (nrow(new_jobs_df) > 0){
    new_jobs_df <- subset(new_jobs_df, select = -c(X, author, summary,
        content, extracted_content_url, published, created_at))
    cols <- c("Job_id", "Site_id", "Job_title", "Job_url")
    colnames(new_jobs_df) <- cols
    Job_company <- as.data.frame(matrix(NA, nrow = nrow(new_jobs_df),
                                   ncol = 1))
    colnames(Job_company) <- "Job_company"
    Job_location <- as.data.frame(matrix(NA, nrow = nrow(new_jobs_df),
                                    ncol = 1))
    colnames(Job_location) <- "Job_location"
    new_jobs_df <- cbind(new_jobs_df, Job_company, Job_location)
    rownames(new_jobs_df) <- NULL
    new_jobs_df <- new_jobs_df[c("Job_id", "Job_title", "Job_url",
                                 "Job_company", "Job_location", "Site_id")]
    new_jobs_df %<>%
        mutate(Job_complete = 0)
}

if (nrow(new_jobs_df2) > 0){
    new_jobs_df2 <- subset(new_jobs_df2, select = -c(X, job_url, company_url,
        linkedin_company_url_cleaned, posted_date, normalized_company_name))
    cols <- c("Job_url", "Job_company", "Job_title", "Job_location")
    colnames(new_jobs_df2) <- cols
    Job_id <- as.data.frame(matrix(NA, nrow = nrow(new_jobs_df2),
                                   ncol = 1))
    colnames(Job_id) <- "Job_id"
    Site_id <- as.data.frame(matrix(1001, nrow = nrow(new_jobs_df2),
                                    ncol = 1))
    colnames(Site_id) <- "Site_id"
    new_jobs_df2 <- cbind(new_jobs_df2, Job_id, Site_id)
    rownames(new_jobs_df2) <- NULL
    new_jobs_df2 <- new_jobs_df2[c("Job_id", "Job_title", "Job_url",
                                   "Job_company", "Job_location", "Site_id")]
    new_jobs_df2 %<>%
        mutate(Job_complete = 0)
}

if (nrow(new_jobs_df) > 0){
    jobs_df <- rbindlist(list(jobs_df, new_jobs_df))[!duplicated(Job_url)]
}
if (nrow(new_jobs_df2) > 0){
    jobs_df <- rbindlist(list(jobs_df, new_jobs_df2))[!duplicated(Job_url)]
}
jobs_df %<>%
    mutate(Job_id = row_number())

```

## Scrape Each Unique Job Listing URL:

We visited the URL for each unique job listing and retrieved the HTML data at that address. Depending on the site, we extracted the specific HTML container that contained the job description. We converted the HTML to text and wrote the resulting lines to individual TXT files. If an error resulted from visiting the site, or we were unable to retrieve a job description for any other reason, we recorded that as -1 in our **Job_complete** field. Values of 0 meant the URL had never been visited, and values of 1 meant the job description was successfully retrieved. (Although Indeed was accessible via RSS feed, the data was not refreshing after a while, and we were not able to scrape the few job listings we retrieved that way. So those listings all received a **Job_complete** value of -1.)

```{r scrape, warning = FALSE, message = FALSE}
for (i in 1:nrow(jobs_df)){
    httr::user_agent("Glen Davis")
    if (jobs_df[i, 7] == 0){
        dat <- try(xml2::read_html(jobs_df$Job_url[[i]]), silent = TRUE)
        if (inherits(dat, "try-error", which = FALSE)){
            jobs_df[i, 7] <- -1
            next
        }
    }else{
        next
    }
    if (jobs_df[i, 6] == 2594160){ #ai-jobs.net is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@id, 'job-description')]")
    }
    else if (jobs_df[i, 6] == 977141){ #python.org is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@class, 'job-description')]")
    }
    else if (jobs_df[i, 6] == 2594162){ #careercast it & eng is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@class, 'arDetailDescriptionRow')]")
    }
    else if (jobs_df[i, 6] == 1378327){ #jobs for r-users is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@class, 'section_content')]")
    }
    else if (jobs_df[i, 6] == 2593879){ #Indeed is source
        jobs_df[i, 7] <- -1
        next
    }
    else if (jobs_df[i, 6] == 2594166){ #Open Data Science is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@class, 'job-desc')]")
    }
    else if (jobs_df[i, 6] == 2594174){ #MLconf is source
        desc <- xml2::xml_find_all(
            dat, "//div[contains(@class, 'job_description')]")
    }
    else if (jobs_df[i, 6] == 1001){ #Linkedin is source
        jobs_df[i, 7] <- -1
        next
    }
    
    desc <- xml2::xml_text(desc)
    fn <- paste(jobs_df[i, 1], ".txt", sep = "")
    writeLines(desc, con = fn)
    jobs_df[i, 7] <- 1
}

```

## Manual Data Collection for LinkedIn:

LinkedIn job listings were not accessible via RSS, so we used a [RapidAPI alternative](https://rapidapi.com/jaypat87/api/linkedin-jobs-search) to retrieve job listings from there. We were unable to automatically scrape these job listings, so we assigned them a **Job_complete** value of -1 similar to what we did for Indeed job listings. However, we were able to manually download the job descriptions for LinkedIn job listings.

```{r manual1}
manual <- jobs_df %>%
    filter(Job_complete == -1 & Site_id == 1001)
write.csv(manual, "manual.csv", row.names = FALSE)

```

We saved all the LinkedIn job listings we retrieved in a file called **manual.csv**. Then, we saved the job descriptions we were able to find in column eight of our file and saved the new file as **manual_edited.csv**.
If we found a job description, we change the **Job_complete** value for that job listing to 1. If we didn't, we just deleted that row. If a previous file existed, we saved over it. We then upload the saved **manual_edited.csv** file to Github before continuing. 

```{r manual2}
file <- "https://raw.githubusercontent.com/geedoubledee/data607_project3/main/manual_edited.csv"

manual_edited <- read.csv(file = file, header = TRUE)
for (i in 1:nrow(manual_edited)){
    job_id <- manual_edited[i, 1]
    if (jobs_df[job_id, 7] != 1){
        job_desc <- manual_edited[i, 8]
        jobs_df[job_id, 7] <- manual_edited[i, 7]
        fn <- paste(job_id, ".txt", sep = "")
        writeLines(job_desc, con = fn)
    }
}

```

## Add Previously Scraped Data for Recent Indeed Job Listings:

Since we had trouble gathering data for Indeed job listings via any other method, we used a [recent Kaggle dataset of Indeed job listings](https://www.kaggle.com/datasets/yusufolonade/data-science-job-postings-indeed-usa) to supplement our data. There are approximately as many observations in this data set as we were able to collect via other means, so we felt we ended up having a good mix of sources. The data was also gathered in November 2022, so we felt it was recent enough to use in our present-day job listing analysis.

```{r kaggle_indeed}
completed_files <- readLines("completed_files.txt")

if (!("data_science_jobs_indeed_usa.csv" %in% completed_files)){
    file <- "https://raw.githubusercontent.com/geedoubledee/data607_project3/main/data_science_jobs_indeed_usa.csv"
    kaggle_indeed <- read.csv(file = file, header = TRUE)
    kaggle_indeed <- subset(kaggle_indeed, select = -c(1, 5, 6, 7, 8))
    cols <- c("Job_title", "Job_company", "Job_location", "Job_url",
              "Job_description")
    colnames(kaggle_indeed) <- cols
    ids <- seq((nrow(jobs_df) + 1), (nrow(jobs_df) + nrow(kaggle_indeed)))
    Job_id <- as.data.frame(matrix(ids, nrow = nrow(kaggle_indeed),
                                   ncol = 1))
    Site_id <- as.data.frame(matrix(2593879, nrow = nrow(kaggle_indeed),
                                    ncol = 1))
    Job_complete <- as.data.frame(matrix(0, nrow = nrow(kaggle_indeed),
                                         ncol = 1))
    colnames(Site_id) <- "Site_id"
    colnames(Job_id) <- "Job_id"
    colnames(Job_complete) <- "Job_complete"
    kaggle_indeed <- cbind(kaggle_indeed, Job_id, Site_id, Job_complete)
    rownames(kaggle_indeed) <- NULL
    kaggle_indeed <- kaggle_indeed[c("Job_id", "Job_title", "Job_url",
                                     "Job_company", "Job_location",
                                     "Site_id", "Job_complete",
                                     "Job_description")]
    jobs_df <- rbind(jobs_df, subset(kaggle_indeed, select = 1:7))
    for (i in 1:nrow(kaggle_indeed)){
        job_id <- kaggle_indeed[i, 1]
        job_desc <- kaggle_indeed[i, 8]
        jobs_df[job_id, 7] <- 1
        fn <- paste(job_id, ".txt", sep = "")
        writeLines(job_desc, con = fn)
    }
    write("data_science_jobs_indeed_usa.csv", file = "completed_files.txt",
          append = TRUE)
    file.copy(from = paste0(getwd(), "/data_science_jobs_indeed_usa.csv"),
              to = paste0(getwd(), "/feeds-csv/data_science_jobs_indeed_usa.csv"))
    file.remove(from = paste0(getwd(), "/data_science_jobs_indeed_usa.csv"))
}

```

## Connect to the SQL Data Base:

```{r db_con2}
con <- DBI::dbConnect(
  RMariaDB::MariaDB(),
  dbname = "dat_sci_jobs",
  username = "root",
  password = as.character(read.table("sql_db.txt", header = FALSE)),
  host = "35.227.102.234")

```

## Write to the SQL Data Base Only If There Were Changes and Disconnect:

```{r db_write2}
copy <- dbReadTable(con, "_Jobs")

if (!identical(jobs_df, copy)){
    dbWriteTable(con, "_Jobs", jobs_df, overwrite = TRUE)
}

dbDisconnect(con)

```

## Set Up a Text Data Frame from the TXT Job Description Files:

Once we had all the job description TXT files we needed, we created a big Text data frame containing all of the **Line** values for each **Job_id**.

```{r tidytext_setup}
files <- list.files(pattern = "^[1-9]+.*txt$")
if (length(files) > 0){
    file.copy(from = paste0(getwd(), "/", files),
              to = paste0(getwd(), "/jobs-txt/", files))
    file.remove(from = paste0(getwd(), "/", files))
}

files <- list.files(path = paste0(getwd(), "/jobs-txt/"),
                    pattern = "^[1-9]+.*txt$")

cols <- c("Text", "Job_id", "Line")

completed_txt_files <- readLines("completed_txt_files.txt")
if (length(completed_txt_files) == 0){
    text_df <- as.data.frame(matrix(nrow = 0, ncol = 3))
    colnames(text_df) <- cols
}else{
    my_url <- "https://raw.githubusercontent.com/geedoubledee/data607_project3/main/text_df.csv"
    text_df <- read.csv(file = my_url, header = TRUE,
                        fileEncoding = "UTF-8")
}

url_base <- "https://raw.githubusercontent.com/geedoubledee/data607_project3/main/jobs-txt/"

for (i in 1:length(files)){
    if (!files[i] %in% completed_txt_files){
        file <- paste(url_base, files[i], sep = "")
        job_id <- str_replace(files[i], ".txt", "")
        lines <- readLines(file)
        for (j in 1:length(lines)){
            col2 <- matrix(job_id, nrow = length(lines), ncol = 1)
            col3 <- matrix(1:length(lines),
                           nrow = length(lines),
                           ncol = 1)
        }
        addition <- cbind(lines, col2, col3)
        colnames(addition) <- cols
        text_df <- rbind(text_df, addition)
        write(files[i], file = "completed_txt_files.txt", append = TRUE)
    }
}

rownames(text_df) <- NULL
write.csv(text_df, "text_df.csv", row.names = FALSE)

```

## Analyze the Text Data Frame with Tidytext:

We calculated how many times a word, bigram, or trigram occurred across all job listings, as well as how many job listings that word occurred in. We called the first metric **term_freq**, and we called the second metric **doc_count**. We normalized both metrics on a scale from 0 to 1, and we recorded a third metric as the sum of those normalizations: **tf_dc_norm_sum**. This third metric was needed so that we could make decisions about a term's importance that accounted for both a term's frequency across all job listings and the number of job listings it appeared in, not just one or the other. It allowed us to sort and rank terms properly, and reduced the amount of clutter we needed to look through to make a compelling selection of skills that are most valuable in data science. (We abandoned a previous metric we made up, **tf_dc_score**, in favor of this more statistical approach.)

```{r tidytext_analysis}
text_df_clean <- text_df
text_df_clean$Text <- tolower(text_df_clean$Text)
text_df_clean$Text <- sapply(text_df_clean$Text, function(x) str_replace_all(
    x,"[^[:alnum:]]", " "))
text_df_clean$Text <- sapply(text_df_clean$Text, function(x) str_replace_all(
    x, "[[:punct:]]", " "))
text_df_clean$Text <- sapply(text_df_clean$Text, function(x) str_replace_all(
    x, "[^A-Za-z]", " "))
text_df_clean$Text <- trimws(text_df_clean$Text)
text_df_clean %<>%
    filter(Text != "" & !is.na(Text))

tidy_text_df_words <- text_df_clean %>%
    unnest_tokens(word, Text)

tidy_text_words_analysis <- tidy_text_df_words %>%
    anti_join(get_stopwords()) %>%
    group_by(word) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3))

performanceTable <- tidy_text_words_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
tidy_text_words_analysis <- cbind(tidy_text_words_analysis, norm)
tidy_text_words_analysis <- tidy_text_words_analysis[c("word", "term_freq",
                                                       "tf_norm",
                                                       "doc_count",
                                                       "dc_norm", 
                                                       "tf_dc_score",
                                                       "tf_dc_norm_sum")]

tidy_text_words_analysis %<>%
    arrange(desc(tf_dc_norm_sum))

datatable(head(tidy_text_words_analysis[, -6], options = list(pageLength = 25), n = 250))

tidy_text_df_bigrams <- text_df_clean %>%
    unnest_tokens(bigram, Text, token = "ngrams", n = 2)

tidy_text_bigrams_analysis <- tidy_text_df_bigrams %>%
    separate(bigram, into = c("first","second"),
             sep = " ", remove = FALSE) %>%
    anti_join(stop_words, by = c("first" = "word")) %>%
    anti_join(stop_words, by = c("second" = "word")) %>%
    group_by(bigram) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3)) %>%
    filter(!is.na(bigram))

performanceTable <- tidy_text_bigrams_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                      normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
tidy_text_bigrams_analysis <- cbind(tidy_text_bigrams_analysis, norm)
tidy_text_bigrams_analysis <- tidy_text_bigrams_analysis[c("bigram",
                                                           "term_freq",
                                                           "tf_norm",
                                                           "doc_count",
                                                           "dc_norm", 
                                                           "tf_dc_score",
                                                           "tf_dc_norm_sum")]

tidy_text_bigrams_analysis %<>%
    arrange(desc(tf_dc_norm_sum))

datatable(head(tidy_text_bigrams_analysis[, -6], options = list(pageLength = 25), n = 250))

tidy_text_df_trigrams <- text_df_clean %>%
    unnest_tokens(trigram, Text, token = "ngrams", n = 3)

tidy_text_trigrams_analysis <- tidy_text_df_trigrams %>%
    separate(trigram, into = c("first","second","third"),
             sep = " ", remove = FALSE) %>%
    anti_join(stop_words, by = c("first" = "word")) %>%
    anti_join(stop_words, by = c("third" = "word")) %>%
    group_by(trigram) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3)) %>%
    filter(!is.na(trigram))

performanceTable <- tidy_text_trigrams_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                      normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
tidy_text_trigrams_analysis <- cbind(tidy_text_trigrams_analysis, norm)
tidy_text_trigrams_analysis <- tidy_text_trigrams_analysis[c("trigram",
                                                           "term_freq",
                                                           "tf_norm",
                                                           "doc_count",
                                                           "dc_norm", 
                                                           "tf_dc_score",
                                                           "tf_dc_norm_sum")]

tidy_text_trigrams_analysis %<>%
    arrange(desc(tf_dc_norm_sum))

datatable(head(tidy_text_trigrams_analysis[, -6], options = list(pageLength = 25), n = 250))

```

## Selecting Important Words, Bigrams, and Trigrams:

After sorting our data in descending order by **tf_dc_norm_sum**, we could see all the terms that can be considered most important in data science job listings. Since we are specifically interested in important terms related to skills a job applicant would need to have, we have highlighted only those terms that appear related to a specific skill.

```{r important_words_bigrams_trigrams}
single_words <- c('sql','python','product','models','communication','cloud','research','customer','database','stakeholders','modeling','ml','r','programming','clients','ai', 'statistics', 'reporting','aws','leadership', 'operations','collaborate', 'algorithms','marketing', 'bi','visualization','tableau','mathematics','dashboards','optimization','datasets','java','oracle')

two_words <- c('machine learning', 'communication skills','business intelligence','data analysis','data engineering', 'data driven','software development','deep learning', 'data visualizations','data warehouse','data pipelines','sql server', 'data management','power bi', 'learning models','data quality', 'artificial intelligence', 'software engineering', 'data modeling', 'project management','data processing','data mining','programming languages', 'written communication','data models','ci cd','data models','financial services','information technology', 'natural language', 'cloud based', 'product development', 'language processing', 'visualization tools', 'verbal communication', 'information systems', 'statistical analysis', 'data collection','scikit learn', 'data warehousing', 'predictive models', 'product management', 'relational databases', 'interpersonal skills', 'ml models', 'team player', 'data architecture')

three_words <- c('machine learning models','attention to detail','written communication skills', 'natural language processing', 'machine learning algorithums', 'machine learning techniques','fast paced environment', 'verbal communication skills', 'design and implement', 'data visualization tools', 'excellent communication skills','business intelligence tools','computer science mathematics', 'data driven decisions','subject matter expert', 'ability to write', 'computer science statistics', 'data driven insights', 'design and develop', 'design and implementation', 'development and implementation', 'experience with agile','experience with aws', 'experience with python', 'knowledge of sql', 'microsoft sql server')

```

We then arranged the terms into six topics based on what kind of skill those terms referred to: **programming**, **presentation**, **theoretical**, **soft_skills**, **data_management**, and **business_interest**.

```{r topics}
programming <- c('python', 'r', 'java', 'programming')
programming <- append(programming ,c('scikit learn', 'software development',
                                     'software engineering', 'data modeling',
                                     'data mining', 'programming languages',
                                     'data model', 'data models', 'ci cd'))
programming <- append(programming, c('experience with python'))

presentation <- c('clients','customer','communication', 'reporting','dashboards',
                  'tableau', 'visualization', 'bi', 'stakeholders')
presentation <- append(presentation, c('communication skills',
                                       'data visualizations','power bi',
                                       'visualization tools',
                                       'verbal communication'))
presentation <- append(presentation, c('verbal communication skills',
                                       'data visualization tools',
                                       'excellent communication skills'))

theoretical <- c('statistics', 'mathematics', 'algorithms', 'ai', 'optimization',
                 'research', 'modeling', 'models', 'ml')
theoretical <- append(theoretical, c('machine learning', 'deep learning',
                                     'learning models', 'artificial intelligence',
                                     'natural language', 'language processing',
                                     'information systems', 'statistical analysis',
                                     'predictive models', 'ml models'))
theoretical <- append(theoretical, c('machine learning models',
                                     'natural language processing',
                                     'machine learning algorithums',
                                     'machine learning techniques',
                                     'computer science mathematics',
                                     'computer science statistics'))

soft_skills <- c('collaborate', 'leadership', 'product')
soft_skills <- append(soft_skills, c('written communication',
                                     'interpersonal skills', 'team player'))
soft_skills <- append(soft_skills, c('attention to detail',
                                     'written communication skills',
                                     'fast paced environment', 'ability to write'))

data_management <- c('cloud', 'aws', 'datasets', 'database', 'oracle', 'sql')
data_management <- append(data_management, c('experience with aws',
                                             'microsoft sql server',
                                             'knowledge of sql'))
data_management <- append(data_management, c('data engineering', 'data warehouse',
                                             'data pipelines', 'sql server',
                                             'data management', 'data quality',
                                             'data processing','cloud based',
                                             'data collection', 'data warehousing',
                                             'relational databases',
                                             'data architecture'))

business_interest <- c('marketing', 'product', 'operations')
business_interest <- append(business_interest, c('business intelligence',
                                                 'data analysis', 'data driven',
                                                 'project management',
                                                 'financial services',
                                                 'information technology',
                                                 'product development',
                                                 'product management'))
business_interest <- append(business_interest, c('design and implement',
                                                 'business intelligence tools',
                                                 'data driven decisions',
                                                 'subject matter expert',
                                                 'data driven insights',
                                                 'design and develop',
                                                 'design and implementation',
                                                 'development and implementation',
                                                 'experience with agile'))

topics <- list(programming, presentation, theoretical, soft_skills,
               data_management, business_interest)
names(topics) <- c("programming", "presentation", "theoretical", "soft_skills",
                   "data_management", "business_interest")

```

## Visualizing Important Words, Bigrams, and Trigrams:

We created wordclouds to display the important terms we selected as being related to particular skills. Words with higher **tf_dc_norm_sum** values are displayed as larger in size relative to words with lower values. We used the topic each term fell under as a key to color the terms. Please see the legend accompanying each wordcloud for a mapping of topic to color.

```{r wordclouds, fig.align="center"}
important_single_words <- tidy_text_words_analysis %>%
    filter(word %in% single_words)
important_single_words$topic <- replicate(nrow(important_single_words), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, important_single_words$word)
        if (!is.na(row)){
            important_single_words[row, 8] <- top
        }
    }
}

important_bigrams <- tidy_text_bigrams_analysis %>%
    filter(bigram %in% two_words)
important_bigrams$topic <- replicate(nrow(important_bigrams), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, important_bigrams$bigram)
        if (!is.na(row)){
            important_bigrams[row, 8] <- top
        }
    }
}

important_trigrams <- tidy_text_trigrams_analysis %>%
    filter(trigram %in% three_words)
important_trigrams$topic <- replicate(nrow(important_trigrams), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, important_trigrams$trigram)
        if (!is.na(row)){
            important_trigrams[row, 8] <- top
        }
    }
}

important_single_words %<>%
    mutate(angle = 90 * sample(c(0, 1), n(), replace = TRUE, prob = c(60, 40)))
important_bigrams %<>%
    mutate(angle = 45 * sample(c(0, 1), n(), replace = TRUE, prob = c(70, 30)))
important_trigrams %<>%
    mutate(angle = 0)

ggplot(important_single_words, aes(label = word, size = tf_dc_norm_sum,
                                   color = topic, angle = angle)) +
    geom_text_wordcloud_area(show.legend = TRUE) +
    scale_size_area(max_size = 16) +
    scale_fill_binned(type = "viridis") +
    guides(size = "none")

ggplot(important_bigrams, aes(label = bigram, size = tf_dc_norm_sum,
                              color = topic, angle = angle)) +
    geom_text_wordcloud_area(show.legend = TRUE) +
    scale_size_area(max_size = 14) +
    scale_fill_binned(type = "viridis") +
    guides(size = "none")

ggplot(important_trigrams, aes(label = trigram, size = tf_dc_norm_sum,
                               color = topic, angle = angle)) +
    geom_text_wordcloud_area(show.legend = TRUE) +
    scale_size_area(max_size = 10) +
    scale_fill_binned(type = "viridis") +
    guides(size = "none")

```

## Historical Analysis:

We wanted to compare how important particular data science skill terms in present-day job listings were to how important they were in historical job listings. So we found a [dataset of data science job descriptions scraped from Glassdoor in  2019](https://www.kaggle.com/datasets/andrewmvd/data-scientist-jobs), which we obtained from Kaggle. We applied the same cleaning and processing procedures to the job descriptions in this dataset as we did to the present-day data.

```{r historical_analysis1}
historical_df <- read.csv("https://raw.githubusercontent.com/geedoubledee/data607_project3/main/DataScientist.csv")
historical_text_df <- subset(historical_df, select = c(Job.Description,
                                                       index))
colnames(historical_text_df) <- c("Text", "Job_id")
historical_text_df_clean <- historical_text_df
historical_text_df_clean$Text <- tolower(historical_text_df_clean$Text)
historical_text_df_clean$Text <- tolower(historical_text_df_clean$Text)
historical_text_df_clean$Text <- sapply(historical_text_df_clean$Text,
                                        function(x) str_replace_all(
                                            x,"[^[:alnum:]]", " "))
historical_text_df_clean$Text <- sapply(historical_text_df_clean$Text,
                                        function(x) str_replace_all(
                                            x, "[[:punct:]]", " "))
historical_text_df_clean$Text <- sapply(historical_text_df_clean$Text,
                                        function(x) str_replace_all(
                                            x, "[^A-Za-z]", " "))
historical_text_df_clean$Text <- trimws(historical_text_df_clean$Text)
historical_text_df_clean %<>%
    filter(Text != "" & !is.na(Text))

historical_text_words <- historical_text_df_clean %>%
    unnest_tokens(word, Text)

historical_text_words_analysis <- historical_text_words %>%
    anti_join(get_stopwords()) %>%
    group_by(word) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3))

performanceTable <- historical_text_words_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
historical_text_words_analysis <- cbind(historical_text_words_analysis, norm)
historical_text_words_analysis <- historical_text_words_analysis[
    c("word", "term_freq", "tf_norm", "doc_count", "dc_norm",
      "tf_dc_score", "tf_dc_norm_sum")]

historical_text_words_analysis %<>%
    arrange(desc(tf_dc_norm_sum))
    
datatable(head(historical_text_words_analysis[, -6],
               options = list(pageLength = 25),n = 250))

historical_text_bigrams <- historical_text_df_clean %>%
    unnest_tokens(bigram, Text, token = "ngrams", n = 2)

historical_text_bigrams_analysis <- historical_text_bigrams %>%
    separate(bigram, into = c("first","second"),
             sep = " ", remove = FALSE) %>%
    anti_join(stop_words, by = c("first" = "word")) %>%
    anti_join(stop_words, by = c("second" = "word")) %>%
    group_by(bigram) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3)) %>%
    filter(!is.na(bigram))

performanceTable <- historical_text_bigrams_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                      normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
historical_text_bigrams_analysis <- cbind(historical_text_bigrams_analysis,
                                          norm)
historical_text_bigrams_analysis <- historical_text_bigrams_analysis[
    c("bigram", "term_freq", "tf_norm", "doc_count", "dc_norm",
      "tf_dc_score", "tf_dc_norm_sum")]

historical_text_bigrams_analysis %<>%
    arrange(desc(tf_dc_norm_sum))

datatable(head(historical_text_bigrams_analysis[, -6],
               options = list(pageLength = 25), n = 250))

historical_text_trigrams <- historical_text_df_clean %>%
    unnest_tokens(trigram, Text, token = "ngrams", n = 3)

historical_text_trigrams_analysis <- historical_text_trigrams %>%
    separate(trigram, into = c("first","second","third"),
             sep = " ", remove = FALSE) %>%
    anti_join(stop_words, by = c("first" = "word")) %>%
    anti_join(stop_words, by = c("third" = "word")) %>%
    group_by(trigram) %>%
    summarize(term_freq = n(),
              doc_count = n_distinct(Job_id),
              tf_dc_score = round((term_freq * doc_count / 1000000), 3)) %>%
    filter(!is.na(trigram))

performanceTable <- historical_text_trigrams_analysis[, 2:3]
normalizationTypes = c("rescaling", "rescaling")
norm <- as.data.frame(normalizePerformanceTable(performanceTable,
                                                      normalizationTypes))
cols <- c("tf_norm", "dc_norm")
colnames(norm) <- cols
norm$tf_norm = round(norm$tf_norm, 3)
norm$dc_norm = round(norm$dc_norm, 3)
norm %<>%
    mutate(tf_dc_norm_sum = tf_norm + dc_norm)
historical_text_trigrams_analysis <- cbind(historical_text_trigrams_analysis,
                                           norm)
historical_text_trigrams_analysis <- historical_text_trigrams_analysis[
    c("trigram", "term_freq", "tf_norm", "doc_count", "dc_norm",
      "tf_dc_score", "tf_dc_norm_sum")]

historical_text_trigrams_analysis %<>%
    arrange(desc(tf_dc_norm_sum))

datatable(head(historical_text_trigrams_analysis[, -6],
               options = list(pageLength = 25), n = 250))

```

That data processing included organizing the words into the same topics we covered in the present-day analysis.

```{r historical_analysis2}
historical_important_single_words <- historical_text_words_analysis %>%
    filter(word %in% single_words)
historical_important_single_words$topic <- replicate(
    nrow(historical_important_single_words), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, historical_important_single_words$word)
        if (!is.na(row)){
            historical_important_single_words[row, 8] <- top
        }
    }
}

historical_important_bigrams <- historical_text_bigrams_analysis %>%
    filter(bigram %in% two_words)
historical_important_bigrams$topic <- replicate(
    nrow(historical_important_bigrams), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, historical_important_bigrams$bigram)
        if (!is.na(row)){
            historical_important_bigrams[row, 8] <- top
        }
    }
}

historical_important_trigrams <- historical_text_trigrams_analysis %>%
    filter(trigram %in% three_words)
historical_important_trigrams$topic <- replicate(
    nrow(historical_important_trigrams), '')

for (i in 1:length(topics)){
    for (j in 1:length(topics[[i]])){
        top <- names(topics)[i]
        wd <- topics[[i]][j]
        row <- match(wd, historical_important_trigrams$trigram)
        if (!is.na(row)){
            historical_important_trigrams[row, 8] <- top
        }
    }
}

```

Then we were ready to look at how these data science skill terms had changed in importance since 2019. First, we calculated the difference between their current **tf_dc_norm_sum** values and their historical values. Then we plotted the 15 terms that had the largest increases in importance, as well as the 15 terms that had the largest decreases in importance, on two separate bar charts. 

```{r historical_analysis3}
a <- important_single_words$tf_dc_norm_sum
b <- historical_important_single_words$tf_dc_norm_sum
important_single_words$diff_in_norm_sum <- a - b
names(important_single_words)[names(important_single_words) == "word"] <- "term"

a <- important_bigrams$tf_dc_norm_sum
b <- historical_important_bigrams$tf_dc_norm_sum
important_bigrams$diff_in_norm_sum <- a - b
names(important_bigrams)[names(important_bigrams) == "bigram"] <- "term"

a <- important_trigrams$tf_dc_norm_sum
b <- historical_important_trigrams$tf_dc_norm_sum
important_trigrams$diff_in_norm_sum <- a - b
names(important_trigrams)[names(important_trigrams) == "trigram"] <- "term"

important_terms <- rbind(important_single_words, important_bigrams,
                         important_trigrams)
important_terms %<>%
    arrange(desc(diff_in_norm_sum))

important_terms_bottom15 <- important_terms %>%
    top_n(-15)

important_terms_top15 <- important_terms %>%
    top_n(15)

important_terms_big <- rbind(important_terms_bottom15,
                             important_terms_top15)

ggplot(important_terms_bottom15, aes(x = reorder(term, diff_in_norm_sum),
                            y = diff_in_norm_sum, fill = topic)) + 
    geom_bar(stat="identity", show.legend = TRUE) +
    scale_fill_viridis_d() + 
    labs(title="Top 15 Decreases in Term Importance",
       y="Difference", x="Terms") +
    coord_flip()

ggplot(important_terms_top15, aes(x = reorder(term, -diff_in_norm_sum),
                            y = diff_in_norm_sum, fill = topic)) + 
    geom_bar(stat="identity", show.legend = TRUE) +
    scale_fill_viridis_d() + 
    labs(title="Top 15 Increases in Term Importance",
       y="Difference", x="Terms") +
    coord_flip()
```

We have also plotted these increases and decreases in importance on a single divergent plot so that they can easily be visually analyzed all at once. 

```{r historical_analysis4}

color <- ifelse(important_terms_big$diff_in_norm_sum < 0, "pink", "lightgreen")

ggplot(important_terms_big, aes(x = reorder(term, diff_in_norm_sum),
                            y = diff_in_norm_sum)) +
    geom_bar(stat = "identity", show.legend = FALSE,
             fill = color, color = "white") +
    geom_hline(yintercept = 0, color = 1, lwd = 0.2) +
    geom_text(aes(label = important_terms_big$term,
                  hjust = ifelse(diff_in_norm_sum <0, 1.15, -.15),
                  vjust = 0.25),
              size = 2.25) +
    labs(x = "Word", y = "Difference", title = "Biggest Changes for Terms") +
    scale_y_continuous(breaks = seq(-.5, .5, by = .25), limits =
        c(min(important_terms_big$diff_in_norm_sum) - 0.1,
          max(important_terms_big$diff_in_norm_sum) + 0.1)) +
    coord_flip() + 
    theme(axis.text.y = element_blank(),  
          axis.ticks.y = element_blank(), 
          panel.grid.major.y = element_blank())

```

## Conclusions:

It's interesting that the phrase "data analysis" has experienced such a decrease in importance. Perhaps the phrase has gone out of usage a little, or hiring managers are looking for more specific/other skills these days. The increase in demand for Python and SQL skills suggests that having knowledge of each language is very highly valued. The increase in demand for data engineering and data warehousing indicates a need to have higher proficiency in data storage and maintenance. 

We also think it's particularly interesting that the "ability to write" has surged in importance. This could mean that companies are expecting more from their data scientists than technical skills these days. Data Science might be developing into a more interdisciplinary field than it previously was; hence, the increase in the relative importance of writing and communication.