final_draft.qmd

---
title: "Analyzing NBA Data Draft 3"
author: "Gavin Daves, Grant Thompson, Evan Brown, Seyaul Kim"
date: 'February 22, 2024'
format:
  pdf: default
  html: default
pdf-engine: pdflatex
editor: visual
fontsize: 10pt
geometry: margin=1in
toc: true                   # add table of contents at the beginning
toc-depth: 2                # Only titles that start with # or ##
---

# Analyzing how the Game of Basketball Developed Over Time

The National Basketball Association (NBA) has been in commission since 1946, and has become one of the most well-known sports organizations in history. Over 65,000 games have been played since the birth of the league, and we will be doing an analysis of play-by-play data from all of these games. For every event since the NBA's commission, this play-by-play data set has notable columns such as detailed play descriptions, score margins, involved players, and time remaining in the game.

Ultimately, our group seeks to analyze, quantify, and visualize two major questions: 1. How has the play style of the NBA evolved throughout its existence? 2. Which players have excelled in the sport and stood above the rest? Throughout this paper, we will use the play-by-play data, a variety of statistics, from shot distances to nationality to adjusted player efficiency ratings, to aim to answer these questions.

```{r setup, echo=FALSE, include=FALSE}
## The following line is an overall configuration
## asking that the R code is displayed.
## Set to FALSE to avoid showing the code by default
## (required for your final project, where you are not supposed
##  to show code)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
```

```{r, echo=FALSE, include=FALSE, warnings=FALSE}
packages <- c("data.table", "dplyr", "ggplot2", "tidyr", "ggalt", "knitr")

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages) > 0) {
  install.packages(new_packages)
}

library(data.table)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggalt)
library(knitr)
```

### Cleaning the Primary Data Set / Extracting Information

Our play-by-play data set has 13 million rows and 34 columns, with each row representing a different game event. In cleaning our data, we dropped a few columns that we didn't have use for, such as the "video_available_flag" column. In addition, we used REGEX to extract valuable information from the play description columns. In doing so, we were able to get new columns like shot distance.

```{r echo=FALSE, include=FALSE}
library(data.table)
library(dplyr)

# reads in play-by-play data
pbp <- fread("play_by_play.csv")

# reads in game-by-game data
game <- fread("game.csv")

# Changes the dates to just their years
game$game_date <- as.integer((format(game$game_date, "%Y")))
game <- game[, c("game_id", "game_date")]

# Removes unnecessary columns
pbp[, c("wctimestring", "player1_team_city", "player1_team_nickname", "person1type", "player2_team_city", "player2_team_nickname", "person2type", "player3_team_city", "player3_team_nickname", "person3type", "video_available_flag")] <- NULL

# Ensures all vital columns are not empty/NA
no_missing <- complete.cases(pbp$"game_id", pbp$"eventnum", pbp$"eventmsgtype", pbp$"eventmsgactiontype",
                             pbp$"period", pbp$"wctimestring", pbp$"pctimestring", pbp$"score")
pbp <- pbp[no_missing, ]

# Clears up memory space
rm(no_missing)

# Removes duplicate rows
pbp <- unique(pbp)

# Merge with game-by-game data to get year information for each event (row)
pbp <- merge(pbp, game, by = "game_id", all.x=TRUE)
head(pbp)
```

#### Extracting Shot Distances

One of the data points we are extracting is the jumpshot distance for made shots. In doing so, we can analyze average shot distance over time, find the make percentage per jumpshot distance, and analyze what the best jumpshot is.

```{r echo=FALSE, include=FALSE}
# parses descriptions to get shot distance
dist <- function(str, pattern = "\\b(\\d+)'") {
  pos <- regexpr(pattern, str)
  if (pos[1] <= 0){
    return(0)
  }
  sub <- regmatches(str, pos)
  num <- as.integer(substring(sub, 1, nchar(sub)-1))
  return (num)
}

# initialize the column
pbp$shot_dist <- 0

# Runs the parser on each row, parsing different columns depending on what is empty or not
pbp <- pbp %>%
  mutate(
    shot_dist = apply(pbp[, c("homedescription", "visitordescription")], 1, function(row) {
      dist_value <- dist(row["homedescription"])
      if (is.na(dist_value) || dist_value <= 0) {
        dist_value <- dist(row["visitordescription"])
      }
      return(dist_value)
    })
  )
```

### Cleaning the Secondary Data Set / Extracting Information

Here, we are making the data a bit more suitable for analysis by dropping unnecessary rows (for instance, we don't need international league stats, as we are only focused on the NBA). We are also adding columns that are based on other columns values for ease of use later in our analysis (ex. PTS / GP = PPG).

```{r, echo=FALSE}
player_stats <- read.csv("player_stats.csv")
player_stats <- player_stats[player_stats$Stage != "International", ]
player_stats$FieldGoalPercentage <- player_stats$FGM / player_stats$FGA * 100
player_stats$ThreePointPercentage <- player_stats$X3PM / player_stats$X3PA * 100
player_stats$ThreePointPercentage[is.nan(player_stats$ThreePointPercentage) | is.na(player_stats$ThreePointPercentage)] <- 0

player_stats$Season <- substring(player_stats$Season, 1,4)

player_stats["MPG"] <- player_stats$MIN/player_stats$GP
player_stats["PPG"] <- player_stats$PTS/player_stats$GP

player_stats["missed_shots"] <- player_stats$FGA - player_stats$FGM


# Creating subsets of the data for different years to analyze how certain aspects
# of the NBA have changed over time.

player_stats_99 <- subset(player_stats, Season == "1999")
player_stats_04 <- subset(player_stats, Season == "2004")
player_stats_09 <- subset(player_stats, Season == "2009")
player_stats_14 <- subset(player_stats, Season == "2014")
player_stats_19 <- subset(player_stats, Season == "2019")

```

## Evolution of the NBA

We will start by analyzing how the NBA has evolved since its creation. We will be analyzing and visualizing shot distance, points per game, adjusted player rating, and more to attempt to understand the league's evolution.

### Average Shot Distances By Year

Trends and play styles change constantly in the NBA as different players and teams use different strategies to win games. This leads to occasional 'revolutions' where people realize that different aspects of the sport, from speed of play to 3-point shots, are more important than others. This graph analyzes how far from the basket on average players were attempting shots every year to attempt to visualize any potential shooting revolutions.

```{r, echo=FALSE}
# Graph specific
# made separate data file after cleaning to speed up runtime
shot_only <- fread("shot_dist.csv", verbose = FALSE)
shot_only <- shot_only[, c("game_date", "shot_dist")]
# Groups by year and find average shot distance for each year
shot_only$game_date <- as.factor(shot_only$game_date)

avg_dist <- data.frame(
  game_date = levels(shot_only$game_date),
  avg_shot_dist = numeric(length(levels(shot_only$game_date)))
)

for (i in 1:length(levels(shot_only$game_date))) {
  year <- levels(shot_only$game_date)[i]
  avg_dist$avg_shot_dist[i] <- mean(shot_only$shot_dist[shot_only$game_date == year], na.rm = TRUE)
}
```

```{r, echo=FALSE}
# Plots as a line graph
plot(avg_dist$game_date, avg_dist$avg_shot_dist,
     type="l", xlab="Year", ylab="Average Shot Distance",
     main="Average Distance of Attempted Shots in the NBA By Year",
     col="black")

tick_marks_x <- seq(min(avg_dist$game_date), max(avg_dist$game_date), by = 1)
labels_x <- ifelse(tick_marks_x %% 5 == 0, as.character(tick_marks_x), "")

axis(1, at = tick_marks_x, labels = labels_x, tick = TRUE)

tick_marks_y <- seq(floor(min(avg_dist$avg_shot_dist)), ceiling(max(avg_dist$avg_shot_dist)), by = 1)
labels_y <- ifelse(tick_marks_y %% 5 == 0, as.character(tick_marks_y), "")

axis(2, at = tick_marks_y, labels = labels_y, tick = TRUE)
```

We can see that the distance of shots increased up until around 2010, when they had a massive drop. This could be for a number of reasons, with one likely one being related to a play style revolution. It was around this time that teams realized how much more valuable 3-point shots were, and how much of a waste long distance 2-point shots (ten plus feet away from the basket) were. Stephen Curry is often accredited with leading this 'revolution', as he was one of the first to adopt this play style and excel with it. Before this revolution, players often ignored 3-point shots to move a few feet closer. After this revolution, players either took a 3pt shot or moved much closer to the basket.

At the beginning of this revolution, it is possible that players realized that long distance 2-point shots were not worth it, leading to a complete disregard for these shots, even when they were actually worth it. Players took mostly one or two foot shots instead. All things balance out though, as players started to move back to taking some long distance 2-point shots if they had to as time went on, but the focus was still noticeably on taking less long distance 2-point shots.

### Density Graph of All Shot Distances Prior to and After 2010

This plot shows the density of all shot distances in the offensive half. It focuses on comparing shots before and after 2010, when the supposed shooting 'revolution' took place, as was discussed earlier. With this, we can see at what distance shots are most frequently taken at and what trends might come with that. It also displays the distance of the 3-point line in red.

```{r, echo=FALSE}
shots <- fread("shot_dist.csv", verbose = FALSE)
shots <- shots[, c("game_date", "shot_dist", "player1_team_abbreviation")]
shots <- shots[shot_dist <= 47]
shots$distance_group <- cut(shots$shot_dist, 
                            breaks = seq(0, max(shots$shot_dist) + 3, by = 3), 
                            include.lowest = TRUE, 
                            right = FALSE)
shots$pivot <- ifelse(shots$game_date < 2010, "Before 2010", "After 2010")
shots$pivot <- factor(shots$pivot, levels = c("Before 2010", "After 2010"))
ggplot(data = shots) + 
  aes(x = 1, y = shot_dist) + 
  geom_violin(fill = "darkblue", color = "black", alpha = 0.7) + 
  labs(title = "Density of All Shot Distances in the Offensive Half", 
       x = "", 
       y = "Shot Distance") + 
  geom_abline(slope = 0, intercept = 23, color = "red") + 
  theme(axis.text.x = element_blank(), 
        axis.ticks.x = element_blank(), 
        plot.title = element_text(hjust = 0.5)) + 
  facet_wrap(~pivot, scales = "free_y") + 
  annotate("text", x = 1.375, y = 21, label = "3pt Line", color = "black", size=3.5)
```

This plot further reinforces the earlier idea that the shooting 'revolution' in 2010 truly did change where players took shots from. As we can see in the plot, a significant proportion of shots were taken from 2-point mid-distance. After this revolution though, we can see that the proportion of shots taken from this distance significantly dropped, with a heavy increase in layups and a slight increase in longer distance 3 point shots. Players started to realize that taking a shot from mid-distance 2-point range didn't have much value, and took the ball towards the net or moved to 3-point range much more often.

### Minutes per game vs Points per game using Linear Regression

Our group found that player minutes vs. player points from the 1999 season seemed to have a positive correlation that was indicative of a linear relationship. We were interested in finding if this was the case for other seasons as well, so we decided to create models for the 1999, 2004, 2009, 2014, and 2019 seasons and plot their line of best fit to assess the possible linear relationships. After, we will plot all of the lines of best fit together to see how the lines vary over time.

```{r, echo=FALSE}

model_99 <- lm(player_stats_99$PPG ~ player_stats_99$MPG)
model_04 <- lm(player_stats_04$PPG ~ player_stats_04$MPG)
model_09 <- lm(player_stats_09$PPG ~ player_stats_09$MPG)
model_14 <- lm(player_stats_14$PPG ~ player_stats_14$MPG)
model_19 <- lm(player_stats_19$PPG ~ player_stats_19$MPG)

plot(player_stats_99$MPG, player_stats_99$PPG, xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "1999 Season MPG vs. PPG",)
abline(model_99, col = 'red')
plot(player_stats_04$MPG, player_stats_04$PPG, xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "2004 Season MPG vs. PPG",)
abline(model_04, col = 'blue')
plot(player_stats_09$MPG, player_stats_09$PPG, xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "2009 Season MPG vs. PPG",)
abline(model_09, col = 'green')
plot(player_stats_14$MPG, player_stats_14$PPG, xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "2014 Season MPG vs. PPG",)
abline(model_14, col = 'purple')
plot(player_stats_19$MPG, player_stats_19$PPG, xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "2019 Season MPG vs. PPG",)
abline(model_19, col = 'lightsalmon')


xrange <- c(0,50)
yrange <- c(0, 30)

plot(xrange, yrange, type = "n", xlab = "Minutes per Game", 
         ylab = "Points per Game", main = "MPG vs. PPG", col = 'white')
abline(model_99, col = 'red')
abline(model_04, col = 'blue')
abline(model_09, col = 'green')
abline(model_14, col = 'purple')
abline(model_19, col = 'lightsalmon')
legend("topleft", legend=c("1999", "2004", "2009", "2014", "2019"), col=c("red", "blue", "green", "purple","lightsalmon"), pch=c(19, 18), lty=1, cex = .75)
```

#### Summary of Regression Model:

In our regression models, we can see our regression lines adhering to the data quite well. Our line hits many of our data points, as well as having a similar number of points above and below the line. However, for each year there is a trend of our data veering far from the regression line as minutes per game increases. Based on the structure of our points, they seem almost exponential. When a player gets between 0 and 30 MPG, their PPG appears to be linearly related to the MPG. However, once a player starts playing more than 30 MPG, their PPG increases rapidly. This phenomenon could be answered by some qualitative observations. The players that play the most MPG are often the best players on the team. Therefore, the steep and sudden increase in PPG after 30 MPG makes sense, it is because the players who play more than 30 MPG are the best players in the league, known for their scoring abilities..

As we can see from the plots with their line of best fit, each year has a positive linear relationship with a line of best fit that adequately predicts points per game in respect to minutes per game. However, when analyzing the lines of best fit plotted together, all seasons have a quite similar intercept, and all have a very close slope to one another aside from the 2019 season. When going back and looking at the data for 2019 once again, we found that towards the higher end of minutes per game (33+), the spread of the data becomes increasingly noticeable, especially towards the higher end of the points per game spectrum as we approach 40+ minutes per game. This indicates that there could be bad leverage points in our plots.

### Diagnostic tests on our Regression models
```{r, echo=FALSE, fig.height = 6, fig.width = 11}
par(mfrow=c(2, 2)) 
plot(model_99, main = "1999 Diagnostic Data") 
par(mfrow=c(2, 2)) 
plot(model_04, main = "2004 Diagnostic Data") 
par(mfrow=c(2, 2)) 
plot(model_09, main = "2009 Diagnostic Data") 
par(mfrow=c(2, 2)) 
plot(model_14, main = "2014 Diagnostic Data") 
par(mfrow=c(2, 2)) 
plot(model_19, main = "2019 Diagnostic Data") 
#paste(player_stats_14[247, "Player"],":",player_stats_14[247, "MPG"], "Minutes per game,", player_stats_14[247, "PPG"], "Points per game" ) 
#paste(player_stats_14[206, "Player"],":",player_stats_14[206, "MPG"], "Minutes per game,", player_stats_14[206, "PPG"], "Points per game" )
```

As we can see from the diagnostic plots from each of the observed seasons, there are certainly outliers, but no bad leverage points that affect the fit of the model aside from 2014. Looking into the 2014 season, the bad leverage points were LeBron James and Anthony Davis, who overperformed their expected play.

### Investigating the data of 30+ Minutes per game using Linear Regression

```{r}
library(viridis)
library(ggplot2)

player_stats_33_plus_min<- subset(player_stats, MPG >= 33)

model_33_plus_min <- lm(player_stats_33_plus_min$PPG ~ player_stats_33_plus_min$MPG)

player_stats_33_plus_min$fitted_values <- predict(model_33_plus_min)

ggplot(player_stats_33_plus_min, aes(x = MPG, y = PPG, color = as.factor(Season))) + 
  geom_point() + 
  geom_line(aes(y = fitted_values), color = "red", linewidth = 1) + 
  scale_color_viridis(discrete = TRUE, option = "D") + 
  labs(title = "PPG vs. MPG for Players with 33+ Min/Game", x = "Minutes Per Game (MPG)", y = "Points Per Game (PPG)", color = "Year") + 
  theme_minimal() 

log_33_model <- lm(log(player_stats_33_plus_min$PPG) ~ player_stats_33_plus_min$MPG)

player_stats_33_plus_min$log_PPG <- log(player_stats_33_plus_min$PPG)

player_stats_33_plus_min$log_ppg_fitted_values <- predict(log_33_model) 

ggplot(player_stats_33_plus_min, aes(x = MPG, y = log_PPG, color = as.factor(Season))) + 
  geom_point() + 
  geom_line(aes(y = log_ppg_fitted_values), color = "red", linewidth = 1) + scale_color_viridis(discrete = TRUE, option = "D") + 
  labs(title = "Log PPG vs. MPG for Players with 33+ Min/Game", x = "Minutes Per Game (MPG)", y = "Log Points Per Game (LPPG)", color = "Year") + 
  theme_minimal()
```
When running a linear regression model on the 33+ minutes per game data, it is clear that the model will not be incredibly accurate, as there is high variation in the data. So, we decided to transform the points per game to see if there was a better metric for linear regression. We chose to take the log of the Points per Game (LPPG). In our new scatterplot, we see that this didn't make much of a difference, as the data is still quite scattered. Ultimately, there is bound to be a large variance in performance when players are tired (or heated up) from being on the court for so long.

### Histogram of Heights over time

Our group was interested to see how the heights of NBA players have changed (or not) over time. So, we decided to create histograms, all with the same limits 160 centimeters is considered very short for an NBA player, and 230 centimeters is considered very tall).

```{r, echo=FALSE}
# Set common y-axis limits and number of breaks
common_ylim <- c(0, 140)
common_breaks <- 14

# Plot histograms with a common y-axis and number of breaks
hist(player_stats_99$height_cm, breaks = common_breaks, xlim = c(160, 230), ylim = common_ylim, xlab = "Height (cm)", main = "Heights of players in the 1999 Season")
hist(player_stats_04$height_cm, breaks = common_breaks, xlim = c(160, 230), ylim = common_ylim, xlab = "Height (cm)", main = "Heights of players in the 2004 Season")
hist(player_stats_09$height_cm, breaks = common_breaks, xlim = c(160, 230), ylim = common_ylim, xlab = "Height (cm)", main = "Heights of players in the 2009 Season")
hist(player_stats_14$height_cm, breaks = common_breaks, xlim = c(160, 230), ylim = common_ylim, xlab = "Height (cm)", main = "Heights of players in the 2014 Season")
hist(player_stats_19$height_cm, breaks = common_breaks, xlim = c(160, 230), ylim = common_ylim, xlab = "Height (cm)", main = "Heights of players in the 2019 Season")

```

From the data, we can see that from the 1999 to 2019, seasons the most common range for heights in all years was 200-210 centimeters. Also, the extreme heights (short and tall) seen in the 1999 and 2004 seasons do not appear in the future seasons,

### Shots Made to Shots Missed Pie Chart over time

Our group was interested in seeing the trends of shots missed to shots made in a season over time. We decided to create a pie chart to visualize the proportion of the 1999, 2004, 2009, 2014, and 2019 seasons.

From the pie charts, we can see that shot accuracy has stayed relatively constant over the period of time, as the percentages of shots made and shots missed are close to the same for every season sampled.

```{r, echo=FALSE, fig.height = 6, fig.width = 11}
# Combine data into a single data frame
seasons <- c("1999", "2004", "2009", "2014", "2019")

data <- data.frame(
  Season = rep(seasons, each = 2),
  Category = rep(c("Shots Missed", "Shots Made"), times = length(seasons)),
  Value = c(
    sum(player_stats_99$missed_shots, na.rm = TRUE),
    sum(player_stats_99$FGM, na.rm = TRUE),
    sum(player_stats_04$missed_shots, na.rm = TRUE),
    sum(player_stats_04$FGM, na.rm = TRUE),
    sum(player_stats_09$missed_shots, na.rm = TRUE),
    sum(player_stats_09$FGM, na.rm = TRUE),
    sum(player_stats_14$missed_shots, na.rm = TRUE),
    sum(player_stats_14$FGM, na.rm = TRUE),
    sum(player_stats_19$missed_shots, na.rm = TRUE),
    sum(player_stats_19$FGM, na.rm = TRUE)
  )
)

# Calculate percentages and labels
data$Percentage <- data$Value / tapply(data$Value, data$Season, sum)[data$Season] * 100
data$Label <- paste(data$Category, sprintf("%d%%", round(data$Percentage)))

# Create faceted pie chart with custom colors
ggplot(data, aes(x = "", y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y") +
  theme_void() +
  facet_wrap(~Season) +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 3) +  # Adjust size parameter
  scale_fill_manual(values = c("darkorchid1", "thistle3")) +  # Set custom colors
  labs(title = "Shots Made to Shots Missed by Season")

```

### Barchart of Average 3-Point Shooting Percentages of NBA Big-Men By Season

In recent years, the NBA has seen a historic rise in the three point shot. Teams are relying on it more than ever. At the same time, the NBA has placed a focus on the drafting and development of skilled big-men. Gone are the days of the back-to-the-basket big, who uses their pure size and strength to win down low in the post. Players like Nikola Jokic, Victor Wembanyama, and Karl-Anthony Towns are known for their ability to space the floor and hit the three point shot. We sought out to analyze this trend, to see if the data would back up our qualitative observations.

```{r, echo=FALSE}
# Convert height_cm to numeric
player_stats$height_cm <- as.numeric(player_stats$height_cm)

# Filter tall players
tall_players <- player_stats[player_stats$height_cm > 203, ]

# Filter tall players w ThreePointPercentage >= 30
tall_players_30plus <- tall_players[tall_players$ThreePointPercentage >= 30, ]

# Count the number of tall players who shot 30% or over by season
num_tall_players <- table(tall_players_30plus$Season)

# Convert the result to a data frame
result <- data.frame(Season = names(num_tall_players), NumTallPlayers = as.vector(num_tall_players))

# Set graphical parameters
par(cex.main = 0.9, cex.lab =0.9)

# Create bar plot with adjusted title and y-axis label size
barplot(result$NumTallPlayers, names.arg = result$Season, col = "blue",
        xlab = "Season", ylab = "Frequency",
        main = "Number of 6'8\"+ Players who shot 30% or over from 3PT by Season")
grid()
```

We chose to go with the height 6'8" as it is the average height for a power forward. In this plot, we are aiming to examine the number of power forwards and centers that shoot \>30% from the three point line. From our graph, we can see that from 1999-2014 there are about 40 NBA big-men who shoot at least 30% from three. In recent years, namely since 2016, this number has doubled, with a massive jump from the 2014 season to the 2016 season. This graph lines up with our qualitative observations, that the NBA has observed a sudden and big jump in skilled big-men with shooting abilities.

### Heat Map of NBA Players By Nationality and Season

A growing trend in the NBA is their global outreach programs. In doing so, they have spread the game of basketball across the world, inspiring young athletes in many countries. In recent years, we have seen foreign NBA players like Nikola Jokic and Giannis Antetokounmpo dominate the league. We wanted to explore this trend, and provide evidence for this phenomenon of an increasing foreign presence in the NBA.

```{r, echo=FALSE}
# Extract the year from the 'Season' column
player_stats$Year <- as.numeric(sub(" - .*", "", player_stats$Season))

nationality_counts <- player_stats %>%
  group_by(nationality) %>%
  summarise(UniquePlayers = n_distinct(Player)) %>%
  ungroup()

# Filter out the United States and nationalities with 10 or fewer players
nationalities_with_more_than_10 <- nationality_counts %>%
  filter(UniquePlayers > 8 & nationality != "United States") %>%
  pull(nationality) # Get a vector of the nationalities

# Filter the original data for these nationalities
filtered_player_stats <- player_stats %>%
  filter(nationality %in% nationalities_with_more_than_10)

annual_nationality_counts <- filtered_player_stats %>%
  group_by(Year, nationality) %>%
  summarise(PlayerCount = n_distinct(Player), .groups = 'drop')

# Plot
p <- ggplot(annual_nationality_counts, aes(x = Year, y = PlayerCount, group = nationality, color = nationality)) +
  geom_line(linewidth = 1.2) +  # Updated to use `linewidth`
  geom_point(data = annual_nationality_counts %>% 
               group_by(nationality) %>% 
               filter(Year == min(Year) | Year == max(Year)),  # Points at beginning and end
             size = 3) +
  labs(title = "Number of NBA Players by Nationality Over Years",
       x = "Year", y = "Number of Players", color = "Nationality") +
  theme_minimal() +
  theme(legend.position = "bottom",
        panel.border = element_rect(colour = "black", fill=NA, linewidth = 1),  # Updated to use `linewidth`
        axis.text.x = element_text(size = 10))  # Change the size as needed

p
annual_nationality_counts_filtered <- annual_nationality_counts %>%
  group_by(nationality) %>%
  filter(Year == min(Year) | Year == max(Year)) %>%
  ungroup()


# Plot with only the beginning and end points for each nationality
q <- ggplot(annual_nationality_counts_filtered, aes(x = Year, y = PlayerCount, group = nationality, color = nationality)) +
  geom_line(size = 1.2) +  # Make lines thicker
  geom_point(size = 3) +   # Add points
  labs(title = "Overall Change in Number of NBA Players by Nationality",
       x = "Year", y = "Number of Players", color = "Nationality") +
  theme_minimal() +
  theme(legend.position = "bottom",
        panel.border = element_rect(colour = "black", fill=NA, linewidth=1))  # Add black border

# Print the plot
print(q)
```
The line graph plots the number of players from Australia, Canada, France, Spain, and Serbia across all the years that we have on the data set. We chose these specific countries as they represent the most foreign NBA players in the league today. Based on the results from the graph, we can see that even in the past 2 decades, there were not many foreign players that joined the NBA. However, as we get closer to the current decade, it becomes obvious that the influence of the NBA has reached outside of the US because more people from all included countries in the graph have a higher density of people that join the NBA. This aligns with the qualitative observation that the NBA is becoming an increasingly popular sport that reaches well outside the confines of the US. The second graph is a graph that shows the beginning number and the end number, and it shows that without fail, across all the countries with the highest density of NBA players, the number of players increase.

### Adjusted Player Rating

We decided to use a modified version of the PER (Player Efficiency Rating) to assign numerical values to player's individual seasons. This is calculated by the sum of their points, assists, rebounds, steals, blocks, made field goals, and made free throws. We then subtract field goal attempts, free throw attempts, and turnovers. We then divide this by the number of games played. 

```{r, echo=FALSE}
player_stats$Season_numeric <- as.integer(substr(player_stats$Season, 1, 4)) - 1998

# Filter data for Regular Season
player_stats_regular_season <- player_stats[player_stats$Stage == "Regular_Season", ]

calculate_PER <- function(player) {
  season_index <- player[["Season_numeric"]]

  # Convert relevant columns to numeric
  numeric_columns <- c("PTS", "AST", "REB", "STL", "BLK", "FGA", "FGM", "GP", "MIN", "FTM", "FTA", "TOV")
  player[numeric_columns] <- lapply(player[numeric_columns], as.numeric)

  # PER calculation
  per <- ((player[["PTS"]] + player[["AST"]] + player[["REB"]] + player[["STL"]] + player[["BLK"]] - (player[["FGA"]] - player[["FGM"]]) - (player[["FTA"]] - player[["FTM"]]) - player[["TOV"]]) / (player[["GP"]]))


  return(per)
}

player_stats_regular_season$Adjusted_player_rating <- apply(player_stats_regular_season, 1, calculate_PER)

player_stats_total <- player_stats
player_stats_total$Adjusted_player_rating <- apply(player_stats_total, 1, calculate_PER)
```

### Density Plot of Adjusted Player Rating By Draft Pick

In NBA dialogue, there has been a recurring theme that "the skill in the NBA is higher than it has ever been." We wanted to start to unpack this qualitative observation with some quantitative values. To do so, we utilized the "Adjusted Player Rating" stat, and plotted the densities by position in the draft. In the NBA, a "lottery pick" refers to the first 14 picks. A late first round pick is a first round pick that occurs after the lottery. In this graph, we wanted to explore the different skill levels across positions in the draft, and explore how these skill levels have changed over time.

```{r, echo=FALSE, fig.height = 8, fig.width = 12}
player_stats_regular_season$draft_pick <- sprintf("%d.%02d", player_stats_regular_season$draft_round, player_stats_regular_season$draft_pick)

# Drop the original "draft_round" column
player_stats_regular_season <- subset(player_stats_regular_season, select = -c(draft_round))

# Convert Draft_pick to numeric for comparison
player_stats_regular_season$Draft_numeric <- as.numeric(player_stats_regular_season$draft_pick)

# Create draft categories based on the specified ranges
player_stats_regular_season$Draft_category[player_stats_regular_season$Draft_numeric >= 1.01 & player_stats_regular_season$Draft_numeric <= 1.14] <- "Lottery Pick"
player_stats_regular_season$Draft_category[player_stats_regular_season$Draft_numeric >= 1.15 & player_stats_regular_season$Draft_numeric <= 1.30] <- "Late First Round"
player_stats_regular_season$Draft_category[player_stats_regular_season$Draft_numeric >= 2.01 & player_stats_regular_season$Draft_numeric <= 2.30] <- "Second Round"
player_stats_regular_season$Draft_category[player_stats_regular_season$Draft_numeric > 2.30] <- "Later Round Pick (Pre 1989 Pick)"
player_stats_regular_season$Draft_category[is.na(player_stats_regular_season$Draft_numeric)] <- "Undrafted"

# Specify the order of Draft categories
order_draft_categories <- c("Lottery Pick", "Late First Round", "Second Round", "Undrafted")
player_stats_regular_season$Draft_category <- factor(player_stats_regular_season$Draft_category, levels = order_draft_categories)

# Calculate average Adjusted_player_rating for each season
average_ratings <- aggregate(Adjusted_player_rating ~ Season, data = player_stats_regular_season, mean, na.rm = TRUE)

# Rename the Adjusted_player_rating column to "season_rating"
names(average_ratings)[names(average_ratings) == "Adjusted_player_rating"] <- "season_rating"

# Merge the average ratings back to the original data frame
player_stats_regular_season <- merge(player_stats_regular_season, average_ratings, by = "Season", all.x = TRUE)

filtered_data <- player_stats_regular_season[player_stats_regular_season$Draft_category != "Later Round Pick (Pre 1989 Pick)", ]

selected_years <- c(1999, 2009, 2019)
ggplot(filtered_data[filtered_data$Season %in% selected_years, ], aes(x = Adjusted_player_rating, fill = Draft_category)) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Adjusted Player Rating",
       subtitle = "Grouped by Draft Category",
       caption = "Source: Your Data Source",
       x = "Adjusted Player Rating",
       fill = "Draft Category") +
  theme_minimal() +
  facet_grid(Season ~ Draft_category, scales = "free_y", space = "free_y") +  # Facet by season and Draft_category with free y-axes
  theme(
    axis.text.x = element_text(size = 8),  # Adjust x-axis text size
    axis.text.y = element_text(size = 8),  # Adjust y-axis text size
    strip.text = element_text(size = 8),    # Adjust facet label text size
  )
```

We can see quite a few interesting patterns in this plot. As predicted, the lottery picks consist of more "top players" (players with high adjusted player ratings) than any other draft position. In addition, we can see that the late first round and second round are rather similar in density of player rating. This trend breaks down in 2019, where we can see many more second round picks at a player rating of 4-8 than the late first round, as well as many more players with a player rating of 10-12. Over the years, we can see the curves shift right, towards more talented players. For example, any draft category in 2019 has significantly more players in the 10-12 player rating range than any previous draft category. This may imply that NBA teams have improved at draft prospect analysis and drafting. However, we can also notice that the skill of undrafted players has always hovered around a 4-8 player rating. But, in 2019, there is a noticeable spike in players with a rating above 8. This may imply that the skill of NBA prospects in general has increased over time, supporting the qualitative observations mentioned previously.

## The Ones Above The Rest

To close out our analysis, we will attempt to find which players have truly stood above the rest throughout the years of the NBA. NBA dialouge has alwasy revolved around player comparison. Rivalries like Lebron v. Jordan or Magic v. Bird has sparked debates over who the "GOAT" (Greatest of All Time) of basketball is. We are looking to provide visualizations to help demonstrate some of the best players in the league of the last 20 years.

### Player Rating vs. Minutes Played, per Season

Our first look at finding the best players utilized our adjusted player rating metric. We wanted to compare the best individual seasons across the last 20 years, hoping to find correlations with the highest adjusted player ratings and seasons said player won MVP.
```{r, echo=FALSE, fig.height = 6, fig.width = 11}
#install.packages("ggalt")

calculate_PER <- function(player) {
  season_index <- player[["Season_numeric"]]

  # Convert relevant columns to numeric
  numeric_columns <- c("PTS", "AST", "REB", "STL", "BLK", "FGA", "FGM", "GP", "MIN", "FTM", "FTA", "TOV")
  player[numeric_columns] <- lapply(player[numeric_columns], as.numeric)

  # PER calculation
  per <- ((player[["PTS"]] + player[["AST"]] + player[["REB"]] + player[["STL"]] + player[["BLK"]] - (player[["FGA"]] - player[["FGM"]]) - (player[["FTA"]] - player[["FTM"]]) - player[["TOV"]]) / (player[["GP"]]))


  return(per)
}


test <- read.csv("player_stats.csv")
test$Season <- substr(test$Season, 1, 4)
test$Season_numeric <- as.integer(test$Season) - 1998
reg_szn <- test[test$Stage == "Regular_Season", ]
reg_szn$efficiency <- apply(reg_szn, 1, calculate_PER)

# Calculate the mean and standard deviation of "Adjusted_player_rating"
mean_rating <- mean(reg_szn$efficiency)
sd_rating <- sd(reg_szn$efficiency)

# Create a scatter plot with encircling for points greater than 1
# Create a scatter plot with encircling for points greater than 1
ggplot(reg_szn, aes(x = MIN, y = efficiency, label = Player, fill = Season)) +
  geom_point(alpha = 0.6, shape = 21, size = 2, color = "black") +
  geom_encircle(aes(x = MIN, y = efficiency), 
                data = subset(reg_szn, efficiency > 3.5 * sd_rating + mean_rating),
                color = "red", 
                size = 1.5,
                expand = 0.1, show.legend = FALSE, fill = NA) +
  geom_text(data = subset(reg_szn, efficiency > 3.5 * sd_rating + mean_rating),
            aes(x = MIN, y = efficiency, label = Player),
            vjust = 1.5, hjust = -0.1, size = 2) +
  labs(title = "Plotting Adjusted Player Rating to Minutes Played",
       x = "Minutes Played",
       y = "Adjusted Player Rating") +
  scale_fill_manual(name = "Season", 
                    values = c("1999" = "blue", "2000" = "green", "2001" = "red", "2002" = "purple", "2003" = "orange",
                               "2004" = "brown", "2005" = "pink", "2006" = "cyan", "2007" = "grey", "2008" = "yellow",
                               "2009" = "magenta", "2010" = "darkgreen", "2011" = "violet", "2012" = "lightblue",
                               "2013" = "darkred", "2014" = "darkblue", "2015" = "darkorange", "2016" = "darkcyan",
                               "2017" = "darkgrey", "2018" = "black", "2019" = "darkmagenta"))

```

We can see quite a few interesting patterns in this plot. As predicted, the lottery picks consist of more "top players" (players with high adjusted player ratings) than any other draft position. In addition, we can see that the late first round and second round are rather similar in density of player rating. This trend breaks down in 2009, where we can see many more second round picks at a player rating of 15-20 than the late first round, as well as many more players with a player rating of 10. Over the years, we can see the curves shift right, towards more talented players. For example, any draft category in 2019 has significantly more players above 20 than any previous draft category. This may imply that NBA teams have improved at draft prospect analysis and drafting. However, we can also notice that the skill of undrafted players has always hovered around a 8-10 player rating. But, in 2019, there is a noticeable spike in players with a rating above 15. This may imply that the skill of NBA prospects in general has increased over time, supporting the qualitative observations mentioned previously.

### Playoff Performance

Among NBA discourse, playoff performance holds significantly more weight than regular season performance. In tandem, many players are labeled "playoff risers" and "playoff chokers", performing better or worse than usual in the playoffs, where each game truly matters. To explore this phenomenon, we wanted to compare players average Adjusted Player Rating throughout their careers, split between their regular season performances and their playoffs performances. In doing so, we can visualize the best playoff performers, the best regular season performers, and those who best elevate their game when it matters.


```{r}
library(ggplot2)
library(ggrepel)
average_ratings_team <- player_stats_regular_season %>%
  group_by(Season, Team) %>%
  summarize(Average_Player_Rating = mean(Adjusted_player_rating, na.rm = TRUE))

reg_season <- player_stats_total[player_stats_total$Stage == "Regular_Season", ]

average_ratings_player_regular <- reg_season %>%
  group_by(Player) %>%
  summarize(Average_Player_Rating = mean(Adjusted_player_rating, na.rm = TRUE))

player_stats_playoffs <- player_stats[player_stats$Stage == "Playoffs", ]
player_stats_playoffs$Adjusted_player_rating <- apply(player_stats_playoffs, 1, calculate_PER)

average_ratings_playoffs <- player_stats_playoffs %>%
  group_by(Player) %>%
  summarize(Average_Player_Rating = mean(Adjusted_player_rating, na.rm = TRUE))


merged_team_data <- merge(average_ratings_player_regular, average_ratings_playoffs, by = "Player", suffixes = c("_regular", "_playoffs"))


ggplot(merged_team_data, aes(x = Average_Player_Rating_regular, y = Average_Player_Rating_playoffs, label = Player)) +
  geom_point(aes(color = ifelse(Average_Player_Rating_playoffs > predict(lm(Average_Player_Rating_playoffs ~ Average_Player_Rating_regular, data = merged_team_data)), "green", "red"))) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_text_repel(data = subset(merged_team_data, abs(Average_Player_Rating_playoffs - predict(lm(Average_Player_Rating_playoffs ~ Average_Player_Rating_regular, data = merged_team_data))) >= 3),
                 aes(label = Player),
                 box.padding = 0.5, point.padding = 0.5, segment.color = 'red',
                 direction = 'both', min.segment.length = 0.5, nudge_x = 0.5, nudge_y = 0.5,
                 size = 2) +
  labs(title = "Scatter Plot of Regular Season and Playoffs Ratings",
       x = "Regular Season Average Rating",
       y = "Playoffs Average Rating") +
  scale_color_manual(values = c("green", "red")) +  # Set colors for above and below the line
  theme_minimal() +
  guides(color = FALSE)  # Remove the legend for color
```

In our plot, players labeled with red dots symbolize individuals who perform worse in the playoffs than the regular season, whereas green is the opposite. We also performed a regression and plotted the line. As we can see in our regression line, playoff rating and regular season rating is nearly one-to-one. However, the slope of the regression line is slightly below one, indicating that, on average, players play worse in the playoffs than they do in the regular season. This makes sense, they are now playing against the best teams in the league, where the competition is significantly elevated. So, we compared playoff "risers" and "chokers" according to this basis. We can see that players like Luka Doncic, Anthony Davis, and Nikola Jokic are not only elite in the regular season, but surpass this level in the playoffs. More "average" players like Jamal Murray and Caris LeVert take thier game to an elite level in the playoffs. In addition, we can see very good players like Karl-Anthony Towns, DeMarcus Cousins, and Kevin Love have historically underperformed in the playoffs, a mark of a "Regular Season Merchant."

All in all, the difference between a good player and a great player is simple. A great player shows up and elevates their game in the playoffs. In our chart, we can see that Luka, Jokic, LeBron, and Anthony Davis are not only elite in the regular season, but elevate their games to never before seen levels in the playoffs.

### Stacked Barchart of 2-Point and 3-Point Shots Made By Top 10 NBA Players Each Year

Seeing how talented NBA players are naturally raises the question of who the best player is. This can be approximated by measuring the number of points a player makes. We aggregated the points each player made per season to see if we could see any trends regarding who the top scorers were.

```{r}
shots_data <- fread("shot_dist.csv")
shots_data[, score := as.character(score)]
shots_data$scoremargin <- as.numeric(gsub("TIE", "0", shots_data$scoremargin))
shots_data <- shots_data[score != ""]

shots_data <- shots_data[order(shots_data$game_id, shots_data$eventnum), ]

common_player_info <- fread("common_player_info.csv")
player_id_name_mapping <- common_player_info[, .(person_id, display_last_comma_first)]
shots_data <- merge(shots_data, player_id_name_mapping, by.x = "player1_id", by.y = "person_id", all.x = TRUE)

shots_data[, player1_name := fifelse(is.na(player1_name) | player1_name == "", display_last_comma_first, player1_name)]

# Convert scoremargin to numeric
shots_data$scoremargin <- as.numeric(as.character(shots_data$scoremargin))

# Handle cases where conversion to numeric introduces NAs due to conversion issues
shots_data$scoremargin[is.na(shots_data$scoremargin)] <- 0

# Calculate points
shots_data$points <- rep(0, nrow(shots_data))

shots_data$points <- c(0, diff(shots_data$scoremargin))

# Identify the start of a new game
is_new_game <- c(TRUE, diff(shots_data$game_id) != 0)

# Correct points at the start of a new game
shots_data$points[is_new_game] <- shots_data$scoremargin[is_new_game]

shots_data$points <- pmin(pmax(shots_data$points, 0), 3)

# Determine shot type based on points
shots_data$shot_type <- ifelse(shots_data$points == 3, "3-pointer", "2-pointer")

shots_data[, shot_count := 1]  # Add a new column to count each shot

# Aggregate the count of shots by player, year, and shot type
shots_data_aggregated <- shots_data[, .(shots_made = .N), by = .(game_date, player1_name, shot_type)]

# Calculate total shots made for each player per year, across all shot types
shots_data_aggregated[, total_shots := sum(shots_made), by = .(game_date, player1_name)]

top_shooters_by_year <- shots_data_aggregated[order(-total_shots), ][, head(.SD, 20), by = .(game_date)]

# Ensure the aggregated shot counts and shot types are correct for the top players, including total_shots
top_shooters_by_year <- top_shooters_by_year[, .(shots_made = sum(shots_made)), by = .(game_date, player1_name, shot_type, total_shots)]

years_of_interest <- c(1998, 2008, 2021)

# Create stacked barplots only for the specified years
for (year in years_of_interest) {
  # Check if the year is present in the dataset
  plot_data <- subset(top_shooters_by_year, game_date == year)
  
  p <- ggplot(plot_data, aes(x = reorder(player1_name, total_shots), y = shots_made, fill = shot_type)) +
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = paste("Top Shooters in", year), x = "Player", y = "Number of Shots Made") +
    scale_fill_manual(values = c("2-pointer" = "blue", "3-pointer" = "green")) +
    guides(fill = guide_legend(title = "Shot Type")) +
    scale_y_continuous(limits = c(0, 1100))  # Set y-axis limit
    
  print(p)
}

```

In the above R script, the top 10 scorers in the 3 most interesting years in our current decade, past decade, and past 2 decades are represented in a stacked bar plot. The notion of 'most interesting' was determined by the years with the most number of points made. The number of 2 pointers and 3 pointers that the players made are the y-axis and the top 10 players ranked by the number of points scored is the x-axis. We chose to only show the top 10 players since modelling every player would obfuscate the details of the graph too much. From the data, we can see the players that scored the most points, what proportion of those shots were 2 pointers versus 3 pointers, and how the player did compared to other top scorers. A specific observation that we can make is LeBron James' dominance in the point scoring scene, which aligns with the qualitative observation that we can make regarding LeBron's status as one of the most prolific, if not the most prolific, scorers of all time.