EXu_Assignment_FINAL.Rmd

---
title: "DATA 605 - Final"
author: "Eddie Xu"
date: "`r Sys.Date()`"
output: html_document
---

### House Prices: Advanced Regression Techniques
Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition, [Kaggle Link](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). I want you to do the following.

#### Load Dependencies
```{r load_dep_data, message=FALSE, warning=FALSE}
# load libraries
library(tidyverse)
library(reactable)
library(DescTools)
library(reshape2)
library(MASS)
library(devtools)
library(ggbiplot)
library(zoo)

# load data
training_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/train.csv")
test_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/test.csv")

# view train data
reactable(training_data)
```

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

*For the quantitative independent variable, the Lot size in square feet will be defined as the variable X and for the dependable variable, the Sale Price will be defined as the variable Y.*
```{r variable, message=FALSE, warning=FALSE}
# histogram
variable_x_histgram <- ggplot(training_data,aes(x = LotArea)) +
  geom_histogram(colour = 4, fill = "white") +
  theme_minimal()

# set the Lot Area as x and Sale Price as y
variable_x <- training_data$LotArea
variable_y <- training_data$SalePrice

# histogram plot
variable_x_histgram
```

#### Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter "x" is estimated as the third quartile of the X variable, and the small letter "y" is estimated as the second quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
```{r summary_variable, message=FALSE}
variable_x_summary <- summary(variable_x)
variable_y_summary <- summary(variable_y)

# 3rd quartile of x and 2nd quartile of y
x_prob_quartile <- round(variable_x_summary["3rd Qu."])
y_prob_quartile <- round(variable_y_summary["Median"])

# print
variable_x_summary
variable_y_summary
```

*Based on both summaries, the third quartile of the variable X is 11602 and the second quartile or median of the variable Y is 163000.*
```{r count_table, message=FALSE, warning=FALSE}
# calculation on less than 3rd quartile
both_less_qrt <- sum(variable_x < x_prob_quartile & variable_y < y_prob_quartile)
y_greater_qrt <- sum(variable_x < x_prob_quartile & variable_y > y_prob_quartile)

# calculation on more than 3rd quartile
both_great_qrt <- sum(variable_x > x_prob_quartile & variable_y > y_prob_quartile)
y_less_qrt <- sum(variable_x > x_prob_quartile & variable_y < y_prob_quartile)

# calculation on total
less_3qrt_total <- both_less_qrt + y_greater_qrt
great_3qrt_total <- both_great_qrt + y_less_qrt
less_2qrt_total <- both_less_qrt + y_less_qrt
great_2qrt_total <- both_great_qrt + y_greater_qrt
grand_total <- less_3qrt_total + great_3qrt_total

# input values into the table matrix
table_matrix <- matrix(c(both_less_qrt, y_greater_qrt, less_3qrt_total,
                         y_less_qrt, both_great_qrt, great_3qrt_total,
                         less_2qrt_total, great_2qrt_total, grand_total), 
                       nrow = 3, ncol = 3, byrow = TRUE)

colnames(table_matrix) <- c("less than 2nd quartile", "greater than 2nd quartile", "Total")
rownames(table_matrix) <- c("less than 3rd quartile", "greater than 3rd quartile", "Total")

table_matrix
```

*Based on the table, there are 1456 counts of potential outcomes. The count of less than 2nd quartile or greater than 2nd quartile is exactly the same; 728 counts. There is a higher count of less than 3rd quantile (1091) than the count of greater than 3rd quantile (365). By following the formula for conditional probability, the value can be defined for each probability.*
```{r prob_func, message=FALSE, warning=FALSE}
# probability a function
prob_a_func <- function(x, y, xp, yp){
  result <- sum(x > xp & x > yp) / sum(x > yp)
  return(result)
}

# probability b function
prob_b_func <- function(x, y, xp, yp){
  result <- sum(x > xp & y > yp)/length(x)
  return(result)
}

# probability c function
prob_c_func <- function(x, y, xp, yp){
  result <- sum(x < xp & x > yp)/sum(x > yp)
  return(result)
}
```

- P(X>x | Y>y)
```{r}
prob_a_value <- prob_a_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_a_value
```

- P(X>x, Y>y)		
```{r}
prob_b_value <- prob_b_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_b_value
```

- P(X<x | Y>y)
```{r}
prob_c_value <- prob_c_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_c_value
```

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

$$P(A|B) = P(A)P(B) \\ P(A|B) = P(A) \\ P(A|B) = P(A)P(B) \\ \; \text{(if and only A and B independent based on the behavior of conditional probabilities)}$$

*Based on the mathematics formula, the variables are independent. The Chi-Square test will be used to evaluate if that is the case. Our hypothesis is based on the following:*

$$H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}$$

```{r chi_test_comparison, message=FALSE, warning=FALSE}
chi_result <- chisq.test(variable_x, variable_y)
chi_result
```

*Based on the result and the critical value for the chi-square test is typically 0.05, the p-value is significant less than the critical value. From there, the test rejected the mathematics evaluation and the null hypothesis is rejected.*

#### Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
```{r scatter_plot_xy, message=FALSE, warning=FALSE}
ggplot(training_data, aes(x = variable_x, y = variable_y, color = variable_y)) +
  geom_point() +
  geom_smooth(method=lm) +
  theme_minimal() +
  xlab("Lot size in square feet ") +
  ylab("Sale Price") +
  ggtitle("Scatter plot between Lot size in square feet & Sale Price")
```

*As shown in the plot, there is a good indication that Lot size in square feet and Sale Price are independent from each other. There are some outliners that helps support that indication.*
```{r infer_ci, message=FALSE, warning=FALSE}
# inference on variable X based on the confidence level of 95%
MeanDiffCI(variable_x, variable_y, conf.level = 0.95)
```

*Given the 95% confidence interval for the difference in the mean of the variables, the lower bound is -174515 and the upper bound is -166294.*
```{r corr_matrix_map, message=FALSE, warning=FALSE}
# new dataset with variable x and y
corr_var <- cbind(variable_x, variable_y) %>%
  as.matrix()

# create a cormat and melted cormat
housing_cormat <- round(cor(corr_var), 2)
melted_housing_cormat <- melt(housing_cormat)

# create a heat map
ggplot(data = melted_housing_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
  theme_minimal()
```

*Based on the plot, it shows that the two variables are more independent. So going back to the hypothesis stated before, testings need to be done to see if the null hypothesis is correct.*

$$H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}$$

```{r}
corr_test <- cor.test(training_data$LotArea, training_data$SalePrice, method = "pearson", conf.level = 0.99)
corr_test
```

*Using the Pearson method to measure the degree of the relationship between LotArea and SalePrice, the p-value is less than 2.2e-16 which is close to 0 and less than the 99% confidence interval. This once again rejected the null hypothesis and shows that LotArea and SalePrice are significantly correlated with a correlation coefficient of 0.264.*

#### Linear Algebra and Correlation
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
```{r precision_matrix, message=FALSE, warning=FALSE}
# inverted correlation matrix
precision_corr <- solve(housing_cormat)
melted_precision_cormat <- melt(precision_corr)

# create a heat map
ggplot(data = melted_precision_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
  theme_minimal()
```

```{r mulitple1, message=FALSE, warning=FALSE}
# multiplication of correlation matrix and precision matrix
corr_prec1_matrix <- housing_cormat %*% precision_corr
melted_corr_prec1 <- melt(corr_prec1_matrix)

# create a heat map
ggplot(data = melted_corr_prec1, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
  theme_minimal()
```

```{r mulitple2, message=FALSE, warning=FALSE}
# multiplication of correlation matrix and precision matrix
corr_prec2_matrix <- precision_corr %*% housing_cormat
melted_corr_prec2 <- melt(corr_prec2_matrix)

# create a heat map
ggplot(data = melted_corr_prec2, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
  theme_minimal()
```

*Given that the correlation between both variables are not that highly correlated and to see the variable is truly independent from each other, the principle components analysis will be used to determine that. Principle Components Analysis is used for making decisions in predictive models and dimensionality reduction by using each data point onto only the few principal components. [R-blogger Link](https://www.r-bloggers.com/2021/05/principal-component-analysis-pca-in-r/) will be used as reference. *
```{r pca_analysis, message=FALSE, warning=FALSE}
# principle components analysis
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)

summary(pca_var)
```

*The PCA summary shows that the PC1 captures 63% which is more than half of variability and PC2 captures 37% of the variability.*
```{r pca_biplot1, message=FALSE, warning=FALSE}
pca_plot1 <- ggscreeplot(pca_var) +
  theme_minimal()

pca_plot1
```

*The screeplot shows that PC1 is necessary to account for most of the variance*
```{r pca_biplot2, message=FALSE, warning=FALSE}
pca_plot2 <- ggbiplot(pca_var, obs.scale = 1, var.scale = 1,
                     var.factor = 1.2,
                     ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
                     circle = TRUE, varname.size = 3, varname.color = "lightblue") + 
  theme(legend.direction = 'horizontal', legend.position = 'top') +
  theme_minimal()

pca_plot2
```

*Based on the PCA biplot, it shows how variable x and y impact the principle components.The arrows are bit farther away from each other, indicating that they are not that correlated after all.*

#### Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See [MASS](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html)). Find the optimal value for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
```{r}
# run fitdistr function and extract the lambda value
exp_variable_x <- fitdistr(variable_x, "exponential")
lambda_x <- exp_variable_x$estimate

# 1000 samples from the exponential distribution
sample_size <- 1000
sample_x <- rexp(sample_size, rate = lambda_x) %>%
  as.data.frame()
colnames(sample_x) <- c("sample_value")

# exponential histogram
exp_variable_x_histogram <- ggplot(sample_x,aes(x = sample_value)) +
  geom_histogram(colour = 4, fill = "white") +
  theme_minimal() 

#plot between the original and exponential
par(mfrow=c(1,2))
variable_x_histgram + ggtitle("Original Distribution")
exp_variable_x_histogram + ggtitle("Exponential Distribution")
```

```{r cdf_quantile, message=FALSE, warning=FALSE}
cdf_quantile <- quantile(sample_x$sample_value, probs=c(0.05, 0.95))
cdf_quantile
```

*Using the exponential PDF and the CDF, the 5th and 95th percentiles are 445.33 and 31512.02 respectively.*
```{r}
# 95% CI based on the empirical data
mean_var_x <- mean(variable_x)
std_var_x <- sd(variable_x)
size_var_x <- length(variable_x)
z <- qnorm(1 - 0.05/2)

# identify the upper and lower bound
lower_ci_var_x <- mean_var_x - z * (std_var_x / sqrt(size_var_x))
upper_ci_var_x <- mean_var_x + z * (std_var_x / sqrt(size_var_x))

# quantile on variable x
empirical_quantile <- quantile(variable_x, probs=c(0.05, 0.95))

round(lower_ci_var_x)
round(upper_ci_var_x)
empirical_quantile
```

*Based on the empirical data, the lower and upper bound CI is 10005 and 11029 respectively. For the 5th and 95th percentiles, it is 3311.7 and 17401.15 respectively. Compared to the exponential value, there is a significant movement.*

#### Modeling
Build some type of regression  model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
```{r linear_regression, message=FALSE, warning=FALSE}
# linear regression model on training data
training_lm <- lm (variable_y ~ variable_x, data = training_data)

summary(training_lm)
```

##### Regression Model
```{r lm_plots, message=FALSE, warning=FALSE}
# histogram on lm
lm_histogram <- ggplot(training_lm,aes(x = training_lm$residuals)) +
  geom_histogram(colour = 4, fill = "white") +
  xlab("Residual on Linear Regression") +
  theme_minimal() 

par(mfrow=c(2,2))
plot(training_lm)
lm_histogram
```

```{r num_train_table, message=FALSE, warning=FALSE}
# pull in all columns with numeric values
training_data_num <- training_data %>%
  dplyr::select(where(is.numeric), -Id) 

# summary
summary(training_data_num)
```

*By pulling all columns with numeric values, it can provide a better perspective in the training set using appropriate plots. In this case, the box plot and histograms are best methods to compare all variables.*
##### Box Plot Part 1
```{r var_boxplot, message=FALSE, warning=FALSE}
# boxplot
par(mfrow=c(2,3))
boxplot(training_data_num$MSSubClass, xlab = "The building class")
boxplot(training_data_num$LotFrontage, xlab = "Linear feet of street connected to property")
boxplot(training_data_num$LotArea, xlab = "Lot size in square feet")
boxplot(training_data_num$OverallQual, xlab = "Overall material and finish quality")
boxplot(training_data_num$OverallCond, xlab = "Overall condition rating")
boxplot(training_data_num$MasVnrArea, xlab = "Masonry veneer area in square feet")
```

##### Box Plot Part 2
```{r var_boxplot2, message=FALSE, warning=FALSE}
# box plot part 2
par(mfrow=c(2,3))
boxplot(training_data_num$BsmtFinSF1, xlab = "Type 1 finished square feet")
boxplot(training_data_num$BsmtFinSF2, xlab = "Type 2 finished square feet")
boxplot(training_data_num$BsmtUnfSF, xlab = "Unfinished square feet of basement area")
boxplot(training_data_num$TotalBsmtSF, xlab = "Overall material and finish quality")
boxplot(training_data_num$X1stFlrSF, xlab = "First Floor square feet")
boxplot(training_data_num$X2ndFlrSF, xlab = "Second Floor square feet")
```

##### Box Plot Part 3
```{r var_boxplot3, message=FALSE, warning=FALSE}
# box plot part 3
par(mfrow=c(2,3))
boxplot(training_data_num$LowQualFinSF, xlab = "Low quality finished square feet (all floors)")
boxplot(training_data_num$GrLivArea, xlab = "Above grade (ground) living area square feet")
boxplot(training_data_num$BsmtFullBath, xlab = "Basement full bathrooms")
boxplot(training_data_num$BsmtHalfBath, xlab = "Basement half bathrooms")
boxplot(training_data_num$FullBath, xlab = "Full bathrooms above grade")
boxplot(training_data_num$HalfBath, xlab = "Half baths above grade")
```

##### Box Plot Part 4
```{r var_boxplot4, message=FALSE, warning=FALSE}
# boxplot part 4
par(mfrow=c(2,3))
boxplot(training_data_num$BedroomAbvGr, xlab = "Number of bedrooms above basement level")
boxplot(training_data_num$KitchenAbvGr, xlab = "Number of kitchens")
boxplot(training_data_num$TotRmsAbvGrd, xlab = "Total rooms above grade (does not include bathrooms)")
boxplot(training_data_num$Fireplaces, xlab = "Number of fireplaces")
boxplot(training_data_num$GarageCars, xlab = "Size of garage in car capacity")
boxplot(training_data_num$GarageArea, xlab = "Size of garage in square feet")
```

##### Box Plot Part 5
```{r var_boxplot5, message=FALSE, warning=FALSE}
# boxplot part 5
par(mfrow=c(2,3))
boxplot(training_data_num$WoodDeckSF, xlab = "Wood deck area in square feet")
boxplot(training_data_num$OpenPorchSF, xlab = "Open porch area in square feet")
boxplot(training_data_num$EnclosedPorch, xlab = "Enclosed porch area in square feet")
boxplot(training_data_num$X3SsnPorch, xlab = "Three season porch area in square feet")
boxplot(training_data_num$ScreenPorch, xlab = "Screen porch area in square feet")
boxplot(training_data_num$PoolArea, xlab = "Pool area in square feet")
```

##### Box Plot Part 6
```{r var_boxplot6, message=FALSE, warning=FALSE}
# boxplot part 6
par(mfrow=c(1,2))
boxplot(training_data_num$MiscVal, xlab = "$ Value of miscellaneous feature")
boxplot(training_data_num$SalePrice, xlab = "Sale Price")
```

##### Histogram
```{r var_histogram, message=FALSE, warning=FALSE}
# histogram
par(mfrow=c(2,3))
hist(training_data_num$YearBuilt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Original construction date")
hist(training_data_num$YearRemodAdd, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Remodel date")
hist(training_data_num$GarageYrBlt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year garage was built")
hist(training_data_num$MoSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Month Sold")
hist(training_data_num$YrSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year Sold")
```

##### Multiple Linear Regression
```{r multi_lm, message=FALSE, warning=FALSE}
# replace NA value in training data with the mean of the column
training_data_revised <- training_data_num %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

# multiple linear regression with numeric values
training_lm2 <- lm(SalePrice ~ LotFrontage + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea , data = training_data_revised)

# summary
summary(training_lm2)
```

*With the multiple regression model, it accounts for estimated 80% of the variance. Given some variables and its P-values are NULL and are higher than the critical value(0.05), those variables will be removed to retrain the model.*
```{r}
# multiple linear regression with revised variables
training_lm3 <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = training_data_revised)

# summary
summary(training_lm3)
```

##### Multiple Linear Regression Plots
```{r multi_lm_plot, message=FALSE, warning=FALSE}
par(mfrow=c(2,2))
plot(training_lm3)
```

*With the model, the test data will be used to help predict the Sale Price. Data wrangling will be used to replacing NA values with the mean. and only columns with numeric values are selected.*
```{r test_data, message=FALSE, warning=FALSE}
# data wrangling on test data
test_data_revised <- test_data %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)) %>%
  dplyr::select(where(is.numeric))
```

##### Sale Price Predictions and write the CSV file
```{r prediction_write_csv, message=FALSE, warning=FALSE}
# sale price prediction
kaggle_submission <- test_data_revised %>%
  mutate(predict_value = predict(training_lm3, newdata = test_data_revised)) %>%
  dplyr::select(c('Id', predict_value))

summary(kaggle_submission)
```

```{r csv_export, warning=FALSE, message=FALSE}
# export csv
kaggle_submission <- rename(kaggle_submission, SalePrice=predict_value)
write.csv(kaggle_submission, "Resources/kaggle_submission.csv", row.names = FALSE)
```

##### Result
*My kaggle username is eddiexunyc and the lowest public score i was able to get is 0.33918.*
![](Kaggle Score.PNG)