-
Notifications
You must be signed in to change notification settings - Fork 0
/
EXu_Assignment_FINAL.Rmd
484 lines (399 loc) · 21.8 KB
/
EXu_Assignment_FINAL.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
---
title: "DATA 605 - Final"
author: "Eddie Xu"
date: "`r Sys.Date()`"
output: html_document
---
### House Prices: Advanced Regression Techniques
Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition, [Kaggle Link](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). I want you to do the following.
#### Load Dependencies
```{r load_dep_data, message=FALSE, warning=FALSE}
# load libraries
library(tidyverse)
library(reactable)
library(DescTools)
library(reshape2)
library(MASS)
library(devtools)
library(ggbiplot)
library(zoo)
# load data
training_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/train.csv")
test_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/test.csv")
# view train data
reactable(training_data)
```
Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
*For the quantitative independent variable, the Lot size in square feet will be defined as the variable X and for the dependable variable, the Sale Price will be defined as the variable Y.*
```{r variable, message=FALSE, warning=FALSE}
# histogram
variable_x_histgram <- ggplot(training_data,aes(x = LotArea)) +
geom_histogram(colour = 4, fill = "white") +
theme_minimal()
# set the Lot Area as x and Sale Price as y
variable_x <- training_data$LotArea
variable_y <- training_data$SalePrice
# histogram plot
variable_x_histgram
```
#### Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter "x" is estimated as the third quartile of the X variable, and the small letter "y" is estimated as the second quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
```{r summary_variable, message=FALSE}
variable_x_summary <- summary(variable_x)
variable_y_summary <- summary(variable_y)
# 3rd quartile of x and 2nd quartile of y
x_prob_quartile <- round(variable_x_summary["3rd Qu."])
y_prob_quartile <- round(variable_y_summary["Median"])
# print
variable_x_summary
variable_y_summary
```
*Based on both summaries, the third quartile of the variable X is 11602 and the second quartile or median of the variable Y is 163000.*
```{r count_table, message=FALSE, warning=FALSE}
# calculation on less than 3rd quartile
both_less_qrt <- sum(variable_x < x_prob_quartile & variable_y < y_prob_quartile)
y_greater_qrt <- sum(variable_x < x_prob_quartile & variable_y > y_prob_quartile)
# calculation on more than 3rd quartile
both_great_qrt <- sum(variable_x > x_prob_quartile & variable_y > y_prob_quartile)
y_less_qrt <- sum(variable_x > x_prob_quartile & variable_y < y_prob_quartile)
# calculation on total
less_3qrt_total <- both_less_qrt + y_greater_qrt
great_3qrt_total <- both_great_qrt + y_less_qrt
less_2qrt_total <- both_less_qrt + y_less_qrt
great_2qrt_total <- both_great_qrt + y_greater_qrt
grand_total <- less_3qrt_total + great_3qrt_total
# input values into the table matrix
table_matrix <- matrix(c(both_less_qrt, y_greater_qrt, less_3qrt_total,
y_less_qrt, both_great_qrt, great_3qrt_total,
less_2qrt_total, great_2qrt_total, grand_total),
nrow = 3, ncol = 3, byrow = TRUE)
colnames(table_matrix) <- c("less than 2nd quartile", "greater than 2nd quartile", "Total")
rownames(table_matrix) <- c("less than 3rd quartile", "greater than 3rd quartile", "Total")
table_matrix
```
*Based on the table, there are 1456 counts of potential outcomes. The count of less than 2nd quartile or greater than 2nd quartile is exactly the same; 728 counts. There is a higher count of less than 3rd quantile (1091) than the count of greater than 3rd quantile (365). By following the formula for conditional probability, the value can be defined for each probability.*
```{r prob_func, message=FALSE, warning=FALSE}
# probability a function
prob_a_func <- function(x, y, xp, yp){
result <- sum(x > xp & x > yp) / sum(x > yp)
return(result)
}
# probability b function
prob_b_func <- function(x, y, xp, yp){
result <- sum(x > xp & y > yp)/length(x)
return(result)
}
# probability c function
prob_c_func <- function(x, y, xp, yp){
result <- sum(x < xp & x > yp)/sum(x > yp)
return(result)
}
```
- P(X>x | Y>y)
```{r}
prob_a_value <- prob_a_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_a_value
```
- P(X>x, Y>y)
```{r}
prob_b_value <- prob_b_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_b_value
```
- P(X<x | Y>y)
```{r}
prob_c_value <- prob_c_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_c_value
```
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
$$P(A|B) = P(A)P(B) \\ P(A|B) = P(A) \\ P(A|B) = P(A)P(B) \\ \; \text{(if and only A and B independent based on the behavior of conditional probabilities)}$$
*Based on the mathematics formula, the variables are independent. The Chi-Square test will be used to evaluate if that is the case. Our hypothesis is based on the following:*
$$H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}$$
```{r chi_test_comparison, message=FALSE, warning=FALSE}
chi_result <- chisq.test(variable_x, variable_y)
chi_result
```
*Based on the result and the critical value for the chi-square test is typically 0.05, the p-value is significant less than the critical value. From there, the test rejected the mathematics evaluation and the null hypothesis is rejected.*
#### Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
```{r scatter_plot_xy, message=FALSE, warning=FALSE}
ggplot(training_data, aes(x = variable_x, y = variable_y, color = variable_y)) +
geom_point() +
geom_smooth(method=lm) +
theme_minimal() +
xlab("Lot size in square feet ") +
ylab("Sale Price") +
ggtitle("Scatter plot between Lot size in square feet & Sale Price")
```
*As shown in the plot, there is a good indication that Lot size in square feet and Sale Price are independent from each other. There are some outliners that helps support that indication.*
```{r infer_ci, message=FALSE, warning=FALSE}
# inference on variable X based on the confidence level of 95%
MeanDiffCI(variable_x, variable_y, conf.level = 0.95)
```
*Given the 95% confidence interval for the difference in the mean of the variables, the lower bound is -174515 and the upper bound is -166294.*
```{r corr_matrix_map, message=FALSE, warning=FALSE}
# new dataset with variable x and y
corr_var <- cbind(variable_x, variable_y) %>%
as.matrix()
# create a cormat and melted cormat
housing_cormat <- round(cor(corr_var), 2)
melted_housing_cormat <- melt(housing_cormat)
# create a heat map
ggplot(data = melted_housing_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
theme_minimal()
```
*Based on the plot, it shows that the two variables are more independent. So going back to the hypothesis stated before, testings need to be done to see if the null hypothesis is correct.*
$$H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}$$
```{r}
corr_test <- cor.test(training_data$LotArea, training_data$SalePrice, method = "pearson", conf.level = 0.99)
corr_test
```
*Using the Pearson method to measure the degree of the relationship between LotArea and SalePrice, the p-value is less than 2.2e-16 which is close to 0 and less than the 99% confidence interval. This once again rejected the null hypothesis and shows that LotArea and SalePrice are significantly correlated with a correlation coefficient of 0.264.*
#### Linear Algebra and Correlation
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
```{r precision_matrix, message=FALSE, warning=FALSE}
# inverted correlation matrix
precision_corr <- solve(housing_cormat)
melted_precision_cormat <- melt(precision_corr)
# create a heat map
ggplot(data = melted_precision_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
theme_minimal()
```
```{r mulitple1, message=FALSE, warning=FALSE}
# multiplication of correlation matrix and precision matrix
corr_prec1_matrix <- housing_cormat %*% precision_corr
melted_corr_prec1 <- melt(corr_prec1_matrix)
# create a heat map
ggplot(data = melted_corr_prec1, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
theme_minimal()
```
```{r mulitple2, message=FALSE, warning=FALSE}
# multiplication of correlation matrix and precision matrix
corr_prec2_matrix <- precision_corr %*% housing_cormat
melted_corr_prec2 <- melt(corr_prec2_matrix)
# create a heat map
ggplot(data = melted_corr_prec2, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
theme_minimal()
```
*Given that the correlation between both variables are not that highly correlated and to see the variable is truly independent from each other, the principle components analysis will be used to determine that. Principle Components Analysis is used for making decisions in predictive models and dimensionality reduction by using each data point onto only the few principal components. [R-blogger Link](https://www.r-bloggers.com/2021/05/principal-component-analysis-pca-in-r/) will be used as reference. *
```{r pca_analysis, message=FALSE, warning=FALSE}
# principle components analysis
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)
summary(pca_var)
```
*The PCA summary shows that the PC1 captures 63% which is more than half of variability and PC2 captures 37% of the variability.*
```{r pca_biplot1, message=FALSE, warning=FALSE}
pca_plot1 <- ggscreeplot(pca_var) +
theme_minimal()
pca_plot1
```
*The screeplot shows that PC1 is necessary to account for most of the variance*
```{r pca_biplot2, message=FALSE, warning=FALSE}
pca_plot2 <- ggbiplot(pca_var, obs.scale = 1, var.scale = 1,
var.factor = 1.2,
ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
circle = TRUE, varname.size = 3, varname.color = "lightblue") +
theme(legend.direction = 'horizontal', legend.position = 'top') +
theme_minimal()
pca_plot2
```
*Based on the PCA biplot, it shows how variable x and y impact the principle components.The arrows are bit farther away from each other, indicating that they are not that correlated after all.*
#### Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See [MASS](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html)). Find the optimal value for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
```{r}
# run fitdistr function and extract the lambda value
exp_variable_x <- fitdistr(variable_x, "exponential")
lambda_x <- exp_variable_x$estimate
# 1000 samples from the exponential distribution
sample_size <- 1000
sample_x <- rexp(sample_size, rate = lambda_x) %>%
as.data.frame()
colnames(sample_x) <- c("sample_value")
# exponential histogram
exp_variable_x_histogram <- ggplot(sample_x,aes(x = sample_value)) +
geom_histogram(colour = 4, fill = "white") +
theme_minimal()
#plot between the original and exponential
par(mfrow=c(1,2))
variable_x_histgram + ggtitle("Original Distribution")
exp_variable_x_histogram + ggtitle("Exponential Distribution")
```
```{r cdf_quantile, message=FALSE, warning=FALSE}
cdf_quantile <- quantile(sample_x$sample_value, probs=c(0.05, 0.95))
cdf_quantile
```
*Using the exponential PDF and the CDF, the 5th and 95th percentiles are 445.33 and 31512.02 respectively.*
```{r}
# 95% CI based on the empirical data
mean_var_x <- mean(variable_x)
std_var_x <- sd(variable_x)
size_var_x <- length(variable_x)
z <- qnorm(1 - 0.05/2)
# identify the upper and lower bound
lower_ci_var_x <- mean_var_x - z * (std_var_x / sqrt(size_var_x))
upper_ci_var_x <- mean_var_x + z * (std_var_x / sqrt(size_var_x))
# quantile on variable x
empirical_quantile <- quantile(variable_x, probs=c(0.05, 0.95))
round(lower_ci_var_x)
round(upper_ci_var_x)
empirical_quantile
```
*Based on the empirical data, the lower and upper bound CI is 10005 and 11029 respectively. For the 5th and 95th percentiles, it is 3311.7 and 17401.15 respectively. Compared to the exponential value, there is a significant movement.*
#### Modeling
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
```{r linear_regression, message=FALSE, warning=FALSE}
# linear regression model on training data
training_lm <- lm (variable_y ~ variable_x, data = training_data)
summary(training_lm)
```
##### Regression Model
```{r lm_plots, message=FALSE, warning=FALSE}
# histogram on lm
lm_histogram <- ggplot(training_lm,aes(x = training_lm$residuals)) +
geom_histogram(colour = 4, fill = "white") +
xlab("Residual on Linear Regression") +
theme_minimal()
par(mfrow=c(2,2))
plot(training_lm)
lm_histogram
```
```{r num_train_table, message=FALSE, warning=FALSE}
# pull in all columns with numeric values
training_data_num <- training_data %>%
dplyr::select(where(is.numeric), -Id)
# summary
summary(training_data_num)
```
*By pulling all columns with numeric values, it can provide a better perspective in the training set using appropriate plots. In this case, the box plot and histograms are best methods to compare all variables.*
##### Box Plot Part 1
```{r var_boxplot, message=FALSE, warning=FALSE}
# boxplot
par(mfrow=c(2,3))
boxplot(training_data_num$MSSubClass, xlab = "The building class")
boxplot(training_data_num$LotFrontage, xlab = "Linear feet of street connected to property")
boxplot(training_data_num$LotArea, xlab = "Lot size in square feet")
boxplot(training_data_num$OverallQual, xlab = "Overall material and finish quality")
boxplot(training_data_num$OverallCond, xlab = "Overall condition rating")
boxplot(training_data_num$MasVnrArea, xlab = "Masonry veneer area in square feet")
```
##### Box Plot Part 2
```{r var_boxplot2, message=FALSE, warning=FALSE}
# box plot part 2
par(mfrow=c(2,3))
boxplot(training_data_num$BsmtFinSF1, xlab = "Type 1 finished square feet")
boxplot(training_data_num$BsmtFinSF2, xlab = "Type 2 finished square feet")
boxplot(training_data_num$BsmtUnfSF, xlab = "Unfinished square feet of basement area")
boxplot(training_data_num$TotalBsmtSF, xlab = "Overall material and finish quality")
boxplot(training_data_num$X1stFlrSF, xlab = "First Floor square feet")
boxplot(training_data_num$X2ndFlrSF, xlab = "Second Floor square feet")
```
##### Box Plot Part 3
```{r var_boxplot3, message=FALSE, warning=FALSE}
# box plot part 3
par(mfrow=c(2,3))
boxplot(training_data_num$LowQualFinSF, xlab = "Low quality finished square feet (all floors)")
boxplot(training_data_num$GrLivArea, xlab = "Above grade (ground) living area square feet")
boxplot(training_data_num$BsmtFullBath, xlab = "Basement full bathrooms")
boxplot(training_data_num$BsmtHalfBath, xlab = "Basement half bathrooms")
boxplot(training_data_num$FullBath, xlab = "Full bathrooms above grade")
boxplot(training_data_num$HalfBath, xlab = "Half baths above grade")
```
##### Box Plot Part 4
```{r var_boxplot4, message=FALSE, warning=FALSE}
# boxplot part 4
par(mfrow=c(2,3))
boxplot(training_data_num$BedroomAbvGr, xlab = "Number of bedrooms above basement level")
boxplot(training_data_num$KitchenAbvGr, xlab = "Number of kitchens")
boxplot(training_data_num$TotRmsAbvGrd, xlab = "Total rooms above grade (does not include bathrooms)")
boxplot(training_data_num$Fireplaces, xlab = "Number of fireplaces")
boxplot(training_data_num$GarageCars, xlab = "Size of garage in car capacity")
boxplot(training_data_num$GarageArea, xlab = "Size of garage in square feet")
```
##### Box Plot Part 5
```{r var_boxplot5, message=FALSE, warning=FALSE}
# boxplot part 5
par(mfrow=c(2,3))
boxplot(training_data_num$WoodDeckSF, xlab = "Wood deck area in square feet")
boxplot(training_data_num$OpenPorchSF, xlab = "Open porch area in square feet")
boxplot(training_data_num$EnclosedPorch, xlab = "Enclosed porch area in square feet")
boxplot(training_data_num$X3SsnPorch, xlab = "Three season porch area in square feet")
boxplot(training_data_num$ScreenPorch, xlab = "Screen porch area in square feet")
boxplot(training_data_num$PoolArea, xlab = "Pool area in square feet")
```
##### Box Plot Part 6
```{r var_boxplot6, message=FALSE, warning=FALSE}
# boxplot part 6
par(mfrow=c(1,2))
boxplot(training_data_num$MiscVal, xlab = "$ Value of miscellaneous feature")
boxplot(training_data_num$SalePrice, xlab = "Sale Price")
```
##### Histogram
```{r var_histogram, message=FALSE, warning=FALSE}
# histogram
par(mfrow=c(2,3))
hist(training_data_num$YearBuilt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Original construction date")
hist(training_data_num$YearRemodAdd, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Remodel date")
hist(training_data_num$GarageYrBlt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year garage was built")
hist(training_data_num$MoSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Month Sold")
hist(training_data_num$YrSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year Sold")
```
##### Multiple Linear Regression
```{r multi_lm, message=FALSE, warning=FALSE}
# replace NA value in training data with the mean of the column
training_data_revised <- training_data_num %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))
# multiple linear regression with numeric values
training_lm2 <- lm(SalePrice ~ LotFrontage + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea , data = training_data_revised)
# summary
summary(training_lm2)
```
*With the multiple regression model, it accounts for estimated 80% of the variance. Given some variables and its P-values are NULL and are higher than the critical value(0.05), those variables will be removed to retrain the model.*
```{r}
# multiple linear regression with revised variables
training_lm3 <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = training_data_revised)
# summary
summary(training_lm3)
```
##### Multiple Linear Regression Plots
```{r multi_lm_plot, message=FALSE, warning=FALSE}
par(mfrow=c(2,2))
plot(training_lm3)
```
*With the model, the test data will be used to help predict the Sale Price. Data wrangling will be used to replacing NA values with the mean. and only columns with numeric values are selected.*
```{r test_data, message=FALSE, warning=FALSE}
# data wrangling on test data
test_data_revised <- test_data %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)) %>%
dplyr::select(where(is.numeric))
```
##### Sale Price Predictions and write the CSV file
```{r prediction_write_csv, message=FALSE, warning=FALSE}
# sale price prediction
kaggle_submission <- test_data_revised %>%
mutate(predict_value = predict(training_lm3, newdata = test_data_revised)) %>%
dplyr::select(c('Id', predict_value))
summary(kaggle_submission)
```
```{r csv_export, warning=FALSE, message=FALSE}
# export csv
kaggle_submission <- rename(kaggle_submission, SalePrice=predict_value)
write.csv(kaggle_submission, "Resources/kaggle_submission.csv", row.names = FALSE)
```
##### Result
*My kaggle username is eddiexunyc and the lowest public score i was able to get is 0.33918.*
![](Kaggle Score.PNG)