diff --git a/_quarto.yml b/_quarto.yml index 2fff5cf3..4d6c32c6 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -6,6 +6,8 @@ project: - "!quizzes/" - "!assignments/" - "next-time*" + - "week5/tutorial*" + - "!data/ACIC23-competition/" output-dir: docs resources: - "assets/*" diff --git a/docs/images/p_forest_detourr.png b/docs/images/p_forest_detourr.png new file mode 100644 index 00000000..5d608b54 Binary files /dev/null and b/docs/images/p_forest_detourr.png differ diff --git a/docs/search.json b/docs/search.json index 162c5aa3..66cab839 100644 --- a/docs/search.json +++ b/docs/search.json @@ -35,914 +35,1082 @@ "text": "Assignments\n\nAssignment 3 is due on Friday 26 April." }, { - "objectID": "week6/index.html", - "href": "week6/index.html", - "title": "Week 6: Neural networks and deep learning", + "objectID": "week6/tutorial.html", + "href": "week6/tutorial.html", + "title": "ETC3250/5250 Tutorial 6", "section": "", - "text": "ISLR 10.1-10.3, 10.7" + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(randomForest)\nlibrary(geozoo)\nlibrary(ggbeeswarm)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" }, { - "objectID": "week6/index.html#main-reference", - "href": "week6/index.html#main-reference", - "title": "Week 6: Neural networks and deep learning", - "section": "", - "text": "ISLR 10.1-10.3, 10.7" + "objectID": "week6/tutorial.html#objectives", + "href": "week6/tutorial.html#objectives", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models." }, { - "objectID": "week6/index.html#what-you-will-learn-this-week", - "href": "week6/index.html#what-you-will-learn-this-week", - "title": "Week 6: Neural networks and deep learning", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nStructure of a neural network\nFitting neural networks\nDiagnosing the fit" + "objectID": "week6/tutorial.html#preparation", + "href": "week6/tutorial.html#preparation", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" }, { - "objectID": "week6/index.html#assignments", - "href": "week6/index.html#assignments", - "title": "Week 6: Neural networks and deep learning", - "section": "Assignments", - "text": "Assignments\n\nAssignment 2 is due on Friday 12 April.\nAssignment 3 is due on Friday 26 April." + "objectID": "week6/tutorial.html#exercises", + "href": "week6/tutorial.html#exercises", + "title": "ETC3250/5250 Tutorial 6", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1156)\np_sub <- p_tidy_std |>\n filter(species != \"Gentoo\") |>\n mutate(species = factor(species)) |>\n select(species, bl, bm)\np_split <- initial_split(p_sub, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. Becoming a car mechanic - looking under the hood at the tree algoriithm\n\nWrite down the equation for the Gini measure of impurity, for two groups, and the parameter \\(p\\) which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of \\(p\\) which gives the maximum value, and report what that maximum function value is.\n\n\nFor two groups, how would the impurity of a split be measured? Give the equation.\n\n\nBelow is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the minsplit parameter, does to prevent splitting on the edges fewer than the specified number of observations.\n\n\n# This works for two classes, and one variable\nmygini <- function(p) {\n g <- 0\n if (p>0 && p<1) {\n g <- 2*p*(1-p)\n }\n\n return(g)\n}\n\nmysplit <- function(x, spl, cl, minsplit=5) {\n # Assumes x is sorted\n # Count number of observations\n n <- length(x)\n \n # Check number of classes\n cl_unique <- unique(cl)\n \n # Split into two subsets on the given value\n left <- x[x<spl]\n cl_left <- cl[x<spl]\n n_l <- length(left)\n\n right <- x[x>=spl]\n cl_right <- cl[x>=spl]\n n_r <- length(right)\n \n # Don't calculate is either set is less than minsplit\n if ((n_l < minsplit) | (n_r < minsplit)) \n impurity = NA\n else {\n # Compute the Gini value for the split\n p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l\n p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r\n if (is.na(p_l)) p_l<-0.5\n if (is.na(p_r)) p_r<-0.5\n impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r)\n }\n return(impurity)\n}\n\n\nApply the function to compute the value for all possible splits for the body mass (bm), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.\n\n\nUse your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments." }, { - "objectID": "week5/slides.html#overview", - "href": "week5/slides.html#overview", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Overview", - "text": "Overview\nWe will cover:\n\nClassification trees, algorithm, stopping rules\nDifference between algorithm and parametric methods, especially trees vs LDA\nForests: ensembles of bagged trees\nDiagnostics: vote matrix, variable importance, proximity\nBoosted trees" + "objectID": "week6/tutorial.html#digging-deeper-into-diagnosing-an-error", + "href": "week6/tutorial.html#digging-deeper-into-diagnosing-an-error", + "title": "ETC3250/5250 Tutorial 6", + "section": "Digging deeper into diagnosing an error", + "text": "Digging deeper into diagnosing an error\n\nFit the random forest model to the full penguins data.\n\n\nReport the confusion matrix.\n\n\nUse linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it?\n\n\nHave a look at the other misclassifications, to understand whether they are ones weโ€™d expect to misclassify, or whether the model is not well constructed.\n\np_cl <- p_tr2 |>\n mutate(pspecies = p_fit_rf$fit$predicted) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2),\n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared,\n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\",\n off = \"plotly_doubleclick\") |>\n add_trace(type = \"scatter\",\n mode = \"markers\")\n\nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n)" }, { - "objectID": "week5/slides.html#trees", - "href": "week5/slides.html#trees", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Trees", - "text": "Trees\nNice explanation of trees, forests, boosted trees" + "objectID": "week6/tutorial.html#deciding-on-variables-in-a-large-data-problem", + "href": "week6/tutorial.html#deciding-on-variables-in-a-large-data-problem", + "title": "ETC3250/5250 Tutorial 6", + "section": "Deciding on variables in a large data problem", + "text": "Deciding on variables in a large data problem\n\nFit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes?\n\nThis code might help:\n\ndata(bushfires)\n\nbushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |>\n mutate(cause = factor(cause))\n\nset.seed(1239)\nbf_split <- initial_split(bushfires_sub, 3/4, strata=cause)\nbf_tr <- training(bf_split)\nbf_ts <- testing(bf_split)\n\nrf_spec <- rand_forest(mtry=5, trees=1000) |>\n set_mode(\"classification\") |>\n set_engine(\"ranger\", probability = TRUE, \n importance=\"permutation\")\nbf_fit_rf <- rf_spec |> \n fit(cause~., data = bf_tr)\n\n# Create votes matrix data\nbf_rf_votes <- bf_fit_rf$fit$predictions |>\n as_tibble() |>\n mutate(cause = bf_tr$cause)\n\n# Project 4D into 3D\nproj <- t(geozoo::f_helmert(4)[-1,])\nbf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj\ncolnames(bf_rf_v_p) <- c(\"x1\", \"x2\", \"x3\")\nbf_rf_v_p <- bf_rf_v_p |>\n as.data.frame() |>\n mutate(cause = bf_tr$cause)\n \n# Add simplex\nsimp <- simplex(p=3)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\")\nsp$cause = \"\"\nbf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |>\n mutate(cause = factor(cause))\nlabels <- c(\"accident\" , \"arson\", \n \"burning_off\", \"lightning\", \n rep(\"\", nrow(bf_rf_v_p)))\n\n\n# Examine votes matrix with bounding simplex\nanimate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, \n axes = \"off\", half_range = 1.3,\n edges = as.matrix(simp$edges),\n obs_labels = labels)\n\n\nCheck the variable importance. Plot the most important variables.\n\nThis code might help:\n\nbf_fit_rf$fit$variable.importance |> \n as_tibble() |> \n rename(imp=value) |>\n mutate(var = colnames(bf_tr)[1:50]) |>\n select(var, imp) |>\n arrange(desc(imp)) |> \n print(n=50)" }, { - "objectID": "week5/slides.html#algorithm-growing-a-tree", - "href": "week5/slides.html#algorithm-growing-a-tree", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Algorithm: growing a tree", - "text": "Algorithm: growing a tree\n\n\n\nAll observations in a single set\nSort values on first variable\nCompute the chosen split criteria for all possible splits into two sets\nChoose the best split on this variable. Save this info.\nRepeat 2-4 for all other variables\nChoose the best variable to split on, based on the best split. Your data is now in two sets.\nRepeat 1-6 on each subset.\nStop when stopping rule that decides that the best classification model is achieved.\n\n\n\nPros and cons:\n\nTrees are a very flexible way to fit a classifier.\nThey can\n\nutilise different types of predictor variables\nignore missing values\nhandle different units or scales on variables\ncapture intricate patterns\n\nHowever, they operate on a per variable basis, and do not effectively model separation when a combination of variables is needed." + "objectID": "week6/tutorial.html#can-boosting-better-detect-bushfire-case", + "href": "week6/tutorial.html#can-boosting-better-detect-bushfire-case", + "title": "ETC3250/5250 Tutorial 6", + "section": "Can boosting better detect bushfire case?", + "text": "Can boosting better detect bushfire case?\nFit a boosted tree model using xgboost to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison.\n\nset.seed(121)\nbf_spec2 <- boost_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"xgboost\")\nbf_fit_bt <- bf_spec2 |> \n fit(cause~., data = bf_tr)" }, { - "objectID": "week5/slides.html#common-split-criteria", - "href": "week5/slides.html#common-split-criteria", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Common split criteria", - "text": "Common split criteria\n\n\nClassification\n\nThe Gini index measures is defined as: \\[G = \\sum_{k =1}^K \\widehat{p}_{mk}(1 - \\widehat{p}_{mk})\\]\nEntropy is defined as \\[D = - \\sum_{k =1}^K \\widehat{p}_{mk} log(\\widehat{p}_{mk})\\] What corresponds to a high value, and what corresponds to a low value?\n\n\nRegression\nDefine\n\\[\\mbox{MSE} = \\frac{1}{n}\\sum_{i=1}^{n} (y_i - \\widehat{y}_i)^2\\]\nSplit the data where combining MSE for left bucket (MSE_L) and right bucket (MSE_R), makes the biggest reduction from the overall MSE." + "objectID": "week6/tutorial.html#finishing-up", + "href": "week6/tutorial.html#finishing-up", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week5/slides.html#illustration-12", - "href": "week5/slides.html#illustration-12", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Illustration (1/2)", - "text": "Illustration (1/2)\n\n\n\n\n\n\n\nx\ncl\n\n\n\n\n11\nA\n\n\n33\nA\n\n\n39\nB\n\n\n44\nA\n\n\n50\nA\n\n\n56\nB\n\n\n70\nB\n\n\n\n\n\n\n\nNote: x is sorted from lowest to highest!\n\n\nAll possible splits shown by vertical lines\n\n\n\n\n\n\n\n\n\nWhat do you think is the best split? 2, 3 or 5??" + "objectID": "week6/tutorialsol.html", + "href": "week6/tutorialsol.html", + "title": "ETC3250/5250 Tutorial 6", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(randomForest)\nlibrary(geozoo)\nlibrary(ggbeeswarm)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" }, { - "objectID": "week5/slides.html#illustration-22", - "href": "week5/slides.html#illustration-22", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Illustration (2/2)", - "text": "Illustration (2/2)\n\n\nCalculate the impurity for split 5\nThe left bucket is\n\n\n\n\n\nx\ncl\n\n\n\n\n11\nA\n\n\n33\nA\n\n\n39\nB\n\n\n44\nA\n\n\n50\nA\n\n\n\n\n\n\n\nand the right bucket is\n\n\n\n\n\nx\ncl\n\n\n\n\n56\nB\n\n\n70\nB\n\n\n\n\n\n\n\n\nUsing Gini \\(G = \\sum_{k =1}^K \\widehat{p}_{mk}(1 - \\widehat{p}_{mk})\\)\nLeft bucket:\n\\[\\widehat{p}_{LA} = 4/5, \\widehat{p}_{LB} = 1/5, ~~ p_L = 5/7\\]\n\\[G_L=0.8(1-0.8)+0.2(1-0.2) = 0.32\\]\nRight bucket:\n\\[\\widehat{p}_{RA} = 0/2, \\widehat{p}_{RB} = 2/2, ~~ p_R = 2/7\\]\n\\[G_R=0(1-0)+1(1-1) = 0\\] Combine with weighted sum to get impurity for the split:\n\\[5/7G_L + 2/7G_R=0.32\\]\n Your turn: Compute the impurity for split 2." + "objectID": "week6/tutorialsol.html#objectives", + "href": "week6/tutorialsol.html#objectives", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models." }, { - "objectID": "week5/slides.html#section", - "href": "week5/slides.html#section", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "", - "text": "Splits on categorical variables\n\n\n\n\n\n\n\n\n\nPossible best split would be if koala then assign to Vic else assign to WA, because Vic has more koalas but and WA has more emus and roos.\n\nDealing with missing values on predictors\n\n\n\n\n\nx1\nx2\nx3\nx4\ny\n\n\n\n\n19\n-8\n22\n-24\nA\n\n\nNA\n-10\n26\n-26\nA\n\n\n15\nNA\n32\n-27\nB\n\n\n17\n-6\n27\n-25\nA\n\n\n18\n-5\nNA\n-23\nA\n\n\n13\n-3\n37\nNA\nB\n\n\n12\n-1\n35\n-30\nB\n\n\n11\n-7\n24\n-31\nB\n\n\n\n\n\n\n\n50% of cases have missing values. Trees ignore missings only on a single variable.\n\nEvery other method ignores a full observation if missing on any variable. That is, would only be able to use half the data." + "objectID": "week6/tutorialsol.html#preparation", + "href": "week6/tutorialsol.html#preparation", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" }, { - "objectID": "week5/slides.html#example-penguins-13", - "href": "week5/slides.html#example-penguins-13", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 1/3", - "text": "Example: penguins 1/3\n\n\n\n\n\n\n\n\n\n\n\n\n\nset.seed(1156)\np_split <- initial_split(p_sub, 2/3, strata=species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\ntree_spec <- decision_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"rpart\")\n\np_fit_tree <- tree_spec |>\n fit(species~., data=p_tr)\n\np_fit_tree\n\nparsnip model object\n\nn= 145 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n1) root 145 45 Adelie (0.690 0.310) \n 2) bl< 43 99 2 Adelie (0.980 0.020) *\n 3) bl>=43 46 3 Chinstrap (0.065 0.935) *\n\n\n\n Can you draw the tree?" + "objectID": "week6/tutorialsol.html#exercises", + "href": "week6/tutorialsol.html#exercises", + "title": "ETC3250/5250 Tutorial 6", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1156)\np_sub <- p_tidy_std |>\n filter(species != \"Gentoo\") |>\n mutate(species = factor(species)) |>\n select(species, bl, bm)\np_split <- initial_split(p_sub, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. Becoming a car mechanic - looking under the hood at the tree algoriithm\n\nWrite down the equation for the Gini measure of impurity, for two groups, and the parameter \\(p\\) which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of \\(p\\) which gives the maximum value, and report what that maximum function value is.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(G = p(1-p)\\) where \\(p\\) is the proportion of class 1 in the subset of data. The domain is \\([0, 1]\\) and the maximum value of \\(0.25\\) is at \\(p=0.5\\).\n\n\n\n\n\nFor two groups, how would the impurity of a split be measured? Give the equation.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\[p_L(p_{L1}(1-p_{L1})) + p_R(p_{R1}(1-p_{R1}))\\] where \\(p_L\\) is the proportion of observations to the left of the split, \\(p_{L1}\\) is the proportion of observations of class 1 to the left of the split, and \\(p_{R1}\\) indicates the equivalent quantities for observations to the right of the split.\n\n\n\n\n\nBelow is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the minsplit parameter, does to prevent splitting on the edges fewer than the specified number of observations.\n\n\n# This works for two classes, and one variable\nmygini <- function(p) {\n g <- 0\n if (p>0 && p<1) {\n g <- 2*p*(1-p)\n }\n\n return(g)\n}\n\nmysplit <- function(x, spl, cl, minsplit=5) {\n # Assumes x is sorted\n # Count number of observations\n n <- length(x)\n \n # Check number of classes\n cl_unique <- unique(cl)\n \n # Split into two subsets on the given value\n left <- x[x<spl]\n cl_left <- cl[x<spl]\n n_l <- length(left)\n\n right <- x[x>=spl]\n cl_right <- cl[x>=spl]\n n_r <- length(right)\n \n # Don't calculate is either set is less than minsplit\n if ((n_l < minsplit) | (n_r < minsplit)) \n impurity = NA\n else {\n # Compute the Gini value for the split\n p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l\n p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r\n if (is.na(p_l)) p_l<-0.5\n if (is.na(p_r)) p_r<-0.5\n impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r)\n }\n return(impurity)\n}\n\n\nApply the function to compute the value for all possible splits for the body mass (bm), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nx <- p_tr |> \n select(species, bm) |>\n arrange(bm)\nunique_splits <- unique(x$bm)\nnsplits <- length(unique_splits)-1\nsplits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2\nimp <- NULL;\nfor (i in 1:length(splits)) {\n s <- splits[i]\n a <- mysplit(x$bm, s, x$species, minsplit=1)\n imp <- c(imp, a)\n}\nd_impurity <- tibble(splits, imp)\nd_impurity_bm <- d_impurity[which.min(d_impurity$imp),]\nggplot() + geom_line(data=d_impurity, aes(x=splits, y=imp)) +\n geom_rug(data=x, aes(x=bm, colour=species), alpha=0.3) + \n ylab(\"Gini impurity\") +\n xlab(\"bm\") +\n scale_color_brewer(\"\", palette=\"Dark2\")\n\n\n\n\n\n\n\n\n\n\n\n\n\nUse your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n# bl: this is the only one needed for the first split\n# because it is so better separated than any others\nx <- p_tr |> \n select(species, bl) |>\n arrange(bl)\nunique_splits <- unique(x$bl)\nnsplits <- length(unique_splits)-1\nsplits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2\nimp <- NULL;\nfor (i in 1:length(splits)) {\n s <- splits[i]\n a <- mysplit(x$bl, s, x$species, minsplit=1)\n imp <- c(imp, a)\n}\nd_impurity <- tibble(splits, imp)\nd_impurity_bl <- d_impurity[which.min(d_impurity$imp),]\n\nggplot() + \n geom_line(data=d_impurity, aes(x=splits, y=imp)) +\n geom_rug(data=x, aes(x=bl, colour=species), alpha=0.3) + \n ylab(\"Gini impurity\") +\n xlab(\"bl\") +\n scale_color_brewer(\"\", palette=\"Dark2\")\n\n\n\n\n\n\n\np_tr_L <- p_tr |>\n filter(bl < d_impurity_bl$splits)\n\np_tr_R <- p_tr |>\n filter(bl > d_impurity_bl$splits)\n\n# Make a function to make calculations easier\nbest_split <- function(x, cl, minsplit=5) {\n unique_splits <- unique(x)\n nsplits <- length(unique_splits)-1\n splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2\n imp <- NULL;\n for (i in 1:length(splits)) {\n s <- splits[i]\n a <- mysplit(x, s, cl, minsplit)\n imp <- c(imp, a)\n }\n d_impurity <- tibble(splits, imp)\n d_impurity_best <- d_impurity[which.min(d_impurity$imp),]\n return(d_impurity_best)\n}\n\ns1 <- best_split(p_tr$bl, p_tr$species, minsplit=5)\ns2 <- best_split(p_tr_R$bm, p_tr_R$species, minsplit=5)\n\nggplot(p_tr, aes(x=bl, y=bm, colour=species)) +\n geom_point() +\n geom_vline(xintercept=s1$splits) +\n annotate(\"segment\", x = s1$splits,\n xend = max(p_tr$bl),\n y = s2$splits, \n yend = s2$splits) +\n scale_colour_brewer(\"\", palette=\"Dark2\") +\n theme(aspect.ratio = 1)" }, { - "objectID": "week5/slides.html#stopping-rules", - "href": "week5/slides.html#stopping-rules", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Stopping rules", - "text": "Stopping rules\n\nMinimum split: number of observations in a node, in order for a split to be made\nMinimum bucket: Minimum number of observations allowed in a terminal node\nComplexity parameter: minimum difference between impurity values required to continue splitting" + "objectID": "week6/tutorialsol.html#digging-deeper-into-diagnosing-an-error", + "href": "week6/tutorialsol.html#digging-deeper-into-diagnosing-an-error", + "title": "ETC3250/5250 Tutorial 6", + "section": "Digging deeper into diagnosing an error", + "text": "Digging deeper into diagnosing an error\n\nFit the random forest model to the full penguins data.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(923)\np_split2 <- initial_split(p_tidy_std, 2/3,\n strata=species)\np_tr2 <- training(p_split2)\np_ts2 <- testing(p_split2)\n\nrf_spec <- rand_forest(mtry=2, trees=1000) |>\n set_mode(\"classification\") |>\n set_engine(\"randomForest\")\np_fit_rf <- rf_spec |> \n fit(species ~ ., data = p_tr2)\n\n\n\n\n\n\nReport the confusion matrix.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_fit_rf\n\nparsnip model object\n\n\nCall:\n randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2, x)) \n Type of random forest: classification\n Number of trees: 1000\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 2.6%\nConfusion matrix:\n Adelie Chinstrap Gentoo class.error\nAdelie 97 2 1 0.030\nChinstrap 2 43 0 0.044\nGentoo 0 1 81 0.012\n\n\n\n\n\n\n\nUse linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it?\n\n\nHave a look at the other misclassifications, to understand whether they are ones weโ€™d expect to misclassify, or whether the model is not well constructed.\n\np_cl <- p_tr2 |>\n mutate(pspecies = p_fit_rf$fit$predicted) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2),\n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared,\n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\",\n off = \"plotly_doubleclick\") |>\n add_trace(type = \"scatter\",\n mode = \"markers\")\n\nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n)" }, { - "objectID": "week5/slides.html#example-penguins-23", - "href": "week5/slides.html#example-penguins-23", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 2/3", - "text": "Example: penguins 2/3\n\n\nDefaults for rpart are:\n\nrpart.control(minsplit = 20, \n minbucket = round(minsplit/3), \n cp = 0.01, \n maxcompete = 4, \n maxsurrogate = 5, \n usesurrogate = 2, \n xval = 10,\n surrogatestyle = 0, maxdepth = 30, \n ...)\n\n\ntree_spec <- decision_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"rpart\",\n control = rpart.control(minsplit = 10), \n model=TRUE)\n\np_fit_tree <- tree_spec |>\n fit(species~., data=p_tr)\n\np_fit_tree\n\nparsnip model object\n\nn= 145 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 145 45 Adelie (0.690 0.310) \n 2) bl< 43 99 2 Adelie (0.980 0.020) \n 4) bl< 41 75 0 Adelie (1.000 0.000) *\n 5) bl>=41 24 2 Adelie (0.917 0.083) \n 10) bm>=3.4e+03 21 0 Adelie (1.000 0.000) *\n 11) bm< 3.4e+03 3 1 Chinstrap (0.333 0.667) *\n 3) bl>=43 46 3 Chinstrap (0.065 0.935) \n 6) bl< 46 10 3 Chinstrap (0.300 0.700) \n 12) bm>=3.8e+03 3 0 Adelie (1.000 0.000) *\n 13) bm< 3.8e+03 7 0 Chinstrap (0.000 1.000) *\n 7) bl>=46 36 0 Chinstrap (0.000 1.000) *" + "objectID": "week6/tutorialsol.html#deciding-on-variables-in-a-large-data-problem", + "href": "week6/tutorialsol.html#deciding-on-variables-in-a-large-data-problem", + "title": "ETC3250/5250 Tutorial 6", + "section": "Deciding on variables in a large data problem", + "text": "Deciding on variables in a large data problem\n\nFit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes?\n\nThis code might help:\n\ndata(bushfires)\n\nbushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |>\n mutate(cause = factor(cause))\n\nset.seed(1239)\nbf_split <- initial_split(bushfires_sub, 3/4, strata=cause)\nbf_tr <- training(bf_split)\nbf_ts <- testing(bf_split)\n\nrf_spec <- rand_forest(mtry=5, trees=1000) |>\n set_mode(\"classification\") |>\n set_engine(\"ranger\", probability = TRUE, \n importance=\"permutation\")\nbf_fit_rf <- rf_spec |> \n fit(cause~., data = bf_tr)\n\n# Create votes matrix data\nbf_rf_votes <- bf_fit_rf$fit$predictions |>\n as_tibble() |>\n mutate(cause = bf_tr$cause)\n\n# Project 4D into 3D\nproj <- t(geozoo::f_helmert(4)[-1,])\nbf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj\ncolnames(bf_rf_v_p) <- c(\"x1\", \"x2\", \"x3\")\nbf_rf_v_p <- bf_rf_v_p |>\n as.data.frame() |>\n mutate(cause = bf_tr$cause)\n \n# Add simplex\nsimp <- simplex(p=3)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\")\nsp$cause = \"\"\nbf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |>\n mutate(cause = factor(cause))\nlabels <- c(\"accident\" , \"arson\", \n \"burning_off\", \"lightning\", \n rep(\"\", nrow(bf_rf_v_p)))\n\n\n# Examine votes matrix with bounding simplex\nanimate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, \n axes = \"off\", half_range = 1.3,\n edges = as.matrix(simp$edges),\n obs_labels = labels)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe pattern is that points are bunched at the vertex corresponding to lightning, extending along the edge leading to accident. We could also say that the points do extend on the face corresponding to lightning, accident and arson, too. The primary confusion for each of the other classes is with lightning. Few points are predicted to be burning_off because this is typically only occurring outside of fire season.\nPart of the reason that the forest predicts predominantly to the lightning class is because it is a highly imbalanced problem. One approach is to change the weights for each class, to give the lightning class a lower priority. This will change the model predictions to be more often the other three classes.\n\n\n\n\n\nCheck the variable importance. Plot the most important variables.\n\nThis code might help:\n\nbf_fit_rf$fit$variable.importance |> \n as_tibble() |> \n rename(imp=value) |>\n mutate(var = colnames(bf_tr)[1:50]) |>\n select(var, imp) |>\n arrange(desc(imp)) |> \n print(n=50)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np1 <- ggplot(bf_tr, aes(x=cause, y=log_dist_road)) +\n geom_quasirandom(alpha=0.5) +\n stat_summary(aes(group = cause), \n fun = median, \n fun.min = median, \n fun.max = median, \n geom = \"crossbar\", \n color = \"orange\", \n width = 0.7, \n lwd = 0.5) +\n xlab(\"\") +\n coord_flip() \np2 <- ggplot(bf_tr, aes(x=cause, y=arf360)) +\n geom_quasirandom(alpha=0.5) +\n stat_summary(aes(group = cause), \n fun = median, \n fun.min = median, \n fun.max = median, \n geom = \"crossbar\", \n color = \"orange\", \n width = 0.7, \n lwd = 0.5) +\n xlab(\"\") +\n coord_flip()\np3 <- ggplot(bf_tr, aes(x=cause, y=log_dist_cfa)) +\n geom_quasirandom(alpha=0.5) +\n stat_summary(aes(group = cause), \n fun = median, \n fun.min = median, \n fun.max = median, \n geom = \"crossbar\", \n color = \"orange\", \n width = 0.7, \n lwd = 0.5) +\n xlab(\"\") +\n coord_flip()\np1 + p2 + p3 + plot_layout(ncol=3)\n\n\n\n\n\n\n\n\nEach of these variables has some difference in median value between the classes, but none shows any separation between them. If the three most important variables show little separation, it indicates the difficulty in distinguishing between these classes. However, it looks like if the distance from a road, or CFA station is bigger, the chance of the cause being a lightning start is higher. This makes sense, because these would be locations further from human activity, and thus the fire is less likely to started by people. The arf360 relates to rain from a year ago. It also appears that if the rainfall was higher a year ago, lightning is more likely the cause. This also makes some sense, because with more rain in the previous year, there should be more vegetation. Particularly, if recent months have been dry, then there is likely a lot of dry vegetation which is combustible. Ideally we would create a new variable (feature engineering) that looks at difference in rainfall from the previous year to just before the current yearโ€™s fire season, to model these types of conditions." }, { - "objectID": "week5/slides.html#example-penguins-33", - "href": "week5/slides.html#example-penguins-33", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 3/3", - "text": "Example: penguins 3/3\n\n\n\n\n\n\n\n\n\n\n\n\n\np_fit_tree |>\n extract_fit_engine() |>\n rpart.plot(type=3, extra=1)" + "objectID": "week6/tutorialsol.html#can-boosting-better-detect-bushfire-case", + "href": "week6/tutorialsol.html#can-boosting-better-detect-bushfire-case", + "title": "ETC3250/5250 Tutorial 6", + "section": "Can boosting better detect bushfire case?", + "text": "Can boosting better detect bushfire case?\nFit a boosted tree model using xgboost to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison.\n\nset.seed(121)\nbf_spec2 <- boost_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"xgboost\")\nbf_fit_bt <- bf_spec2 |> \n fit(cause~., data = bf_tr)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe results for the random forest are:\n\nbf_ts_rf_pred <- bf_ts |>\n mutate(pcause = predict(bf_fit_rf, bf_ts)$.pred_class)\nbal_accuracy(bf_ts_rf_pred, cause, pcause)\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy macro 0.638\n\nbf_ts_rf_pred |>\n count(cause, pcause) |>\n group_by(cause) |>\n mutate(Accuracy = n[cause==pcause]/sum(n)) |>\n pivot_wider(names_from = \"pcause\", \n values_from = n, values_fill = 0) |>\n select(cause, accident, arson, burning_off, lightning, Accuracy)\n\n# A tibble: 4 ร— 6\n# Groups: cause [4]\n cause accident arson burning_off lightning Accuracy\n <fct> <int> <int> <int> <int> <dbl>\n1 accident 14 0 0 19 0.424 \n2 arson 2 1 0 10 0.0769\n3 burning_off 0 0 1 3 0.25 \n4 lightning 0 0 0 206 1 \n\n\nand for the boosted tree are:\n\nbf_ts_bt_pred <- bf_ts |>\n mutate(pcause = predict(bf_fit_bt, \n bf_ts)$.pred_class)\nbal_accuracy(bf_ts_bt_pred, cause, pcause)\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy macro 0.765\n\nbf_ts_bt_pred |>\n count(cause, pcause) |>\n group_by(cause) |>\n mutate(Accuracy = n[cause==pcause]/sum(n)) |>\n pivot_wider(names_from = \"pcause\", \n values_from = n, values_fill = 0) |>\n select(cause, accident, arson, burning_off, lightning, Accuracy)\n\n# A tibble: 4 ร— 6\n# Groups: cause [4]\n cause accident arson burning_off lightning Accuracy\n <fct> <int> <int> <int> <int> <dbl>\n1 accident 19 1 0 13 0.576\n2 arson 4 6 0 3 0.462\n3 burning_off 0 0 2 2 0.5 \n4 lightning 3 1 0 202 0.981\n\n\nThe boosted tree does improve the balanced accuracy." }, { - "objectID": "week5/slides.html#example-penguins-34", - "href": "week5/slides.html#example-penguins-34", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 3/4", - "text": "Example: penguins 3/4\n\n\nModel fit summary\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy binary 0.946\n\n\n# A tibble: 2 ร— 4\n# Groups: species [2]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 50 1 0.980\n2 Chinstrap 3 20 0.870\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy binary 0.925\n\n\n\nCan you see the misclassified test cases?\n\nModel-in-the-data-space" + "objectID": "week6/tutorialsol.html#finishing-up", + "href": "week6/tutorialsol.html#finishing-up", + "title": "ETC3250/5250 Tutorial 6", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week5/slides.html#comparison-with-lda", - "href": "week5/slides.html#comparison-with-lda", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Comparison with LDA", - "text": "Comparison with LDA\n\n\n\nTree model\n\n\n\n\n\n\n\n\n\n\n\nData-driven, only split on single variables\n\n\n\nLDA model\n\n\n\n\n\n\n\n\n\n\n\nAssume normal, equal VC, oblique splits" + "objectID": "week5/tutorial.html", + "href": "week5/tutorial.html", + "title": "ETC3250/5250 Tutorial 5", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" }, { - "objectID": "week5/slides.html#random-forests", - "href": "week5/slides.html#random-forests", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Random forests", - "text": "Random forests" + "objectID": "week5/tutorial.html#objectives", + "href": "week5/tutorial.html#objectives", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from logistic regression models, and linear discriminant analysis models." }, { - "objectID": "week5/slides.html#overview-1", - "href": "week5/slides.html#overview-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Overview", - "text": "Overview\nA random forest is an ensemble classifier, built from fitting multiple trees to different subsets of the training data." + "objectID": "week5/tutorial.html#preparation", + "href": "week5/tutorial.html#preparation", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" }, { - "objectID": "week5/slides.html#bagging-and-variable-sampling", - "href": "week5/slides.html#bagging-and-variable-sampling", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bagging and variable sampling", - "text": "Bagging and variable sampling\n\n\n\nTake \\(B\\) different bootstrapped training sets: \\(D_1, D_2, \\dots, D_B\\), each using a sample of variables.\nBuild a separate prediction model using each \\(D_{(\\cdot)}\\): \\[\\widehat{f}_1(x), \\widehat{f}_2(x), \\dots, \\widehat{f}_B(x)\\]\nPredict the out-of-bag cases for each tree, compute proportion of trees a case was predicted to be each class.\nPredicted value for each observation is the class with the highest proportion.\n\n\n\n\nEach individual tree has high variance.\nAggregating the results from \\(B\\) trees reduces the variance." + "objectID": "week5/tutorial.html#exercises", + "href": "week5/tutorial.html#exercises", + "title": "ETC3250/5250 Tutorial 5", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1148)\np_split <- initial_split(p_tidy_std, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. LDA\nThis problem uses linear discriminant analysis on the penguins data.\n\nIs the assumption of equal variance-covariance reasonable to make for this data?\n\n\nFit the LDA model to the training data, using this code\n\n\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(1/3, 1/3, 1/3))\nlda_fit <- lda_spec |> \n fit(species ~ ., data = p_tr)\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set.\n\n\nPlot the training and test data in the discriminant space, using symbols to indicate which set. See if you can mark the misclassified cases, too.\n\n\nRe-do the plot of the discriminant space, to examine the boundary between groups. Youโ€™ll need to generate a set of random points in the domain of the data, predict their class, and projection into the discriminant space. The explore() in the classifly package can help you generate the box of random points.\n\n\nWhat happens to the boundary, if you change the prior probabilities? And why does this happen? Change the prior probabilities to be 1.999/3, 0.001/3, 1/3 for Adelie, Chinstrap, Gentoo, respectively. Re-do the plot of the boundaries in the discriminant space.\n\n\n\n2. Logistic\n\nFit a logistic discriminant model to the training set. You can use this code:\n\n\nlog_fit <- multinom_reg() |> \n fit(species ~ ., \n data = p_tr)\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set. You can use this code to make the predictions.\n\n\np_tr_pred <- log_fit |> \n augment(new_data = p_tr) |>\n rename(pspecies = .pred_class)\np_ts_pred <- log_fit |> \n augment(new_data = p_ts) |>\n rename(pspecies = .pred_class)\n\n\nCheck the boundaries produced by logistic regression, and how they differ from those of LDA. Using the 2D projection produced by the LDA rule (using equal priors) predict the your set of random points using the logistic model.\n\n\n\n3. Interactively explore misclassifications\nHere you are going to use interactive graphics to explore the misclassifications from the linear discriminant analysis. Weโ€™ll need to use detourr to accomplish this. The code below makes a scatterplot of the confusion matrix, where points corresponding to a class have been spread apart by jittering. This plot is linked to a tour plot. Try:\n\nSelecting penguins that have been misclassified, from the display of the confusion matrix. Observe where they are in the data space. Are they in an area where it is hard to distinguish the groups?\nSelecting neighbouring points in the tour, and examine where they are in the confusion matrix.\n\n\np_cl <- p_tidy_std |>\n mutate(pspecies = predict(lda_fit$fit, p_tidy_std)$class) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared, \n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n ) \n\n\n\n4. Exploring the math\nSlide 23 of the lecture notes has the steps to go from Bayes rule to the discriminant functions. Explain what was done at each step to get to the next one." }, { - "objectID": "week5/slides.html#comparison-with-a-single-tree-and-lda", - "href": "week5/slides.html#comparison-with-a-single-tree-and-lda", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Comparison with a single tree and LDA", - "text": "Comparison with a single tree and LDA\n\n\n\nTree model\n\n\n\n\n\n\n\n\n\n\n\nData-driven, only split on single variables\n\n\n\nRandom forest\n\n\n\n\n\n\n\n\n\n\n\nData-driven, multiple trees gives non-linear fit\n\n\n\nLDA model\n\n\n\n\n\n\n\n\n\n\n\nAssume normal, equal VC, oblique splits" + "objectID": "week5/tutorial.html#finishing-up", + "href": "week5/tutorial.html#finishing-up", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week5/slides.html#random-forest-fit-and-predicted-values", - "href": "week5/slides.html#random-forest-fit-and-predicted-values", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Random forest fit and predicted values", - "text": "Random forest fit and predicted values\n\n\nFit\n\nrf_spec <- rand_forest(mtry=2, trees=1000) |>\n set_mode(\"classification\") |>\n set_engine(\"randomForest\")\np_fit_rf <- rf_spec |> \n fit(species ~ ., data = p_tr)\n\n\n\nparsnip model object\n\n\nCall:\n randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2, x)) \n Type of random forest: classification\n Number of trees: 1000\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 4.8%\nConfusion matrix:\n Adelie Chinstrap class.error\nAdelie 96 4 0.040\nChinstrap 3 42 0.067\n\n\n\nPredicted values\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy binary 0.973\n\n\n# A tibble: 2 ร— 4\n# Groups: species [2]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 51 0 1 \n2 Chinstrap 2 21 0.913\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy binary 0.957\n\n\n\n Warning: Donโ€™t use the predict() on the training set, youโ€™ll always get 0 error. The object p_fit_rf$fit$predict has the fitted values." + "objectID": "week5/tutorialsol.html", + "href": "week5/tutorialsol.html", + "title": "ETC3250/5250 Tutorial 5", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" }, { - "objectID": "week5/slides.html#diagnostics", - "href": "week5/slides.html#diagnostics", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Diagnostics", - "text": "Diagnostics\n\nError is computed automatically on the out-of-bag cases.\nVote matrix, \\(n\\times K\\): Proportion of times a case is predicted to the class \\(k\\). Also consider these to be predictive probabilities.\nVariable importance: uses permutation!\nProximities, \\(n\\times n\\): Closeness of cases measured by how often they are in the same terminal node." + "objectID": "week5/tutorialsol.html#objectives", + "href": "week5/tutorialsol.html#objectives", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from logistic regression models, and linear discriminant analysis models." }, { - "objectID": "week5/slides.html#vote-matrix", - "href": "week5/slides.html#vote-matrix", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Vote Matrix", - "text": "Vote Matrix\n\n\n\nProportion of trees the case is predicted to be each class, ranges between 0-1\nCan be used to identify troublesome cases.\nUsed with plots of the actual data can help determine if it is the record itself that is the problem, or if method is biased.\nUnderstand the difference in accuracy of prediction for different classes.\n\n\n\np_fit_rf$fit$votes\n\n Adelie Chinstrap\n1 1.0000 0.0000\n2 1.0000 0.0000\n3 0.9807 0.0193\n4 1.0000 0.0000\n5 1.0000 0.0000\n6 1.0000 0.0000\n7 1.0000 0.0000\n8 0.3982 0.6018\n9 1.0000 0.0000\n10 1.0000 0.0000\n11 1.0000 0.0000\n12 0.8274 0.1726\n13 0.3425 0.6575\n14 1.0000 0.0000\n15 1.0000 0.0000\n16 0.7931 0.2069\n17 1.0000 0.0000\n18 1.0000 0.0000\n19 0.9973 0.0027\n20 1.0000 0.0000\n21 0.7622 0.2378\n22 1.0000 0.0000\n23 0.9459 0.0541\n24 1.0000 0.0000\n25 1.0000 0.0000\n26 0.8568 0.1432\n27 1.0000 0.0000\n28 1.0000 0.0000\n29 1.0000 0.0000\n30 1.0000 0.0000\n31 1.0000 0.0000\n32 1.0000 0.0000\n33 1.0000 0.0000\n34 1.0000 0.0000\n35 1.0000 0.0000\n36 1.0000 0.0000\n37 1.0000 0.0000\n38 1.0000 0.0000\n39 1.0000 0.0000\n40 1.0000 0.0000\n41 1.0000 0.0000\n42 1.0000 0.0000\n43 1.0000 0.0000\n44 1.0000 0.0000\n45 0.2773 0.7227\n46 1.0000 0.0000\n47 0.9821 0.0179\n48 1.0000 0.0000\n49 0.9973 0.0027\n50 1.0000 0.0000\n51 1.0000 0.0000\n52 1.0000 0.0000\n53 1.0000 0.0000\n54 1.0000 0.0000\n55 1.0000 0.0000\n56 1.0000 0.0000\n57 1.0000 0.0000\n58 1.0000 0.0000\n59 1.0000 0.0000\n60 1.0000 0.0000\n61 0.9833 0.0167\n62 1.0000 0.0000\n63 0.9113 0.0887\n64 1.0000 0.0000\n65 1.0000 0.0000\n66 1.0000 0.0000\n67 1.0000 0.0000\n68 1.0000 0.0000\n69 0.9912 0.0088\n70 1.0000 0.0000\n71 0.9535 0.0465\n72 0.9914 0.0086\n73 1.0000 0.0000\n74 0.9676 0.0324\n75 1.0000 0.0000\n76 1.0000 0.0000\n77 1.0000 0.0000\n78 1.0000 0.0000\n79 1.0000 0.0000\n80 1.0000 0.0000\n81 1.0000 0.0000\n82 0.9973 0.0027\n83 1.0000 0.0000\n84 1.0000 0.0000\n85 1.0000 0.0000\n86 0.4624 0.5376\n87 0.6160 0.3840\n88 1.0000 0.0000\n89 1.0000 0.0000\n90 1.0000 0.0000\n91 1.0000 0.0000\n92 0.9948 0.0052\n93 1.0000 0.0000\n94 0.9972 0.0028\n95 1.0000 0.0000\n96 1.0000 0.0000\n97 1.0000 0.0000\n98 1.0000 0.0000\n99 1.0000 0.0000\n100 1.0000 0.0000\n101 0.0000 1.0000\n102 0.0000 1.0000\n103 0.0055 0.9945\n104 0.0653 0.9347\n105 0.0000 1.0000\n106 0.0000 1.0000\n107 0.0000 1.0000\n108 0.0000 1.0000\n109 0.0000 1.0000\n110 0.1935 0.8065\n111 0.0000 1.0000\n112 0.0000 1.0000\n113 0.0159 0.9841\n114 0.0000 1.0000\n115 0.0000 1.0000\n116 0.2074 0.7926\n117 0.0000 1.0000\n118 0.0000 1.0000\n119 0.0117 0.9883\n120 1.0000 0.0000\n121 0.0529 0.9471\n122 0.9536 0.0464\n123 0.0027 0.9973\n124 0.0000 1.0000\n125 0.0000 1.0000\n126 0.0163 0.9837\n127 0.0000 1.0000\n128 0.0000 1.0000\n129 0.0111 0.9889\n130 0.0000 1.0000\n131 0.0694 0.9306\n132 0.0000 1.0000\n133 0.0137 0.9863\n134 0.0000 1.0000\n135 0.0000 1.0000\n136 0.0052 0.9948\n137 0.0000 1.0000\n138 0.0000 1.0000\n139 0.0135 0.9865\n140 0.0000 1.0000\n141 0.0140 0.9860\n142 0.0000 1.0000\n143 0.6325 0.3675\n144 0.0000 1.0000\n145 0.0000 1.0000\nattr(,\"class\")\n[1] \"matrix\" \"array\" \"votes\"" + "objectID": "week5/tutorialsol.html#preparation", + "href": "week5/tutorialsol.html#preparation", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" }, { - "objectID": "week5/slides.html#curious", - "href": "week5/slides.html#curious", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Curious", - "text": "Curious\n\n\nWhere are the Adelie penguins in the training set that are misclassified?\n\n\nparsnip model object\n\n\nCall:\n randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2, x)) \n Type of random forest: classification\n Number of trees: 1000\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 4.8%\nConfusion matrix:\n Adelie Chinstrap class.error\nAdelie 96 4 0.040\nChinstrap 3 42 0.067\n\n\n\nJoin data containing true, predicted and predictive probabilities, to diagnose.\n\n\n\n\n\n# A tibble: 7 ร— 6\n species bl bm pspecies Adelie Chinstrap\n <fct> <dbl> <int> <fct> <dbl> <dbl>\n1 Adelie 41.1 3200 Chinstrap 0.398 0.602 \n2 Adelie 46 4200 Chinstrap 0.342 0.658 \n3 Adelie 45.8 4150 Chinstrap 0.277 0.723 \n4 Adelie 44.1 4000 Chinstrap 0.462 0.538 \n5 Chinstrap 40.9 3200 Adelie 1 0 \n6 Chinstrap 42.5 3350 Adelie 0.954 0.0464\n7 Chinstrap 43.5 3400 Adelie 0.632 0.368" + "objectID": "week5/tutorialsol.html#exercises", + "href": "week5/tutorialsol.html#exercises", + "title": "ETC3250/5250 Tutorial 5", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1148)\np_split <- initial_split(p_tidy_std, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. LDA\nThis problem uses linear discriminant analysis on the penguins data.\n\nIs the assumption of equal variance-covariance reasonable to make for this data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nYou need to look at the data in a tour, using:\n\nanimate_xy(p_tidy_std[,2:5], col=p_tidy$species)\n\nUse the standardised data, because the measurements are in different sizes, and this is not relevant for this data.\n\n\n\n\n\nFit the LDA model to the training data, using this code\n\n\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(1/3, 1/3, 1/3))\nlda_fit <- lda_spec |> \n fit(species ~ ., data = p_tr)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\nparsnip model object\n\nCall:\nlda(species ~ ., data = data, prior = ~c(1/3, 1/3, 1/3))\n\nPrior probabilities of groups:\n Adelie Chinstrap Gentoo \n 0.33 0.33 0.33 \n\nGroup means:\n bl bd fl bm\nAdelie -0.94 0.65 -0.79 -0.59\nChinstrap 0.92 0.64 -0.37 -0.62\nGentoo 0.70 -1.08 1.19 1.16\n\nCoefficients of linear discriminants:\n LD1 LD2\nbl -0.34 -2.251\nbd 2.02 0.035\nfl -1.13 -0.170\nbm -1.18 1.376\n\nProportion of trace:\n LD1 LD2 \n0.82 0.18 \n\n\n\n\n\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 100 0 0 1 \n2 Chinstrap 1 44 0 0.978\n3 Gentoo 0 0 82 1 \n\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 49 2 0 0.961\n2 Chinstrap 1 22 0 0.957\n3 Gentoo 0 0 41 1 \n\n\n[1] 0.97\n\n\n\n\n\n\n\nPlot the training and test data in the discriminant space, using symbols to indicate which set. See if you can mark the misclassified cases, too.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRe-do the plot of the discriminant space, to examine the boundary between groups. Youโ€™ll need to generate a set of random points in the domain of the data, predict their class, and projection into the discriminant space. The explore() in the classifly package can help you generate the box of random points.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhat happens to the boundary, if you change the prior probabilities? And why does this happen? Change the prior probabilities to be 1.999/3, 0.001/3, 1/3 for Adelie, Chinstrap, Gentoo, respectively. Re-do the plot of the boundaries in the discriminant space.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf the prior probabilities are unequal, it gives more importance to some classes. Here the importance of the Adelie penguins has been increased to the detriment of the Chinstrap. So the boundary moves away from the Adelie, which means more often a new penguin would be classified as an Adelie.\n\n\n\n\n\n\n2. Logistic\n\nFit a logistic discriminant model to the training set. You can use this code:\n\n\nlog_fit <- multinom_reg() |> \n fit(species ~ ., \n data = p_tr)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nlog_fit\n\nparsnip model object\n\nCall:\nnnet::multinom(formula = species ~ ., data = data, trace = FALSE)\n\nCoefficients:\n (Intercept) bl bd fl bm\nChinstrap 18.0 84 -42 4.7 -25\nGentoo 7.4 38 -69 33.7 25\n\nResidual Deviance: 0.00024 \nAIC: 20 \n\n\n\n\n\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set. You can use this code to make the predictions.\n\n\np_tr_pred <- log_fit |> \n augment(new_data = p_tr) |>\n rename(pspecies = .pred_class)\np_ts_pred <- log_fit |> \n augment(new_data = p_ts) |>\n rename(pspecies = .pred_class)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_tr_pred |> count(species, pspecies) |>\n group_by(species) |>\n mutate(cl_acc = n[pspecies==species]/sum(n)) |>\n pivot_wider(names_from = pspecies, \n values_from = n, values_fill=0) |>\n select(species, Adelie, Chinstrap, Gentoo, cl_acc)\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 100 0 0 1\n2 Chinstrap 0 45 0 1\n3 Gentoo 0 0 82 1\n\np_ts_pred |> count(species, pspecies) |>\n group_by(species) |>\n mutate(cl_acc = n[pspecies==species]/sum(n)) |>\n pivot_wider(names_from = pspecies, \n values_from = n, values_fill=0) |>\n select(species, Adelie, Chinstrap, Gentoo, cl_acc)\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 49 2 0 0.961\n2 Chinstrap 0 23 0 1 \n3 Gentoo 0 0 41 1 \n\naccuracy(p_ts_pred, species, pspecies)$.estimate\n\n[1] 0.98\n\n\n\n\n\n\n\nCheck the boundaries produced by logistic regression, and how they differ from those of LDA. Using the 2D projection produced by the LDA rule (using equal priors) predict the your set of random points using the logistic model.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_log_bnd_ds <- log_fit |> \n augment(new_data = p_bnd) |>\n rename(pspecies = .pred_class)\n\nggplot() +\n geom_point(\n data=p_log_bnd_ds[p_log_bnd_ds$.TYPE == \"simulated\",], \n aes(x=LD1, y=LD2, \n colour=pspecies), shape=46, alpha=0.8) + \n scale_color_discrete_divergingx(\"Zissou 1\") +\n geom_point(data=p_log_bnd_ds[p_log_bnd_ds$.TYPE == \"actual\",],\n aes(x=LD1, y=LD2, \n colour=species), shape=16, alpha=0.8) \n\n\n\n\n\n\n\n\nOne thing that you can notice is that the boundaries are not โ€œcrispโ€, that there is overlap of the coloured points marking the classification regions. This means that the separation from the logistic regression model is not accomplished in the same 2D space as LDA.\n\n\n\n\n\n\n3. Interactively explore misclassifications\nHere you are going to use interactive graphics to explore the misclassifications from the linear discriminant analysis. Weโ€™ll need to use detourr to accomplish this. The code below makes a scatterplot of the confusion matrix, where points corresponding to a class have been spread apart by jittering. This plot is linked to a tour plot. Try:\n\nSelecting penguins that have been misclassified, from the display of the confusion matrix. Observe where they are in the data space. Are they in an area where it is hard to distinguish the groups?\nSelecting neighbouring points in the tour, and examine where they are in the confusion matrix.\n\n\np_cl <- p_tidy_std |>\n mutate(pspecies = predict(lda_fit$fit, p_tidy_std)$class) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared, \n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n ) \n\n\n\n4. Exploring the math\nSlide 23 of the lecture notes has the steps to go from Bayes rule to the discriminant functions. Explain what was done at each step to get to the next one." }, { - "objectID": "week5/slides.html#variable-importance-12", - "href": "week5/slides.html#variable-importance-12", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Variable importance (1/2)", - "text": "Variable importance (1/2)\n\nFor every tree predict the oob cases and count the number of votes cast for the correct class.\n\n\n\nRandomly permute the values on a variable in the oob cases and predict the class for these cases.\n\n\n\n3.Difference the votes for the correct class in the variable-permuted oob cases and the real oob cases. Average this number over all trees in the forest. If the value is large, then the variable is very important.\n\n\n Alternatively, Gini importance adds up the difference in impurity value of the descendant nodes with the parent node. Quick to compute.\n\n\n Read a fun explanation by Harriet Mason" + "objectID": "week5/tutorialsol.html#finishing-up", + "href": "week5/tutorialsol.html#finishing-up", + "title": "ETC3250/5250 Tutorial 5", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week5/slides.html#variable-importance-22", - "href": "week5/slides.html#variable-importance-22", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Variable importance (2/2)", - "text": "Variable importance (2/2)\n\n\n\np_fit_rf$fit$importance\n\n MeanDecreaseGini\nbl 57.2\nbm 4.5\n\n\n\n\n\n\n\n\n\n\n\n\n\np_tr_perm <- p_tr |>\n mutate(bl = sample(bl))\nggplot(p_tr_perm, aes(x=bl, y=bm, colour=species)) +\n geom_point() +\n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n ggtitle(\"Permuted bl\") +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\nVotes will be close to 0.5 for both classes." + "objectID": "week5/index.html", + "href": "week5/index.html", + "title": "Week 5: Trees and forests", + "section": "", + "text": "ISLR 8.1, 8.2" }, { - "objectID": "week5/slides.html#proximities", - "href": "week5/slides.html#proximities", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Proximities", - "text": "Proximities\n\nMeasure how each pair of observations land in the forest\nRun both in- and out-of-bag cases down the tree, and increase proximity value of cases \\(i, j\\) by 1 each time they are in the same terminal node.\nNormalize by dividing by \\(B\\).\n\nThis creates a similarity matrix between all pairs of observations.\n\nUse this for cluster analysis of the data for further diagnosing unusual observations, and model inadequacies." + "objectID": "week5/index.html#main-reference", + "href": "week5/index.html#main-reference", + "title": "Week 5: Trees and forests", + "section": "", + "text": "ISLR 8.1, 8.2" }, { - "objectID": "week5/slides.html#utilising-diagnostics-13", - "href": "week5/slides.html#utilising-diagnostics-13", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Utilising diagnostics (1/3)", - "text": "Utilising diagnostics (1/3)\n\n\nThe votes matrix yields more information than the confusion matrix, about the confidence that the model has in the prediction for each observation, in the training set.\nIt is a \\(K\\)-D object, but lives in \\((K-1)\\)-D because the rows add to 1.\nLetโ€™s re-fit the random forest model to the three species of the penguins.\n\n\n\n\n\np_ternary" + "objectID": "week5/index.html#what-you-will-learn-this-week", + "href": "week5/index.html#what-you-will-learn-this-week", + "title": "Week 5: Trees and forests", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nClassification trees, algorithm, stopping rules\nDifference between algorithm and parametric methods, especially trees vs LDA\nForests: ensembles of bagged trees\nDiagnostics: vote matrix, variable importance, proximity\nBoosted trees" }, { - "objectID": "week5/slides.html#utilising-diagnostics-23", - "href": "week5/slides.html#utilising-diagnostics-23", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Utilising diagnostics (2/3)", - "text": "Utilising diagnostics (2/3)\nDEMO: Use interactivity to investigate the uncertainty in the predictions.\n\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\np_tr2_std <- p_tr2 |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\np_tr2_v <- bind_cols(p_tr2_std, p_rf_v_p[,1:2]) \np_tr2_v_shared <- SharedData$new(p_tr2_v)\n\ndetour_plot <- detour(p_tr2_v_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", \n height = \"450px\",\n palette = hcl.colors(3,\n palette=\"Zissou 1\"))\n\nvot_mat <- plot_ly(p_tr2_v_shared, \n x = ~x1,\n y = ~x2,\n color = ~species,\n colors = hcl.colors(3,\n palette=\"Zissou 1\"),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, vot_mat,\n widths = c(5, 6)\n )" + "objectID": "week5/index.html#lecture-slides", + "href": "week5/index.html#lecture-slides", + "title": "Week 5: Trees and forests", + "section": "Lecture slides", + "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" }, { - "objectID": "week5/slides.html#utilising-diagnostics-33", - "href": "week5/slides.html#utilising-diagnostics-33", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Utilising diagnostics (3/3)", - "text": "Utilising diagnostics (3/3)\n\n\nVariable importance can help with variable selection.\n\n\np_fit_rf2$fit$importance\n\n MeanDecreaseGini\nbl 58\nbd 28\nfl 45\nbm 12\n\n\nTop two variables are bl and fl. \nEspecially useful when you have many more variables." + "objectID": "week5/index.html#tutorial-instructions", + "href": "week5/index.html#tutorial-instructions", + "title": "Week 5: Trees and forests", + "section": "Tutorial instructions", + "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" }, { - "objectID": "week5/slides.html#boosted-trees-13", - "href": "week5/slides.html#boosted-trees-13", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Boosted trees (1/3)", - "text": "Boosted trees (1/3)\nRandom forests build an ensemble of independent trees, while boosted trees build an ensemble from shallow trees in a sequence with each tree learning and improving on the previous one, by re-weighting observations to give mistakes more importance.\n\n\n\nSource: Boehmke (2020) Hands on Machine Learning with R" + "objectID": "week5/index.html#assignments", + "href": "week5/index.html#assignments", + "title": "Week 5: Trees and forests", + "section": "Assignments", + "text": "Assignments\n\nAssignment 2 is due on Friday 12 April." }, { - "objectID": "week5/slides.html#boosted-trees-23", - "href": "week5/slides.html#boosted-trees-23", + "objectID": "week4/slides.html#overview", + "href": "week4/slides.html#overview", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Boosted trees (2/3)", - "text": "Boosted trees (2/3)\nBoosting iteratively fits multiple trees, sequentially putting more weight on observations that have predicted inaccurately.\n\nSet weights (probabilities) for all observations in training set ( according to class sample sizes using log odds ratio). Fit a tree with fixed \\(d\\) splits ( \\(d+1\\) terminal nodes).\nFor b=1, 2, โ€ฆ, B, repeat:\n\nCompute fitted values \nCompute pseudo-residuals \nFit the tree to the residuals \nCompute new weights (probabilities)\n\nAggregate predictions from all trees.\n\nThis StatQuest video by Josh Starmer, is the best explanation!\nAnd this is a fun explanation of boosting by Harriet Mason." + "section": "Overview", + "text": "Overview\nWe will cover:\n\nFitting a categorical response using logistic curves\nMultivariate summary statistics\nLinear discriminant analysis, assuming samples are elliptically shaped and equal in size\nQuadratic discriminant analysis, assuming samples are elliptically shaped and different in size\nDiscriminant space: making a low-dimensional visual summary" }, { - "objectID": "week5/slides.html#boosted-trees-33", - "href": "week5/slides.html#boosted-trees-33", + "objectID": "week4/slides.html#logistic-regression", + "href": "week4/slides.html#logistic-regression", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Boosted trees (3/3)", - "text": "Boosted trees (3/3)\n\nset.seed(1110)\nbt_spec <- boost_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"xgboost\")\np_fit_bt <- bt_spec |> \n fit(species ~ ., data = p_tr2)\n\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.991\n\n\n# A tibble: 3 ร— 4\n# Groups: species [3]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 50 1 0.980\n2 Chinstrap 0 23 1 \n3 Gentoo 0 0 1" + "section": "Logistic regression", + "text": "Logistic regression" }, { - "objectID": "week5/slides.html#limitations-of-trees", - "href": "week5/slides.html#limitations-of-trees", + "objectID": "week4/slides.html#when-linear-regression-is-not-appropriate", + "href": "week4/slides.html#when-linear-regression-is-not-appropriate", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Limitations of trees", - "text": "Limitations of trees\n\nMost implementations only splits on a single variable, not combinations.\nThere are versions that build trees on combinations, eg PPTreeViz and PPforest, but you lose interpretability, and fitting is more difficult.\nSees only splits, but not gaps. (See support vector machines, in a few weeks.)\nAlgorithm takes variables in order, and splits in order, and will use first as best.\nNeed tuning and cross-validation." + "section": "When linear regression is not appropriate", + "text": "When linear regression is not appropriate\n\n\n Consider the following data Default in the ISLR R package (textbook) which looks at the default status based on credit balance.\n\nlibrary(ISLR)\ndata(Default)\nsimcredit <- Default |>\n mutate(default_bin = ifelse(default==\"Yes\", 1, 0))\n\n Why is a linear model less than ideal for this data?" }, { - "objectID": "week5/slides.html#next-neural-networks-and-deep-learning", - "href": "week5/slides.html#next-neural-networks-and-deep-learning", + "objectID": "week4/slides.html#modelling-binary-responses", + "href": "week4/slides.html#modelling-binary-responses", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Next: Neural networks and deep learning", - "text": "Next: Neural networks and deep learning\n\n\n\nETC3250/5250 Lecture 4 | iml.numbat.space" + "section": "Modelling binary responses", + "text": "Modelling binary responses\n\n\n\n\n\n\n\n\n\n\n\n\n\nOrange line (logistic model fit) is similar to computing a running average of the 0s/1s. Itโ€™s much better than the linear fit, because it remains between 0 and 1, and can be interpreted as proportion of 1s.\nWhat is a logistic function?" }, { - "objectID": "week4/tutorial.html", - "href": "week4/tutorial.html", - "title": "ETC3250/5250 Tutorial 4", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(mvtnorm)\nlibrary(boot)\nlibrary(nullabor)\nlibrary(palmerpenguins)\nlibrary(GGally)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species)" + "objectID": "week4/slides.html#the-logistic-function", + "href": "week4/slides.html#the-logistic-function", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "The logistic function", + "text": "The logistic function\n\n\nInstead of predicting the outcome directly, we instead predict the probability of being class 1, given the (linear combination of) predictors, using the logistic function.\n\\[ p(y=1|\\beta_0 + \\beta_1 x) = f(x) \\] where\n\\[f(x) = \\frac{e^{\\beta_0+\\beta_1x}}{1+e^{\\beta_0+\\beta_1x}}\\]" }, { - "objectID": "week4/tutorial.html#objectives", - "href": "week4/tutorial.html#objectives", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables." + "objectID": "week4/slides.html#logistic-function", + "href": "week4/slides.html#logistic-function", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Logistic function", + "text": "Logistic function\n\n\nTransform the function:\n\\[~~~~y = \\frac{e^{\\beta_0+\\beta_1x}}{1+e^{\\beta_0+\\beta_1x}}\\]\n\\(\\longrightarrow y = \\frac{1}{1/e^{\\beta_0+\\beta_1x}+1}\\)\n\\(\\longrightarrow 1/y = 1/e^{\\beta_0+\\beta_1x}+1\\)\n\\(\\longrightarrow 1/y - 1 = 1/e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\frac{1}{1/y - 1} = e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\frac{y}{1 - y} = e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\log_e\\frac{y}{1 - y} = \\beta_0+\\beta_1x\\)\n\n\n \nTransforming the response \\(\\log_e\\frac{y}{1 - y}\\) makes it possible to use a linear model fit.\n \n\nThe left-hand side, \\(\\log_e\\frac{y}{1 - y}\\), is known as the log-odds ratio or logit." }, { - "objectID": "week4/tutorial.html#preparation", - "href": "week4/tutorial.html#preparation", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 3" + "objectID": "week4/slides.html#the-logistic-regression-model", + "href": "week4/slides.html#the-logistic-regression-model", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "The logistic regression model", + "text": "The logistic regression model\nThe fitted model, where \\(P(Y=0|X) = 1 - P(Y=1|X)\\), is then written as:\n\n\\(\\log_e\\frac{P(Y=1|X)}{1 - P(Y=1|X)} = \\beta_0+\\beta_1X\\)\n\n When there are more than two categories:\n\nthe formula can be extended, using dummy variables.\nfollows from the above, extended to provide probabilities for each level/category, and the last category is 1-sum of the probabilities of other categories.\nthe sum of all probabilities has to be 1." }, { - "objectID": "week4/tutorial.html#exercises", - "href": "week4/tutorial.html#exercises", - "title": "ETC3250/5250 Tutorial 4", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Assess the significance of PC coefficients using bootstrap\nIn the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womensโ€™ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?\nThe data can be read using:\n\ntrack <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv\")\n\n\n\n2. Using simulation to assess results when there is no structure\nThe ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.\n\nWhat is the mean and covariance matrix of a multivariate standard normal distribution?\n\n\nSimulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)\n\n\nCompute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the womenโ€™s track data?\n\n\n\n3. Making a lineup plot to assess the dependence between variables\nPermutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.\n\nGenerate a sample of data that has a strong non-linear relationship but no correlation, as follows:\n\n\nset.seed(908)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)\n\nand then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:\n\nset.seed(912)\ndf_l <- lineup(null_permute('x1'), df)\n\nand make all 20 plots as follows:\n\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\nIs the data plot recognisably different from the plots of permuted data?\n\nRepeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?\n\n\n\n4. Computing \\(k\\)-folds for cross-validation\nFor the penguins data, compute 5-fold cross-validation sets, stratified by species.\n\nList the observations in each sample, so that you can see there is no overlap.\n\n\nMake a scatterplot matrix for each fold, coloured by species. Do the samples look similar?\n\n\n\n5. What was the easiest part of this tutorial to understand, and what was the hardest?" + "objectID": "week4/slides.html#connection-to-generalised-linear-models", + "href": "week4/slides.html#connection-to-generalised-linear-models", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Connection to generalised linear models", + "text": "Connection to generalised linear models\n\nTo model binary data, we need to link our predictors to our response using a link function. Another way to think about it is that we will transform \\(Y\\), to convert it to a proportion, and then build the linear model on the transformed response.\nThere are many different types of link functions we could use, but for a binary response we typically use the logistic link function." }, { - "objectID": "week4/tutorial.html#finishing-up", - "href": "week4/tutorial.html#finishing-up", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week4/slides.html#interpretation", + "href": "week4/slides.html#interpretation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Interpretation", + "text": "Interpretation\n\nLinear regression\n\n\\(\\beta_1\\) gives the average change in \\(Y\\) associated with a one-unit increase in \\(X\\)\n\nLogistic regression\n\nBecause the model is not linear in \\(X\\), \\(\\beta_1\\) does not correspond to the change in response associated with a one-unit increase in \\(X\\).\nHowever, increasing \\(X\\) by one unit changes the log odds by \\(\\beta_1\\), or equivalently it multiplies the odds by \\(e^{\\beta_1}\\)" }, { - "objectID": "week4/tutorialsol.html", - "href": "week4/tutorialsol.html", - "title": "ETC3250/5250 Tutorial 4", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(mvtnorm)\nlibrary(boot)\nlibrary(nullabor)\nlibrary(palmerpenguins)\nlibrary(GGally)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species)" + "objectID": "week4/slides.html#maximum-likelihood-estimation", + "href": "week4/slides.html#maximum-likelihood-estimation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Maximum Likelihood Estimation", + "text": "Maximum Likelihood Estimation\nGiven the logistic \\(p(x_i) = \\frac{1}{e^{-(\\beta_0+\\beta_1x_i)}+1}\\) choose parameters \\(\\beta_0, \\beta_1\\) to maximize the likelihood:\n\\[\\mathcal{l}_n(\\beta_0, \\beta_1) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}.\\]\nIt is more convenient to maximize the log-likelihood:\n\\[\\begin{align*}\n\\log l_n(\\beta_0, \\beta_1) &= \\sum_{i = 1}^n \\big( y_i\\log p(x_i) + (1-y_i)\\log(1-p(x_i))\\big)\\\\\n&= \\sum_{i=1}^n\\big(y_i(\\beta_0+\\beta_1x_i)-\\log{(1+e^{\\beta_0+\\beta_1x_i})}\\big)\n\\end{align*}\\]" }, { - "objectID": "week4/tutorialsol.html#objectives", - "href": "week4/tutorialsol.html#objectives", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables." + "objectID": "week4/slides.html#making-predictions", + "href": "week4/slides.html#making-predictions", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Making predictions", + "text": "Making predictions\n\n\nWith estimates from the model fit, \\(\\hat{\\beta}_0, \\hat{\\beta}_1\\), we can predict the probability of belonging to class 1 using:\n\\[p(y=1|\\hat{\\beta}_0 + \\hat{\\beta}_1 x) = \\frac{e^{\\hat{\\beta}_0+ \\hat{\\beta}_1x}}{1+e^{\\hat{\\beta}_0+ \\hat{\\beta}_1x}}\\] \nRound to 0 or 1 for class prediction.\n\nfit <- glm(default~balance, \n data=simcredit, family=\"binomial\") \nsimcredit_fit <- augment(fit, simcredit,\n type.predict=\"response\")\n\n\n\n\n\n\n\n\n\n\n\nOrange points are fitted values, \\(\\hat{y}_i\\). Black points are observed response, \\(y_i\\) (either 0 or 1)." }, { - "objectID": "week4/tutorialsol.html#preparation", - "href": "week4/tutorialsol.html#preparation", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 3" + "objectID": "week4/slides.html#fitting-credit-data-in-r", + "href": "week4/slides.html#fitting-credit-data-in-r", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Fitting credit data in R", + "text": "Fitting credit data in R\n\n\nWe can use the glm function in R to fit a logistic regression model. The glm function can support many response types, so we specify family=\"binomial\" to let R know that our response is binary.\n\nfit <- glm(default~balance, \n data=simcredit, family=\"binomial\") \nsimcredit_fit <- augment(fit, simcredit,\n type.predict=\"response\")\n\n\n \nSame calculation but written in tidymodels style\n\nlogistic_mod <- logistic_reg() |> \n set_engine(\"glm\") |> \n set_mode(\"classification\") |> \n translate()\n\nlogistic_fit <- \n logistic_mod |> \n fit(default ~ balance, \n data = simcredit)" }, { - "objectID": "week4/tutorialsol.html#exercises", - "href": "week4/tutorialsol.html#exercises", - "title": "ETC3250/5250 Tutorial 4", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Assess the significance of PC coefficients using bootstrap\nIn the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womensโ€™ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?\nThe data can be read using:\n\ntrack <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\ncompute_PC1 <- function(data, index) {\n pc1 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,1]\n # Coordinate signs\n if (sign(pc1[1]) < 0) \n pc1 <- -pc1 \n return(pc1)\n}\n# Make sure sign of first PC element is positive\nPC1_boot <- boot(data=track[,1:7], compute_PC1, R=1000)\ncolnames(PC1_boot$t) <- colnames(track[,1:7])\nPC1_boot_ci <- as_tibble(PC1_boot$t) %>%\n gather(var, coef) %>% \n mutate(var = factor(var, levels=c(\"m100\", \"m200\", \"m400\", \"m800\", \"m1500\", \"m3000\", \"marathon\"))) %>%\n group_by(var) %>%\n summarise(q2.5 = quantile(coef, 0.025), \n q5 = median(coef),\n q97.5 = quantile(coef, 0.975)) %>%\n mutate(t0 = PC1_boot$t0) \n \n# The red horizontal line indicates the null value \n# of the coefficient when all are equal.\nggplot(PC1_boot_ci, aes(x=var, y=t0)) + \n geom_hline(yintercept=1/sqrt(7), linetype=2, colour=\"red\") +\n geom_point() +\n geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +\n #geom_hline(yintercept=0, linewidth=3, colour=\"white\") +\n xlab(\"\") + ylab(\"coefficient\") \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n2. Using simulation to assess results when there is no structure\nThe ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.\n\nWhat is the mean and covariance matrix of a multivariate standard normal distribution?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe mean is a \\(p\\)-dimensional vector of 0, and the covariance is a \\(p\\)-dimensional variance-covariance matrix.\n\n\n\n\n\nSimulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(854)\nd <- rmvnorm(55, mean = rep(0, 7), sigma = diag(7))\napply(d, 2, mean)\n\n[1] 0.271 0.125 0.054 -0.076 -0.012 -0.141 -0.055\n\ncov(d)\n\n [,1] [,2] [,3] [,4] [,5] [,6] [,7]\n[1,] 0.8162 -0.126 0.0102 -0.030 0.244 -0.0932 0.0097\n[2,] -0.1263 0.915 -0.0050 -0.051 -0.092 -0.1128 -0.0242\n[3,] 0.0102 -0.005 1.1710 0.077 0.387 -0.0019 0.1609\n[4,] -0.0298 -0.051 0.0766 0.659 0.027 0.1862 0.0463\n[5,] 0.2438 -0.092 0.3872 0.027 0.917 -0.1307 0.0143\n[6,] -0.0932 -0.113 -0.0019 0.186 -0.131 0.8257 0.0120\n[7,] 0.0097 -0.024 0.1609 0.046 0.014 0.0120 0.8046\n\n\n\n\n\n\n\nCompute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the womenโ€™s track data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nd_pca <- prcomp(d, center=FALSE, scale=FALSE)\nd_pca$sdev^2\n\n[1] 1.55 1.15 1.04 0.77 0.68 0.56 0.48\n\n\nThe variance of the first PC of the womensโ€™ track data is 5.8, which is much higher than that from this sample. It says that there is substantially more variance explained by PC 1 of the womensโ€™s track data than would be expected if there was no association between any variables.\nYou should repeat generating the multivariate normal samples and computing the variance of PC 1 a few more times to learn what is the largest that would be observed.\n\n\n\n\n\n\n3. Making a lineup plot to assess the dependence between variables\nPermutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.\n\nGenerate a sample of data that has a strong non-linear relationship but no correlation, as follows:\n\n\nset.seed(908)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)\n\nand then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:\n\nset.seed(912)\ndf_l <- lineup(null_permute('x1'), df)\n\nand make all 20 plots as follows:\n\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\nIs the data plot recognisably different from the plots of permuted data?\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe data and the permuted data are very different. The permutation breaks any relationship between the two variables, so we know that there is NO relationship in any of the permuted data examples. This says that the relationship seen in the data is strongly statistically significant.\n\n\n\n\n\nRepeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(916)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = rnorm(n)*0.1)\ndf_l <- lineup(null_permute('x1'), df)\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\n\n\n\n\n\n\n\nThe data cannot be distinguished from the permuted data, so there is no statistically significant relatiomship between the two variables.\n\n\n\n\n\n\n4. Computing \\(k\\)-folds for cross-validation\nFor the penguins data, compute 5-fold cross-validation sets, stratified by species.\n\nList the observations in each sample, so that you can see there is no overlap.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(929)\np_folds <- vfold_cv(p_tidy, 5, strata=species)\nc(1:nrow(p_tidy))[-p_folds$splits[[1]]$in_id]\n\n [1] 3 6 31 36 42 44 51 53 59 62 65 66 67 79 85 88 93 96 103\n[20] 104 105 107 108 113 114 118 122 128 141 143 144 155 157 158 163 170 177 179\n[39] 182 194 195 202 204 211 213 221 222 224 226 239 246 248 256 258 264 265 275\n[58] 280 287 292 295 296 297 307 322 327 328 335 336 339\n\nc(1:nrow(p_tidy))[-p_folds$splits[[2]]$in_id]\n\n [1] 1 8 13 17 19 21 24 29 41 50 54 56 78 86 87 89 97 100 101\n[20] 112 117 121 123 129 130 132 133 139 149 150 152 159 166 167 168 169 171 189\n[39] 190 191 193 198 212 215 225 228 231 241 244 249 250 259 260 262 266 268 269\n[58] 270 271 272 276 282 283 284 288 321 331 337 342\n\nc(1:nrow(p_tidy))[-p_folds$splits[[3]]$in_id]\n\n [1] 4 9 10 15 25 30 32 35 37 39 43 47 48 55 57 64 69 71 80\n[20] 82 91 109 111 116 124 127 134 136 140 147 162 176 178 180 186 199 200 203\n[39] 207 208 210 216 218 219 220 229 232 236 240 243 247 252 254 261 267 277 279\n[58] 286 290 299 300 303 306 308 312 320 325 326 329\n\nc(1:nrow(p_tidy))[-p_folds$splits[[4]]$in_id]\n\n [1] 5 11 18 20 22 23 27 28 33 34 52 70 72 73 75 77 81 90 92\n[20] 94 95 106 110 119 125 137 138 142 145 151 154 156 160 161 165 174 181 183\n[39] 187 192 196 206 214 223 227 234 237 238 245 255 257 274 281 285 289 293 294\n[58] 298 302 313 314 315 317 324 330 332 338\n\nc(1:nrow(p_tidy))[-p_folds$splits[[5]]$in_id]\n\n [1] 2 7 12 14 16 26 38 40 45 46 49 58 60 61 63 68 74 76 83\n[20] 84 98 99 102 115 120 126 131 135 146 148 153 164 172 173 175 184 185 188\n[39] 197 201 205 209 217 230 233 235 242 251 253 263 273 278 291 301 304 305 309\n[58] 310 311 316 318 319 323 333 334 340 341\n\n\n\n\n\n\n\nMake a scatterplot matrix for each fold, coloured by species. Do the samples look similar?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[1]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[2]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[3]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[4]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[5]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\n\nThe folds are similar but there are some noticeable differences that might lead to variation in the statistics that are calculated from each other. However, one should consider this variation something that might generally occur if we had different samples.\n\n\n\n\n\n\n5. What was the easiest part of this tutorial to understand, and what was the hardest?" + "objectID": "week4/slides.html#examine-the-fit", + "href": "week4/slides.html#examine-the-fit", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Examine the fit", + "text": "Examine the fit\n\n\n\ntidy(logistic_fit) \n\n# A tibble: 2 ร— 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) -10.7 0.361 -29.5 3.62e-191\n2 balance 0.00550 0.000220 25.0 1.98e-137\n\nglance(logistic_fit) \n\n# A tibble: 1 ร— 8\n null.deviance df.null logLik AIC BIC deviance\n <dbl> <int> <dbl> <dbl> <dbl> <dbl>\n1 2921. 9999 -798. 1600. 1615. 1596.\n# โ„น 2 more variables: df.residual <int>, nobs <int>\n\n\n\n\n\nParameter estimates\n\\(\\widehat{\\beta}_0 =\\) -10.65\n\\(\\widehat{\\beta}_1 =\\) 0.01\nCan you write out the model?\n\n\nModel fit summary\nNull model deviance 2920.6 (error for model with no predictors)\nModel deviance 1596.5 (error from fitted model)\nHow good is the model?" }, { - "objectID": "week4/tutorialsol.html#finishing-up", - "href": "week4/tutorialsol.html#finishing-up", - "title": "ETC3250/5250 Tutorial 4", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week4/slides.html#check-the-model-performance", + "href": "week4/slides.html#check-the-model-performance", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Check the model performance", + "text": "Check the model performance\n\n\n\nsimcredit_fit <- augment(logistic_fit, simcredit) \nsimcredit_fit |> \n count(default, .pred_class) |>\n group_by(default) |>\n mutate(Accuracy = n[.pred_class==default]/sum(n)) |>\n pivot_wider(names_from = \".pred_class\", values_from = n) |>\n select(default, No, Yes, Accuracy)\n\n# A tibble: 2 ร— 4\n# Groups: default [2]\n default No Yes Accuracy\n <fct> <int> <int> <dbl>\n1 No 9625 42 0.996\n2 Yes 233 100 0.300\n\n\nCompute the balanced accuracy.\nUnbalanced data set, with very different performance on each class.\n\nHow good is this model?\n\n\n\nExplains about half of the variation in the response, which would normally be reasonable.\nGets most of the smaller but important class wrong.\nNot a very useful model." }, { - "objectID": "week4/index.html", - "href": "week4/index.html", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "", - "text": "ISLR 4.3, 4.4" + "objectID": "week4/slides.html#a-warning-for-using-glms", + "href": "week4/slides.html#a-warning-for-using-glms", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "A warning for using GLMs!", + "text": "A warning for using GLMs!\n\n\n\n\nLogistic regression model fitting fails when the data is perfectly separated.\n\nMLE fit will try and fit a step-wise function to this graph, pushing coefficients sizes towards infinity and produce large standard errors.\nPay attention to warnings!\n\n\n\n\n\n\n\n\n\n\n\nlogistic_fit <- \n logistic_mod |> \n fit(default_new ~ balance, \n data = simcredit)\n\nWarning: glm.fit: algorithm did not converge\n\n\nWarning: glm.fit: fitted probabilities numerically 0 or 1\noccurred" }, { - "objectID": "week4/index.html#main-reference", - "href": "week4/index.html#main-reference", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "", - "text": "ISLR 4.3, 4.4" + "objectID": "week4/slides.html#discriminant-analysis", + "href": "week4/slides.html#discriminant-analysis", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Discriminant Analysis", + "text": "Discriminant Analysis" }, { - "objectID": "week4/index.html#what-you-will-learn-this-week", - "href": "week4/index.html#what-you-will-learn-this-week", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nFitting a categorical response using logistic curves\nMultivariate summary statistics\nLinear discriminant analysis, assuming samples are elliptically shaped and equal in size\nQuadratic discriminant analysis, assuming samples are elliptically shaped and different in size\nDiscriminant space: making a low-dimensional visual summary" + "objectID": "week4/slides.html#linear-discriminant-analysis", + "href": "week4/slides.html#linear-discriminant-analysis", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Linear Discriminant Analysis", + "text": "Linear Discriminant Analysis\n\n\n\n\n\n\n\n\n\n\n\nWhere would you draw a line to create a boundary separating Adelie and Gentoo penguins?\n\n\n\nWhere are the sample means?\nWhat is the shape of the sample variance-covariance?\n\n\n\nLinear discriminant analysis assumes the distribution of the predictors is a multivariate normal, with the same variance-covariance matrix, separately for each class." }, { - "objectID": "week4/index.html#lecture-slides", - "href": "week4/index.html#lecture-slides", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "Lecture slides", - "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" + "objectID": "week4/slides.html#assumptions-underlie-lda", + "href": "week4/slides.html#assumptions-underlie-lda", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Assumptions underlie LDA", + "text": "Assumptions underlie LDA\n\n\n\n\nSource: https://xkcd.com\n\n\n\n\nAll samples come from normal populations\nwith the same population variance-covariance matrix" }, { - "objectID": "week4/index.html#tutorial-instructions", - "href": "week4/index.html#tutorial-instructions", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "Tutorial instructions", - "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" + "objectID": "week4/slides.html#lda-with-p1-predictors-14", + "href": "week4/slides.html#lda-with-p1-predictors-14", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "LDA with \\(p=1\\) predictors 1/4", + "text": "LDA with \\(p=1\\) predictors 1/4\n\n\nIf \\(K = 2\\) (two classes labelled A and B) and each group has the same prior probability, the LDA rule is to assign the new observation \\(x_0\\) to class A if\n\n\\[\nx_0 > \\frac{\\bar{x}_A + \\bar{x}_B}{2}\n\\]\n\n\n\nItโ€™s a really intuitive rule, eh?\nIt also matters which of the two classes is considered to be A!!!\nSo maybe easier to think about as โ€œassign the new observation to the group with the closest meanโ€.\nHow does this rule arise from the assumptions?" }, { - "objectID": "week4/index.html#assignments", - "href": "week4/index.html#assignments", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "Assignments", - "text": "Assignments" + "objectID": "week4/slides.html#bayes-theorem-24", + "href": "week4/slides.html#bayes-theorem-24", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bayes Theorem 2/4", + "text": "Bayes Theorem 2/4\nLet \\(f_k(x)\\) be the density function for predictor \\(x\\) for class \\(k\\). If \\(f\\) is large, the probability that \\(x\\) belongs to class \\(k\\) is large, or if \\(f\\) is small it is unlikely that \\(x\\) belongs to class \\(k\\).\nAccording to Bayes theorem (for \\(K\\) classes) the probability that \\(x\\) belongs to class \\(k\\) is:\n\\[P(Y = k|X = x) = p_k(x) = \\frac{\\pi_kf_k(x)}{\\sum_{i=1}^K \\pi_kf_k(x)}\\]\nwhere \\(\\pi_k\\) is the prior probability that an observation comes from class \\(k\\)." }, { - "objectID": "week4/index.html#assignments-1", - "href": "week4/index.html#assignments-1", - "title": "Week 4: Logistic regression and discriminant analysis", - "section": "Assignments", - "text": "Assignments\n\nAssignment 1 is due on Friday 22 March.\nAssignment 2 is due on Friday 12 April." + "objectID": "week4/slides.html#lda-with-p1-predictors-34", + "href": "week4/slides.html#lda-with-p1-predictors-34", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "LDA with \\(p=1\\) predictors 3/4", + "text": "LDA with \\(p=1\\) predictors 3/4\n\n\nThe density function \\(f_k(x)\\) of a univariate normal (Gaussian) is\n\\[\nf_k(x) = \\frac{1}{\\sqrt{2 \\pi} \\sigma_k} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2_k} (x - \\mu_k)^2 \\right)\n\\]\nwhere \\(\\mu_k\\) and \\(\\sigma^2_k\\) are the mean and variance parameters for the \\(k\\)th class. We also assume that \\(\\sigma_1^2 = \\sigma_2^2 = \\dots = \\sigma_K^2\\); then the conditional probabilities are\n\\[\np_k(x) = \\frac{\\pi_k \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_k)^2 \\right) }{ \\sum_{l = 1}^K \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) }\n\\]" }, { - "objectID": "week3/slides.html#overview", - "href": "week3/slides.html#overview", + "objectID": "week4/slides.html#lda-with-p1-predictors-44", + "href": "week4/slides.html#lda-with-p1-predictors-44", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Overview", - "text": "Overview\nWe will cover:\n\nCommon re-sampling methods: bootstrap, cross-validation, permutation, simulation.\nCross-validation for checking generalisability of model fit, parameter tuning, variable selection.\nBootstrapping for understanding variance of parameter estimates.\nPermutation to understand significance of associations between variables, and variable importance.\nSimulation can be used to assess what might happen with samples from known distributions.\nWhat can go wrong in high-d, and how to adjust using regularisation methods." + "section": "LDA with \\(p=1\\) predictors 4/4", + "text": "LDA with \\(p=1\\) predictors 4/4\n\n\nA simplification of \\(p_k(x_0)\\) yields the discriminant functions, \\(\\delta_k(x_0)\\):\n\\[\\delta_k(x_0) = x_0 \\frac{\\mu_k}{\\sigma^2} - \\frac{\\mu_k^2}{2 \\sigma^2} + log(\\pi_k)\\] from which the LDA rule will assign \\(x_0\\) to the class \\(k\\) with the largest value.\n\nLet \\(K=2\\), then the rule reduces to assign \\(x_0\\) to class A if\n\\[\\begin{align*}\n& \\frac{\\pi_A \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_A)^2 \\right) }{ \\sum_{l = 1}^2 \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) } > \\frac{\\pi_B \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_B)^2 \\right) }{ \\sum_{l = 1}^2 \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) }\\\\\n &\\longrightarrow \\pi_A \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 \\right) > \\pi_B \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2 \\right)\\\\\n &\\longrightarrow \\pi_A \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 \\right) > \\pi_B \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2 \\right) \\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2\\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (x_0^2 - 2x_0\\mu_A + \\mu_A^2) > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (x_0^2 - 2x_0\\mu_B + \\mu_B^2) \\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (-2x_0\\mu_A + \\mu_A^2) > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (-2x_0\\mu_B + \\mu_B^2) \\\\\n &\\longrightarrow \\log \\pi_A + \\frac{x_0\\mu_A}{\\sigma^2} - \\frac{\\mu_A^2}{\\sigma^2} > \\log \\pi_B + \\frac{x_0\\mu_B}{\\sigma^2} - \\frac{\\mu_B^2}{\\sigma^2} \\\\\n &\\longrightarrow \\underbrace{x_0\\frac{\\mu_A}{\\sigma^2} - \\frac{\\mu_A^2}{\\sigma^2} + \\log \\pi_A}_{\\text{Discriminant function for class A}} > \\underbrace{x_0\\frac{\\mu_B}{\\sigma^2} - \\frac{\\mu_B^2}{\\sigma^2} + \\log \\pi_B}_{\\text{Discriminant function for class B}}\n\\end{align*}\\]" }, { - "objectID": "week3/slides.html#model-development-and-choice", - "href": "week3/slides.html#model-development-and-choice", + "objectID": "week4/slides.html#multivariate-lda-p1", + "href": "week4/slides.html#multivariate-lda-p1", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Model development and choice", - "text": "Model development and choice" + "section": "Multivariate LDA, \\(p>1\\)", + "text": "Multivariate LDA, \\(p>1\\)\nA \\(p\\)-dimensional random variable \\(X\\) has a multivariate Gaussian distribution with mean \\(\\mu\\) and variance-covariance \\(\\Sigma\\), we write \\(X \\sim N(\\mu, \\Sigma)\\).\nThe multivariate normal density function is:\n\\[f(x) = \\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}} \\exp\\{-\\frac{1}{2}(x-\\mu)^\\top\\Sigma^{-1}(x-\\mu)\\}\\]\nwith \\(x, \\mu\\) are \\(p\\)-dimensional vectors, \\(\\Sigma\\) is a \\(p\\times p\\) variance-covariance matrix." }, { - "objectID": "week3/slides.html#how-do-you-get-new-data", - "href": "week3/slides.html#how-do-you-get-new-data", + "objectID": "week4/slides.html#multivariate-lda-k2", + "href": "week4/slides.html#multivariate-lda-k2", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How do you get new data?", - "text": "How do you get new data?" + "section": "Multivariate LDA, \\(K=2\\)", + "text": "Multivariate LDA, \\(K=2\\)\nThe discriminant functions are:\n\\[\\delta_k(x) = x^\\top\\Sigma^{-1}\\mu_k - \\frac{1}{2}\\mu_k^\\top\\Sigma^{-1}\\mu_k + \\log(\\pi_k)\\]\nand Bayes classifier is assign a new observation \\(x_0\\) to the class with the highest \\(\\delta_k(x_0)\\).\nWhen \\(K=2\\) and \\(\\pi_A=\\pi_B\\) this reduces to\nAssign observation \\(x_0\\) to class A if\n\\[x_0^\\top\\underbrace{\\Sigma^{-1}(\\mu_A-\\mu_B)}_{dimension~reduction} > \\frac{1}{2}(\\mu_A+\\mu_B)^\\top\\underbrace{\\Sigma^{-1}(\\mu_A-\\mu_B)}_{dimension~reduction}\\]\nNOTE: Class A and B need to be mapped to the classes in the your data. The class โ€œto the rightโ€ on the reduced dimension will correspond to class A in this equation." }, { - "objectID": "week3/slides.html#common-re-sampling-methods", - "href": "week3/slides.html#common-re-sampling-methods", + "objectID": "week4/slides.html#computation", + "href": "week4/slides.html#computation", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Common re-sampling methods", - "text": "Common re-sampling methods\n\n\n\n\nCross-validation: Splitting the data into multiple samples.\nBootstrap: Sampling with replacement\nPermutation: Re-order the values of one or more variables\n\n\n\n\nCross-validation: This is used to gain some understanding of the variance (as in bias-variance trade-off ) of models, and how parameter or algorithm choices affect the performance of the model on future samples.\n\n\n\n\n\nBootstrap: Compute confidence intervals for model parameters, or the model fit statistics. can be used similarly to cross-validation samples but avoids the complication of smaller sample size that may affect interpretation of cross-validation samples.\nPermutation: Used to assess significance of relationships, especially to assess the importance of individual variables or combinations of variables for a fitted model." + "section": "Computation", + "text": "Computation\n Use sample mean \\(\\bar{x}_k\\) to estimate \\(\\mu_k\\), and\n\nto estimate \\(\\Sigma\\) use the pooled variance-covariance:\n\\[\nS = \\frac{n_1S_1 + n_2S_2+ \\dots +n_kS_k}{n_1+n_2+\\dots +n_k}\n\\]" }, { - "objectID": "week3/slides.html#cross-validation", - "href": "week3/slides.html#cross-validation", + "objectID": "week4/slides.html#example-penguins-13", + "href": "week4/slides.html#example-penguins-13", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Cross-validation", - "text": "Cross-validation\n\nTraining/test split: make one split of your data, keeping one purely for assessing future performance.\n\nAfter making that split, we would use these methods on the training sample:\n\nLeave-one-out: make \\(n\\) splits, fitting multiple models and using left-out observation for assessing variability.\n\\(k\\)-fold: break data into \\(k\\) subsets, fitting multiple models with one group left out each time." + "section": "Example: penguins 1/3", + "text": "Example: penguins 1/3\n\n\nSummary statistics\n\n\n# A tibble: 2 ร— 3\n species bm bd\n <fct> <dbl> <dbl>\n1 Adelie 3701. 18.3\n2 Gentoo 5076. 15.0\n\n\n bm bd\nbm 210283 321.4\nbd 321 1.5\n\n\n bm bd\nbm 254133 355.69\nbd 356 0.96\n\n\n\n\n\n\n\n\n\n\n\nlibrary(discrim)\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(0.5, 0.5))\nlda_fit <- lda_spec |> \n fit(species ~ bm + bd, data = p_sub)\n\nlda_fit\n\nparsnip model object\n\nCall:\nlda(species ~ bm + bd, data = data, prior = ~c(0.5, 0.5))\n\nPrior probabilities of groups:\nAdelie Gentoo \n 0.5 0.5 \n\nGroup means:\n bm bd\nAdelie 3701 18\nGentoo 5076 15\n\nCoefficients of linear discriminants:\n LD1\nbm 0.0024\nbd -1.0444\n\n\n\nRecommendation: standardise the variables before fitting model, even though it is not necessary for LDA." }, { - "objectID": "week3/slides.html#trainingtest-split-13", - "href": "week3/slides.html#trainingtest-split-13", + "objectID": "week4/slides.html#example-penguins-23", + "href": "week4/slides.html#example-penguins-23", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Training/test split (1/3)", - "text": "Training/test split (1/3)\n \nA set of \\(n\\) observations are randomly split into a training set (blue, containing observations 7, 22, 13, โ€ฆ) and a test set (yellow, all other observations not in training set).\n\nNeed to stratify the sampling to ensure training and test groups are appropriately balanced.\nOnly one split of data made, may have a lucky or unlucky split, accurately estimating test error relies on the one sample.\n\n (Chapter5/5.1.pdf)" + "section": "Example: penguins 2/3", + "text": "Example: penguins 2/3\n\n\nSummary statistics\n\n\n# A tibble: 2 ร— 3\n species bm bd\n <fct> <dbl> <dbl>\n1 Adelie -0.739 0.750\n2 Gentoo 0.907 -0.921\n\n\n bm bd\nbm 0.30 0.19\nbd 0.19 0.37\n\n\n bm bd\nbm 0.36 0.21\nbd 0.21 0.24\n\n\n\n\n\n\n\n\n\n\n\nlibrary(discrim)\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(0.5, 0.5))\nlda_fit <- lda_spec |> \n fit(species ~ bm + bd, data = p_sub)\n\nlda_fit\n\nparsnip model object\n\nCall:\nlda(species ~ bm + bd, data = data, prior = ~c(0.5, 0.5))\n\nPrior probabilities of groups:\nAdelie Gentoo \n 0.5 0.5 \n\nGroup means:\n bm bd\nAdelie -0.74 0.75\nGentoo 0.91 -0.92\n\nCoefficients of linear discriminants:\n LD1\nbm 2.0\nbd -2.1\n\n\nEasier to see that both variables contribute almost equally to the classification." }, { - "objectID": "week3/slides.html#trainingtest-split-23", - "href": "week3/slides.html#trainingtest-split-23", + "objectID": "week4/slides.html#example-penguins-33", + "href": "week4/slides.html#example-penguins-33", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Training/test split (2/3)", - "text": "Training/test split (2/3)\n\n\nWith tidymodels, the function initial_split() creates the indexes of observations to be allocated into training or test samples. To generate these samples use training() and test() functions.\n\nd_bal <- tibble(y=c(rep(\"A\", 6), rep(\"B\", 6)),\n x=c(runif(12)))\nd_bal$y\n\n [1] \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(130)\nd_bal_split <- initial_split(d_bal, prop = 0.70)\ntraining(d_bal_split)$y\n\n[1] \"A\" \"A\" \"B\" \"A\" \"B\" \"A\" \"B\" \"A\"\n\ntesting(d_bal_split)$y\n\n[1] \"A\" \"B\" \"B\" \"B\"\n\n\n\nHow do you ensure that you get 0.70 in each class?\n\n\n\nStratify the sampling\n\nd_bal$y\n\n [1] \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(1225)\nd_bal_split <- initial_split(d_bal, \n prop = 0.70, \n strata=y)\ntraining(d_bal_split)$y\n\n[1] \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\"\n\ntesting(d_bal_split)$y\n\n[1] \"A\" \"A\" \"B\" \"B\"\n\n\nNow the test set has 2 Aโ€™s and 2 Bโ€™2. This is best practice!" + "section": "Example: penguins 3/3", + "text": "Example: penguins 3/3\n\n\n\\[\nS^{-1}(\\bar{x}_A - \\bar{x}_B)\n\\]\n\nS1 <- cov(p_sub[p_sub$species == \"Adelie\",-1])\nS2 <- cov(p_sub[p_sub$species == \"Gentoo\",-1])\nSp <- (S1+S2)/2\nSp\n\n bm bd\nbm 0.33 0.2\nbd 0.20 0.3\n\nSpinv <- solve(Sp)\nSpinv\n\n bm bd\nbm 5.1 -3.4\nbd -3.4 5.6\n\nm1 <- as.matrix(lda_fit$fit$means[1,], ncol=1)\nm1\n\n [,1]\nbm -0.74\nbd 0.75\n\nm2 <- as.matrix(lda_fit$fit$means[2,], ncol=1)\nm2\n\n [,1]\nbm 0.91\nbd -0.92\n\nSpinv %*% (m1-m2)\n\n [,1]\nbm -14\nbd 15\n\n\n\n\\[\nx_0 S^{-1}(\\bar{x}_A - \\bar{x}_B) > \\frac{\\bar{x}_A + \\bar{x}_B}{2} S^{-1}(\\bar{x}_A - \\bar{x}_B)\n\\]\n\n(m1 + m2)/2\n\n [,1]\nbm 0.084\nbd -0.085\n\nmatrix((m1 + m2)/2, ncol=2) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] -2.4\n\n\nIf \\(x_0\\) is -0.68, 0.93, what species is it?\n\n\nas.matrix(p_sub[1,-1]) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] 23\n\n\nIs Adelie class A or is Gentoo class A?\n\n\nCheck by plugging in the means\n\nt(m1) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] 21\n\n\n\n\n\npredict(lda_fit, p_sub[1,-1])$.pred_class\n\n[1] Adelie\nLevels: Adelie Gentoo" }, { - "objectID": "week3/slides.html#trainingtest-split-33", - "href": "week3/slides.html#trainingtest-split-33", + "objectID": "week4/slides.html#dimension-reduction", + "href": "week4/slides.html#dimension-reduction", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Training/test split (3/3)", - "text": "Training/test split (3/3)\n\n\nNot stratifying can cause major problems with unbalanced samples.\n\nd_unb <- tibble(y=c(rep(\"A\", 2), rep(\"B\", 10)),\n x=c(runif(12)))\nd_unb$y\n\n [1] \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(132)\nd_unb_split <- initial_split(d_unb, prop = 0.70)\ntraining(d_unb_split)$y\n\n[1] \"B\" \"B\" \"A\" \"B\" \"B\" \"A\" \"B\" \"B\"\n\ntesting(d_unb_split)$y\n\n[1] \"B\" \"B\" \"B\" \"B\"\n\n\nThe test set is missing one entire class!\n\n\n\nAlways stratify splitting by sub-groups, especially response variable classes, and possibly other variables too.\n\n\nd_unb_strata <- initial_split(d_unb, \n prop = 0.70, \n strata=y)\ntraining(d_unb_strata)$y\n\n[1] \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\ntesting(d_unb_strata)$y\n\n[1] \"A\" \"B\" \"B\" \"B\"\n\n\nNow there is an A in the test set!" - }, - { - "objectID": "week3/slides.html#checking-the-trainingtest-split-response", - "href": "week3/slides.html#checking-the-trainingtest-split-response", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Checking the training/test split: response", - "text": "Checking the training/test split: response\n\n\n\nGOOD\n\n\n\n\n\n\n\n\n\n\n\n\nBAD\n\n\n\n\n\n\n\n\n\n\n\n Check the class proportions of the response by computing counts and proportions in each class, and tabulating or plotting the result. Itโ€™s good if there are similar numbers of each class in both sets." - }, - { - "objectID": "week3/slides.html#checking-the-trainingtest-split-predictors", - "href": "week3/slides.html#checking-the-trainingtest-split-predictors", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Checking the training/test split: predictors", - "text": "Checking the training/test split: predictors\n\n\n\nGOOD\n\n\nMake a training/test variable and plot the predictors. Need to have similar distributions.\n\n\nLooks good\n\n\n\n\n\n\n\n\n\n\nOn the response training and test sets have similar proportions of each class so looks good BUT itโ€™s not\n\n\nBut BAD\n\n\nTest set has smaller penguins on at least two of the variables." - }, - { - "objectID": "week3/slides.html#cross-validation-1", - "href": "week3/slides.html#cross-validation-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Cross-validation", - "text": "Cross-validation" + "section": "Dimension reduction", + "text": "Dimension reduction" }, { - "objectID": "week3/slides.html#k-fold-cross-validation-14", - "href": "week3/slides.html#k-fold-cross-validation-14", + "objectID": "week4/slides.html#dimension-reduction-via-lda", + "href": "week4/slides.html#dimension-reduction-via-lda", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "k-fold cross validation (1/4)", - "text": "k-fold cross validation (1/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time" + "section": "Dimension reduction via LDA", + "text": "Dimension reduction via LDA\nDiscriminant space: LDA also provides a low-dimensional projection of the \\(p\\)-dimensional space, where the groups are the most separated. For \\(K=2\\), this is\n\n\\[\n\\Sigma^{-1}(\\mu_A-\\mu_B)\n\\]\nThe distance between means relative to the variance-covariance, ie Mahalanobis distance." }, { - "objectID": "week3/slides.html#k-fold-cross-validation-24", - "href": "week3/slides.html#k-fold-cross-validation-24", + "objectID": "week4/slides.html#discriminant-space", + "href": "week4/slides.html#discriminant-space", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "k-fold cross validation (2/4)", - "text": "k-fold cross validation (2/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nHere are the row numbers for \\(k=5\\) folds:\n\np_folds <- vfold_cv(p_sub, 5, strata=species)\nc(1:nrow(p_sub))[-p_folds$splits[[1]]$in_id]\n\n [1] 5 6 8 12 16 23 28 31 43 44 45 53 57 58 70 73 74 77\n\nc(1:nrow(p_sub))[-p_folds$splits[[2]]$in_id]\n\n [1] 2 9 10 11 13 17 22 25 39 48 50 51 55 61 65 69 75 78\n\nc(1:nrow(p_sub))[-p_folds$splits[[3]]$in_id]\n\n [1] 1 3 14 18 20 26 33 41 42 49 56 67 72 81 83 84\n\nc(1:nrow(p_sub))[-p_folds$splits[[4]]$in_id]\n\n [1] 4 19 29 32 34 35 36 40 46 52 63 64 66 76 79 80\n\nc(1:nrow(p_sub))[-p_folds$splits[[5]]$in_id]\n\n [1] 7 15 21 24 27 30 37 38 47 54 59 60 62 68 71 82" + "section": "Discriminant space", + "text": "Discriminant space\nThe dashed lines are the Bayes decision boundaries. Ellipses that contain 95% of the probability for each of the three classes are shown. Solid line corresponds to the class boundaries from the LDA model fit to the sample.\n\n \n\n(Chapter4/4.6.pdf)" }, { - "objectID": "week3/slides.html#k-fold-cross-validation-34", - "href": "week3/slides.html#k-fold-cross-validation-34", + "objectID": "week4/slides.html#discriminant-space-using-sample-statistics", + "href": "week4/slides.html#discriminant-space-using-sample-statistics", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "k-fold cross validation (3/4)", - "text": "k-fold cross validation (3/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nFit the model to the \\(k-1\\) set, and compute the statistic on the \\(k\\)-fold, that was not used in the model fit.\nHere we use the accuracy as the statistic of interest.\nValue for fold 1 is:\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.889" + "section": "Discriminant space: using sample statistics", + "text": "Discriminant space: using sample statistics\n\nDiscriminant space: is the low-dimensional space (\\((K-1)\\)-dimensional) where the class means are the furthest apart relative to the common variance-covariance.\n\nThe discriminant space is provided by the eigenvectors after making an eigen-decomposition of \\(W^{-1}B\\), where\n\\[\nB = \\frac{1}{K}\\sum_{i=1}^{K} (\\bar{x}_i-\\bar{x})(\\bar{x}_i-\\bar{x})^\\top\n~~~\\text{and}~~~\nW = \\frac{1}{K}\\sum_{k=1}^K\\frac{1}{n_k}\\sum_{i=1}^{n_k} (x_i-\\bar{x}_k)(x_i-\\bar{x}_k)^\\top\n\\]\nNote \\(W\\) is the (unweighted) pooled variance-covariance matrix." }, { - "objectID": "week3/slides.html#k-fold-cross-validation-44", - "href": "week3/slides.html#k-fold-cross-validation-44", + "objectID": "week4/slides.html#section", + "href": "week4/slides.html#section", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "k-fold cross validation (4/4)", - "text": "k-fold cross validation (4/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nHere is the accuracy computed for each of the \\(k=5\\) folds. Remember, this means that the model was fitted to the rest of the data, and accuracy was calculate on the observations in this fold.\n\n\n[1] 0.89 0.89 1.00 0.88 1.00\n\n\n\n\nRecommended reading: Alison Hillโ€™s Take a Sad Script & Make it Better: Tidymodels Edition" + "section": "", + "text": "Mahalanobis distance\nFor two \\(p\\)-dimensional vectors, Euclidean distance is\n\\[d(x,y) = \\sqrt{(x-y)^\\top(x-y)}\\] and Mahalanobs distance is\n\\[d(x,y) = \\sqrt{(x-y)^\\top\\Sigma^{-1}(x-y)}\\]\nWhich points are closest according to Euclidean distance? Which points are closest relative to the variance-covariance?" }, { - "objectID": "week3/slides.html#loocv", - "href": "week3/slides.html#loocv", + "objectID": "week4/slides.html#discriminant-space-1", + "href": "week4/slides.html#discriminant-space-1", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "LOOCV", - "text": "LOOCV\n\nLeave-one-out (LOOCV) is a special case of \\(k\\)-fold cross-validation, where \\(k=n\\). There are \\(n\\) CV sets, each with ONE observation left out.\n\nBenefits:\n\nUseful when sample size is very small.\nSome statistics can be calculated algebraically, without having to do computation for each fold." + "section": "Discriminant space", + "text": "Discriminant space\nIn the means of scenarios 1 and 2 are the same, but the variance-covariances are different. The calculated discriminant space is different for different variance-covariances.\n\nNotice: Means for groups are different, and variance-covariance for each group are the same." }, { - "objectID": "week3/slides.html#where-is-cross-validation-used", - "href": "week3/slides.html#where-is-cross-validation-used", + "objectID": "week4/slides.html#quadratic-discriminant-analysis", + "href": "week4/slides.html#quadratic-discriminant-analysis", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Where is cross-validation used?", - "text": "Where is cross-validation used?\n\n\nModel evaluation and selection, by estimating the generalisability on future data.\nParameter tuning: finding optimal choice of parameters or control variables, like number of trees or branches, or polynomial terms to generate the best model fit.\nVariable selection: which variables are more or less important for the best model fit. Possibly some variables can be dropped from the model." + "section": "Quadratic Discriminant Analysis", + "text": "Quadratic Discriminant Analysis\nIf the groups have different variance-covariance matrices, but still come from a normal distribution" }, { - "objectID": "week3/slides.html#bootstrap", - "href": "week3/slides.html#bootstrap", + "objectID": "week4/slides.html#quadratic-da-qda", + "href": "week4/slides.html#quadratic-da-qda", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap", - "text": "Bootstrap" + "section": "Quadratic DA (QDA)", + "text": "Quadratic DA (QDA)\nIf the variance-covariance matrices for the groups are NOT EQUAL, then the discriminant functions are:\n\\[\\delta_k(x) = x^\\top\\Sigma_k^{-1}x + x^\\top\\Sigma_k^{-1}\\mu_k - \\frac12\\mu_k^\\top\\Sigma_k^{-1}\\mu_k - \\frac12 \\log{|\\Sigma_k|} + \\log(\\pi_k)\\]\nwhere \\(\\Sigma_k\\) is the population variance-covariance for class \\(k\\), estimated by the sample variance-covariance \\(S_k\\), and \\(\\mu_k\\) is the population mean vector for class \\(k\\), estimated by the sample mean \\(\\bar{x}_k\\)." }, { - "objectID": "week3/slides.html#bootstrap-15", - "href": "week3/slides.html#bootstrap-15", + "objectID": "week4/slides.html#quadratic-da-qda-1", + "href": "week4/slides.html#quadratic-da-qda-1", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap (1/5)", - "text": "Bootstrap (1/5)\nA bootstrap sample is a sample that is the same size as the original data set that is made using replacement. This results in analysis samples that have multiple replicates of some of the original rows of the data. The assessment set is defined as the rows of the original data that were not included in the bootstrap sample, referred to as the out-of-bag (OOB) sample.\n\nset.seed(115)\ndf <- tibble(id = 1:26, \n cl = c(rep(\"A\", 12), rep(\"B\", 14)))\ndf_b <- bootstraps(df, times = 100, strata = cl)\nt(df_b$splits[[1]]$data[df_b$splits[[1]]$in_id,])\n\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]\nid \" 1\" \" 2\" \" 2\" \" 2\" \" 2\" \" 5\" \" 6\" \" 7\" \" 9\" \"11\" \"12\" \ncl \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \n [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]\nid \"12\" \"14\" \"14\" \"18\" \"18\" \"18\" \"18\" \"18\" \"19\" \ncl \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \n [,21] [,22] [,23] [,24] [,25] [,26]\nid \"21\" \"21\" \"21\" \"22\" \"25\" \"25\" \ncl \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \n\n\nWhich observations are out-of-bag in bootstrap sample 1?" + "section": "Quadratic DA (QDA)", + "text": "Quadratic DA (QDA)\nA quadratic boundary is obtained by relaxing the assumption of equal variance-covariance, and assume that \\(\\Sigma_k \\neq \\Sigma_l, ~~k\\neq l, k,l=1,...,K\\)\n\n \n\ntrue, LDA, QDA.\n(Chapter4/4.9.pdf)" }, { - "objectID": "week3/slides.html#bootstrap-25", - "href": "week3/slides.html#bootstrap-25", + "objectID": "week4/slides.html#qda-olive-oils-example", + "href": "week4/slides.html#qda-olive-oils-example", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap (2/5)", - "text": "Bootstrap (2/5)\n\n\nBootstrap is preferable to cross-validation when the sample size is small, or if the structure in the data being modelled is complex.\nIt is commonly used for estimating the variance of parameter estimates, especially when the data is non-normal." + "section": "QDA: Olive oils example", + "text": "QDA: Olive oils example\n\n\nEven if the population is NOT normally distributed, QDA might do reasonably. On this data, region 3 has a โ€œbanana-shapedโ€ variance-covariance, and region 2 has two separate clusters. The quadratic boundary though does well to carve the space into neat sections dividing the two regions." }, { - "objectID": "week3/slides.html#bootstrap-35", - "href": "week3/slides.html#bootstrap-35", + "objectID": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-12", + "href": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-12", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap (3/5)", - "text": "Bootstrap (3/5)\nIn dimension reduction it can be used to assess if the coefficients of a PC (the eigenvectors) are significantly different from ZERO. The 95% bootstrap confidence intervals can be computed by:\n\nGenerating B bootstrap samples of the data\nCompute PCA, record the loadings\nRe-orient the loadings, by choosing one variable with large coefficient to be the direction base\nIf B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC." + "section": "Checking the assumptions for LDA and QDA 1/2", + "text": "Checking the assumptions for LDA and QDA 1/2\nCheck the shape of the variability of each group could be considered to be elliptical, and the size is same for LDA but different to use QDA.\n\n\n\nGOOD\n\n\n\n\nBAD\n\n\n\n\nfrom Cook and Laa (2024)" }, { - "objectID": "week3/slides.html#bootstrap-45", - "href": "week3/slides.html#bootstrap-45", + "objectID": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-22", + "href": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-22", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap (4/5)", - "text": "Bootstrap (4/5)\nAssessing the loadings for PC 2 of PCA on the womens track data. Remember the summary: \n\n\nStandard deviations (1, .., p=7):\n[1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15\n\nRotation (n x k) = (7 x 7):\n PC1 PC2 PC3 PC4 PC5 PC6 PC7\nm100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052\nm200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109\nm400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208\nm800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315\nm1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693\nm3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598\nmarathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070\n\n\n Should we consider m800, m400 contributing to PC2 or not?" + "section": "Checking the assumptions for LDA and QDA 2/2", + "text": "Checking the assumptions for LDA and QDA 2/2\nThis can also be done for \\(p>2\\).\n\n\n\nDATA\n\n\n\n\nPOINTS ON SURFACE OF ELLIPSES\n\n\n\n\nfrom Cook and Laa (2024)" }, { - "objectID": "week3/slides.html#bootstrap-55", - "href": "week3/slides.html#bootstrap-55", + "objectID": "week4/slides.html#plotting-the-model", + "href": "week4/slides.html#plotting-the-model", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bootstrap (5/5)", - "text": "Bootstrap (5/5)\n\n\nWe said that PC2 is a contrast between short distance events and long distance events, particularly 100m, 200m vs 1500m, 3000m, marathon. How reliably can we state this?\n\n\nCode\nlibrary(boot)\ncompute_PC2 <- function(data, index) {\n pc2 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,2]\n # Coordinate signs: make m100 always positive\n if (sign(pc2[1]) < 0) \n pc2 <- -pc2 \n return(pc2)\n}\n# Make sure sign of first PC element is positive\nset.seed(201)\nPC2_boot <- boot(data=track[,1:7], compute_PC2, R=1000)\ncolnames(PC2_boot$t) <- colnames(track[,1:7])\nPC2_boot_ci <- as_tibble(PC2_boot$t) %>%\n gather(var, coef) %>% \n mutate(var = factor(var, levels=c(\"m100\", \"m200\", \"m400\", \"m800\", \"m1500\", \"m3000\", \"marathon\"))) %>%\n group_by(var) %>%\n summarise(q2.5 = quantile(coef, 0.025), \n q5 = median(coef),\n q97.5 = quantile(coef, 0.975)) %>%\n mutate(t0 = PC2_boot$t0) \npb <- ggplot(PC2_boot_ci, aes(x=var, y=t0)) + \n geom_hline(yintercept=0, linetype=2, colour=\"red\") +\n geom_point() +\n geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +\n xlab(\"\") + ylab(\"coefficient\") \n\n\nConfidence intervals for m400 and m800 cross ZERO, hence zero is a plausible value for the population coefficient corresponding to this estimate." + "section": "Plotting the model", + "text": "Plotting the model\n\n\n\nData-in-the-model-space\n\n\n\n\nModel-in-the-data-space\n\n\n\n\nfrom Cook and Laa (2024)" }, { - "objectID": "week3/slides.html#permutation", - "href": "week3/slides.html#permutation", + "objectID": "week4/slides.html#next-trees-and-forests", + "href": "week4/slides.html#next-trees-and-forests", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Permutation", - "text": "Permutation" + "section": "Next: Trees and forests", + "text": "Next: Trees and forests\n\n\n\nETC3250/5250 Lecture 4 | iml.numbat.space" }, { - "objectID": "week3/slides.html#permutation-13", - "href": "week3/slides.html#permutation-13", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Permutation (1/3)", - "text": "Permutation (1/3)\n\n\nPermutation breaks relationships, and is often used for conducting statistical hypothesis tests, without requiring too many assumptions.\n\nDATA\n\n\n# A tibble: 10 ร— 2\n x cl \n <dbl> <chr>\n 1 0.281 A \n 2 0.330 A \n 3 0.708 A \n 4 0.463 A \n 5 3.37 A \n 6 0.528 B \n 7 0.852 B \n 8 5.58 B \n 9 0.685 B \n10 3.28 B \n\n\n\n\nPERMUTE cl\n\n\n# A tibble: 10 ร— 2\n x cl \n <dbl> <chr>\n 1 0.281 A \n 2 0.330 B \n 3 0.708 A \n 4 0.463 B \n 5 3.37 B \n 6 0.528 A \n 7 0.852 B \n 8 5.58 B \n 9 0.685 A \n10 3.28 A \n\n\n\n\nIs there a difference in the medians of the groups?" + "objectID": "week3/tutorial.html", + "href": "week3/tutorial.html", + "title": "ETC3250/5250 Tutorial 3", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(colorspace)\nlibrary(patchwork)\nlibrary(MASS)\nlibrary(randomForest)\nlibrary(gridExtra)\nlibrary(GGally)\nlibrary(geozoo)\nlibrary(mulgar)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(tourr::flea)" }, { - "objectID": "week3/slides.html#permutation-23", - "href": "week3/slides.html#permutation-23", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Permutation (2/3)", - "text": "Permutation (2/3)\n\n\nIs there a difference in the medians of the groups?\n\n\n\n\n\n\n\n\n\n\nGenerate \\(k\\) permutation samples, compute the medians for each, and compare the difference with original." + "objectID": "week3/tutorial.html#objectives", + "href": "week3/tutorial.html#objectives", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to learn and practice visualising high-dimensional data." }, { - "objectID": "week3/slides.html#permutation-33", - "href": "week3/slides.html#permutation-33", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Permutation (3/3)", - "text": "Permutation (3/3)\n\nCaution: permuting small numbers, especially classes may return very similar samples to the original data.\nStay tuned for random forest models, where permutation is used to help assess the importance of all the variables." + "objectID": "week3/tutorial.html#preparation", + "href": "week3/tutorial.html#preparation", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 2" }, { - "objectID": "week3/slides.html#simulation", - "href": "week3/slides.html#simulation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Simulation", - "text": "Simulation" + "objectID": "week3/tutorial.html#exercises", + "href": "week3/tutorial.html#exercises", + "title": "ETC3250/5250 Tutorial 3", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. The sparseness of high dimensions\nRandomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?\nThe code to generate and view the cubes is:\n\n\nCode to generate the data and show in a tour\nlibrary(tourr)\nlibrary(geozoo)\nset.seed(1234)\ncube3 <- cube.solid.random(3, 500)$points\ncube5 <- cube.solid.random(5, 500)$points\ncube10 <- cube.solid.random(10, 500)$points\n\nanimate_xy(cube3, axes=\"bottomleft\")\nanimate_xy(cube5, axes=\"bottomleft\")\nanimate_xy(cube10, axes=\"bottomleft\")\n\n\n\n\n2. Detecting clusters\nFor the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).\n\n\nCode to show in a tour\nanimate_xy(c1)\nanimate_xy(c3)\n\n\n\n\n3. Effect of covariance\nExamine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n\n\nCode to generate the samples\nlibrary(mvtnorm)\nset.seed(501)\n\ns1 <- diag(5)\ns2 <- diag(5)\ns2[3,4] <- 0.7\ns2[4,3] <- 0.7\ns3 <- s2\ns3[1,2] <- -0.7\ns3[2,1] <- -0.7\n\ns1\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0 0 0\n[2,] 0 1 0 0 0\n[3,] 0 0 1 0 0\n[4,] 0 0 0 1 0\n[5,] 0 0 0 0 1\n\n\nCode to generate the samples\ns2\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0.0 0.0 0\n[2,] 0 1 0.0 0.0 0\n[3,] 0 0 1.0 0.7 0\n[4,] 0 0 0.7 1.0 0\n[5,] 0 0 0.0 0.0 1\n\n\nCode to generate the samples\ns3\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1.0 -0.7 0.0 0.0 0\n[2,] -0.7 1.0 0.0 0.0 0\n[3,] 0.0 0.0 1.0 0.7 0\n[4,] 0.0 0.0 0.7 1.0 0\n[5,] 0.0 0.0 0.0 0.0 1\n\n\nCode to generate the samples\nset.seed(1234)\nd1 <- as.data.frame(rmvnorm(500, sigma = s1))\nd2 <- as.data.frame(rmvnorm(500, sigma = s2))\nd3 <- as.data.frame(rmvnorm(500, sigma = s3))\n\n\n\n\n4. Principal components analysis on the simulated data\n๐Ÿง For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?\n๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ปConduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.\n๐ŸคฏOften, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.\n\n\n5. PCA on cross-currency time series\nThe rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.\n\nrates <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv\") |>\n select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)\n\n\nStandardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.\n\n\n\nCode to standardise currencies\nlibrary(plotly)\nrates_std <- rates |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\nrownames(rates_std) <- rates_std$date\np <- rates_std |>\n pivot_longer(cols=ARS:ZAR, \n names_to = \"currency\", \n values_to = \"rate\") |>\n ggplot(aes(x=date, y=rate, \n group=currency, label=currency)) +\n geom_line() \nggplotly(p, width=400, height=300)\n\n\n\nConduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.\n\n\n\nCode to do PCA and screeplot\nrates_pca <- prcomp(rates_std[,-1], scale=FALSE)\nmulgar::ggscree(rates_pca, q=24)\noptions(digits=2)\nsummary(rates_pca)\n\n\n\n\nCode to make a nice summary\n# Summarise the coefficients nicely\nrates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>\n mutate(p = evl/sum(evl), \n cum_p = cumsum(evl/sum(evl))) |> \n t() |>\n as.data.frame()\ncolnames(rates_pca_smry) <- colnames(rates_pca$rotation)\nrates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),\n rates_pca_smry)\nrownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),\n \"Variance\", \"Proportion\", \n \"Cum. prop\")\nrates_pca_smry[,1:5]\n\n\n\nMake a biplot of the first two PCs. Explain what you learn.\n\n\n\nBiplot code\nlibrary(ggfortify)\nautoplot(rates_pca, loadings = TRUE, \n loadings.label = TRUE) \n\n\n\nMake a time series plot of PC1 and PC2. Explain why this is useful to do for this data.\n\n\n\nCode to plot PCs\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC1)) + geom_line()\n\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC2)) + geom_line()\n\n\n\nYouโ€™ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?\n\n\n\nCode to use interaction of the PC plot\nlibrary(plotly)\np2 <- rates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=PC1, y=PC2, label=date)) +\n geom_point() +\n geom_path()\nggplotly(p2, width=400, height=400)\n\n\n\n\n6. Write a simple question about the weekโ€™s material and test your neighbour, or your tutor." }, { - "objectID": "week3/slides.html#simulation-12", - "href": "week3/slides.html#simulation-12", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Simulation (1/2)", - "text": "Simulation (1/2)\nSimulation from known statistical distributions allows us to check data and calculations against what is known is controlled conditions.\nFor example, how likely is it to see the extreme a value if my data is a sample from a normal distribution?" + "objectID": "week3/tutorial.html#finishing-up", + "href": "week3/tutorial.html#finishing-up", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week3/slides.html#simulation-22", - "href": "week3/slides.html#simulation-22", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Simulation (2/2)", - "text": "Simulation (2/2)\n\n\n\n\n\n\n\n\n\n\n\n\nGrey line is a guide line, computed by doing PCA on 100 samples from a standard \\(p\\)-dimensional normal distribution.\nThat is a comparison of the correlation matrix of the track data with a correlation matrix that is the identity matrix, where there is no association between variables.\nThe largest variance we expect is under 2. The observed variance for PC 1 is much higher. Much larger than expected, very important for capturing the variability in the data!\nWhy is there a difference in variance, when there is no difference in variance?" + "objectID": "week3/tutorialsol.html", + "href": "week3/tutorialsol.html", + "title": "ETC3250/5250 Tutorial 3", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(colorspace)\nlibrary(patchwork)\nlibrary(MASS)\nlibrary(randomForest)\nlibrary(gridExtra)\nlibrary(GGally)\nlibrary(geozoo)\nlibrary(mulgar)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(tourr::flea)" }, { - "objectID": "week3/slides.html#what-can-go-wrong-in-high-dimensions", - "href": "week3/slides.html#what-can-go-wrong-in-high-dimensions", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "What can go wrong in high-dimensions", - "text": "What can go wrong in high-dimensions" + "objectID": "week3/tutorialsol.html#objectives", + "href": "week3/tutorialsol.html#objectives", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to learn and practice visualising high-dimensional data." }, { - "objectID": "week3/slides.html#space-is-huge", - "href": "week3/slides.html#space-is-huge", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Space is huge!", - "text": "Space is huge!\n\n\n\nset.seed(357)\nmy_sparse_data <- tibble(cl = c(rep(\"A\", 12), \n rep(\"B\", 9)),\n x1 = rnorm(21),\n x2 = rnorm(21), \n x3 = rnorm(21),\n x4 = rnorm(21),\n x5 = rnorm(21), \n x6 = rnorm(21), \n x7 = rnorm(21), \n x8 = rnorm(21), \n x9 = rnorm(21), \n x10 = rnorm(21), \n x11 = rnorm(21), \n x12 = rnorm(21), \n x13 = rnorm(21), \n x14 = rnorm(21), \n x15 = rnorm(21)) |>\n mutate(cl = factor(cl)) |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\n\n Do we agree that there is no REAL difference between A and B?\n\n\n\n\n\n\n\nDifference is due to having insufficient data with too many variables." + "objectID": "week3/tutorialsol.html#preparation", + "href": "week3/tutorialsol.html#preparation", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 2" }, { - "objectID": "week3/slides.html#regularisation", - "href": "week3/slides.html#regularisation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Regularisation", - "text": "Regularisation\n The fitting criteria has an added penalty term with the effect being that some parameter estimates are forced to ZERO. This effectively reduces the dimensionality by removing noise, and variability in the sample that is consistent with what would be expected if it was purely noise.\n Stay tuned for examples in various methods!" + "objectID": "week3/tutorialsol.html#exercises", + "href": "week3/tutorialsol.html#exercises", + "title": "ETC3250/5250 Tutorial 3", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. The sparseness of high dimensions\nRandomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?\nThe code to generate and view the cubes is:\n\n\nCode to generate the data and show in a tour\nlibrary(tourr)\nlibrary(geozoo)\nset.seed(1234)\ncube3 <- cube.solid.random(3, 500)$points\ncube5 <- cube.solid.random(5, 500)$points\ncube10 <- cube.solid.random(10, 500)$points\n\nanimate_xy(cube3, axes=\"bottomleft\")\nanimate_xy(cube5, axes=\"bottomleft\")\nanimate_xy(cube10, axes=\"bottomleft\")\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nEach of the projections has a boxy shape, which gets less distinct as the dimension increases.\nAs the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges.\n\n\n\n\n\n\n2. Detecting clusters\nFor the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).\n\n\nCode to show in a tour\nanimate_xy(c1)\nanimate_xy(c3)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe first data set c1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions.\nThe second data set c3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections.\n\n\n\n\n\n\n3. Effect of covariance\nExamine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n\n\nCode to generate the samples\nlibrary(mvtnorm)\nset.seed(501)\n\ns1 <- diag(5)\ns2 <- diag(5)\ns2[3,4] <- 0.7\ns2[4,3] <- 0.7\ns3 <- s2\ns3[1,2] <- -0.7\ns3[2,1] <- -0.7\n\ns1\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0 0 0\n[2,] 0 1 0 0 0\n[3,] 0 0 1 0 0\n[4,] 0 0 0 1 0\n[5,] 0 0 0 0 1\n\n\nCode to generate the samples\ns2\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0.0 0.0 0\n[2,] 0 1 0.0 0.0 0\n[3,] 0 0 1.0 0.7 0\n[4,] 0 0 0.7 1.0 0\n[5,] 0 0 0.0 0.0 1\n\n\nCode to generate the samples\ns3\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1.0 -0.7 0.0 0.0 0\n[2,] -0.7 1.0 0.0 0.0 0\n[3,] 0.0 0.0 1.0 0.7 0\n[4,] 0.0 0.0 0.7 1.0 0\n[5,] 0.0 0.0 0.0 0.0 1\n\n\nCode to generate the samples\nset.seed(1234)\nd1 <- as.data.frame(rmvnorm(500, sigma = s1))\nd2 <- as.data.frame(rmvnorm(500, sigma = s2))\nd3 <- as.data.frame(rmvnorm(500, sigma = s3))\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nanimate_xy(d1)\nanimate_xy(d2)\nanimate_xy(d3)\n\nThe points in data d1 are pretty spread in every projection. For the data d2, d3 have some projections where the data is concentrated along a line. This should be seen to be when variables 3 and 4 are contributing to the projection in d2, and when variables 1, 2, 3, 4 contributing to the projection in d3.\n\n\n\n\n\n\n4. Principal components analysis on the simulated data\n๐Ÿง For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?\n๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ปConduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.\n๐ŸคฏOften, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThinking about it: In d2 there is strong correlation between variables 3 and 4, which means probably only 4PC s would be needed. In d3 there is strong correlation also between variables 1 and 2 which would mean that only 3 PCs would be needed.\n\nd2_pca <- prcomp(d2, scale=TRUE)\nd2_pca\n\nStandard deviations (1, .., p=5):\n[1] 1.2944925 1.0120246 0.9995775 0.9840652 0.5766767\n\nRotation (n x k) = (5 x 5):\n PC1 PC2 PC3 PC4 PC5\nV1 0.009051897 0.60982755 0.600760775 0.51637067 -0.02182300\nV2 0.042039564 0.44070702 -0.798335151 0.40808929 0.01158053\nV3 0.702909484 0.03224989 0.034228444 -0.06034512 0.70715280\nV4 0.702411571 0.03021836 0.002269932 -0.08050218 -0.70655437\nV5 0.103377852 -0.65721722 0.023890154 0.74612487 -0.01027051\n\nd2_pca$sdev^2/5\n\n[1] 0.33514216 0.20483875 0.19983102 0.19367686 0.06651121\n\nmulgar::ggscree(d2_pca, q=5)\n\n\n\n\n\n\n\n\nFour PCs explain 93% of the variation. PC1 is the combination of variables 3 and 4, which captures this reduced dimension.\n\nd3_pca <- prcomp(d3, scale=TRUE)\nd3_pca\n\nStandard deviations (1, .., p=5):\n[1] 1.3262816 1.2831152 0.9984103 0.5561311 0.5371102\n\nRotation (n x k) = (5 x 5):\n PC1 PC2 PC3 PC4 PC5\nV1 0.47372917 0.52551030 0.007091154 -0.55745578 0.434295265\nV2 -0.49362867 -0.50367594 -0.047544823 -0.58444458 0.398503844\nV3 -0.50057768 0.49960926 0.030888892 -0.40488840 -0.578726039\nV4 -0.52968729 0.46318477 0.073441704 0.42649507 0.563559684\nV5 0.02765464 -0.07745919 0.995661287 -0.04283613 -0.007678753\n\nd3_pca$sdev^2/5\n\n[1] 0.35180458 0.32927695 0.19936462 0.06185637 0.05769748\n\nmulgar::ggscree(d3_pca, q=5)\n\n\n\n\n\n\n\n\nThree PCs explain 88% of the variation, and the last two PCs have much smaller variance than the others. PC 1 and 2 are combinations of variables 1, 2, 3 and 4, which captures this reduced dimension, and PC 3 is primarily variable 5.\nThe PCs are awkward combinations of the original variables. For d2, it would make sense to use PC1 (or equivalently and equal combination of V3, V4), and then keep the original variables V1, V2, V5.\nFor d3 itโ€™s harder to make this call because the first two PCs are combinations of four variables. Its hard to see from this that the ideal solution would be to use an equal combination of V1, V2, and equal combination of V3, V4 and V5 on its own.\nOften understanding the variance that is explained by the PCs is hard to interpret.\n\n\n\n\n\n\n5. PCA on cross-currency time series\nThe rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.\n\nrates <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv\") |>\n select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)\n\n\nStandardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.\n\n\n\nCode to standardise currencies\nlibrary(plotly)\nrates_std <- rates |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\nrownames(rates_std) <- rates_std$date\np <- rates_std |>\n pivot_longer(cols=ARS:ZAR, \n names_to = \"currency\", \n values_to = \"rate\") |>\n ggplot(aes(x=date, y=rate, \n group=currency, label=currency)) +\n geom_line() \nggplotly(p, width=400, height=300)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\nIt isnโ€™t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.\n\n\n\n\n\nConduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.\n\n\n\nCode to do PCA and screeplot\nrates_pca <- prcomp(rates_std[,-1], scale=FALSE)\nmulgar::ggscree(rates_pca, q=24)\noptions(digits=2)\nsummary(rates_pca)\n\n\n\n\nCode to make a nice summary\n# Summarise the coefficients nicely\nrates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>\n mutate(p = evl/sum(evl), \n cum_p = cumsum(evl/sum(evl))) |> \n t() |>\n as.data.frame()\ncolnames(rates_pca_smry) <- colnames(rates_pca$rotation)\nrates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),\n rates_pca_smry)\nrownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),\n \"Variance\", \"Proportion\", \n \"Cum. prop\")\nrates_pca_smry[,1:5]\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nImportance of components:\n PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8\nStandard deviation 4.193 1.679 1.0932 0.9531 0.7358 0.5460 0.38600 0.33484\nProportion of Variance 0.733 0.118 0.0498 0.0379 0.0226 0.0124 0.00621 0.00467\nCumulative Proportion 0.733 0.850 0.8999 0.9377 0.9603 0.9727 0.97893 0.98360\n PC9 PC10 PC11 PC12 PC13 PC14 PC15\nStandard deviation 0.30254 0.25669 0.25391 0.17893 0.16189 0.15184 0.14260\nProportion of Variance 0.00381 0.00275 0.00269 0.00133 0.00109 0.00096 0.00085\nCumulative Proportion 0.98741 0.99016 0.99284 0.99418 0.99527 0.99623 0.99708\n PC16 PC17 PC18 PC19 PC20 PC21 PC22\nStandard deviation 0.11649 0.10691 0.09923 0.09519 0.08928 0.07987 0.07222\nProportion of Variance 0.00057 0.00048 0.00041 0.00038 0.00033 0.00027 0.00022\nCumulative Proportion 0.99764 0.99812 0.99853 0.99891 0.99924 0.99950 0.99972\n PC23 PC24\nStandard deviation 0.05985 0.05588\nProportion of Variance 0.00015 0.00013\nCumulative Proportion 0.99987 1.00000\n\n\n\n\n PC1 PC2 PC3 PC4 PC5\nARS 0.215 -0.121 0.19832 0.181 -0.2010\nAUD 0.234 0.013 0.11466 0.018 0.0346\nBRL 0.229 -0.108 0.10513 0.093 -0.0526\nCAD 0.235 -0.025 -0.02659 -0.037 0.0337\nCHF -0.065 0.505 -0.33521 -0.188 -0.0047\nCNY 0.144 0.237 -0.45337 -0.238 -0.5131\nEUR 0.088 0.495 0.24474 0.245 -0.1416\nFJD 0.234 0.055 0.04470 0.028 0.0330\nGBP 0.219 0.116 -0.00915 -0.073 0.3059\nIDR 0.218 -0.022 -0.24905 -0.117 0.2362\nINR 0.223 -0.147 -0.00734 -0.014 0.0279\nISK 0.230 -0.016 0.10979 0.093 0.1295\nJPY -0.022 0.515 0.14722 0.234 0.3388\nKRW 0.214 0.063 0.17488 0.059 -0.3404\nKZT 0.217 0.013 -0.23244 -0.119 0.3304\nMXN 0.229 -0.059 -0.13804 -0.102 0.2048\nMYR 0.227 0.040 -0.13970 -0.115 -0.2009\nNZD 0.230 0.061 0.04289 -0.056 -0.0354\nQAR -0.013 0.111 0.55283 -0.807 0.0078\nRUB 0.233 -0.102 -0.05863 -0.042 0.0063\nSEK 0.205 0.240 0.07570 0.085 0.0982\nSGD 0.227 0.057 0.14225 0.115 -0.2424\nUYU 0.231 -0.101 0.00064 -0.053 0.0957\nZAR 0.232 -0.070 -0.00328 0.042 -0.0443\nVariance 17.582 2.820 1.19502 0.908 0.5413\nProportion 0.733 0.118 0.04979 0.038 0.0226\nCum. prop 0.733 0.850 0.89989 0.938 0.9603\n\n\n\nThe first two principal components explain 85% of the total variation.\nPC1 is a combination of all of the currencies except for CHF, EUR, JPY, QAR.\nPC2 is a combination of CHF, EUR, JPY.\n\n\n\n\n\n\nMake a biplot of the first two PCs. Explain what you learn.\n\n\n\nBiplot code\nlibrary(ggfortify)\nautoplot(rates_pca, loadings = TRUE, \n loadings.label = TRUE) \n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMost of the currencies contribute substantially to PC1. Only three contribute strongly to PC2: CHF, JPY, EUR. Similar to what is learned from the summary table (made in b).\nThe pattern of the points is most unusual! It has a curious S shape. Principal components are supposed to be a random scattering of values, with no obvious structure. This is a very strong pattern.\n\n\n\n\n\n\nMake a time series plot of PC1 and PC2. Explain why this is useful to do for this data.\n\n\n\nCode to plot PCs\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC1)) + geom_line()\n\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC2)) + geom_line()\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBecause there is a strong pattern in the first two PCs, it could be useful to understand if this is related to the temporal context of the data.\nHere we might expect that the PCs extract the main temporal patterns. We see this is the case.\nPC1 reflects the large group of currencies that greatly increase in mid-March.\nPC2 reflects the few currencies that decrease at the start of March.\n\nNote that: increase here means that the value of the currency declines relative to the USD and a decrease indicates stronger relative to the USD. Is this correct?\n\n\n\n\n\nYouโ€™ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?\n\n\n\nCode to use interaction of the PC plot\nlibrary(plotly)\np2 <- rates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=PC1, y=PC2, label=date)) +\n geom_point() +\n geom_path()\nggplotly(p2, width=400, height=400)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\nThe pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.\n\n\n\n\n\n\n6. Write a simple question about the weekโ€™s material and test your neighbour, or your tutor." }, { - "objectID": "week3/slides.html#next-logistic-regression-and-discriminant-analysis", - "href": "week3/slides.html#next-logistic-regression-and-discriminant-analysis", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Next: Logistic regression and discriminant analysis", - "text": "Next: Logistic regression and discriminant analysis\n\n\n\nETC3250/5250 Lecture 3 | iml.numbat.space" + "objectID": "week3/tutorialsol.html#finishing-up", + "href": "week3/tutorialsol.html#finishing-up", + "title": "ETC3250/5250 Tutorial 3", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week2/tutorial.html", - "href": "week2/tutorial.html", - "title": "ETC3250/5250 Tutorial 2", + "objectID": "week3/index.html", + "href": "week3/index.html", + "title": "Week 3: Re-sampling and regularisation", "section": "", - "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." + "text": "ISLR 5.1, 5.2, 6.2, 6.4" }, { - "objectID": "week2/tutorial.html#objectives", - "href": "week2/tutorial.html#objectives", - "title": "ETC3250/5250 Tutorial 2", + "objectID": "week3/index.html#main-reference", + "href": "week3/index.html#main-reference", + "title": "Week 3: Re-sampling and regularisation", "section": "", - "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." + "text": "ISLR 5.1, 5.2, 6.2, 6.4" }, { - "objectID": "week2/tutorial.html#preparation", - "href": "week2/tutorial.html#preparation", - "title": "ETC3250/5250 Tutorial 2", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 1" + "objectID": "week3/index.html#what-you-will-learn-this-week", + "href": "week3/index.html#what-you-will-learn-this-week", + "title": "Week 3: Re-sampling and regularisation", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nCommon re-sampling methods: bootstrap, cross-validation, permutation, simulation.\nCross-validation for checking generalisability of model fit, parameter tuning, variable selection.\nBootstrapping for understanding variance of parameter estimates.\nPermutation to understand significance of associations between variables, and variable importance.\nSimulation can be used to assess what might happen with samples from known distributions.\nWhat can go wrong in high-d, and how to adjust using regularisation methods." }, { - "objectID": "week2/tutorial.html#exercises", - "href": "week2/tutorial.html#exercises", - "title": "ETC3250/5250 Tutorial 2", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Answer the following questions for this data matrix,\n\\[\\begin{align*}\n{\\mathbf X} = \\left[\\begin{array}{rrrrr}\n2 & -2 & -8 & 6 & -7 \\\\\n6 & 6 & -4 & 9 & 6 \\\\\n5 & 4 & 3 & -7 & 8 \\\\\n1 & -7 & 6 & 7 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\nWhat is \\(X_1\\) (variable 1)?\n\n\nWhat is observation 3?\n\n\nWhat is \\(n\\)?\n\n\nWhat is \\(p\\)?\n\n\nWhat is \\(X^\\top\\)?\n\n\nWrite a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.\n\n\nWhy canโ€™t the following matrix considered a projection matrix?\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n-1/\\sqrt{2} & 1/\\sqrt{3} \\\\\n0 & 0 \\\\\n1/\\sqrt{2} & 0 \\\\\n0 & \\sqrt{2}/\\sqrt{3} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n2. Which of these statements is the most accurate? And which is the most precise?\nA. It is almost certain to rain in the next week.\nB. It is 90% likely to get at least 10mm of rain tomorrow.\n\n\n3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.\n\nlibrary(readr)\nlibrary(dplyr)\nlibrary(rsample)\n\nbushfires <- read_csv(\"https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv\")\nbushfires |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 138\n2 arson 37\n3 burning_off 9\n4 lightning 838\n\n\n\n\n4. In the lecture slides from week 1 on bias vs variance, these four images were shown.\n \n \nMark the images with the labels โ€œtrue modelโ€, โ€œfitted modelโ€, โ€œbiasโ€. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.\n\n\n5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.\n\npred_data <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv\") |>\n mutate(true = factor(true))\n\n\nHow many classes?\n\n\nCompute the confusion table, using the maximum predictive probability to label the observation.\n\n\nCompute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?\n\n\nCompute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?\n\n\n\n6. This question relates to feature engineering, creating better variables on which to build your model.\n\nThe following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?\n\n\nlibrary(ggplot2)\nlibrary(ggbeeswarm)\nspam <- read_csv(\"http://ggobi.org/book/data/spam.csv\")\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\nFor the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?\n\n\nolive <- read_csv(\"http://ggobi.org/book/data/olive.csv\") |>\n dplyr::filter(region != 1) |>\n dplyr::select(region, arachidic, linoleic) |>\n mutate(region = factor(region))\nggplot(olive, aes(x=linoleic, \n y=arachidic, \n colour=region)) +\n geom_point() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n theme(legend.position=\"none\", \n aspect.ratio=1)\n\n\n\n\n\n\n\n\n\n\n7. Discuss with your neighbour, what you found the most difficult part of last weekโ€™s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer." + "objectID": "week3/index.html#lecture-slides", + "href": "week3/index.html#lecture-slides", + "title": "Week 3: Re-sampling and regularisation", + "section": "Lecture slides", + "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" }, { - "objectID": "week2/tutorial.html#finishing-up", - "href": "week2/tutorial.html#finishing-up", - "title": "ETC3250/5250 Tutorial 2", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week3/index.html#tutorial-instructions", + "href": "week3/index.html#tutorial-instructions", + "title": "Week 3: Re-sampling and regularisation", + "section": "Tutorial instructions", + "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" }, { - "objectID": "week2/tutorialsol.html", - "href": "week2/tutorialsol.html", - "title": "ETC3250/5250 Tutorial 2", - "section": "", - "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." + "objectID": "week3/index.html#assignments", + "href": "week3/index.html#assignments", + "title": "Week 3: Re-sampling and regularisation", + "section": "Assignments", + "text": "Assignments" }, { - "objectID": "week2/tutorialsol.html#objectives", - "href": "week2/tutorialsol.html#objectives", - "title": "ETC3250/5250 Tutorial 2", - "section": "", - "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." + "objectID": "week3/index.html#assignments-1", + "href": "week3/index.html#assignments-1", + "title": "Week 3: Re-sampling and regularisation", + "section": "Assignments", + "text": "Assignments\n\nAssignment 1 is due on Friday 22 March." }, { - "objectID": "week2/tutorialsol.html#preparation", - "href": "week2/tutorialsol.html#preparation", - "title": "ETC3250/5250 Tutorial 2", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 1" + "objectID": "week2/slides.html#overview", + "href": "week2/slides.html#overview", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Overview", + "text": "Overview\nIn this week we will cover:\n\nConceptual framing for visualisation\nCommon methods: scatterplot matrix, parallel coordinates, tours\nDetails on using tours for examining clustering and class structure\nDimension reduction\n\nLinear: principal component analysis\nNon-linear: multidimensional scaling, t-stochastic neighbour embedding (t-SNE), uniform manifold approximation and projection (UMAP)\n\nUsing tours to assess dimension reduction" }, { - "objectID": "week2/tutorialsol.html#exercises", - "href": "week2/tutorialsol.html#exercises", - "title": "ETC3250/5250 Tutorial 2", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Answer the following questions for this data matrix,\n\\[\\begin{align*}\n{\\mathbf X} = \\left[\\begin{array}{rrrrr}\n2 & -2 & -8 & 6 & -7 \\\\\n6 & 6 & -4 & 9 & 6 \\\\\n5 & 4 & 3 & -7 & 8 \\\\\n1 & -7 & 6 & 7 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\nWhat is \\(X_1\\) (variable 1)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(X_1 = (2 ~6 ~5 ~1)\\)\n\n\n\n\n\nWhat is observation 3?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(5 ~ 4 ~ 3 ~ -7 ~ 8\\)\n\n\n\n\n\nWhat is \\(n\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(4\\)\n\n\n\n\n\nWhat is \\(p\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(5\\)\n\n\n\n\n\nWhat is \\(X^\\top\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\[\\begin{align*}\n{\\mathbf X}^\\top = \\left[\\begin{array}{rrrr}\n2 & 6 & 5 & 1\\\\\n-2 & 6 & 4 & -7\\\\\n-8 & -4 & 3 & 6 \\\\\n6 & 9 & -7 & 7 \\\\\n-7 & 6 & 8 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\nWrite a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n\\frac{1}{\\sqrt{2}} & 0 \\\\\n0 & \\frac{1}{\\sqrt{3}} \\\\\n0 & 0 \\\\\n\\frac{1}{\\sqrt{2}} & 0 \\\\\n0 & \\frac{\\sqrt{2}}{\\sqrt{3}} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\nWhy canโ€™t the following matrix considered a projection matrix?\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n-1/\\sqrt{2} & 1/\\sqrt{3} \\\\\n0 & 0 \\\\\n1/\\sqrt{2} & 0 \\\\\n0 & \\sqrt{2}/\\sqrt{3} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe columns are not orthonormal. The cross-product is not equal to 0.\n\n\n\n\n\n\n2. Which of these statements is the most accurate? And which is the most precise?\nA. It is almost certain to rain in the next week.\nB. It is 90% likely to get at least 10mm of rain tomorrow.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nA is more accurate, but B is more precise.\n\n\n\n\n\n\n3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.\n\nlibrary(readr)\nlibrary(dplyr)\nlibrary(rsample)\n\nbushfires <- read_csv(\"https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv\")\nbushfires |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 138\n2 arson 37\n3 burning_off 9\n4 lightning 838\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe data is unbalanced, so it is especially important to stratify the sampling by the response variable. Without stratifying the test set is likely missing observations in the burning_off category.\n\nset.seed(1156)\nbushfires_split <- initial_split(bushfires, prop = 0.60, strata=cause)\ntraining(bushfires_split) |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 84\n2 arson 21\n3 burning_off 5\n4 lightning 502\n\ntesting(bushfires_split) |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 54\n2 arson 16\n3 burning_off 4\n4 lightning 336\n\n\n\n\n\n\n\n\n4. In the lecture slides from week 1 on bias vs variance, these four images were shown.\n \n \nMark the images with the labels โ€œtrue modelโ€, โ€œfitted modelโ€, โ€œbiasโ€. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe linear model will be very similar regardless of the training sample, so it has small variance. But because it misses the curved nature of the true model, it has large bias, missing critical parts of the two classes that are different.\nThe non-parametric model which captures the curves thus has small bias, but the fitted model might vary a lot from one training sample to another which would result in it being considered to have large variance.\n \n\n\n\n\n\n\n5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.\n\npred_data <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv\") |>\n mutate(true = factor(true))\n\n\nHow many classes?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\npred_data |> count(true)\n\n# A tibble: 2 ร— 2\n true n\n <fct> <int>\n1 Adelie 30\n2 Chinstrap 5\n\n\n\n\n\n\n\nCompute the confusion table, using the maximum predictive probability to label the observation.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nlibrary(tidyr)\npred_data <- pred_data |>\n mutate(pred = levels(pred_data$true)[apply(pred_data[,-1], 1, which.max)])\npred_data |> count(true, pred) |> \n group_by(true) |>\n mutate(cl_err = n[pred==true]/sum(n)) |>\n pivot_wider(names_from = pred, \n values_from = n,\n values_fill = 0) |>\n dplyr::select(true, Adelie, Chinstrap, cl_err)\n\n# A tibble: 2 ร— 4\n# Groups: true [2]\n true Adelie Chinstrap cl_err\n <fct> <int> <int> <dbl>\n1 Adelie 30 0 1 \n2 Chinstrap 2 3 0.6\n\n\n\n\n\n\n\nCompute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nAccuracy = 33/35 = 0.94\nAccuracy when all predicted to be Adelie = 30/35 = 0.86\nThere are only 5 observations in the Chinstrap class. So accuracy remains high, if we simply ignore this class.\n\n\n\n\n\nCompute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe balanced accuracy is 0.8. This is a better reflection on the predictive ability of the model for this data because it reflects the difficulty in predicting the Chinstrap group.\n\n\n\n\n\n\n6. This question relates to feature engineering, creating better variables on which to build your model.\n\nThe following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?\n\n\nlibrary(ggplot2)\nlibrary(ggbeeswarm)\nspam <- read_csv(\"http://ggobi.org/book/data/spam.csv\")\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\") +\n scale_y_log10()\n\n\n\n\n\n\n\n\n\n\n\n\n\nFor the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?\n\n\nolive <- read_csv(\"http://ggobi.org/book/data/olive.csv\") |>\n dplyr::filter(region != 1) |>\n dplyr::select(region, arachidic, linoleic) |>\n mutate(region = factor(region))\nggplot(olive, aes(x=linoleic, \n y=arachidic, \n colour=region)) +\n geom_point() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n theme(legend.position=\"none\", \n aspect.ratio=1)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nolive <- olive |>\n mutate(linoarch = 0.377 * linoleic + \n 0.926 * arachidic)\nggplot(olive, aes(x=region, \n y=linoarch, \n colour=region)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\") \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n7. Discuss with your neighbour, what you found the most difficult part of last weekโ€™s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer." + "objectID": "week2/slides.html#concepts", + "href": "week2/slides.html#concepts", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Concepts", + "text": "Concepts" }, { - "objectID": "week2/tutorialsol.html#finishing-up", - "href": "week2/tutorialsol.html#finishing-up", - "title": "ETC3250/5250 Tutorial 2", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week2/slides.html#model-in-the-data-space", + "href": "week2/slides.html#model-in-the-data-space", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Model-in-the-data-space", + "text": "Model-in-the-data-space\n\n\n\n\n\nFrom XKCD\n\n\n\n We plot the model on the data to assess whether it fits or is a misfit!\n\n\nDoing this in high-dimensions is considered difficult!\n\n\nSo it is common to only plot the data-in-the-model-space." }, { - "objectID": "week2/index.html", - "href": "week2/index.html", - "title": "Week 2: Visualising your data and models", - "section": "", - "text": "Cook and Laa Ch 1, 3, 4, 5, 6, 13" + "objectID": "week2/slides.html#data-in-the-model-space", + "href": "week2/slides.html#data-in-the-model-space", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Data-in-the-model-space", + "text": "Data-in-the-model-space\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPredictive probabilities are aspects of the model. It is useful to plot. What do we learn here?\n\nBut it doesnโ€™t tell you why there is a difference." }, { - "objectID": "week2/index.html#main-reference", - "href": "week2/index.html#main-reference", - "title": "Week 2: Visualising your data and models", - "section": "", - "text": "Cook and Laa Ch 1, 3, 4, 5, 6, 13" + "objectID": "week2/slides.html#model-in-the-data-space-1", + "href": "week2/slides.html#model-in-the-data-space-1", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Model-in-the-data-space", + "text": "Model-in-the-data-space\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?\n\nOne model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also โ€ฆ" }, { - "objectID": "week2/index.html#what-you-will-learn-this-week", - "href": "week2/index.html#what-you-will-learn-this-week", - "title": "Week 2: Visualising your data and models", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nDimension reduction methods: linear and non-linear\nVisualising high-dimensions using animations of linear projections\nScatterplot matrices\nParallel coordinate plots\nConcept of model-in-the-data-space, relative to data-in-the-moel-space" + "objectID": "week2/slides.html#how-do-you-visualise-beyond-2d", + "href": "week2/slides.html#how-do-you-visualise-beyond-2d", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How do you visualise beyond 2D?", + "text": "How do you visualise beyond 2D?" }, { - "objectID": "week2/index.html#lecture-slides", - "href": "week2/index.html#lecture-slides", - "title": "Week 2: Visualising your data and models", - "section": "Lecture slides", - "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" + "objectID": "week2/slides.html#scatterplot-matrix", + "href": "week2/slides.html#scatterplot-matrix", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Scatterplot matrix", + "text": "Scatterplot matrix\n\n\n Start simply! Make static plots that organise the variables on a page. \nPlot all the pairs of variables. When laid out in a matrix format this is called a scatterplot matrix.\n Here, we see linear association, clumping and clustering, potentially some outliers." }, { - "objectID": "week2/index.html#tutorial-instructions", - "href": "week2/index.html#tutorial-instructions", - "title": "Week 2: Visualising your data and models", - "section": "Tutorial instructions", - "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" + "objectID": "week2/slides.html#scatterplot-matrix-drawbacks", + "href": "week2/slides.html#scatterplot-matrix-drawbacks", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Scatterplot matrix: drawbacks", + "text": "Scatterplot matrix: drawbacks\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere is an outlier in the data on the right, like the one in the left, but it is hidden in a combination of variables. Itโ€™s not visible in any pair of variables." }, { - "objectID": "week2/index.html#assignments", - "href": "week2/index.html#assignments", - "title": "Week 2: Visualising your data and models", - "section": "Assignments", - "text": "Assignments\n\nAssignment 1 is due on Friday 22 March." + "objectID": "week2/slides.html#perception", + "href": "week2/slides.html#perception", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Perception", + "text": "Perception\n\nAspect ratio for scatterplots needs to be equal, or square!\n\nWhen you make a scatterplot of two variables from a multivariate data set, most software renders it with an unequal aspect ratio, as a rectangle. You need to over-ride this and force the square aspect ratio. Why?\n\n\n\nBecause it adversely affects the perception of correlation and association between variables." }, { - "objectID": "week11/index.html", - "href": "week11/index.html", - "title": "Week 11: Evaluating your clustering model", - "section": "", - "text": "Cook and Laa Ch 12" + "objectID": "week2/slides.html#parallel-coordinate-plot", + "href": "week2/slides.html#parallel-coordinate-plot", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Parallel coordinate plot", + "text": "Parallel coordinate plot\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5) + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n Parallel coordinate plots are side-by-side dotplots with values from a row connected with a line.\nExamine the direction and orientation of lines to perceive multivariate relationships.\nCrossing lines indicate negative association. Lines with same slope indicate positive association. Outliers have a different up/down pattern to other points. Groups of lines with same pattern indicate clustering." }, { - "objectID": "week11/index.html#main-reference", - "href": "week11/index.html#main-reference", - "title": "Week 11: Evaluating your clustering model", - "section": "", - "text": "Cook and Laa Ch 12" + "objectID": "week2/slides.html#parallel-coordinate-plot-drawbacks", + "href": "week2/slides.html#parallel-coordinate-plot-drawbacks", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Parallel coordinate plot: drawbacks", + "text": "Parallel coordinate plot: drawbacks\n\n\n\nHard to follow lines - need interactivity\nOrder of variables\nScaling of variables\n\n\n\nBut the advantage is that you can pack a lot of variables into the single page." + }, + { + "objectID": "week2/slides.html#parallel-coordinate-plot-effect-of-scaling", + "href": "week2/slides.html#parallel-coordinate-plot-effect-of-scaling", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Parallel coordinate plot: effect of scaling", + "text": "Parallel coordinate plot: effect of scaling\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n scale=\"globalminmax\") + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n scale=\"uniminmax\") + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)" + }, + { + "objectID": "week2/slides.html#parallel-coordinate-plot-effect-of-ordering", + "href": "week2/slides.html#parallel-coordinate-plot-effect-of-ordering", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Parallel coordinate plot: effect of ordering", + "text": "Parallel coordinate plot: effect of ordering\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n groupColumn = 1) + \n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n xlab(\"\") + ylab(\"\") +\n theme(legend.position=\"none\", aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n groupColumn = 1, order=c(4, 2, 5, 3)) + \n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n xlab(\"\") + ylab(\"\") +\n theme(legend.position=\"none\", aspect.ratio=0.8)" + }, + { + "objectID": "week2/slides.html#adding-interactivity-to-static-plots-scatterplot-matrix", + "href": "week2/slides.html#adding-interactivity-to-static-plots-scatterplot-matrix", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Adding interactivity to static plots: scatterplot matrix", + "text": "Adding interactivity to static plots: scatterplot matrix\n\n\n\nlibrary(plotly)\ng <- ggpairs(p_tidy, columns=2:5) +\n theme(axis.text = element_blank()) \n\n Selecting points, using plotly, allows you to see where this observation lies in the other plots (pairs of variables).\n\n\nggplotly(g, width=600, height=600)" + }, + { + "objectID": "week2/slides.html#adding-interactivity-to-static-plots-parallel-coordinates", + "href": "week2/slides.html#adding-interactivity-to-static-plots-parallel-coordinates", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Adding interactivity to static plots: parallel coordinates", + "text": "Adding interactivity to static plots: parallel coordinates\n\n\n\np_pcp <- p_tidy |>\n na.omit() |>\n plot_ly(type = 'parcoords',\n line = list(),\n dimensions = list(\n list(range = c(172, 231),\n label = 'fl', values = ~fl),\n list(range = c(32, 60),\n label = 'bl', values = ~bl),\n list(range = c(2700, 6300),\n label = 'bm', values = ~bm),\n list(range = c(13, 22),\n label = 'bd', values = ~bd)\n )\n )\n\n\n\np_pcp" + }, + { + "objectID": "week2/slides.html#what-is-high-dimensions", + "href": "week2/slides.html#what-is-high-dimensions", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "What is high-dimensions?", + "text": "What is high-dimensions?" + }, + { + "objectID": "week2/slides.html#high-dimensions-in-statistics", + "href": "week2/slides.html#high-dimensions-in-statistics", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "High-dimensions in statistics", + "text": "High-dimensions in statistics\n\n\n\nIncreasing dimension adds an additional orthogonal axis.\n\nIf you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, โ€ฆ\nAnd read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott." + }, + { + "objectID": "week2/slides.html#remember", + "href": "week2/slides.html#remember", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Remember", + "text": "Remember\nData\n\\[\\begin{eqnarray*}\nX_{~n\\times p} =\n[X_{~1}~X_{~2}~\\dots~X_{~p}]_{~n\\times p} = \\left[ \\begin{array}{cccc}\nx_{~11} & x_{~12} & \\dots & x_{~1p} \\\\\nx_{~21} & x_{~22} & \\dots & x_{~2p}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nx_{~n1} & x_{~n2} & \\dots & x_{~np} \\end{array} \\right]_{~n\\times p}\n\\end{eqnarray*}\\]" + }, + { + "objectID": "week2/slides.html#remember-1", + "href": "week2/slides.html#remember-1", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Remember", + "text": "Remember\nProjection\n\\[\\begin{eqnarray*}\nA_{~p\\times d} = \\left[ \\begin{array}{cccc}\na_{~11} & a_{~12} & \\dots & a_{~1d} \\\\\na_{~21} & a_{~22} & \\dots & a_{~2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\na_{~p1} & a_{~p2} & \\dots & a_{~pd} \\end{array} \\right]_{~p\\times d}\n\\end{eqnarray*}\\]" + }, + { + "objectID": "week2/slides.html#remember-2", + "href": "week2/slides.html#remember-2", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Remember", + "text": "Remember\nProjected data\n\\[\\begin{eqnarray*}\nY_{~n\\times d} = XA = \\left[ \\begin{array}{cccc}\ny_{~11} & y_{~12} & \\dots & y_{~1d} \\\\\ny_{~21} & y_{~22} & \\dots & y_{~2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\ny_{~n1} & y_{~n2} & \\dots & y_{~nd} \\end{array} \\right]_{~n\\times d}\n\\end{eqnarray*}\\]" + }, + { + "objectID": "week2/slides.html#tours-of-linear-projections", + "href": "week2/slides.html#tours-of-linear-projections", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Tours of linear projections", + "text": "Tours of linear projections\n\n\n\n\nData is 2D: \\(~~p=2\\)\nProjection is 1D: \\(~~d=1\\)\n\n\\[\\begin{eqnarray*}\nA_{~2\\times 1} = \\left[ \\begin{array}{c}\na_{~11} \\\\\na_{~21}\\\\\n\\end{array} \\right]_{~2\\times 1}\n\\end{eqnarray*}\\]\n\n\n Notice that the values of \\(A\\) change between (-1, 1). All possible values being shown during the tour.\n \n \\[\\begin{eqnarray*}\nA = \\left[ \\begin{array}{c}\n1 \\\\\n0\\\\\n\\end{array} \\right]\n~~~~~~~~~~~~~~~~\nA = \\left[ \\begin{array}{c}\n0.7 \\\\\n0.7\\\\\n\\end{array} \\right]\n~~~~~~~~~~~~~~~~\nA = \\left[ \\begin{array}{c}\n0.7 \\\\\n-0.7\\\\\n\\end{array} \\right]\n\n\\end{eqnarray*}\\]\n\n\n watching the 1D shadows we can see:\n\nunimodality\nbimodality, there are two clusters.\n\n\n\n What does the 2D data look like? Can you sketch it?" + }, + { + "objectID": "week2/slides.html#tours-of-linear-projections-1", + "href": "week2/slides.html#tours-of-linear-projections-1", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Tours of linear projections", + "text": "Tours of linear projections\n\n\n\n\n\n\n\n\n\n\n\n\n โŸต The 2D data" + }, + { + "objectID": "week2/slides.html#tours-of-linear-projections-2", + "href": "week2/slides.html#tours-of-linear-projections-2", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Tours of linear projections", + "text": "Tours of linear projections\n\n\n\nData is 3D: \\(p=3\\)\nProjection is 2D: \\(d=2\\)\n\\[\\begin{eqnarray*}\nA_{~3\\times 2} = \\left[ \\begin{array}{cc}\na_{~11} & a_{~12} \\\\\na_{~21} & a_{~22}\\\\\na_{~31} & a_{~32}\\\\\n\\end{array} \\right]_{~3\\times 2}\n\\end{eqnarray*}\\]\n\n\n Notice that the values of \\(A\\) change between (-1, 1). All possible values being shown during the tour.\n\n\nSee:\n\ncircular shapes\nsome transparency, reveals middle\nhole in in some projections\nno clustering" + }, + { + "objectID": "week2/slides.html#tours-of-linear-projections-3", + "href": "week2/slides.html#tours-of-linear-projections-3", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Tours of linear projections", + "text": "Tours of linear projections\n\n\n\nData is 4D: \\(p=4\\)\nProjection is 2D: \\(d=2\\)\n\\[\\begin{eqnarray*}\nA_{~4\\times 2} = \\left[ \\begin{array}{cc}\na_{~11} & a_{~12} \\\\\na_{~21} & a_{~22}\\\\\na_{~31} & a_{~32}\\\\\na_{~41} & a_{~42}\\\\\n\\end{array} \\right]_{~4\\times 2}\n\\end{eqnarray*}\\]\n\n How many clusters do you see?\n\n\nthree, right?\none separated, and two very close,\nand they each have an elliptical shape.\n\n\n\n\ndo you also see an outlier or two?" + }, + { + "objectID": "week2/slides.html#intuitively-tours-are-like", + "href": "week2/slides.html#intuitively-tours-are-like", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Intuitively, tours are like โ€ฆ", + "text": "Intuitively, tours are like โ€ฆ" + }, + { + "objectID": "week2/slides.html#and-help-to-see-the-datamodel-as-a-whole", + "href": "week2/slides.html#and-help-to-see-the-datamodel-as-a-whole", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "And help to see the data/model as a whole", + "text": "And help to see the data/model as a whole\n\n\nAvoid misinterpretation โ€ฆ\n\n\n\n\n\n\nโ€ฆ see the bigger picture!\n\n\n\n\n\n\n\n\nImage: Sketchplanations." + }, + { + "objectID": "week2/slides.html#anomaly-is-no-longer-hidden", + "href": "week2/slides.html#anomaly-is-no-longer-hidden", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Anomaly is no longer hidden", + "text": "Anomaly is no longer hidden\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWait for it!" + }, + { + "objectID": "week2/slides.html#how-to-use-a-tour-in-r", + "href": "week2/slides.html#how-to-use-a-tour-in-r", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to use a tour in R", + "text": "How to use a tour in R\n\n\nThis is a basic tour, which will run in your RStudio plot window.\n\nlibrary(tourr)\nanimate_xy(flea[, 1:6], rescale=TRUE)\n\n\n This data has a class variable, species.\n\n\nflea |> slice_head(n=3)\n\n species tars1 tars2 head aede1 aede2 aede3\n1 Concinna 191 131 53 150 15 104\n2 Concinna 185 134 50 147 13 105\n3 Concinna 200 137 52 144 14 102\n\n\n\nUse this to colour points with:\n\nanimate_xy(flea[, 1:6], \n col = flea$species, \n rescale=TRUE)\n\n\n\n\nYou can specifically guide the tour choice of projections using\n\nanimate_xy(flea[, 1:6], \n tour_path = guided_tour(holes()), \n col = flea$species, \n rescale = TRUE, \n sphere = TRUE)\n\n\n\n and you can manually choose a variable to control with:\n\nset.seed(915)\nanimate_xy(flea[, 1:6], \n radial_tour(basis_random(6, 2), \n mvar = 6), \n rescale = TRUE,\n col = flea$species)" + }, + { + "objectID": "week2/slides.html#how-to-save-a-tour", + "href": "week2/slides.html#how-to-save-a-tour", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to save a tour", + "text": "How to save a tour\n\n\n\n\n\n\nTo save as an animated gif:\n\nset.seed(645)\nrender_gif(penguins_sub[,1:4],\n grand_tour(),\n display_xy(col=\"#EC5C00\",\n half_range=3.8, \n axes=\"bottomleft\", cex=2.5),\n gif_file = \"../gifs/penguins1.gif\",\n apf = 1/60,\n frames = 1500,\n width = 500, \n height = 400)" + }, + { + "objectID": "week2/slides.html#dimension-reduction", + "href": "week2/slides.html#dimension-reduction", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Dimension reduction", + "text": "Dimension reduction" + }, + { + "objectID": "week2/slides.html#pca", + "href": "week2/slides.html#pca", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "PCA", + "text": "PCA\n\n\nFor this 2D data, sketch a line or a direction that if you squashed the data into it would provide most of the information.\n\n\n\n\n\n\n\n\n\n\n\n What about this data?" + }, + { + "objectID": "week2/slides.html#pca-1", + "href": "week2/slides.html#pca-1", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "PCA", + "text": "PCA\n\nPrincipal component analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. It is an unsupervised learning method.\n\nUse it, when:\n\nYou have too many predictors for a regression. Instead, we can use the first few principal components.\nNeed to understand relationships between variables.\nTo make plots summarising the variation in a large number of variables." + }, + { + "objectID": "week2/slides.html#first-principal-component", + "href": "week2/slides.html#first-principal-component", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "First principal component", + "text": "First principal component\nThe first principal component is a new variable created from a linear combination\n\\[z_1 = \\phi_{11}x_1 + \\phi_{21} x_2 + \\dots + \\phi_{p1} x_p\\]\nof the original \\(x_1, x_2, \\dots, x_p\\) that has the largest variance. The elements \\(\\phi_{11},\\dots,\\phi_{p1}\\) are the loadings of the first principal component and are constrained by:\n\\[\n\\displaystyle\\sum_{j=1}^p \\phi^2_{j1} = 1\n\\]" + }, + { + "objectID": "week2/slides.html#calculation", + "href": "week2/slides.html#calculation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Calculation", + "text": "Calculation\n\nThe loading vector \\(\\phi_1 = [\\phi_{11},\\dots,\\phi_{p1}]^\\top\\) defines direction in feature space along which data vary most.\nIf we project the \\(n\\) data points \\({x}_1,\\dots,{x}_n\\) onto this direction, the projected values are the principal component scores \\(z_{11},\\dots,z_{n1}\\).\n\n\n\n\nThe second principal component is the linear combination \\(z_{i2} = \\phi_{12}x_{i1} + \\phi_{22}x_{i2} + \\dots + \\phi_{p2}x_{ip}\\) that has maximal variance among all linear combinations that are uncorrelated with \\(z_1\\).\nEquivalent to constraining \\(\\phi_2\\) to be orthogonal (perpendicular) to \\(\\phi_1\\). And so on.\nThere are at most \\(\\min(n - 1, p)\\) PCs." }, { - "objectID": "week11/index.html#what-you-will-learn-this-week", - "href": "week11/index.html#what-you-will-learn-this-week", - "title": "Week 11: Evaluating your clustering model", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nConfusion tables\nCluster metrics" + "objectID": "week2/slides.html#example", + "href": "week2/slides.html#example", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Example", + "text": "Example\n\n \n\nIf you think of the first few PCs like a linear model fit, and the others as the error, it is like regression, except that errors are orthogonal to model.\n(Chapter6/6.15.pdf)" }, { - "objectID": "week11/index.html#assignments", - "href": "week11/index.html#assignments", - "title": "Week 11: Evaluating your clustering model", - "section": "Assignments", - "text": "Assignments\n\nProject is due on Friday 17 May." + "objectID": "week2/slides.html#geometry", + "href": "week2/slides.html#geometry", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Geometry", + "text": "Geometry\nPCA can be thought of as fitting an \\(n\\)-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The new variables produced by principal components correspond to rotating and scaling the ellipse into a circle. It spheres the data." }, { - "objectID": "week1/tutorial.html", - "href": "week1/tutorial.html", - "title": "ETC53250/5250 Tutorial 1", - "section": "", - "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." + "objectID": "week2/slides.html#computation", + "href": "week2/slides.html#computation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Computation", + "text": "Computation\nSuppose we have a \\(n\\times p\\) data set \\(X = [x_{ij}]\\).\n\nCentre each of the variables to have mean zero (i.e., the column means of \\({X}\\) are zero).\nLet \\(z_{i1} = \\phi_{11}x_{i1} + \\phi_{21} x_{i2} + \\dots + \\phi_{p1} x_{ip}\\)\nCompute sample variance of \\(z_{i1}\\) is \\(\\displaystyle\\frac1n\\sum_{i=1}^n z_{i1}^2\\).\nEstimate \\(\\phi_{j1}\\)\n\n\\[\n\\mathop{\\text{maximize}}_{\\phi_{11},\\dots,\\phi_{p1}} \\frac{1}{n}\\sum_{i=1}^n\n\\left(\\sum_{j=1}^p \\phi_{j1}x_{ij}\\right)^{\\!\\!\\!2} \\text{ subject to }\n\\sum_{j=1}^p \\phi^2_{j1} = 1\n\\]\nRepeat optimisation to estimate \\(\\phi_{jk}\\), with additional constraint that \\(\\sum_{j=1, k<k'}^p \\phi_{jk}\\phi_{jk'} = 0\\) (next vector is orthogonal to previous eigenvector)." }, { - "objectID": "week1/tutorial.html#objectives", - "href": "week1/tutorial.html#objectives", - "title": "ETC53250/5250 Tutorial 1", - "section": "", - "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." + "objectID": "week2/slides.html#alternative-forumulations", + "href": "week2/slides.html#alternative-forumulations", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Alternative forumulations", + "text": "Alternative forumulations\n\n\nEigen-decomposition\n\nCompute the covariance matrix (after centering the columns of \\({X}\\)) \\[S = {X}^T{X}\\]\nFind eigenvalues (diagonal elements of \\(D\\)) and eigenvectors ( \\(V\\) ): \\[{S}={V}{D}{V}^T\\] where columns of \\({V}\\) are orthonormal (i.e., \\({V}^T{V}={I}\\))\n\n\nSingular Value Decomposition\n\\[X = U\\Lambda V^T\\]\n\n\\(X\\) is an \\(n\\times p\\) matrix\n\\(U\\) is \\(n \\times r\\) matrix with orthonormal columns ( \\(U^TU=I\\) )\n\\(\\Lambda\\) is \\(r \\times r\\) diagonal matrix with non-negative elements. (Square root of the eigenvalues.)\n\\(V\\) is \\(p \\times r\\) matrix with orthonormal columns (These are the eigenvectors, and \\(V^TV=I\\) ).\n\nIt is always possible to uniquely decompose a matrix in this way." }, { - "objectID": "week1/tutorial.html#preparation", - "href": "week1/tutorial.html#preparation", - "title": "ETC53250/5250 Tutorial 1", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nInstall the latest versions of R and RStudio on your computer\n\n\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\n\n\nCreate a project for this unit called iml.Rproj. All of your tutorial work and assignments should be completed in this workspace." + "objectID": "week2/slides.html#total-variance", + "href": "week2/slides.html#total-variance", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Total variance", + "text": "Total variance\nRemember, PCA is trying to summarise the variance in the data.\nTotal variance (TV) in data (assuming variables centered at 0):\n\\[\n\\text{TV} = \\sum_{j=1}^p \\text{Var}(x_j) = \\sum_{j=1}^p \\frac{1}{n}\\sum_{i=1}^n x_{ij}^2\n\\]\nIf variables are standardised, TV=number of variables.\n\nVariance explained by mโ€™th PC: \\(V_m = \\text{Var}(z_m) = \\frac{1}{n}\\sum_{i=1}^n z_{im}^2\\)\n\\[\n\\text{TV} = \\sum_{m=1}^M V_m \\text{ where }M=\\min(n-1,p).\n\\]" }, { - "objectID": "week1/tutorial.html#exercises", - "href": "week1/tutorial.html#exercises", - "title": "ETC53250/5250 Tutorial 1", - "section": "Exercises:", - "text": "Exercises:\n\n1. The materials at https://learnr.numbat.space are an especially good way to check your R skills are ready for the unit. Regardless how advanced you are, at some point you will need help. How you ask for help is a big factor in getting your problem fixed. The following code generates an error.\n\nlibrary(dplyr)\nlibrary(MASS)\nlibrary(palmerpenguins)\np_sub <- penguins |>\n select(species, flipper_length_mm) |>\n filter(species == \"Adelie\")\n\n\nCan you work out why?\nUse the reprex package to create a text where the code and error are visible, and can be shared with someone that might be able to help.\n\n\n\n2. Your turn to write some code that generates an error. Create a reprex, and share with your tutor or neighbour, to see if they can fix the error.\n\n\n3. Follow the guidelines at https://tensorflow.rstudio.com/install/ to setup python and tensorflow on your computer. Then test your installation by following the beginner tutorial.\n\n\n4. Download the slides.qmd file for week 1 lecture.\n\nUse knitr::purl() to extract the R code for the class.\nOpen the resulting slides.R file in your RStudio file browser. What code is in the setup.R file that is sourced at the top?\n\n\nRun the rest of the code in small chunks. Does it all work for you? Do you get any errors? Do you have any suggestions on making it easier to run or understand the code?" + "objectID": "week2/slides.html#how-to-choose-k", + "href": "week2/slides.html#how-to-choose-k", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to choose \\(k\\)?", + "text": "How to choose \\(k\\)?\n\nPCA is a useful dimension reduction technique for large datasets, but deciding on how many dimensions to keep isnโ€™t often clear.\n\nHow do we know how many principal components to choose?" }, { - "objectID": "week1/tutorial.html#finishing-up", - "href": "week1/tutorial.html#finishing-up", - "title": "ETC53250/5250 Tutorial 1", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week2/slides.html#how-to-choose-k-1", + "href": "week2/slides.html#how-to-choose-k-1", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to choose \\(k\\)?", + "text": "How to choose \\(k\\)?\n\n\nProportion of variance explained:\n\\[\\text{PVE}_m = \\frac{V_m}{TV}\\]\nChoosing the number of PCs that adequately summarises the variation in \\(X\\), is achieved by examining the cumulative proportion of variance explained.\n\n\nCumulative proportion of variance explained:\n\\[\\text{CPVE}_k = \\sum_{m=1}^k\\frac{V_m}{TV}\\]" }, { - "objectID": "week1/tutorialsol.html", - "href": "week1/tutorialsol.html", - "title": "ETC53250/5250 Tutorial 1", - "section": "", - "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." + "objectID": "week2/slides.html#how-to-choose-k-2", + "href": "week2/slides.html#how-to-choose-k-2", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to choose \\(k\\)?", + "text": "How to choose \\(k\\)?\n\n\n\nScree plot: Plot of variance explained by each component vs number of component." }, { - "objectID": "week1/tutorialsol.html#objectives", - "href": "week1/tutorialsol.html#objectives", - "title": "ETC53250/5250 Tutorial 1", - "section": "", - "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." + "objectID": "week2/slides.html#how-to-choose-k-3", + "href": "week2/slides.html#how-to-choose-k-3", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "How to choose \\(k\\)?", + "text": "How to choose \\(k\\)?\n\n\n\nScree plot: Plot of variance explained by each component vs number of component." }, { - "objectID": "week1/tutorialsol.html#preparation", - "href": "week1/tutorialsol.html#preparation", - "title": "ETC53250/5250 Tutorial 1", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nInstall the latest versions of R and RStudio on your computer\n\n\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\n\n\nCreate a project for this unit called iml.Rproj. All of your tutorial work and assignments should be completed in this workspace." + "objectID": "week2/slides.html#example---track-records", + "href": "week2/slides.html#example---track-records", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Example - track records", + "text": "Example - track records\nThe data on national track records for women (as at 1984).\n\ntrack <- read_csv(here::here(\"data/womens_track.csv\"))\nglimpse(track)\n\nRows: 55\nColumns: 8\n$ m100 <dbl> 12, 11, 11, 11, 11, 11, 12, 11, 12, 12, 1โ€ฆ\n$ m200 <dbl> 23, 22, 23, 23, 23, 23, 24, 22, 25, 24, 2โ€ฆ\n$ m400 <dbl> 54, 51, 51, 52, 53, 53, 55, 50, 55, 55, 5โ€ฆ\n$ m800 <dbl> 2.1, 2.0, 2.0, 2.0, 2.2, 2.1, 2.2, 2.0, 2โ€ฆ\n$ m1500 <dbl> 4.4, 4.1, 4.2, 4.1, 4.6, 4.5, 4.5, 4.1, 4โ€ฆ\n$ m3000 <dbl> 9.8, 9.1, 9.3, 8.9, 9.8, 9.8, 9.5, 8.8, 9โ€ฆ\n$ marathon <dbl> 179, 152, 159, 158, 170, 169, 191, 149, 1โ€ฆ\n$ country <chr> \"argentin\", \"australi\", \"austria\", \"belgiโ€ฆ\n\n\nSource: Johnson and Wichern, Applied multivariate analysis" }, { - "objectID": "week1/tutorialsol.html#exercises", - "href": "week1/tutorialsol.html#exercises", - "title": "ETC53250/5250 Tutorial 1", - "section": "Exercises:", - "text": "Exercises:\n\n1. The materials at https://learnr.numbat.space are an especially good way to check your R skills are ready for the unit. Regardless how advanced you are, at some point you will need help. How you ask for help is a big factor in getting your problem fixed. The following code generates an error.\n\nlibrary(dplyr)\nlibrary(MASS)\nlibrary(palmerpenguins)\np_sub <- penguins |>\n select(species, flipper_length_mm) |>\n filter(species == \"Adelie\")\n\n\nCan you work out why?\nUse the reprex package to create a text where the code and error are visible, and can be shared with someone that might be able to help.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe error is\nError in select(penguins, species, flipper_length_mm) : \n unused arguments (species, flipper_length_mm)\nand is caused by a conflict in functions between the dplyr and MASS packages. If you read the warning messages when the packages were loaded you might have been aware of this before trying to run code.\nYou can fix it by:\n\nPrefacing functions that have conflicts with their package name, eg dplyr::select()\nUse the conflicted package to set your preferences at the start of any document.\n\nTo make the reprex, copy the code to clipboard, and run reprex(). This will generate:\n\n\n\n\n\n\n\n2. Your turn to write some code that generates an error. Create a reprex, and share with your tutor or neighbour, to see if they can fix the error.\n\n\n3. Follow the guidelines at https://tensorflow.rstudio.com/install/ to setup python and tensorflow on your computer. Then test your installation by following the beginner tutorial.\n\n\n4. Download the slides.qmd file for week 1 lecture.\n\nUse knitr::purl() to extract the R code for the class.\nOpen the resulting slides.R file in your RStudio file browser. What code is in the setup.R file that is sourced at the top?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nLibraries are loaded.\nThere are some global options for slides set, and styling of plots.\nConflicts for some common functions are resolved with preferences.\n\n\n\n\n\n\nRun the rest of the code in small chunks. Does it all work for you? Do you get any errors? Do you have any suggestions on making it easier to run or understand the code?" + "objectID": "week2/slides.html#explore-the-data-scatterplot-matrix", + "href": "week2/slides.html#explore-the-data-scatterplot-matrix", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Explore the data: scatterplot matrix", + "text": "Explore the data: scatterplot matrix\n\n\n\n\n\n\n\n\n\n\n\n\nWhat do you learn?\n\n\nLinear relationships between most variables\nOutliers in long distance events, and in 400m vs 100m, 200m\nNon-linear relationship between marathon and 400m, 800m" }, { - "objectID": "week1/tutorialsol.html#finishing-up", - "href": "week1/tutorialsol.html#finishing-up", - "title": "ETC53250/5250 Tutorial 1", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week2/slides.html#explore-the-data-tour", + "href": "week2/slides.html#explore-the-data-tour", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Explore the data: tour", + "text": "Explore the data: tour\n\n\n\n\n\n\nWhat do you learn?\n\nMostly like a very slightly curved pencil\nSeveral outliers, in different directions" }, { - "objectID": "week1/index.html", - "href": "week1/index.html", - "title": "Week 1: Foundations of machine learning", - "section": "", - "text": "ISLR 2.1, 2.2" + "objectID": "week2/slides.html#compute-pca", + "href": "week2/slides.html#compute-pca", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Compute PCA", + "text": "Compute PCA\n\noptions(digits=2)\n\n\ntrack_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE)\ntrack_pca\n\nStandard deviations (1, .., p=7):\n[1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15\n\nRotation (n x k) = (7 x 7):\n PC1 PC2 PC3 PC4 PC5 PC6 PC7\nm100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052\nm200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109\nm400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208\nm800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315\nm1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693\nm3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598\nmarathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070" }, { - "objectID": "week1/index.html#main-reference", - "href": "week1/index.html#main-reference", - "title": "Week 1: Foundations of machine learning", - "section": "", - "text": "ISLR 2.1, 2.2" + "objectID": "week2/slides.html#summarise", + "href": "week2/slides.html#summarise", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Summarise", + "text": "Summarise\nSummary of the principal components:\n\n\n\n\n\n\nPC1\nPC2\nPC3\nPC4\nPC5\nPC6\nPC7\n\n\n\n\nVariance\n5.81\n0.65\n0.30\n0.13\n0.05\n0.04\n0.02\n\n\nProportion\n0.83\n0.09\n0.04\n0.02\n0.01\n0.01\n0.00\n\n\nCum. prop\n0.83\n0.92\n0.97\n0.98\n0.99\n1.00\n1.00\n\n\n\n\n\n\n\nIncrease in variance explained large until \\(k=3\\) PCs, and then tapers off. A choice of 3 PCs would explain 97% of the total variance." }, { - "objectID": "week1/index.html#what-you-will-learn-this-week", - "href": "week1/index.html#what-you-will-learn-this-week", - "title": "Week 1: Foundations of machine learning", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nFraming the problems\nNotation and math\nBias variance-tradeoff\nFitting your models: training/test splits, optimisation\nMeasuring fit: accuracy, loss\nDiagnostics: residuals\nFeature engineering: combining variables to better match purpose and help the model fitting" + "objectID": "week2/slides.html#decide", + "href": "week2/slides.html#decide", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Decide", + "text": "Decide\n\n\n\nScree plot: Where is the elbow?\n\n At \\(k=2\\), thus the scree plot suggests 2 PCs would be sufficient to explain the variability." }, { - "objectID": "week1/index.html#lecture-slides", - "href": "week1/index.html#lecture-slides", - "title": "Week 1: Foundations of machine learning", - "section": "Lecture slides", - "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" + "objectID": "week2/slides.html#assess-data-in-the-model-space", + "href": "week2/slides.html#assess-data-in-the-model-space", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Assess: Data-in-the-model-space", + "text": "Assess: Data-in-the-model-space\n\n\n\nVisualise model using a biplot: Plot the principal component scores, and also the contribution of the original variables to the principal component.\n\nA biplot is like a single projection from a tour." }, { - "objectID": "week1/index.html#tutorial-instructions", - "href": "week1/index.html#tutorial-instructions", - "title": "Week 1: Foundations of machine learning", - "section": "Tutorial instructions", - "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" + "objectID": "week2/slides.html#interpret", + "href": "week2/slides.html#interpret", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Interpret", + "text": "Interpret\n\nPC1 measures overall magnitude, the strength of the athletics program. High positive values indicate poor programs with generally slow times across events.\nPC2 measures the contrast in the program between short and long distance events. Some countries have relatively stronger long distance atheletes, while others have relatively stronger short distance athletes.\nThere are several outliers visible in this plot, wsamoa, cookis, dpkorea. PCA, because it is computed using the variance in the data, can be affected by outliers. It may be better to remove these countries, and re-run the PCA.\nPC3, may or may not be useful to keep. The interpretation would that this variable summarises countries with different middle distance performance." }, { - "objectID": "week1/index.html#assignments", - "href": "week1/index.html#assignments", - "title": "Week 1: Foundations of machine learning", - "section": "Assignments", - "text": "Assignments" + "objectID": "week2/slides.html#assess-model-in-the-data-space", + "href": "week2/slides.html#assess-model-in-the-data-space", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Assess: Model-in-the-data-space", + "text": "Assess: Model-in-the-data-space\n\n\n\ntrack_std <- track |> \n mutate_if(is.numeric, function(x) (x-\n mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\ntrack_std_pca <- prcomp(track_std[,1:7], \n scale = FALSE, \n retx=TRUE)\ntrack_model <- pca_model(track_std_pca, d=2, s=2)\ntrack_all <- rbind(track_model$points, track_std[,1:7])\nanimate_xy(track_all, edges=track_model$edges,\n edges.col=\"#E7950F\", \n edges.width=3, \n axes=\"off\")\nrender_gif(track_all, \n grand_tour(), \n display_xy(\n edges=track_model$edges, \n edges.col=\"#E7950F\", \n edges.width=3, \n axes=\"off\"),\n gif_file=\"gifs/track_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n\nMostly captures the variance in the data. Seems to slightly miss the non-linear relationship." }, { - "objectID": "index.html", - "href": "index.html", + "objectID": "week2/slides.html#delectable-details", + "href": "week2/slides.html#delectable-details", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "", - "text": "Professor Di Cook\n\nEmail: etc3250.clayton-x@monash.edu\nConsultation: Thu 9:00-10:30 (zoom only)" + "section": "Delectable details", + "text": "Delectable details\n\n\n๐Ÿคญ\n\nSometimes the lowest PCs show the interesting patterns, like non-linear relationships, or clusters.\n\n\n\n\nPCA summarises linear relationships, and might not see other interesting dependencies. Projection pursuit is a generalisation that can find other interesting patterns.\nOutliers can affect results, because direction of outliers will appear to have larger variance\nScaling of variables matters, and typically you would first standardise each variable to have mean 0 and variance 1. Otherwise, PCA might simply report the variables with the largest variance, which we already know." }, { - "objectID": "index.html#lecturerchief-examiner", - "href": "index.html#lecturerchief-examiner", + "objectID": "week2/slides.html#non-linear-dimension-reduction", + "href": "week2/slides.html#non-linear-dimension-reduction", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "", - "text": "Professor Di Cook\n\nEmail: etc3250.clayton-x@monash.edu\nConsultation: Thu 9:00-10:30 (zoom only)" + "section": "Non-linear dimension reduction", + "text": "Non-linear dimension reduction" }, { - "objectID": "index.html#tutors", - "href": "index.html#tutors", + "objectID": "week2/slides.html#common-approaches", + "href": "week2/slides.html#common-approaches", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Tutors", - "text": "Tutors\n\nPatrick Li\n\nTutorials: Mon 15:00 (LTB 323), Fri 11:00 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 10:30-12:00 (W9.20)\n\nHarriet Mason\n\nTutorials: Wed 18:00 (LTB G60), Fri 12:30 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 3:00-4:30 (zoom only)\n\nJayani Lakshika\n\nTutorials: Wed 8:00, 9:30 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 12:00-1:30 (W9.20)\n\nKrisanat Anukarnsakulchularp\n\nTutorials: Mon 12:00, 13:30 (LTB 323)\nConsultation: Fri 9:30-11:00 (W9.20)" + "section": "Common approaches", + "text": "Common approaches\n\n\nFind some low-dimensional layout of points which approximates the distance between points in high-dimensions, with the purpose being to have a useful representation that reveals high-dimensional patterns, like clusters.\nMultidimensional scaling (MDS) is the original approach:\n\\[\n\\mbox{Stress}_D(x_1, ..., x_n) = \\left(\\sum_{i, j=1; i\\neq j}^n (d_{ij} - d_k(i,j))^2\\right)^{1/2}\n\\] where \\(D\\) is an \\(n\\times n\\) matrix of distances \\((d_{ij})\\) between all pairs of points, and \\(d_k(i,j)\\) is the distance between the points in the low-dimensional space.\nPCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide some non-linear transformation.\n\n\nMany variations being developed:\n\nt-stochastic neighbourhood embedding (t-SNE): compares interpoint distances with a standard probability distribution (eg \\(t\\)-distribution) to exaggerate local neighbourhood differences.\nuniform manifold approximation and projection (UMAP): compares the interpoint distances with what might be expected if the data was uniformly distributed in the high-dimensions.\n\n\nNLDR can be useful but it can also make some misleading representations." }, { - "objectID": "index.html#weekly-schedule", - "href": "index.html#weekly-schedule", + "objectID": "week2/slides.html#umap-12", + "href": "week2/slides.html#umap-12", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Weekly schedule", - "text": "Weekly schedule\n\nLecture: Wed 1:05-2:45pm\nTutorial: 1.5 hours\nWeekly learning quizzes due Mondays 9am\n\n\n\n\nWeek\nTopic\nReference\nAssessments\n\n\n\n\n26 Feb\nFoundations of machine learning\nISLR 2.1, 2.2\n\n\n\n04 Mar\nVisualising your data and models\nCook and Laa Ch 1, 3, 4, 5, 6, 13\n\n\n\n11 Mar\nRe-sampling and regularisation\nISLR 5.1, 5.2, 6.2, 6.4\n\n\n\n18 Mar\nLogistic regression and discriminant analysis\nISLR 4.3, 4.4\nAssignment 1\n\n\n25 Mar\nTrees and forests\nISLR 8.1, 8.2\n\n\n\n01 Apr\nMid-semester break\n\n\n\n\n08 Apr\nNeural networks and deep learning\nISLR 10.1-10.3, 10.7\nAssignment 2\n\n\n15 Apr\nExplainable artificial intelligence (XAI)\nMolnar 8.1, 8.5, 9.2-9.6\n\n\n\n22 Apr\nSupport vector machines and nearest neighbours\nISLR 9.1-9.3\nAssignment 3\n\n\n29 Apr\nK-nearest neighbours and hierarchical clustering\nHOML Ch 20, 21\n\n\n\n06 May\nModel-based clustering and self-organising maps\nHOML Ch 22\n\n\n\n13 May\nEvaluating your clustering model\nCook and Laa Ch 12\nProject\n\n\n20 May\nProject presentations by Masters students" + "section": "UMAP (1/2)", + "text": "UMAP (1/2)\n\n\n\nUMAP 2D representation\n\n\n\n\n\n\n\n\n\n\n\nlibrary(uwot)\nset.seed(253)\np_tidy_umap <- umap(p_tidy_std[,2:5], init = \"spca\")\n\n\n\nTour animation" }, { - "objectID": "index.html#assessments", - "href": "index.html#assessments", + "objectID": "week2/slides.html#umap-22", + "href": "week2/slides.html#umap-22", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Assessments", - "text": "Assessments\n\nWeekly learning quizzes: 3%\nAssignment 1: Instructions, Submit to moodle (9%)\nAssignment 2: Instructions, Submit to moodle (9%)\nAssignment 3: Instructions, Submit to moodle (9%)\nProject: 10%\nFinal exam: 60%" + "section": "UMAP (2/2)", + "text": "UMAP (2/2)\n\n\n\nUMAP 2D representation\n\n\n\n\n\n\n\n\n\nTour animation" }, { - "objectID": "index.html#software", - "href": "index.html#software", + "objectID": "week2/slides.html#next-re-sampling-and-regularisation", + "href": "week2/slides.html#next-re-sampling-and-regularisation", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Software", - "text": "Software\nWe will be using the latest versions of R and RStudio.\nHere is the code to install (most of) the R packages we will be using in this unit.\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\nIf you run into problems completing the full install, the likely culprits are tidyverse and tidymodels. These are bundles of packages, and might fail at individual packages. To resolve the problems, install each package from the bundle individually, and donโ€™t install any that fail on your system.\nIn addition, follow these instructions to set up tensorflow and keras, which requires having python installed.\nIf you are relatively new to R, working through the materials at https://learnr.numbat.space is an excellent way to up-skill. You are epsecially encouraged to work through Chapter 3, on Troubleshooting and asking for help, because at some point you will need help with your coding, and how you go about this matters and impacts the ability of others to help you.\nThe ISLR book also comes with python code, and you are welcome to do most of your work with python instead of R. However, what you submit for marking must be done with R." + "section": "Next: Re-sampling and regularisation", + "text": "Next: Re-sampling and regularisation\n\n\n\nETC3250/5250 Lecture 2 | iml.numbat.space" + }, + { + "objectID": "week12/index.html#presentations-from-masters-students", + "href": "week12/index.html#presentations-from-masters-students", + "title": "Week 12: Project presentations by Masters students", + "section": "Presentations from Masters students", + "text": "Presentations from Masters students" + }, + { + "objectID": "week10/index.html", + "href": "week10/index.html", + "title": "Week 10: Model-based clustering and self-organising maps", + "section": "", + "text": "HOML Ch 22" + }, + { + "objectID": "week10/index.html#main-reference", + "href": "week10/index.html#main-reference", + "title": "Week 10: Model-based clustering and self-organising maps", + "section": "", + "text": "HOML Ch 22" + }, + { + "objectID": "week10/index.html#what-you-will-learn-this-week", + "href": "week10/index.html#what-you-will-learn-this-week", + "title": "Week 10: Model-based clustering and self-organising maps", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nModels of multimodality using Gaussian mixtures\nFitting model-based clustering\nDiagnostics for the model fit\nSelf-organising maps and dimension reduction" }, { - "objectID": "resources.html", - "href": "resources.html", - "title": "ETC3250/5250 Resources", - "section": "", - "text": "Books and articles\n\nAn Introduction to Statistical Learning (ISLR)\n\nThis book by James, Witten, Hastie and Tibshirani contains the primary content for the unit. It has the explanations for different methodology, practical labs, and a range of exercises to work through. Use the second edition, with Applications in R.\n\nHands-On Machine Learning with R\n\nThis book by Boehmke & Greenwell is an accessible and practical guide to many aspects of machine learning. Itโ€™s coverage of unsupervised classification is very good.\n\nTidy Modeling with R\n\nMachine learning is an active area of research across several disciplines, primarily statistics and computer science. Perhaps because of this there are many ways to define and fit models. The tidy modeling approach coordinates these into a consistent and understandable workflow. It doesnโ€™t interface to all software, but getting started with machine learning using this mind-set helps you get organised despite the fragmented landscape. This book accompanies the software tidymodels.\n\nISLR tidymodels labs\n\nThis book contains the code to do most of the exercises from ISLR using the tidymodels thinking and coding style.\n\nInteractively exploring high-dimensional data and models in R\n\nThis book by Cook and Laa is the primary resource for learning how to visualise high-dimensions, how to explore the data, and to visually examine and diagnose models.\n\nInterpretable Machine Learning\n\nThis book by Christoph Molnar serves as a guide for making black box models explainable. It is an excellent resource for developing your understanding of the different types of models and how to diagnose and interpret them\n\nFeature Engineering A-Z\n\nWritten by Emil Hvitfeldt to cover creating new variables as broadly as possibly. Has classical methods such as dummy variables and box-cox transformations, temporal and spatial data and missing value imputation.\n\n\nUseful links\n\nTensorFlow for R\nA gentle introduction to deep learning in R using Keras\n(M+C)ยฒ Blog" + "objectID": "week10/index.html#assignments", + "href": "week10/index.html#assignments", + "title": "Week 10: Model-based clustering and self-organising maps", + "section": "Assignments", + "text": "Assignments\n\nProject is due on Friday 17 May." }, { "objectID": "week1/slides.html#welcome-meet-the-teaching-team", @@ -1232,970 +1400,914 @@ "text": "Next: Visualisation\n\n\n\nETC3250/5250 Lecture 1 | iml.numbat.space" }, { - "objectID": "week10/index.html", - "href": "week10/index.html", - "title": "Week 10: Model-based clustering and self-organising maps", - "section": "", - "text": "HOML Ch 22" - }, - { - "objectID": "week10/index.html#main-reference", - "href": "week10/index.html#main-reference", - "title": "Week 10: Model-based clustering and self-organising maps", + "objectID": "resources.html", + "href": "resources.html", + "title": "ETC3250/5250 Resources", "section": "", - "text": "HOML Ch 22" - }, - { - "objectID": "week10/index.html#what-you-will-learn-this-week", - "href": "week10/index.html#what-you-will-learn-this-week", - "title": "Week 10: Model-based clustering and self-organising maps", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nModels of multimodality using Gaussian mixtures\nFitting model-based clustering\nDiagnostics for the model fit\nSelf-organising maps and dimension reduction" - }, - { - "objectID": "week10/index.html#assignments", - "href": "week10/index.html#assignments", - "title": "Week 10: Model-based clustering and self-organising maps", - "section": "Assignments", - "text": "Assignments\n\nProject is due on Friday 17 May." - }, - { - "objectID": "week12/index.html#presentations-from-masters-students", - "href": "week12/index.html#presentations-from-masters-students", - "title": "Week 12: Project presentations by Masters students", - "section": "Presentations from Masters students", - "text": "Presentations from Masters students" - }, - { - "objectID": "week2/slides.html#overview", - "href": "week2/slides.html#overview", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Overview", - "text": "Overview\nIn this week we will cover:\n\nConceptual framing for visualisation\nCommon methods: scatterplot matrix, parallel coordinates, tours\nDetails on using tours for examining clustering and class structure\nDimension reduction\n\nLinear: principal component analysis\nNon-linear: multidimensional scaling, t-stochastic neighbour embedding (t-SNE), uniform manifold approximation and projection (UMAP)\n\nUsing tours to assess dimension reduction" - }, - { - "objectID": "week2/slides.html#concepts", - "href": "week2/slides.html#concepts", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Concepts", - "text": "Concepts" - }, - { - "objectID": "week2/slides.html#model-in-the-data-space", - "href": "week2/slides.html#model-in-the-data-space", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Model-in-the-data-space", - "text": "Model-in-the-data-space\n\n\n\n\n\nFrom XKCD\n\n\n\n We plot the model on the data to assess whether it fits or is a misfit!\n\n\nDoing this in high-dimensions is considered difficult!\n\n\nSo it is common to only plot the data-in-the-model-space." - }, - { - "objectID": "week2/slides.html#data-in-the-model-space", - "href": "week2/slides.html#data-in-the-model-space", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Data-in-the-model-space", - "text": "Data-in-the-model-space\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPredictive probabilities are aspects of the model. It is useful to plot. What do we learn here?\n\nBut it doesnโ€™t tell you why there is a difference." + "text": "Books and articles\n\nAn Introduction to Statistical Learning (ISLR)\n\nThis book by James, Witten, Hastie and Tibshirani contains the primary content for the unit. It has the explanations for different methodology, practical labs, and a range of exercises to work through. Use the second edition, with Applications in R.\n\nHands-On Machine Learning with R\n\nThis book by Boehmke & Greenwell is an accessible and practical guide to many aspects of machine learning. Itโ€™s coverage of unsupervised classification is very good.\n\nTidy Modeling with R\n\nMachine learning is an active area of research across several disciplines, primarily statistics and computer science. Perhaps because of this there are many ways to define and fit models. The tidy modeling approach coordinates these into a consistent and understandable workflow. It doesnโ€™t interface to all software, but getting started with machine learning using this mind-set helps you get organised despite the fragmented landscape. This book accompanies the software tidymodels.\n\nISLR tidymodels labs\n\nThis book contains the code to do most of the exercises from ISLR using the tidymodels thinking and coding style.\n\nInteractively exploring high-dimensional data and models in R\n\nThis book by Cook and Laa is the primary resource for learning how to visualise high-dimensions, how to explore the data, and to visually examine and diagnose models.\n\nInterpretable Machine Learning\n\nThis book by Christoph Molnar serves as a guide for making black box models explainable. It is an excellent resource for developing your understanding of the different types of models and how to diagnose and interpret them\n\nFeature Engineering A-Z\n\nWritten by Emil Hvitfeldt to cover creating new variables as broadly as possibly. Has classical methods such as dummy variables and box-cox transformations, temporal and spatial data and missing value imputation.\n\n\nUseful links\n\nTensorFlow for R\nA gentle introduction to deep learning in R using Keras\n(M+C)ยฒ Blog" }, { - "objectID": "week2/slides.html#model-in-the-data-space-1", - "href": "week2/slides.html#model-in-the-data-space-1", + "objectID": "index.html", + "href": "index.html", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Model-in-the-data-space", - "text": "Model-in-the-data-space\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?\n\nOne model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also โ€ฆ" + "section": "", + "text": "Professor Di Cook\n\nEmail: etc3250.clayton-x@monash.edu\nConsultation: Thu 9:00-10:30 (zoom only)" }, { - "objectID": "week2/slides.html#how-do-you-visualise-beyond-2d", - "href": "week2/slides.html#how-do-you-visualise-beyond-2d", + "objectID": "index.html#lecturerchief-examiner", + "href": "index.html#lecturerchief-examiner", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How do you visualise beyond 2D?", - "text": "How do you visualise beyond 2D?" + "section": "", + "text": "Professor Di Cook\n\nEmail: etc3250.clayton-x@monash.edu\nConsultation: Thu 9:00-10:30 (zoom only)" }, { - "objectID": "week2/slides.html#scatterplot-matrix", - "href": "week2/slides.html#scatterplot-matrix", + "objectID": "index.html#tutors", + "href": "index.html#tutors", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Scatterplot matrix", - "text": "Scatterplot matrix\n\n\n Start simply! Make static plots that organise the variables on a page. \nPlot all the pairs of variables. When laid out in a matrix format this is called a scatterplot matrix.\n Here, we see linear association, clumping and clustering, potentially some outliers." + "section": "Tutors", + "text": "Tutors\n\nPatrick Li\n\nTutorials: Mon 15:00 (LTB 323), Fri 11:00 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 10:30-12:00 (W9.20)\n\nHarriet Mason\n\nTutorials: Wed 18:00 (LTB G60), Fri 12:30 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 3:00-4:30 (zoom only)\n\nJayani Lakshika\n\nTutorials: Wed 8:00, 9:30 (CL_33 Innovation Walk, FG04 Bldg 73P)\nConsultation: Thu 12:00-1:30 (W9.20)\n\nKrisanat Anukarnsakulchularp\n\nTutorials: Mon 12:00, 13:30 (LTB 323)\nConsultation: Fri 9:30-11:00 (W9.20)" }, { - "objectID": "week2/slides.html#scatterplot-matrix-drawbacks", - "href": "week2/slides.html#scatterplot-matrix-drawbacks", + "objectID": "index.html#weekly-schedule", + "href": "index.html#weekly-schedule", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Scatterplot matrix: drawbacks", - "text": "Scatterplot matrix: drawbacks\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere is an outlier in the data on the right, like the one in the left, but it is hidden in a combination of variables. Itโ€™s not visible in any pair of variables." + "section": "Weekly schedule", + "text": "Weekly schedule\n\nLecture: Wed 1:05-2:45pm\nTutorial: 1.5 hours\nWeekly learning quizzes due Mondays 9am\n\n\n\n\nWeek\nTopic\nReference\nAssessments\n\n\n\n\n26 Feb\nFoundations of machine learning\nISLR 2.1, 2.2\n\n\n\n04 Mar\nVisualising your data and models\nCook and Laa Ch 1, 3, 4, 5, 6, 13\n\n\n\n11 Mar\nRe-sampling and regularisation\nISLR 5.1, 5.2, 6.2, 6.4\n\n\n\n18 Mar\nLogistic regression and discriminant analysis\nISLR 4.3, 4.4\nAssignment 1\n\n\n25 Mar\nTrees and forests\nISLR 8.1, 8.2\n\n\n\n01 Apr\nMid-semester break\n\n\n\n\n08 Apr\nNeural networks and deep learning\nISLR 10.1-10.3, 10.7\nAssignment 2\n\n\n15 Apr\nExplainable artificial intelligence (XAI)\nMolnar 8.1, 8.5, 9.2-9.6\n\n\n\n22 Apr\nSupport vector machines and nearest neighbours\nISLR 9.1-9.3\nAssignment 3\n\n\n29 Apr\nK-nearest neighbours and hierarchical clustering\nHOML Ch 20, 21\n\n\n\n06 May\nModel-based clustering and self-organising maps\nHOML Ch 22\n\n\n\n13 May\nEvaluating your clustering model\nCook and Laa Ch 12\nProject\n\n\n20 May\nProject presentations by Masters students" }, { - "objectID": "week2/slides.html#perception", - "href": "week2/slides.html#perception", + "objectID": "index.html#assessments", + "href": "index.html#assessments", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Perception", - "text": "Perception\n\nAspect ratio for scatterplots needs to be equal, or square!\n\nWhen you make a scatterplot of two variables from a multivariate data set, most software renders it with an unequal aspect ratio, as a rectangle. You need to over-ride this and force the square aspect ratio. Why?\n\n\n\nBecause it adversely affects the perception of correlation and association between variables." + "section": "Assessments", + "text": "Assessments\n\nWeekly learning quizzes: 3%\nAssignment 1: Instructions, Submit to moodle (9%)\nAssignment 2: Instructions, Submit to moodle (9%)\nAssignment 3: Instructions, Submit to moodle (9%)\nProject: 10%\nFinal exam: 60%" }, { - "objectID": "week2/slides.html#parallel-coordinate-plot", - "href": "week2/slides.html#parallel-coordinate-plot", + "objectID": "index.html#software", + "href": "index.html#software", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Parallel coordinate plot", - "text": "Parallel coordinate plot\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5) + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n Parallel coordinate plots are side-by-side dotplots with values from a row connected with a line.\nExamine the direction and orientation of lines to perceive multivariate relationships.\nCrossing lines indicate negative association. Lines with same slope indicate positive association. Outliers have a different up/down pattern to other points. Groups of lines with same pattern indicate clustering." + "section": "Software", + "text": "Software\nWe will be using the latest versions of R and RStudio.\nHere is the code to install (most of) the R packages we will be using in this unit.\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\nIf you run into problems completing the full install, the likely culprits are tidyverse and tidymodels. These are bundles of packages, and might fail at individual packages. To resolve the problems, install each package from the bundle individually, and donโ€™t install any that fail on your system.\nIn addition, follow these instructions to set up tensorflow and keras, which requires having python installed.\nIf you are relatively new to R, working through the materials at https://learnr.numbat.space is an excellent way to up-skill. You are epsecially encouraged to work through Chapter 3, on Troubleshooting and asking for help, because at some point you will need help with your coding, and how you go about this matters and impacts the ability of others to help you.\nThe ISLR book also comes with python code, and you are welcome to do most of your work with python instead of R. However, what you submit for marking must be done with R." }, { - "objectID": "week2/slides.html#parallel-coordinate-plot-drawbacks", - "href": "week2/slides.html#parallel-coordinate-plot-drawbacks", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Parallel coordinate plot: drawbacks", - "text": "Parallel coordinate plot: drawbacks\n\n\n\nHard to follow lines - need interactivity\nOrder of variables\nScaling of variables\n\n\n\nBut the advantage is that you can pack a lot of variables into the single page." + "objectID": "week1/index.html", + "href": "week1/index.html", + "title": "Week 1: Foundations of machine learning", + "section": "", + "text": "ISLR 2.1, 2.2" }, { - "objectID": "week2/slides.html#parallel-coordinate-plot-effect-of-scaling", - "href": "week2/slides.html#parallel-coordinate-plot-effect-of-scaling", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Parallel coordinate plot: effect of scaling", - "text": "Parallel coordinate plot: effect of scaling\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n scale=\"globalminmax\") + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n scale=\"uniminmax\") + \n xlab(\"\") + ylab(\"\") + \n theme(aspect.ratio=0.8)" + "objectID": "week1/index.html#main-reference", + "href": "week1/index.html#main-reference", + "title": "Week 1: Foundations of machine learning", + "section": "", + "text": "ISLR 2.1, 2.2" }, { - "objectID": "week2/slides.html#parallel-coordinate-plot-effect-of-ordering", - "href": "week2/slides.html#parallel-coordinate-plot-effect-of-ordering", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Parallel coordinate plot: effect of ordering", - "text": "Parallel coordinate plot: effect of ordering\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n groupColumn = 1) + \n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n xlab(\"\") + ylab(\"\") +\n theme(legend.position=\"none\", aspect.ratio=0.8)\n\n\n\n\n\n\n\n\n\n\nggparcoord(p_tidy, columns = 2:5, alphaLines = 0.5,\n groupColumn = 1, order=c(4, 2, 5, 3)) + \n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n xlab(\"\") + ylab(\"\") +\n theme(legend.position=\"none\", aspect.ratio=0.8)" + "objectID": "week1/index.html#what-you-will-learn-this-week", + "href": "week1/index.html#what-you-will-learn-this-week", + "title": "Week 1: Foundations of machine learning", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nFraming the problems\nNotation and math\nBias variance-tradeoff\nFitting your models: training/test splits, optimisation\nMeasuring fit: accuracy, loss\nDiagnostics: residuals\nFeature engineering: combining variables to better match purpose and help the model fitting" }, { - "objectID": "week2/slides.html#adding-interactivity-to-static-plots-scatterplot-matrix", - "href": "week2/slides.html#adding-interactivity-to-static-plots-scatterplot-matrix", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Adding interactivity to static plots: scatterplot matrix", - "text": "Adding interactivity to static plots: scatterplot matrix\n\n\n\nlibrary(plotly)\ng <- ggpairs(p_tidy, columns=2:5) +\n theme(axis.text = element_blank()) \n\n Selecting points, using plotly, allows you to see where this observation lies in the other plots (pairs of variables).\n\n\nggplotly(g, width=600, height=600)" + "objectID": "week1/index.html#lecture-slides", + "href": "week1/index.html#lecture-slides", + "title": "Week 1: Foundations of machine learning", + "section": "Lecture slides", + "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" }, { - "objectID": "week2/slides.html#adding-interactivity-to-static-plots-parallel-coordinates", - "href": "week2/slides.html#adding-interactivity-to-static-plots-parallel-coordinates", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Adding interactivity to static plots: parallel coordinates", - "text": "Adding interactivity to static plots: parallel coordinates\n\n\n\np_pcp <- p_tidy |>\n na.omit() |>\n plot_ly(type = 'parcoords',\n line = list(),\n dimensions = list(\n list(range = c(172, 231),\n label = 'fl', values = ~fl),\n list(range = c(32, 60),\n label = 'bl', values = ~bl),\n list(range = c(2700, 6300),\n label = 'bm', values = ~bm),\n list(range = c(13, 22),\n label = 'bd', values = ~bd)\n )\n )\n\n\n\np_pcp" + "objectID": "week1/index.html#tutorial-instructions", + "href": "week1/index.html#tutorial-instructions", + "title": "Week 1: Foundations of machine learning", + "section": "Tutorial instructions", + "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" }, { - "objectID": "week2/slides.html#what-is-high-dimensions", - "href": "week2/slides.html#what-is-high-dimensions", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "What is high-dimensions?", - "text": "What is high-dimensions?" + "objectID": "week1/index.html#assignments", + "href": "week1/index.html#assignments", + "title": "Week 1: Foundations of machine learning", + "section": "Assignments", + "text": "Assignments" }, { - "objectID": "week2/slides.html#high-dimensions-in-statistics", - "href": "week2/slides.html#high-dimensions-in-statistics", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "High-dimensions in statistics", - "text": "High-dimensions in statistics\n\n\n\nIncreasing dimension adds an additional orthogonal axis.\n\nIf you want more high-dimensional shapes there is an R package, geozoo, which will generate cubes, spheres, simplices, mobius strips, torii, boy surface, klein bottles, cones, various polytopes, โ€ฆ\nAnd read or watch Flatland: A Romance of Many Dimensions (1884) Edwin Abbott." + "objectID": "week1/tutorialsol.html", + "href": "week1/tutorialsol.html", + "title": "ETC53250/5250 Tutorial 1", + "section": "", + "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." }, { - "objectID": "week2/slides.html#remember", - "href": "week2/slides.html#remember", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Remember", - "text": "Remember\nData\n\\[\\begin{eqnarray*}\nX_{~n\\times p} =\n[X_{~1}~X_{~2}~\\dots~X_{~p}]_{~n\\times p} = \\left[ \\begin{array}{cccc}\nx_{~11} & x_{~12} & \\dots & x_{~1p} \\\\\nx_{~21} & x_{~22} & \\dots & x_{~2p}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\nx_{~n1} & x_{~n2} & \\dots & x_{~np} \\end{array} \\right]_{~n\\times p}\n\\end{eqnarray*}\\]" + "objectID": "week1/tutorialsol.html#objectives", + "href": "week1/tutorialsol.html#objectives", + "title": "ETC53250/5250 Tutorial 1", + "section": "", + "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." }, { - "objectID": "week2/slides.html#remember-1", - "href": "week2/slides.html#remember-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Remember", - "text": "Remember\nProjection\n\\[\\begin{eqnarray*}\nA_{~p\\times d} = \\left[ \\begin{array}{cccc}\na_{~11} & a_{~12} & \\dots & a_{~1d} \\\\\na_{~21} & a_{~22} & \\dots & a_{~2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\na_{~p1} & a_{~p2} & \\dots & a_{~pd} \\end{array} \\right]_{~p\\times d}\n\\end{eqnarray*}\\]" + "objectID": "week1/tutorialsol.html#preparation", + "href": "week1/tutorialsol.html#preparation", + "title": "ETC53250/5250 Tutorial 1", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nInstall the latest versions of R and RStudio on your computer\n\n\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\n\n\nCreate a project for this unit called iml.Rproj. All of your tutorial work and assignments should be completed in this workspace." }, { - "objectID": "week2/slides.html#remember-2", - "href": "week2/slides.html#remember-2", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Remember", - "text": "Remember\nProjected data\n\\[\\begin{eqnarray*}\nY_{~n\\times d} = XA = \\left[ \\begin{array}{cccc}\ny_{~11} & y_{~12} & \\dots & y_{~1d} \\\\\ny_{~21} & y_{~22} & \\dots & y_{~2d}\\\\\n\\vdots & \\vdots & & \\vdots \\\\\ny_{~n1} & y_{~n2} & \\dots & y_{~nd} \\end{array} \\right]_{~n\\times d}\n\\end{eqnarray*}\\]" + "objectID": "week1/tutorialsol.html#exercises", + "href": "week1/tutorialsol.html#exercises", + "title": "ETC53250/5250 Tutorial 1", + "section": "Exercises:", + "text": "Exercises:\n\n1. The materials at https://learnr.numbat.space are an especially good way to check your R skills are ready for the unit. Regardless how advanced you are, at some point you will need help. How you ask for help is a big factor in getting your problem fixed. The following code generates an error.\n\nlibrary(dplyr)\nlibrary(MASS)\nlibrary(palmerpenguins)\np_sub <- penguins |>\n select(species, flipper_length_mm) |>\n filter(species == \"Adelie\")\n\n\nCan you work out why?\nUse the reprex package to create a text where the code and error are visible, and can be shared with someone that might be able to help.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe error is\nError in select(penguins, species, flipper_length_mm) : \n unused arguments (species, flipper_length_mm)\nand is caused by a conflict in functions between the dplyr and MASS packages. If you read the warning messages when the packages were loaded you might have been aware of this before trying to run code.\nYou can fix it by:\n\nPrefacing functions that have conflicts with their package name, eg dplyr::select()\nUse the conflicted package to set your preferences at the start of any document.\n\nTo make the reprex, copy the code to clipboard, and run reprex(). This will generate:\n\n\n\n\n\n\n\n2. Your turn to write some code that generates an error. Create a reprex, and share with your tutor or neighbour, to see if they can fix the error.\n\n\n3. Follow the guidelines at https://tensorflow.rstudio.com/install/ to setup python and tensorflow on your computer. Then test your installation by following the beginner tutorial.\n\n\n4. Download the slides.qmd file for week 1 lecture.\n\nUse knitr::purl() to extract the R code for the class.\nOpen the resulting slides.R file in your RStudio file browser. What code is in the setup.R file that is sourced at the top?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nLibraries are loaded.\nThere are some global options for slides set, and styling of plots.\nConflicts for some common functions are resolved with preferences.\n\n\n\n\n\n\nRun the rest of the code in small chunks. Does it all work for you? Do you get any errors? Do you have any suggestions on making it easier to run or understand the code?" }, { - "objectID": "week2/slides.html#tours-of-linear-projections", - "href": "week2/slides.html#tours-of-linear-projections", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Tours of linear projections", - "text": "Tours of linear projections\n\n\n\n\nData is 2D: \\(~~p=2\\)\nProjection is 1D: \\(~~d=1\\)\n\n\\[\\begin{eqnarray*}\nA_{~2\\times 1} = \\left[ \\begin{array}{c}\na_{~11} \\\\\na_{~21}\\\\\n\\end{array} \\right]_{~2\\times 1}\n\\end{eqnarray*}\\]\n\n\n Notice that the values of \\(A\\) change between (-1, 1). All possible values being shown during the tour.\n \n \\[\\begin{eqnarray*}\nA = \\left[ \\begin{array}{c}\n1 \\\\\n0\\\\\n\\end{array} \\right]\n~~~~~~~~~~~~~~~~\nA = \\left[ \\begin{array}{c}\n0.7 \\\\\n0.7\\\\\n\\end{array} \\right]\n~~~~~~~~~~~~~~~~\nA = \\left[ \\begin{array}{c}\n0.7 \\\\\n-0.7\\\\\n\\end{array} \\right]\n\n\\end{eqnarray*}\\]\n\n\n watching the 1D shadows we can see:\n\nunimodality\nbimodality, there are two clusters.\n\n\n\n What does the 2D data look like? Can you sketch it?" + "objectID": "week1/tutorialsol.html#finishing-up", + "href": "week1/tutorialsol.html#finishing-up", + "title": "ETC53250/5250 Tutorial 1", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week2/slides.html#tours-of-linear-projections-1", - "href": "week2/slides.html#tours-of-linear-projections-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Tours of linear projections", - "text": "Tours of linear projections\n\n\n\n\n\n\n\n\n\n\n\n\n โŸต The 2D data" + "objectID": "week1/tutorial.html", + "href": "week1/tutorial.html", + "title": "ETC53250/5250 Tutorial 1", + "section": "", + "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." }, { - "objectID": "week2/slides.html#tours-of-linear-projections-2", - "href": "week2/slides.html#tours-of-linear-projections-2", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Tours of linear projections", - "text": "Tours of linear projections\n\n\n\nData is 3D: \\(p=3\\)\nProjection is 2D: \\(d=2\\)\n\\[\\begin{eqnarray*}\nA_{~3\\times 2} = \\left[ \\begin{array}{cc}\na_{~11} & a_{~12} \\\\\na_{~21} & a_{~22}\\\\\na_{~31} & a_{~32}\\\\\n\\end{array} \\right]_{~3\\times 2}\n\\end{eqnarray*}\\]\n\n\n Notice that the values of \\(A\\) change between (-1, 1). All possible values being shown during the tour.\n\n\nSee:\n\ncircular shapes\nsome transparency, reveals middle\nhole in in some projections\nno clustering" + "objectID": "week1/tutorial.html#objectives", + "href": "week1/tutorial.html#objectives", + "title": "ETC53250/5250 Tutorial 1", + "section": "", + "text": "The goal for this week is for you to get up and running with the computing environment needed to successfully complete this unit." }, { - "objectID": "week2/slides.html#tours-of-linear-projections-3", - "href": "week2/slides.html#tours-of-linear-projections-3", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Tours of linear projections", - "text": "Tours of linear projections\n\n\n\nData is 4D: \\(p=4\\)\nProjection is 2D: \\(d=2\\)\n\\[\\begin{eqnarray*}\nA_{~4\\times 2} = \\left[ \\begin{array}{cc}\na_{~11} & a_{~12} \\\\\na_{~21} & a_{~22}\\\\\na_{~31} & a_{~32}\\\\\na_{~41} & a_{~42}\\\\\n\\end{array} \\right]_{~4\\times 2}\n\\end{eqnarray*}\\]\n\n How many clusters do you see?\n\n\nthree, right?\none separated, and two very close,\nand they each have an elliptical shape.\n\n\n\n\ndo you also see an outlier or two?" + "objectID": "week1/tutorial.html#preparation", + "href": "week1/tutorial.html#preparation", + "title": "ETC53250/5250 Tutorial 1", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nInstall the latest versions of R and RStudio on your computer\n\n\ninstall.packages(c(\"tidyverse\", \"tidymodels\", \"tourr\", \"geozoo\", \"mulgar\", \"ggpcp\", \"plotly\", \"detourr\", \"langevitour\", \"ggbeeswarm\", \"MASS\", \"GGally\", \"ISLR\", \"mvtnorm\", \"rpart\", \"rpart.plot\", \"randomForest\", \"e1071\", \"xgboost\", \"Rtsne\", \"classifly\", \"penalizedLDA\", \"nnet\", \"kernelshap\", \"shapviz\", \"iml\", \"DALEX\", \"cxhull\", \"fpc\", \"mclust\", \"ggdendro\", \"kohonen\", \"aweSOM\", \"patchwork\", \"ggthemes\", \"colorspace\", \"palmerpenguins\"), dependencies = TRUE)\n\n\nCreate a project for this unit called iml.Rproj. All of your tutorial work and assignments should be completed in this workspace." }, { - "objectID": "week2/slides.html#intuitively-tours-are-like", - "href": "week2/slides.html#intuitively-tours-are-like", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Intuitively, tours are like โ€ฆ", - "text": "Intuitively, tours are like โ€ฆ" + "objectID": "week1/tutorial.html#exercises", + "href": "week1/tutorial.html#exercises", + "title": "ETC53250/5250 Tutorial 1", + "section": "Exercises:", + "text": "Exercises:\n\n1. The materials at https://learnr.numbat.space are an especially good way to check your R skills are ready for the unit. Regardless how advanced you are, at some point you will need help. How you ask for help is a big factor in getting your problem fixed. The following code generates an error.\n\nlibrary(dplyr)\nlibrary(MASS)\nlibrary(palmerpenguins)\np_sub <- penguins |>\n select(species, flipper_length_mm) |>\n filter(species == \"Adelie\")\n\n\nCan you work out why?\nUse the reprex package to create a text where the code and error are visible, and can be shared with someone that might be able to help.\n\n\n\n2. Your turn to write some code that generates an error. Create a reprex, and share with your tutor or neighbour, to see if they can fix the error.\n\n\n3. Follow the guidelines at https://tensorflow.rstudio.com/install/ to setup python and tensorflow on your computer. Then test your installation by following the beginner tutorial.\n\n\n4. Download the slides.qmd file for week 1 lecture.\n\nUse knitr::purl() to extract the R code for the class.\nOpen the resulting slides.R file in your RStudio file browser. What code is in the setup.R file that is sourced at the top?\n\n\nRun the rest of the code in small chunks. Does it all work for you? Do you get any errors? Do you have any suggestions on making it easier to run or understand the code?" }, { - "objectID": "week2/slides.html#and-help-to-see-the-datamodel-as-a-whole", - "href": "week2/slides.html#and-help-to-see-the-datamodel-as-a-whole", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "And help to see the data/model as a whole", - "text": "And help to see the data/model as a whole\n\n\nAvoid misinterpretation โ€ฆ\n\n\n\n\n\n\nโ€ฆ see the bigger picture!\n\n\n\n\n\n\n\n\nImage: Sketchplanations." + "objectID": "week1/tutorial.html#finishing-up", + "href": "week1/tutorial.html#finishing-up", + "title": "ETC53250/5250 Tutorial 1", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week2/slides.html#anomaly-is-no-longer-hidden", - "href": "week2/slides.html#anomaly-is-no-longer-hidden", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Anomaly is no longer hidden", - "text": "Anomaly is no longer hidden\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWait for it!" + "objectID": "week11/index.html", + "href": "week11/index.html", + "title": "Week 11: Evaluating your clustering model", + "section": "", + "text": "Cook and Laa Ch 12" }, { - "objectID": "week2/slides.html#how-to-use-a-tour-in-r", - "href": "week2/slides.html#how-to-use-a-tour-in-r", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to use a tour in R", - "text": "How to use a tour in R\n\n\nThis is a basic tour, which will run in your RStudio plot window.\n\nlibrary(tourr)\nanimate_xy(flea[, 1:6], rescale=TRUE)\n\n\n This data has a class variable, species.\n\n\nflea |> slice_head(n=3)\n\n species tars1 tars2 head aede1 aede2 aede3\n1 Concinna 191 131 53 150 15 104\n2 Concinna 185 134 50 147 13 105\n3 Concinna 200 137 52 144 14 102\n\n\n\nUse this to colour points with:\n\nanimate_xy(flea[, 1:6], \n col = flea$species, \n rescale=TRUE)\n\n\n\n\nYou can specifically guide the tour choice of projections using\n\nanimate_xy(flea[, 1:6], \n tour_path = guided_tour(holes()), \n col = flea$species, \n rescale = TRUE, \n sphere = TRUE)\n\n\n\n and you can manually choose a variable to control with:\n\nset.seed(915)\nanimate_xy(flea[, 1:6], \n radial_tour(basis_random(6, 2), \n mvar = 6), \n rescale = TRUE,\n col = flea$species)" + "objectID": "week11/index.html#main-reference", + "href": "week11/index.html#main-reference", + "title": "Week 11: Evaluating your clustering model", + "section": "", + "text": "Cook and Laa Ch 12" }, { - "objectID": "week2/slides.html#how-to-save-a-tour", - "href": "week2/slides.html#how-to-save-a-tour", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to save a tour", - "text": "How to save a tour\n\n\n\n\n\n\nTo save as an animated gif:\n\nset.seed(645)\nrender_gif(penguins_sub[,1:4],\n grand_tour(),\n display_xy(col=\"#EC5C00\",\n half_range=3.8, \n axes=\"bottomleft\", cex=2.5),\n gif_file = \"../gifs/penguins1.gif\",\n apf = 1/60,\n frames = 1500,\n width = 500, \n height = 400)" + "objectID": "week11/index.html#what-you-will-learn-this-week", + "href": "week11/index.html#what-you-will-learn-this-week", + "title": "Week 11: Evaluating your clustering model", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nConfusion tables\nCluster metrics" }, { - "objectID": "week2/slides.html#dimension-reduction", - "href": "week2/slides.html#dimension-reduction", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Dimension reduction", - "text": "Dimension reduction" + "objectID": "week11/index.html#assignments", + "href": "week11/index.html#assignments", + "title": "Week 11: Evaluating your clustering model", + "section": "Assignments", + "text": "Assignments\n\nProject is due on Friday 17 May." }, { - "objectID": "week2/slides.html#pca", - "href": "week2/slides.html#pca", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "PCA", - "text": "PCA\n\n\nFor this 2D data, sketch a line or a direction that if you squashed the data into it would provide most of the information.\n\n\n\n\n\n\n\n\n\n\n\n What about this data?" + "objectID": "week2/index.html", + "href": "week2/index.html", + "title": "Week 2: Visualising your data and models", + "section": "", + "text": "Cook and Laa Ch 1, 3, 4, 5, 6, 13" }, { - "objectID": "week2/slides.html#pca-1", - "href": "week2/slides.html#pca-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "PCA", - "text": "PCA\n\nPrincipal component analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. It is an unsupervised learning method.\n\nUse it, when:\n\nYou have too many predictors for a regression. Instead, we can use the first few principal components.\nNeed to understand relationships between variables.\nTo make plots summarising the variation in a large number of variables." + "objectID": "week2/index.html#main-reference", + "href": "week2/index.html#main-reference", + "title": "Week 2: Visualising your data and models", + "section": "", + "text": "Cook and Laa Ch 1, 3, 4, 5, 6, 13" }, { - "objectID": "week2/slides.html#first-principal-component", - "href": "week2/slides.html#first-principal-component", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "First principal component", - "text": "First principal component\nThe first principal component is a new variable created from a linear combination\n\\[z_1 = \\phi_{11}x_1 + \\phi_{21} x_2 + \\dots + \\phi_{p1} x_p\\]\nof the original \\(x_1, x_2, \\dots, x_p\\) that has the largest variance. The elements \\(\\phi_{11},\\dots,\\phi_{p1}\\) are the loadings of the first principal component and are constrained by:\n\\[\n\\displaystyle\\sum_{j=1}^p \\phi^2_{j1} = 1\n\\]" + "objectID": "week2/index.html#what-you-will-learn-this-week", + "href": "week2/index.html#what-you-will-learn-this-week", + "title": "Week 2: Visualising your data and models", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nDimension reduction methods: linear and non-linear\nVisualising high-dimensions using animations of linear projections\nScatterplot matrices\nParallel coordinate plots\nConcept of model-in-the-data-space, relative to data-in-the-moel-space" }, { - "objectID": "week2/slides.html#calculation", - "href": "week2/slides.html#calculation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Calculation", - "text": "Calculation\n\nThe loading vector \\(\\phi_1 = [\\phi_{11},\\dots,\\phi_{p1}]^\\top\\) defines direction in feature space along which data vary most.\nIf we project the \\(n\\) data points \\({x}_1,\\dots,{x}_n\\) onto this direction, the projected values are the principal component scores \\(z_{11},\\dots,z_{n1}\\).\n\n\n\n\nThe second principal component is the linear combination \\(z_{i2} = \\phi_{12}x_{i1} + \\phi_{22}x_{i2} + \\dots + \\phi_{p2}x_{ip}\\) that has maximal variance among all linear combinations that are uncorrelated with \\(z_1\\).\nEquivalent to constraining \\(\\phi_2\\) to be orthogonal (perpendicular) to \\(\\phi_1\\). And so on.\nThere are at most \\(\\min(n - 1, p)\\) PCs." + "objectID": "week2/index.html#lecture-slides", + "href": "week2/index.html#lecture-slides", + "title": "Week 2: Visualising your data and models", + "section": "Lecture slides", + "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" }, { - "objectID": "week2/slides.html#example", - "href": "week2/slides.html#example", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example", - "text": "Example\n\n \n\nIf you think of the first few PCs like a linear model fit, and the others as the error, it is like regression, except that errors are orthogonal to model.\n(Chapter6/6.15.pdf)" + "objectID": "week2/index.html#tutorial-instructions", + "href": "week2/index.html#tutorial-instructions", + "title": "Week 2: Visualising your data and models", + "section": "Tutorial instructions", + "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" }, { - "objectID": "week2/slides.html#geometry", - "href": "week2/slides.html#geometry", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Geometry", - "text": "Geometry\nPCA can be thought of as fitting an \\(n\\)-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The new variables produced by principal components correspond to rotating and scaling the ellipse into a circle. It spheres the data." + "objectID": "week2/index.html#assignments", + "href": "week2/index.html#assignments", + "title": "Week 2: Visualising your data and models", + "section": "Assignments", + "text": "Assignments\n\nAssignment 1 is due on Friday 22 March." }, { - "objectID": "week2/slides.html#computation", - "href": "week2/slides.html#computation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Computation", - "text": "Computation\nSuppose we have a \\(n\\times p\\) data set \\(X = [x_{ij}]\\).\n\nCentre each of the variables to have mean zero (i.e., the column means of \\({X}\\) are zero).\nLet \\(z_{i1} = \\phi_{11}x_{i1} + \\phi_{21} x_{i2} + \\dots + \\phi_{p1} x_{ip}\\)\nCompute sample variance of \\(z_{i1}\\) is \\(\\displaystyle\\frac1n\\sum_{i=1}^n z_{i1}^2\\).\nEstimate \\(\\phi_{j1}\\)\n\n\\[\n\\mathop{\\text{maximize}}_{\\phi_{11},\\dots,\\phi_{p1}} \\frac{1}{n}\\sum_{i=1}^n\n\\left(\\sum_{j=1}^p \\phi_{j1}x_{ij}\\right)^{\\!\\!\\!2} \\text{ subject to }\n\\sum_{j=1}^p \\phi^2_{j1} = 1\n\\]\nRepeat optimisation to estimate \\(\\phi_{jk}\\), with additional constraint that \\(\\sum_{j=1, k<k'}^p \\phi_{jk}\\phi_{jk'} = 0\\) (next vector is orthogonal to previous eigenvector)." + "objectID": "week2/tutorialsol.html", + "href": "week2/tutorialsol.html", + "title": "ETC3250/5250 Tutorial 2", + "section": "", + "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." }, { - "objectID": "week2/slides.html#alternative-forumulations", - "href": "week2/slides.html#alternative-forumulations", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Alternative forumulations", - "text": "Alternative forumulations\n\n\nEigen-decomposition\n\nCompute the covariance matrix (after centering the columns of \\({X}\\)) \\[S = {X}^T{X}\\]\nFind eigenvalues (diagonal elements of \\(D\\)) and eigenvectors ( \\(V\\) ): \\[{S}={V}{D}{V}^T\\] where columns of \\({V}\\) are orthonormal (i.e., \\({V}^T{V}={I}\\))\n\n\nSingular Value Decomposition\n\\[X = U\\Lambda V^T\\]\n\n\\(X\\) is an \\(n\\times p\\) matrix\n\\(U\\) is \\(n \\times r\\) matrix with orthonormal columns ( \\(U^TU=I\\) )\n\\(\\Lambda\\) is \\(r \\times r\\) diagonal matrix with non-negative elements. (Square root of the eigenvalues.)\n\\(V\\) is \\(p \\times r\\) matrix with orthonormal columns (These are the eigenvectors, and \\(V^TV=I\\) ).\n\nIt is always possible to uniquely decompose a matrix in this way." + "objectID": "week2/tutorialsol.html#objectives", + "href": "week2/tutorialsol.html#objectives", + "title": "ETC3250/5250 Tutorial 2", + "section": "", + "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." }, { - "objectID": "week2/slides.html#total-variance", - "href": "week2/slides.html#total-variance", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Total variance", - "text": "Total variance\nRemember, PCA is trying to summarise the variance in the data.\nTotal variance (TV) in data (assuming variables centered at 0):\n\\[\n\\text{TV} = \\sum_{j=1}^p \\text{Var}(x_j) = \\sum_{j=1}^p \\frac{1}{n}\\sum_{i=1}^n x_{ij}^2\n\\]\nIf variables are standardised, TV=number of variables.\n\nVariance explained by mโ€™th PC: \\(V_m = \\text{Var}(z_m) = \\frac{1}{n}\\sum_{i=1}^n z_{im}^2\\)\n\\[\n\\text{TV} = \\sum_{m=1}^M V_m \\text{ where }M=\\min(n-1,p).\n\\]" + "objectID": "week2/tutorialsol.html#preparation", + "href": "week2/tutorialsol.html#preparation", + "title": "ETC3250/5250 Tutorial 2", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 1" }, { - "objectID": "week2/slides.html#how-to-choose-k", - "href": "week2/slides.html#how-to-choose-k", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to choose \\(k\\)?", - "text": "How to choose \\(k\\)?\n\nPCA is a useful dimension reduction technique for large datasets, but deciding on how many dimensions to keep isnโ€™t often clear.\n\nHow do we know how many principal components to choose?" + "objectID": "week2/tutorialsol.html#exercises", + "href": "week2/tutorialsol.html#exercises", + "title": "ETC3250/5250 Tutorial 2", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Answer the following questions for this data matrix,\n\\[\\begin{align*}\n{\\mathbf X} = \\left[\\begin{array}{rrrrr}\n2 & -2 & -8 & 6 & -7 \\\\\n6 & 6 & -4 & 9 & 6 \\\\\n5 & 4 & 3 & -7 & 8 \\\\\n1 & -7 & 6 & 7 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\nWhat is \\(X_1\\) (variable 1)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(X_1 = (2 ~6 ~5 ~1)\\)\n\n\n\n\n\nWhat is observation 3?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(5 ~ 4 ~ 3 ~ -7 ~ 8\\)\n\n\n\n\n\nWhat is \\(n\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(4\\)\n\n\n\n\n\nWhat is \\(p\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\(5\\)\n\n\n\n\n\nWhat is \\(X^\\top\\)?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\[\\begin{align*}\n{\\mathbf X}^\\top = \\left[\\begin{array}{rrrr}\n2 & 6 & 5 & 1\\\\\n-2 & 6 & 4 & -7\\\\\n-8 & -4 & 3 & 6 \\\\\n6 & 9 & -7 & 7 \\\\\n-7 & 6 & 8 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\nWrite a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n\\frac{1}{\\sqrt{2}} & 0 \\\\\n0 & \\frac{1}{\\sqrt{3}} \\\\\n0 & 0 \\\\\n\\frac{1}{\\sqrt{2}} & 0 \\\\\n0 & \\frac{\\sqrt{2}}{\\sqrt{3}} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\nWhy canโ€™t the following matrix considered a projection matrix?\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n-1/\\sqrt{2} & 1/\\sqrt{3} \\\\\n0 & 0 \\\\\n1/\\sqrt{2} & 0 \\\\\n0 & \\sqrt{2}/\\sqrt{3} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe columns are not orthonormal. The cross-product is not equal to 0.\n\n\n\n\n\n\n2. Which of these statements is the most accurate? And which is the most precise?\nA. It is almost certain to rain in the next week.\nB. It is 90% likely to get at least 10mm of rain tomorrow.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nA is more accurate, but B is more precise.\n\n\n\n\n\n\n3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.\n\nlibrary(readr)\nlibrary(dplyr)\nlibrary(rsample)\n\nbushfires <- read_csv(\"https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv\")\nbushfires |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 138\n2 arson 37\n3 burning_off 9\n4 lightning 838\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe data is unbalanced, so it is especially important to stratify the sampling by the response variable. Without stratifying the test set is likely missing observations in the burning_off category.\n\nset.seed(1156)\nbushfires_split <- initial_split(bushfires, prop = 0.60, strata=cause)\ntraining(bushfires_split) |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 84\n2 arson 21\n3 burning_off 5\n4 lightning 502\n\ntesting(bushfires_split) |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 54\n2 arson 16\n3 burning_off 4\n4 lightning 336\n\n\n\n\n\n\n\n\n4. In the lecture slides from week 1 on bias vs variance, these four images were shown.\n \n \nMark the images with the labels โ€œtrue modelโ€, โ€œfitted modelโ€, โ€œbiasโ€. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe linear model will be very similar regardless of the training sample, so it has small variance. But because it misses the curved nature of the true model, it has large bias, missing critical parts of the two classes that are different.\nThe non-parametric model which captures the curves thus has small bias, but the fitted model might vary a lot from one training sample to another which would result in it being considered to have large variance.\n \n\n\n\n\n\n\n5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.\n\npred_data <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv\") |>\n mutate(true = factor(true))\n\n\nHow many classes?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\npred_data |> count(true)\n\n# A tibble: 2 ร— 2\n true n\n <fct> <int>\n1 Adelie 30\n2 Chinstrap 5\n\n\n\n\n\n\n\nCompute the confusion table, using the maximum predictive probability to label the observation.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nlibrary(tidyr)\npred_data <- pred_data |>\n mutate(pred = levels(pred_data$true)[apply(pred_data[,-1], 1, which.max)])\npred_data |> count(true, pred) |> \n group_by(true) |>\n mutate(cl_err = n[pred==true]/sum(n)) |>\n pivot_wider(names_from = pred, \n values_from = n,\n values_fill = 0) |>\n dplyr::select(true, Adelie, Chinstrap, cl_err)\n\n# A tibble: 2 ร— 4\n# Groups: true [2]\n true Adelie Chinstrap cl_err\n <fct> <int> <int> <dbl>\n1 Adelie 30 0 1 \n2 Chinstrap 2 3 0.6\n\n\n\n\n\n\n\nCompute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nAccuracy = 33/35 = 0.94\nAccuracy when all predicted to be Adelie = 30/35 = 0.86\nThere are only 5 observations in the Chinstrap class. So accuracy remains high, if we simply ignore this class.\n\n\n\n\n\nCompute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe balanced accuracy is 0.8. This is a better reflection on the predictive ability of the model for this data because it reflects the difficulty in predicting the Chinstrap group.\n\n\n\n\n\n\n6. This question relates to feature engineering, creating better variables on which to build your model.\n\nThe following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?\n\n\nlibrary(ggplot2)\nlibrary(ggbeeswarm)\nspam <- read_csv(\"http://ggobi.org/book/data/spam.csv\")\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\") +\n scale_y_log10()\n\n\n\n\n\n\n\n\n\n\n\n\n\nFor the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?\n\n\nolive <- read_csv(\"http://ggobi.org/book/data/olive.csv\") |>\n dplyr::filter(region != 1) |>\n dplyr::select(region, arachidic, linoleic) |>\n mutate(region = factor(region))\nggplot(olive, aes(x=linoleic, \n y=arachidic, \n colour=region)) +\n geom_point() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n theme(legend.position=\"none\", \n aspect.ratio=1)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nolive <- olive |>\n mutate(linoarch = 0.377 * linoleic + \n 0.926 * arachidic)\nggplot(olive, aes(x=region, \n y=linoarch, \n colour=region)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\") \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n7. Discuss with your neighbour, what you found the most difficult part of last weekโ€™s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer." }, { - "objectID": "week2/slides.html#how-to-choose-k-1", - "href": "week2/slides.html#how-to-choose-k-1", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to choose \\(k\\)?", - "text": "How to choose \\(k\\)?\n\n\nProportion of variance explained:\n\\[\\text{PVE}_m = \\frac{V_m}{TV}\\]\nChoosing the number of PCs that adequately summarises the variation in \\(X\\), is achieved by examining the cumulative proportion of variance explained.\n\n\nCumulative proportion of variance explained:\n\\[\\text{CPVE}_k = \\sum_{m=1}^k\\frac{V_m}{TV}\\]" + "objectID": "week2/tutorialsol.html#finishing-up", + "href": "week2/tutorialsol.html#finishing-up", + "title": "ETC3250/5250 Tutorial 2", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week2/slides.html#how-to-choose-k-2", - "href": "week2/slides.html#how-to-choose-k-2", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to choose \\(k\\)?", - "text": "How to choose \\(k\\)?\n\n\n\nScree plot: Plot of variance explained by each component vs number of component." + "objectID": "week2/tutorial.html", + "href": "week2/tutorial.html", + "title": "ETC3250/5250 Tutorial 2", + "section": "", + "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." }, { - "objectID": "week2/slides.html#how-to-choose-k-3", - "href": "week2/slides.html#how-to-choose-k-3", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "How to choose \\(k\\)?", - "text": "How to choose \\(k\\)?\n\n\n\nScree plot: Plot of variance explained by each component vs number of component." + "objectID": "week2/tutorial.html#objectives", + "href": "week2/tutorial.html#objectives", + "title": "ETC3250/5250 Tutorial 2", + "section": "", + "text": "The goal for this week is for you to learn and practice some of the basics of machine learning." }, { - "objectID": "week2/slides.html#example---track-records", - "href": "week2/slides.html#example---track-records", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example - track records", - "text": "Example - track records\nThe data on national track records for women (as at 1984).\n\ntrack <- read_csv(here::here(\"data/womens_track.csv\"))\nglimpse(track)\n\nRows: 55\nColumns: 8\n$ m100 <dbl> 12, 11, 11, 11, 11, 11, 12, 11, 12, 12, 1โ€ฆ\n$ m200 <dbl> 23, 22, 23, 23, 23, 23, 24, 22, 25, 24, 2โ€ฆ\n$ m400 <dbl> 54, 51, 51, 52, 53, 53, 55, 50, 55, 55, 5โ€ฆ\n$ m800 <dbl> 2.1, 2.0, 2.0, 2.0, 2.2, 2.1, 2.2, 2.0, 2โ€ฆ\n$ m1500 <dbl> 4.4, 4.1, 4.2, 4.1, 4.6, 4.5, 4.5, 4.1, 4โ€ฆ\n$ m3000 <dbl> 9.8, 9.1, 9.3, 8.9, 9.8, 9.8, 9.5, 8.8, 9โ€ฆ\n$ marathon <dbl> 179, 152, 159, 158, 170, 169, 191, 149, 1โ€ฆ\n$ country <chr> \"argentin\", \"australi\", \"austria\", \"belgiโ€ฆ\n\n\nSource: Johnson and Wichern, Applied multivariate analysis" + "objectID": "week2/tutorial.html#preparation", + "href": "week2/tutorial.html#preparation", + "title": "ETC3250/5250 Tutorial 2", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 1" }, { - "objectID": "week2/slides.html#explore-the-data-scatterplot-matrix", - "href": "week2/slides.html#explore-the-data-scatterplot-matrix", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Explore the data: scatterplot matrix", - "text": "Explore the data: scatterplot matrix\n\n\n\n\n\n\n\n\n\n\n\n\nWhat do you learn?\n\n\nLinear relationships between most variables\nOutliers in long distance events, and in 400m vs 100m, 200m\nNon-linear relationship between marathon and 400m, 800m" + "objectID": "week2/tutorial.html#exercises", + "href": "week2/tutorial.html#exercises", + "title": "ETC3250/5250 Tutorial 2", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Answer the following questions for this data matrix,\n\\[\\begin{align*}\n{\\mathbf X} = \\left[\\begin{array}{rrrrr}\n2 & -2 & -8 & 6 & -7 \\\\\n6 & 6 & -4 & 9 & 6 \\\\\n5 & 4 & 3 & -7 & 8 \\\\\n1 & -7 & 6 & 7 & -1\n\\end{array}\\right]\n\\end{align*}\\]\n\nWhat is \\(X_1\\) (variable 1)?\n\n\nWhat is observation 3?\n\n\nWhat is \\(n\\)?\n\n\nWhat is \\(p\\)?\n\n\nWhat is \\(X^\\top\\)?\n\n\nWrite a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.\n\n\nWhy canโ€™t the following matrix considered a projection matrix?\n\n\\[\\begin{align*}\n{\\mathbf A} = \\left[\\begin{array}{rr}\n-1/\\sqrt{2} & 1/\\sqrt{3} \\\\\n0 & 0 \\\\\n1/\\sqrt{2} & 0 \\\\\n0 & \\sqrt{2}/\\sqrt{3} \\\\\n\\end{array}\\right]\n\\end{align*}\\]\n\n\n2. Which of these statements is the most accurate? And which is the most precise?\nA. It is almost certain to rain in the next week.\nB. It is 90% likely to get at least 10mm of rain tomorrow.\n\n\n3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.\n\nlibrary(readr)\nlibrary(dplyr)\nlibrary(rsample)\n\nbushfires <- read_csv(\"https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv\")\nbushfires |> count(cause)\n\n# A tibble: 4 ร— 2\n cause n\n <chr> <int>\n1 accident 138\n2 arson 37\n3 burning_off 9\n4 lightning 838\n\n\n\n\n4. In the lecture slides from week 1 on bias vs variance, these four images were shown.\n \n \nMark the images with the labels โ€œtrue modelโ€, โ€œfitted modelโ€, โ€œbiasโ€. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.\n\n\n5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.\n\npred_data <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv\") |>\n mutate(true = factor(true))\n\n\nHow many classes?\n\n\nCompute the confusion table, using the maximum predictive probability to label the observation.\n\n\nCompute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?\n\n\nCompute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?\n\n\n\n6. This question relates to feature engineering, creating better variables on which to build your model.\n\nThe following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?\n\n\nlibrary(ggplot2)\nlibrary(ggbeeswarm)\nspam <- read_csv(\"http://ggobi.org/book/data/spam.csv\")\nggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +\n geom_quasirandom() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n coord_flip() +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\nFor the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?\n\n\nolive <- read_csv(\"http://ggobi.org/book/data/olive.csv\") |>\n dplyr::filter(region != 1) |>\n dplyr::select(region, arachidic, linoleic) |>\n mutate(region = factor(region))\nggplot(olive, aes(x=linoleic, \n y=arachidic, \n colour=region)) +\n geom_point() +\n scale_color_brewer(\"\", palette = \"Dark2\") + \n theme(legend.position=\"none\", \n aspect.ratio=1)\n\n\n\n\n\n\n\n\n\n\n7. Discuss with your neighbour, what you found the most difficult part of last weekโ€™s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer." }, { - "objectID": "week2/slides.html#explore-the-data-tour", - "href": "week2/slides.html#explore-the-data-tour", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Explore the data: tour", - "text": "Explore the data: tour\n\n\n\n\n\n\nWhat do you learn?\n\nMostly like a very slightly curved pencil\nSeveral outliers, in different directions" + "objectID": "week2/tutorial.html#finishing-up", + "href": "week2/tutorial.html#finishing-up", + "title": "ETC3250/5250 Tutorial 2", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week2/slides.html#compute-pca", - "href": "week2/slides.html#compute-pca", + "objectID": "week3/slides.html#overview", + "href": "week3/slides.html#overview", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Compute PCA", - "text": "Compute PCA\n\noptions(digits=2)\n\n\ntrack_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE)\ntrack_pca\n\nStandard deviations (1, .., p=7):\n[1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15\n\nRotation (n x k) = (7 x 7):\n PC1 PC2 PC3 PC4 PC5 PC6 PC7\nm100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052\nm200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109\nm400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208\nm800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315\nm1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693\nm3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598\nmarathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070" + "section": "Overview", + "text": "Overview\nWe will cover:\n\nCommon re-sampling methods: bootstrap, cross-validation, permutation, simulation.\nCross-validation for checking generalisability of model fit, parameter tuning, variable selection.\nBootstrapping for understanding variance of parameter estimates.\nPermutation to understand significance of associations between variables, and variable importance.\nSimulation can be used to assess what might happen with samples from known distributions.\nWhat can go wrong in high-d, and how to adjust using regularisation methods." }, { - "objectID": "week2/slides.html#summarise", - "href": "week2/slides.html#summarise", + "objectID": "week3/slides.html#model-development-and-choice", + "href": "week3/slides.html#model-development-and-choice", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Summarise", - "text": "Summarise\nSummary of the principal components:\n\n\n\n\n\n\nPC1\nPC2\nPC3\nPC4\nPC5\nPC6\nPC7\n\n\n\n\nVariance\n5.81\n0.65\n0.30\n0.13\n0.05\n0.04\n0.02\n\n\nProportion\n0.83\n0.09\n0.04\n0.02\n0.01\n0.01\n0.00\n\n\nCum. prop\n0.83\n0.92\n0.97\n0.98\n0.99\n1.00\n1.00\n\n\n\n\n\n\n\nIncrease in variance explained large until \\(k=3\\) PCs, and then tapers off. A choice of 3 PCs would explain 97% of the total variance." + "section": "Model development and choice", + "text": "Model development and choice" }, { - "objectID": "week2/slides.html#decide", - "href": "week2/slides.html#decide", + "objectID": "week3/slides.html#how-do-you-get-new-data", + "href": "week3/slides.html#how-do-you-get-new-data", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Decide", - "text": "Decide\n\n\n\nScree plot: Where is the elbow?\n\n At \\(k=2\\), thus the scree plot suggests 2 PCs would be sufficient to explain the variability." + "section": "How do you get new data?", + "text": "How do you get new data?" }, { - "objectID": "week2/slides.html#assess-data-in-the-model-space", - "href": "week2/slides.html#assess-data-in-the-model-space", + "objectID": "week3/slides.html#common-re-sampling-methods", + "href": "week3/slides.html#common-re-sampling-methods", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Assess: Data-in-the-model-space", - "text": "Assess: Data-in-the-model-space\n\n\n\nVisualise model using a biplot: Plot the principal component scores, and also the contribution of the original variables to the principal component.\n\nA biplot is like a single projection from a tour." + "section": "Common re-sampling methods", + "text": "Common re-sampling methods\n\n\n\n\nCross-validation: Splitting the data into multiple samples.\nBootstrap: Sampling with replacement\nPermutation: Re-order the values of one or more variables\n\n\n\n\nCross-validation: This is used to gain some understanding of the variance (as in bias-variance trade-off ) of models, and how parameter or algorithm choices affect the performance of the model on future samples.\n\n\n\n\n\nBootstrap: Compute confidence intervals for model parameters, or the model fit statistics. can be used similarly to cross-validation samples but avoids the complication of smaller sample size that may affect interpretation of cross-validation samples.\nPermutation: Used to assess significance of relationships, especially to assess the importance of individual variables or combinations of variables for a fitted model." }, { - "objectID": "week2/slides.html#interpret", - "href": "week2/slides.html#interpret", + "objectID": "week3/slides.html#cross-validation", + "href": "week3/slides.html#cross-validation", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Interpret", - "text": "Interpret\n\nPC1 measures overall magnitude, the strength of the athletics program. High positive values indicate poor programs with generally slow times across events.\nPC2 measures the contrast in the program between short and long distance events. Some countries have relatively stronger long distance atheletes, while others have relatively stronger short distance athletes.\nThere are several outliers visible in this plot, wsamoa, cookis, dpkorea. PCA, because it is computed using the variance in the data, can be affected by outliers. It may be better to remove these countries, and re-run the PCA.\nPC3, may or may not be useful to keep. The interpretation would that this variable summarises countries with different middle distance performance." + "section": "Cross-validation", + "text": "Cross-validation\n\nTraining/test split: make one split of your data, keeping one purely for assessing future performance.\n\nAfter making that split, we would use these methods on the training sample:\n\nLeave-one-out: make \\(n\\) splits, fitting multiple models and using left-out observation for assessing variability.\n\\(k\\)-fold: break data into \\(k\\) subsets, fitting multiple models with one group left out each time." }, { - "objectID": "week2/slides.html#assess-model-in-the-data-space", - "href": "week2/slides.html#assess-model-in-the-data-space", + "objectID": "week3/slides.html#trainingtest-split-13", + "href": "week3/slides.html#trainingtest-split-13", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Assess: Model-in-the-data-space", - "text": "Assess: Model-in-the-data-space\n\n\n\ntrack_std <- track |> \n mutate_if(is.numeric, function(x) (x-\n mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\ntrack_std_pca <- prcomp(track_std[,1:7], \n scale = FALSE, \n retx=TRUE)\ntrack_model <- pca_model(track_std_pca, d=2, s=2)\ntrack_all <- rbind(track_model$points, track_std[,1:7])\nanimate_xy(track_all, edges=track_model$edges,\n edges.col=\"#E7950F\", \n edges.width=3, \n axes=\"off\")\nrender_gif(track_all, \n grand_tour(), \n display_xy(\n edges=track_model$edges, \n edges.col=\"#E7950F\", \n edges.width=3, \n axes=\"off\"),\n gif_file=\"gifs/track_model.gif\",\n frames=500,\n width=400,\n height=400,\n loop=FALSE)\n\nMostly captures the variance in the data. Seems to slightly miss the non-linear relationship." + "section": "Training/test split (1/3)", + "text": "Training/test split (1/3)\n \nA set of \\(n\\) observations are randomly split into a training set (blue, containing observations 7, 22, 13, โ€ฆ) and a test set (yellow, all other observations not in training set).\n\nNeed to stratify the sampling to ensure training and test groups are appropriately balanced.\nOnly one split of data made, may have a lucky or unlucky split, accurately estimating test error relies on the one sample.\n\n (Chapter5/5.1.pdf)" }, { - "objectID": "week2/slides.html#delectable-details", - "href": "week2/slides.html#delectable-details", + "objectID": "week3/slides.html#trainingtest-split-23", + "href": "week3/slides.html#trainingtest-split-23", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Delectable details", - "text": "Delectable details\n\n\n๐Ÿคญ\n\nSometimes the lowest PCs show the interesting patterns, like non-linear relationships, or clusters.\n\n\n\n\nPCA summarises linear relationships, and might not see other interesting dependencies. Projection pursuit is a generalisation that can find other interesting patterns.\nOutliers can affect results, because direction of outliers will appear to have larger variance\nScaling of variables matters, and typically you would first standardise each variable to have mean 0 and variance 1. Otherwise, PCA might simply report the variables with the largest variance, which we already know." + "section": "Training/test split (2/3)", + "text": "Training/test split (2/3)\n\n\nWith tidymodels, the function initial_split() creates the indexes of observations to be allocated into training or test samples. To generate these samples use training() and test() functions.\n\nd_bal <- tibble(y=c(rep(\"A\", 6), rep(\"B\", 6)),\n x=c(runif(12)))\nd_bal$y\n\n [1] \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(130)\nd_bal_split <- initial_split(d_bal, prop = 0.70)\ntraining(d_bal_split)$y\n\n[1] \"A\" \"A\" \"B\" \"A\" \"B\" \"A\" \"B\" \"A\"\n\ntesting(d_bal_split)$y\n\n[1] \"A\" \"B\" \"B\" \"B\"\n\n\n\nHow do you ensure that you get 0.70 in each class?\n\n\n\nStratify the sampling\n\nd_bal$y\n\n [1] \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(1225)\nd_bal_split <- initial_split(d_bal, \n prop = 0.70, \n strata=y)\ntraining(d_bal_split)$y\n\n[1] \"A\" \"A\" \"A\" \"A\" \"B\" \"B\" \"B\" \"B\"\n\ntesting(d_bal_split)$y\n\n[1] \"A\" \"A\" \"B\" \"B\"\n\n\nNow the test set has 2 Aโ€™s and 2 Bโ€™2. This is best practice!" }, { - "objectID": "week2/slides.html#non-linear-dimension-reduction", - "href": "week2/slides.html#non-linear-dimension-reduction", + "objectID": "week3/slides.html#trainingtest-split-33", + "href": "week3/slides.html#trainingtest-split-33", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Non-linear dimension reduction", - "text": "Non-linear dimension reduction" + "section": "Training/test split (3/3)", + "text": "Training/test split (3/3)\n\n\nNot stratifying can cause major problems with unbalanced samples.\n\nd_unb <- tibble(y=c(rep(\"A\", 2), rep(\"B\", 10)),\n x=c(runif(12)))\nd_unb$y\n\n [1] \"A\" \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\nset.seed(132)\nd_unb_split <- initial_split(d_unb, prop = 0.70)\ntraining(d_unb_split)$y\n\n[1] \"B\" \"B\" \"A\" \"B\" \"B\" \"A\" \"B\" \"B\"\n\ntesting(d_unb_split)$y\n\n[1] \"B\" \"B\" \"B\" \"B\"\n\n\nThe test set is missing one entire class!\n\n\n\nAlways stratify splitting by sub-groups, especially response variable classes, and possibly other variables too.\n\n\nd_unb_strata <- initial_split(d_unb, \n prop = 0.70, \n strata=y)\ntraining(d_unb_strata)$y\n\n[1] \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\"\n\ntesting(d_unb_strata)$y\n\n[1] \"A\" \"B\" \"B\" \"B\"\n\n\nNow there is an A in the test set!" }, { - "objectID": "week2/slides.html#common-approaches", - "href": "week2/slides.html#common-approaches", + "objectID": "week3/slides.html#checking-the-trainingtest-split-response", + "href": "week3/slides.html#checking-the-trainingtest-split-response", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Common approaches", - "text": "Common approaches\n\n\nFind some low-dimensional layout of points which approximates the distance between points in high-dimensions, with the purpose being to have a useful representation that reveals high-dimensional patterns, like clusters.\nMultidimensional scaling (MDS) is the original approach:\n\\[\n\\mbox{Stress}_D(x_1, ..., x_n) = \\left(\\sum_{i, j=1; i\\neq j}^n (d_{ij} - d_k(i,j))^2\\right)^{1/2}\n\\] where \\(D\\) is an \\(n\\times n\\) matrix of distances \\((d_{ij})\\) between all pairs of points, and \\(d_k(i,j)\\) is the distance between the points in the low-dimensional space.\nPCA is a special case of MDS. The result from PCA is a linear projection, but generally MDS can provide some non-linear transformation.\n\n\nMany variations being developed:\n\nt-stochastic neighbourhood embedding (t-SNE): compares interpoint distances with a standard probability distribution (eg \\(t\\)-distribution) to exaggerate local neighbourhood differences.\nuniform manifold approximation and projection (UMAP): compares the interpoint distances with what might be expected if the data was uniformly distributed in the high-dimensions.\n\n\nNLDR can be useful but it can also make some misleading representations." + "section": "Checking the training/test split: response", + "text": "Checking the training/test split: response\n\n\n\nGOOD\n\n\n\n\n\n\n\n\n\n\n\n\nBAD\n\n\n\n\n\n\n\n\n\n\n\n Check the class proportions of the response by computing counts and proportions in each class, and tabulating or plotting the result. Itโ€™s good if there are similar numbers of each class in both sets." }, { - "objectID": "week2/slides.html#umap-12", - "href": "week2/slides.html#umap-12", + "objectID": "week3/slides.html#checking-the-trainingtest-split-predictors", + "href": "week3/slides.html#checking-the-trainingtest-split-predictors", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "UMAP (1/2)", - "text": "UMAP (1/2)\n\n\n\nUMAP 2D representation\n\n\n\n\n\n\n\n\n\n\n\nlibrary(uwot)\nset.seed(253)\np_tidy_umap <- umap(p_tidy_std[,2:5], init = \"spca\")\n\n\n\nTour animation" + "section": "Checking the training/test split: predictors", + "text": "Checking the training/test split: predictors\n\n\n\nGOOD\n\n\nMake a training/test variable and plot the predictors. Need to have similar distributions.\n\n\nLooks good\n\n\n\n\n\n\n\n\n\n\nOn the response training and test sets have similar proportions of each class so looks good BUT itโ€™s not\n\n\nBut BAD\n\n\nTest set has smaller penguins on at least two of the variables." }, { - "objectID": "week2/slides.html#umap-22", - "href": "week2/slides.html#umap-22", + "objectID": "week3/slides.html#cross-validation-1", + "href": "week3/slides.html#cross-validation-1", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "UMAP (2/2)", - "text": "UMAP (2/2)\n\n\n\nUMAP 2D representation\n\n\n\n\n\n\n\n\n\nTour animation" + "section": "Cross-validation", + "text": "Cross-validation" }, { - "objectID": "week2/slides.html#next-re-sampling-and-regularisation", - "href": "week2/slides.html#next-re-sampling-and-regularisation", + "objectID": "week3/slides.html#k-fold-cross-validation-14", + "href": "week3/slides.html#k-fold-cross-validation-14", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Next: Re-sampling and regularisation", - "text": "Next: Re-sampling and regularisation\n\n\n\nETC3250/5250 Lecture 2 | iml.numbat.space" + "section": "k-fold cross validation (1/4)", + "text": "k-fold cross validation (1/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time" }, { - "objectID": "week3/index.html", - "href": "week3/index.html", - "title": "Week 3: Re-sampling and regularisation", - "section": "", - "text": "ISLR 5.1, 5.2, 6.2, 6.4" + "objectID": "week3/slides.html#k-fold-cross-validation-24", + "href": "week3/slides.html#k-fold-cross-validation-24", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "k-fold cross validation (2/4)", + "text": "k-fold cross validation (2/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nHere are the row numbers for \\(k=5\\) folds:\n\np_folds <- vfold_cv(p_sub, 5, strata=species)\nc(1:nrow(p_sub))[-p_folds$splits[[1]]$in_id]\n\n [1] 5 6 8 12 16 23 28 31 43 44 45 53 57 58 70 73 74 77\n\nc(1:nrow(p_sub))[-p_folds$splits[[2]]$in_id]\n\n [1] 2 9 10 11 13 17 22 25 39 48 50 51 55 61 65 69 75 78\n\nc(1:nrow(p_sub))[-p_folds$splits[[3]]$in_id]\n\n [1] 1 3 14 18 20 26 33 41 42 49 56 67 72 81 83 84\n\nc(1:nrow(p_sub))[-p_folds$splits[[4]]$in_id]\n\n [1] 4 19 29 32 34 35 36 40 46 52 63 64 66 76 79 80\n\nc(1:nrow(p_sub))[-p_folds$splits[[5]]$in_id]\n\n [1] 7 15 21 24 27 30 37 38 47 54 59 60 62 68 71 82" }, { - "objectID": "week3/index.html#main-reference", - "href": "week3/index.html#main-reference", - "title": "Week 3: Re-sampling and regularisation", - "section": "", - "text": "ISLR 5.1, 5.2, 6.2, 6.4" + "objectID": "week3/slides.html#k-fold-cross-validation-34", + "href": "week3/slides.html#k-fold-cross-validation-34", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "k-fold cross validation (3/4)", + "text": "k-fold cross validation (3/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nFit the model to the \\(k-1\\) set, and compute the statistic on the \\(k\\)-fold, that was not used in the model fit.\nHere we use the accuracy as the statistic of interest.\nValue for fold 1 is:\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.889" }, { - "objectID": "week3/index.html#what-you-will-learn-this-week", - "href": "week3/index.html#what-you-will-learn-this-week", - "title": "Week 3: Re-sampling and regularisation", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nCommon re-sampling methods: bootstrap, cross-validation, permutation, simulation.\nCross-validation for checking generalisability of model fit, parameter tuning, variable selection.\nBootstrapping for understanding variance of parameter estimates.\nPermutation to understand significance of associations between variables, and variable importance.\nSimulation can be used to assess what might happen with samples from known distributions.\nWhat can go wrong in high-d, and how to adjust using regularisation methods." + "objectID": "week3/slides.html#k-fold-cross-validation-44", + "href": "week3/slides.html#k-fold-cross-validation-44", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "k-fold cross validation (4/4)", + "text": "k-fold cross validation (4/4)\n\n\n\nDivide the data set into \\(k\\) different parts.\nRemove one part, fit the model on the remaining \\(k โˆ’ 1\\) parts, and compute the statistic of interest on the omitted part.\nRepeat \\(k\\) times taking out a different part each time\n\n\nHere is the accuracy computed for each of the \\(k=5\\) folds. Remember, this means that the model was fitted to the rest of the data, and accuracy was calculate on the observations in this fold.\n\n\n[1] 0.89 0.89 1.00 0.88 1.00\n\n\n\n\nRecommended reading: Alison Hillโ€™s Take a Sad Script & Make it Better: Tidymodels Edition" }, { - "objectID": "week3/index.html#lecture-slides", - "href": "week3/index.html#lecture-slides", - "title": "Week 3: Re-sampling and regularisation", - "section": "Lecture slides", - "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" + "objectID": "week3/slides.html#loocv", + "href": "week3/slides.html#loocv", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "LOOCV", + "text": "LOOCV\n\nLeave-one-out (LOOCV) is a special case of \\(k\\)-fold cross-validation, where \\(k=n\\). There are \\(n\\) CV sets, each with ONE observation left out.\n\nBenefits:\n\nUseful when sample size is very small.\nSome statistics can be calculated algebraically, without having to do computation for each fold." }, { - "objectID": "week3/index.html#tutorial-instructions", - "href": "week3/index.html#tutorial-instructions", - "title": "Week 3: Re-sampling and regularisation", - "section": "Tutorial instructions", - "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" + "objectID": "week3/slides.html#where-is-cross-validation-used", + "href": "week3/slides.html#where-is-cross-validation-used", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Where is cross-validation used?", + "text": "Where is cross-validation used?\n\n\nModel evaluation and selection, by estimating the generalisability on future data.\nParameter tuning: finding optimal choice of parameters or control variables, like number of trees or branches, or polynomial terms to generate the best model fit.\nVariable selection: which variables are more or less important for the best model fit. Possibly some variables can be dropped from the model." }, { - "objectID": "week3/index.html#assignments", - "href": "week3/index.html#assignments", - "title": "Week 3: Re-sampling and regularisation", - "section": "Assignments", - "text": "Assignments" + "objectID": "week3/slides.html#bootstrap", + "href": "week3/slides.html#bootstrap", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap", + "text": "Bootstrap" }, { - "objectID": "week3/index.html#assignments-1", - "href": "week3/index.html#assignments-1", - "title": "Week 3: Re-sampling and regularisation", - "section": "Assignments", - "text": "Assignments\n\nAssignment 1 is due on Friday 22 March." + "objectID": "week3/slides.html#bootstrap-15", + "href": "week3/slides.html#bootstrap-15", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap (1/5)", + "text": "Bootstrap (1/5)\nA bootstrap sample is a sample that is the same size as the original data set that is made using replacement. This results in analysis samples that have multiple replicates of some of the original rows of the data. The assessment set is defined as the rows of the original data that were not included in the bootstrap sample, referred to as the out-of-bag (OOB) sample.\n\nset.seed(115)\ndf <- tibble(id = 1:26, \n cl = c(rep(\"A\", 12), rep(\"B\", 14)))\ndf_b <- bootstraps(df, times = 100, strata = cl)\nt(df_b$splits[[1]]$data[df_b$splits[[1]]$in_id,])\n\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]\nid \" 1\" \" 2\" \" 2\" \" 2\" \" 2\" \" 5\" \" 6\" \" 7\" \" 9\" \"11\" \"12\" \ncl \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \"A\" \n [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]\nid \"12\" \"14\" \"14\" \"18\" \"18\" \"18\" \"18\" \"18\" \"19\" \ncl \"A\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \n [,21] [,22] [,23] [,24] [,25] [,26]\nid \"21\" \"21\" \"21\" \"22\" \"25\" \"25\" \ncl \"B\" \"B\" \"B\" \"B\" \"B\" \"B\" \n\n\nWhich observations are out-of-bag in bootstrap sample 1?" }, { - "objectID": "week3/tutorialsol.html", - "href": "week3/tutorialsol.html", - "title": "ETC3250/5250 Tutorial 3", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(colorspace)\nlibrary(patchwork)\nlibrary(MASS)\nlibrary(randomForest)\nlibrary(gridExtra)\nlibrary(GGally)\nlibrary(geozoo)\nlibrary(mulgar)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(tourr::flea)" + "objectID": "week3/slides.html#bootstrap-25", + "href": "week3/slides.html#bootstrap-25", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap (2/5)", + "text": "Bootstrap (2/5)\n\n\nBootstrap is preferable to cross-validation when the sample size is small, or if the structure in the data being modelled is complex.\nIt is commonly used for estimating the variance of parameter estimates, especially when the data is non-normal." }, { - "objectID": "week3/tutorialsol.html#objectives", - "href": "week3/tutorialsol.html#objectives", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to learn and practice visualising high-dimensional data." + "objectID": "week3/slides.html#bootstrap-35", + "href": "week3/slides.html#bootstrap-35", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap (3/5)", + "text": "Bootstrap (3/5)\nIn dimension reduction it can be used to assess if the coefficients of a PC (the eigenvectors) are significantly different from ZERO. The 95% bootstrap confidence intervals can be computed by:\n\nGenerating B bootstrap samples of the data\nCompute PCA, record the loadings\nRe-orient the loadings, by choosing one variable with large coefficient to be the direction base\nIf B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC." }, { - "objectID": "week3/tutorialsol.html#preparation", - "href": "week3/tutorialsol.html#preparation", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 2" + "objectID": "week3/slides.html#bootstrap-45", + "href": "week3/slides.html#bootstrap-45", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap (4/5)", + "text": "Bootstrap (4/5)\nAssessing the loadings for PC 2 of PCA on the womens track data. Remember the summary: \n\n\nStandard deviations (1, .., p=7):\n[1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15\n\nRotation (n x k) = (7 x 7):\n PC1 PC2 PC3 PC4 PC5 PC6 PC7\nm100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052\nm200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109\nm400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208\nm800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315\nm1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693\nm3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598\nmarathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070\n\n\n Should we consider m800, m400 contributing to PC2 or not?" }, { - "objectID": "week3/tutorialsol.html#exercises", - "href": "week3/tutorialsol.html#exercises", - "title": "ETC3250/5250 Tutorial 3", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. The sparseness of high dimensions\nRandomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?\nThe code to generate and view the cubes is:\n\n\nCode to generate the data and show in a tour\nlibrary(tourr)\nlibrary(geozoo)\nset.seed(1234)\ncube3 <- cube.solid.random(3, 500)$points\ncube5 <- cube.solid.random(5, 500)$points\ncube10 <- cube.solid.random(10, 500)$points\n\nanimate_xy(cube3, axes=\"bottomleft\")\nanimate_xy(cube5, axes=\"bottomleft\")\nanimate_xy(cube10, axes=\"bottomleft\")\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nEach of the projections has a boxy shape, which gets less distinct as the dimension increases.\nAs the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges.\n\n\n\n\n\n\n2. Detecting clusters\nFor the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).\n\n\nCode to show in a tour\nanimate_xy(c1)\nanimate_xy(c3)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe first data set c1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions.\nThe second data set c3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections.\n\n\n\n\n\n\n3. Effect of covariance\nExamine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n\n\nCode to generate the samples\nlibrary(mvtnorm)\nset.seed(501)\n\ns1 <- diag(5)\ns2 <- diag(5)\ns2[3,4] <- 0.7\ns2[4,3] <- 0.7\ns3 <- s2\ns3[1,2] <- -0.7\ns3[2,1] <- -0.7\n\ns1\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0 0 0\n[2,] 0 1 0 0 0\n[3,] 0 0 1 0 0\n[4,] 0 0 0 1 0\n[5,] 0 0 0 0 1\n\n\nCode to generate the samples\ns2\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0.0 0.0 0\n[2,] 0 1 0.0 0.0 0\n[3,] 0 0 1.0 0.7 0\n[4,] 0 0 0.7 1.0 0\n[5,] 0 0 0.0 0.0 1\n\n\nCode to generate the samples\ns3\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1.0 -0.7 0.0 0.0 0\n[2,] -0.7 1.0 0.0 0.0 0\n[3,] 0.0 0.0 1.0 0.7 0\n[4,] 0.0 0.0 0.7 1.0 0\n[5,] 0.0 0.0 0.0 0.0 1\n\n\nCode to generate the samples\nset.seed(1234)\nd1 <- as.data.frame(rmvnorm(500, sigma = s1))\nd2 <- as.data.frame(rmvnorm(500, sigma = s2))\nd3 <- as.data.frame(rmvnorm(500, sigma = s3))\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nanimate_xy(d1)\nanimate_xy(d2)\nanimate_xy(d3)\n\nThe points in data d1 are pretty spread in every projection. For the data d2, d3 have some projections where the data is concentrated along a line. This should be seen to be when variables 3 and 4 are contributing to the projection in d2, and when variables 1, 2, 3, 4 contributing to the projection in d3.\n\n\n\n\n\n\n4. Principal components analysis on the simulated data\n๐Ÿง For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?\n๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ปConduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.\n๐ŸคฏOften, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThinking about it: In d2 there is strong correlation between variables 3 and 4, which means probably only 4PC s would be needed. In d3 there is strong correlation also between variables 1 and 2 which would mean that only 3 PCs would be needed.\n\nd2_pca <- prcomp(d2, scale=TRUE)\nd2_pca\n\nStandard deviations (1, .., p=5):\n[1] 1.2944925 1.0120246 0.9995775 0.9840652 0.5766767\n\nRotation (n x k) = (5 x 5):\n PC1 PC2 PC3 PC4 PC5\nV1 0.009051897 0.60982755 0.600760775 0.51637067 -0.02182300\nV2 0.042039564 0.44070702 -0.798335151 0.40808929 0.01158053\nV3 0.702909484 0.03224989 0.034228444 -0.06034512 0.70715280\nV4 0.702411571 0.03021836 0.002269932 -0.08050218 -0.70655437\nV5 0.103377852 -0.65721722 0.023890154 0.74612487 -0.01027051\n\nd2_pca$sdev^2/5\n\n[1] 0.33514216 0.20483875 0.19983102 0.19367686 0.06651121\n\nmulgar::ggscree(d2_pca, q=5)\n\n\n\n\n\n\n\n\nFour PCs explain 93% of the variation. PC1 is the combination of variables 3 and 4, which captures this reduced dimension.\n\nd3_pca <- prcomp(d3, scale=TRUE)\nd3_pca\n\nStandard deviations (1, .., p=5):\n[1] 1.3262816 1.2831152 0.9984103 0.5561311 0.5371102\n\nRotation (n x k) = (5 x 5):\n PC1 PC2 PC3 PC4 PC5\nV1 0.47372917 0.52551030 0.007091154 -0.55745578 0.434295265\nV2 -0.49362867 -0.50367594 -0.047544823 -0.58444458 0.398503844\nV3 -0.50057768 0.49960926 0.030888892 -0.40488840 -0.578726039\nV4 -0.52968729 0.46318477 0.073441704 0.42649507 0.563559684\nV5 0.02765464 -0.07745919 0.995661287 -0.04283613 -0.007678753\n\nd3_pca$sdev^2/5\n\n[1] 0.35180458 0.32927695 0.19936462 0.06185637 0.05769748\n\nmulgar::ggscree(d3_pca, q=5)\n\n\n\n\n\n\n\n\nThree PCs explain 88% of the variation, and the last two PCs have much smaller variance than the others. PC 1 and 2 are combinations of variables 1, 2, 3 and 4, which captures this reduced dimension, and PC 3 is primarily variable 5.\nThe PCs are awkward combinations of the original variables. For d2, it would make sense to use PC1 (or equivalently and equal combination of V3, V4), and then keep the original variables V1, V2, V5.\nFor d3 itโ€™s harder to make this call because the first two PCs are combinations of four variables. Its hard to see from this that the ideal solution would be to use an equal combination of V1, V2, and equal combination of V3, V4 and V5 on its own.\nOften understanding the variance that is explained by the PCs is hard to interpret.\n\n\n\n\n\n\n5. PCA on cross-currency time series\nThe rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.\n\nrates <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv\") |>\n select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)\n\n\nStandardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.\n\n\n\nCode to standardise currencies\nlibrary(plotly)\nrates_std <- rates |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\nrownames(rates_std) <- rates_std$date\np <- rates_std |>\n pivot_longer(cols=ARS:ZAR, \n names_to = \"currency\", \n values_to = \"rate\") |>\n ggplot(aes(x=date, y=rate, \n group=currency, label=currency)) +\n geom_line() \nggplotly(p, width=400, height=300)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\nIt isnโ€™t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.\n\n\n\n\n\nConduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.\n\n\n\nCode to do PCA and screeplot\nrates_pca <- prcomp(rates_std[,-1], scale=FALSE)\nmulgar::ggscree(rates_pca, q=24)\noptions(digits=2)\nsummary(rates_pca)\n\n\n\n\nCode to make a nice summary\n# Summarise the coefficients nicely\nrates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>\n mutate(p = evl/sum(evl), \n cum_p = cumsum(evl/sum(evl))) |> \n t() |>\n as.data.frame()\ncolnames(rates_pca_smry) <- colnames(rates_pca$rotation)\nrates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),\n rates_pca_smry)\nrownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),\n \"Variance\", \"Proportion\", \n \"Cum. prop\")\nrates_pca_smry[,1:5]\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nImportance of components:\n PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8\nStandard deviation 4.193 1.679 1.0932 0.9531 0.7358 0.5460 0.38600 0.33484\nProportion of Variance 0.733 0.118 0.0498 0.0379 0.0226 0.0124 0.00621 0.00467\nCumulative Proportion 0.733 0.850 0.8999 0.9377 0.9603 0.9727 0.97893 0.98360\n PC9 PC10 PC11 PC12 PC13 PC14 PC15\nStandard deviation 0.30254 0.25669 0.25391 0.17893 0.16189 0.15184 0.14260\nProportion of Variance 0.00381 0.00275 0.00269 0.00133 0.00109 0.00096 0.00085\nCumulative Proportion 0.98741 0.99016 0.99284 0.99418 0.99527 0.99623 0.99708\n PC16 PC17 PC18 PC19 PC20 PC21 PC22\nStandard deviation 0.11649 0.10691 0.09923 0.09519 0.08928 0.07987 0.07222\nProportion of Variance 0.00057 0.00048 0.00041 0.00038 0.00033 0.00027 0.00022\nCumulative Proportion 0.99764 0.99812 0.99853 0.99891 0.99924 0.99950 0.99972\n PC23 PC24\nStandard deviation 0.05985 0.05588\nProportion of Variance 0.00015 0.00013\nCumulative Proportion 0.99987 1.00000\n\n\n\n\n PC1 PC2 PC3 PC4 PC5\nARS 0.215 -0.121 0.19832 0.181 -0.2010\nAUD 0.234 0.013 0.11466 0.018 0.0346\nBRL 0.229 -0.108 0.10513 0.093 -0.0526\nCAD 0.235 -0.025 -0.02659 -0.037 0.0337\nCHF -0.065 0.505 -0.33521 -0.188 -0.0047\nCNY 0.144 0.237 -0.45337 -0.238 -0.5131\nEUR 0.088 0.495 0.24474 0.245 -0.1416\nFJD 0.234 0.055 0.04470 0.028 0.0330\nGBP 0.219 0.116 -0.00915 -0.073 0.3059\nIDR 0.218 -0.022 -0.24905 -0.117 0.2362\nINR 0.223 -0.147 -0.00734 -0.014 0.0279\nISK 0.230 -0.016 0.10979 0.093 0.1295\nJPY -0.022 0.515 0.14722 0.234 0.3388\nKRW 0.214 0.063 0.17488 0.059 -0.3404\nKZT 0.217 0.013 -0.23244 -0.119 0.3304\nMXN 0.229 -0.059 -0.13804 -0.102 0.2048\nMYR 0.227 0.040 -0.13970 -0.115 -0.2009\nNZD 0.230 0.061 0.04289 -0.056 -0.0354\nQAR -0.013 0.111 0.55283 -0.807 0.0078\nRUB 0.233 -0.102 -0.05863 -0.042 0.0063\nSEK 0.205 0.240 0.07570 0.085 0.0982\nSGD 0.227 0.057 0.14225 0.115 -0.2424\nUYU 0.231 -0.101 0.00064 -0.053 0.0957\nZAR 0.232 -0.070 -0.00328 0.042 -0.0443\nVariance 17.582 2.820 1.19502 0.908 0.5413\nProportion 0.733 0.118 0.04979 0.038 0.0226\nCum. prop 0.733 0.850 0.89989 0.938 0.9603\n\n\n\nThe first two principal components explain 85% of the total variation.\nPC1 is a combination of all of the currencies except for CHF, EUR, JPY, QAR.\nPC2 is a combination of CHF, EUR, JPY.\n\n\n\n\n\n\nMake a biplot of the first two PCs. Explain what you learn.\n\n\n\nBiplot code\nlibrary(ggfortify)\nautoplot(rates_pca, loadings = TRUE, \n loadings.label = TRUE) \n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMost of the currencies contribute substantially to PC1. Only three contribute strongly to PC2: CHF, JPY, EUR. Similar to what is learned from the summary table (made in b).\nThe pattern of the points is most unusual! It has a curious S shape. Principal components are supposed to be a random scattering of values, with no obvious structure. This is a very strong pattern.\n\n\n\n\n\n\nMake a time series plot of PC1 and PC2. Explain why this is useful to do for this data.\n\n\n\nCode to plot PCs\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC1)) + geom_line()\n\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC2)) + geom_line()\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBecause there is a strong pattern in the first two PCs, it could be useful to understand if this is related to the temporal context of the data.\nHere we might expect that the PCs extract the main temporal patterns. We see this is the case.\nPC1 reflects the large group of currencies that greatly increase in mid-March.\nPC2 reflects the few currencies that decrease at the start of March.\n\nNote that: increase here means that the value of the currency declines relative to the USD and a decrease indicates stronger relative to the USD. Is this correct?\n\n\n\n\n\nYouโ€™ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?\n\n\n\nCode to use interaction of the PC plot\nlibrary(plotly)\np2 <- rates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=PC1, y=PC2, label=date)) +\n geom_point() +\n geom_path()\nggplotly(p2, width=400, height=400)\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\nThe pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.\n\n\n\n\n\n\n6. Write a simple question about the weekโ€™s material and test your neighbour, or your tutor." + "objectID": "week3/slides.html#bootstrap-55", + "href": "week3/slides.html#bootstrap-55", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Bootstrap (5/5)", + "text": "Bootstrap (5/5)\n\n\nWe said that PC2 is a contrast between short distance events and long distance events, particularly 100m, 200m vs 1500m, 3000m, marathon. How reliably can we state this?\n\n\nCode\nlibrary(boot)\ncompute_PC2 <- function(data, index) {\n pc2 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,2]\n # Coordinate signs: make m100 always positive\n if (sign(pc2[1]) < 0) \n pc2 <- -pc2 \n return(pc2)\n}\n# Make sure sign of first PC element is positive\nset.seed(201)\nPC2_boot <- boot(data=track[,1:7], compute_PC2, R=1000)\ncolnames(PC2_boot$t) <- colnames(track[,1:7])\nPC2_boot_ci <- as_tibble(PC2_boot$t) %>%\n gather(var, coef) %>% \n mutate(var = factor(var, levels=c(\"m100\", \"m200\", \"m400\", \"m800\", \"m1500\", \"m3000\", \"marathon\"))) %>%\n group_by(var) %>%\n summarise(q2.5 = quantile(coef, 0.025), \n q5 = median(coef),\n q97.5 = quantile(coef, 0.975)) %>%\n mutate(t0 = PC2_boot$t0) \npb <- ggplot(PC2_boot_ci, aes(x=var, y=t0)) + \n geom_hline(yintercept=0, linetype=2, colour=\"red\") +\n geom_point() +\n geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +\n xlab(\"\") + ylab(\"coefficient\") \n\n\nConfidence intervals for m400 and m800 cross ZERO, hence zero is a plausible value for the population coefficient corresponding to this estimate." }, { - "objectID": "week3/tutorialsol.html#finishing-up", - "href": "week3/tutorialsol.html#finishing-up", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week3/slides.html#permutation", + "href": "week3/slides.html#permutation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Permutation", + "text": "Permutation" }, { - "objectID": "week3/tutorial.html", - "href": "week3/tutorial.html", - "title": "ETC3250/5250 Tutorial 3", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(colorspace)\nlibrary(patchwork)\nlibrary(MASS)\nlibrary(randomForest)\nlibrary(gridExtra)\nlibrary(GGally)\nlibrary(geozoo)\nlibrary(mulgar)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(tourr::flea)" + "objectID": "week3/slides.html#permutation-13", + "href": "week3/slides.html#permutation-13", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Permutation (1/3)", + "text": "Permutation (1/3)\n\n\nPermutation breaks relationships, and is often used for conducting statistical hypothesis tests, without requiring too many assumptions.\n\nDATA\n\n\n# A tibble: 10 ร— 2\n x cl \n <dbl> <chr>\n 1 0.281 A \n 2 0.330 A \n 3 0.708 A \n 4 0.463 A \n 5 3.37 A \n 6 0.528 B \n 7 0.852 B \n 8 5.58 B \n 9 0.685 B \n10 3.28 B \n\n\n\n\nPERMUTE cl\n\n\n# A tibble: 10 ร— 2\n x cl \n <dbl> <chr>\n 1 0.281 A \n 2 0.330 B \n 3 0.708 A \n 4 0.463 B \n 5 3.37 B \n 6 0.528 A \n 7 0.852 B \n 8 5.58 B \n 9 0.685 A \n10 3.28 A \n\n\n\n\nIs there a difference in the medians of the groups?" }, { - "objectID": "week3/tutorial.html#objectives", - "href": "week3/tutorial.html#objectives", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to learn and practice visualising high-dimensional data." + "objectID": "week3/slides.html#permutation-23", + "href": "week3/slides.html#permutation-23", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Permutation (2/3)", + "text": "Permutation (2/3)\n\n\nIs there a difference in the medians of the groups?\n\n\n\n\n\n\n\n\n\n\nGenerate \\(k\\) permutation samples, compute the medians for each, and compare the difference with original." }, { - "objectID": "week3/tutorial.html#preparation", - "href": "week3/tutorial.html#preparation", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 2" + "objectID": "week3/slides.html#permutation-33", + "href": "week3/slides.html#permutation-33", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Permutation (3/3)", + "text": "Permutation (3/3)\n\nCaution: permuting small numbers, especially classes may return very similar samples to the original data.\nStay tuned for random forest models, where permutation is used to help assess the importance of all the variables." }, { - "objectID": "week3/tutorial.html#exercises", - "href": "week3/tutorial.html#exercises", - "title": "ETC3250/5250 Tutorial 3", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. The sparseness of high dimensions\nRandomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?\nThe code to generate and view the cubes is:\n\n\nCode to generate the data and show in a tour\nlibrary(tourr)\nlibrary(geozoo)\nset.seed(1234)\ncube3 <- cube.solid.random(3, 500)$points\ncube5 <- cube.solid.random(5, 500)$points\ncube10 <- cube.solid.random(10, 500)$points\n\nanimate_xy(cube3, axes=\"bottomleft\")\nanimate_xy(cube5, axes=\"bottomleft\")\nanimate_xy(cube10, axes=\"bottomleft\")\n\n\n\n\n2. Detecting clusters\nFor the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).\n\n\nCode to show in a tour\nanimate_xy(c1)\nanimate_xy(c3)\n\n\n\n\n3. Effect of covariance\nExamine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?\n\n\nCode to generate the samples\nlibrary(mvtnorm)\nset.seed(501)\n\ns1 <- diag(5)\ns2 <- diag(5)\ns2[3,4] <- 0.7\ns2[4,3] <- 0.7\ns3 <- s2\ns3[1,2] <- -0.7\ns3[2,1] <- -0.7\n\ns1\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0 0 0\n[2,] 0 1 0 0 0\n[3,] 0 0 1 0 0\n[4,] 0 0 0 1 0\n[5,] 0 0 0 0 1\n\n\nCode to generate the samples\ns2\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1 0 0.0 0.0 0\n[2,] 0 1 0.0 0.0 0\n[3,] 0 0 1.0 0.7 0\n[4,] 0 0 0.7 1.0 0\n[5,] 0 0 0.0 0.0 1\n\n\nCode to generate the samples\ns3\n\n\n [,1] [,2] [,3] [,4] [,5]\n[1,] 1.0 -0.7 0.0 0.0 0\n[2,] -0.7 1.0 0.0 0.0 0\n[3,] 0.0 0.0 1.0 0.7 0\n[4,] 0.0 0.0 0.7 1.0 0\n[5,] 0.0 0.0 0.0 0.0 1\n\n\nCode to generate the samples\nset.seed(1234)\nd1 <- as.data.frame(rmvnorm(500, sigma = s1))\nd2 <- as.data.frame(rmvnorm(500, sigma = s2))\nd3 <- as.data.frame(rmvnorm(500, sigma = s3))\n\n\n\n\n4. Principal components analysis on the simulated data\n๐Ÿง For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?\n๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป๐Ÿ‘ฉโ€๐Ÿ’ปConduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.\n๐ŸคฏOften, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.\n\n\n5. PCA on cross-currency time series\nThe rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.\n\nrates <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv\") |>\n select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)\n\n\nStandardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.\n\n\n\nCode to standardise currencies\nlibrary(plotly)\nrates_std <- rates |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\nrownames(rates_std) <- rates_std$date\np <- rates_std |>\n pivot_longer(cols=ARS:ZAR, \n names_to = \"currency\", \n values_to = \"rate\") |>\n ggplot(aes(x=date, y=rate, \n group=currency, label=currency)) +\n geom_line() \nggplotly(p, width=400, height=300)\n\n\n\nConduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.\n\n\n\nCode to do PCA and screeplot\nrates_pca <- prcomp(rates_std[,-1], scale=FALSE)\nmulgar::ggscree(rates_pca, q=24)\noptions(digits=2)\nsummary(rates_pca)\n\n\n\n\nCode to make a nice summary\n# Summarise the coefficients nicely\nrates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>\n mutate(p = evl/sum(evl), \n cum_p = cumsum(evl/sum(evl))) |> \n t() |>\n as.data.frame()\ncolnames(rates_pca_smry) <- colnames(rates_pca$rotation)\nrates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),\n rates_pca_smry)\nrownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),\n \"Variance\", \"Proportion\", \n \"Cum. prop\")\nrates_pca_smry[,1:5]\n\n\n\nMake a biplot of the first two PCs. Explain what you learn.\n\n\n\nBiplot code\nlibrary(ggfortify)\nautoplot(rates_pca, loadings = TRUE, \n loadings.label = TRUE) \n\n\n\nMake a time series plot of PC1 and PC2. Explain why this is useful to do for this data.\n\n\n\nCode to plot PCs\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC1)) + geom_line()\n\nrates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=date, y=PC2)) + geom_line()\n\n\n\nYouโ€™ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?\n\n\n\nCode to use interaction of the PC plot\nlibrary(plotly)\np2 <- rates_pca$x |>\n as.data.frame() |>\n mutate(date = rates_std$date) |>\n ggplot(aes(x=PC1, y=PC2, label=date)) +\n geom_point() +\n geom_path()\nggplotly(p2, width=400, height=400)\n\n\n\n\n6. Write a simple question about the weekโ€™s material and test your neighbour, or your tutor." + "objectID": "week3/slides.html#simulation", + "href": "week3/slides.html#simulation", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Simulation", + "text": "Simulation" }, { - "objectID": "week3/tutorial.html#finishing-up", - "href": "week3/tutorial.html#finishing-up", - "title": "ETC3250/5250 Tutorial 3", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week3/slides.html#simulation-12", + "href": "week3/slides.html#simulation-12", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Simulation (1/2)", + "text": "Simulation (1/2)\nSimulation from known statistical distributions allows us to check data and calculations against what is known is controlled conditions.\nFor example, how likely is it to see the extreme a value if my data is a sample from a normal distribution?" }, { - "objectID": "week4/slides.html#overview", - "href": "week4/slides.html#overview", + "objectID": "week3/slides.html#simulation-22", + "href": "week3/slides.html#simulation-22", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Overview", - "text": "Overview\nWe will cover:\n\nFitting a categorical response using logistic curves\nMultivariate summary statistics\nLinear discriminant analysis, assuming samples are elliptically shaped and equal in size\nQuadratic discriminant analysis, assuming samples are elliptically shaped and different in size\nDiscriminant space: making a low-dimensional visual summary" + "section": "Simulation (2/2)", + "text": "Simulation (2/2)\n\n\n\n\n\n\n\n\n\n\n\n\nGrey line is a guide line, computed by doing PCA on 100 samples from a standard \\(p\\)-dimensional normal distribution.\nThat is a comparison of the correlation matrix of the track data with a correlation matrix that is the identity matrix, where there is no association between variables.\nThe largest variance we expect is under 2. The observed variance for PC 1 is much higher. Much larger than expected, very important for capturing the variability in the data!\nWhy is there a difference in variance, when there is no difference in variance?" }, { - "objectID": "week4/slides.html#logistic-regression", - "href": "week4/slides.html#logistic-regression", + "objectID": "week3/slides.html#what-can-go-wrong-in-high-dimensions", + "href": "week3/slides.html#what-can-go-wrong-in-high-dimensions", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Logistic regression", - "text": "Logistic regression" + "section": "What can go wrong in high-dimensions", + "text": "What can go wrong in high-dimensions" }, { - "objectID": "week4/slides.html#when-linear-regression-is-not-appropriate", - "href": "week4/slides.html#when-linear-regression-is-not-appropriate", + "objectID": "week3/slides.html#space-is-huge", + "href": "week3/slides.html#space-is-huge", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "When linear regression is not appropriate", - "text": "When linear regression is not appropriate\n\n\n Consider the following data Default in the ISLR R package (textbook) which looks at the default status based on credit balance.\n\nlibrary(ISLR)\ndata(Default)\nsimcredit <- Default |>\n mutate(default_bin = ifelse(default==\"Yes\", 1, 0))\n\n Why is a linear model less than ideal for this data?" + "section": "Space is huge!", + "text": "Space is huge!\n\n\n\nset.seed(357)\nmy_sparse_data <- tibble(cl = c(rep(\"A\", 12), \n rep(\"B\", 9)),\n x1 = rnorm(21),\n x2 = rnorm(21), \n x3 = rnorm(21),\n x4 = rnorm(21),\n x5 = rnorm(21), \n x6 = rnorm(21), \n x7 = rnorm(21), \n x8 = rnorm(21), \n x9 = rnorm(21), \n x10 = rnorm(21), \n x11 = rnorm(21), \n x12 = rnorm(21), \n x13 = rnorm(21), \n x14 = rnorm(21), \n x15 = rnorm(21)) |>\n mutate(cl = factor(cl)) |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\n\n Do we agree that there is no REAL difference between A and B?\n\n\n\n\n\n\n\nDifference is due to having insufficient data with too many variables." }, { - "objectID": "week4/slides.html#modelling-binary-responses", - "href": "week4/slides.html#modelling-binary-responses", + "objectID": "week3/slides.html#regularisation", + "href": "week3/slides.html#regularisation", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Modelling binary responses", - "text": "Modelling binary responses\n\n\n\n\n\n\n\n\n\n\n\n\n\nOrange line (logistic model fit) is similar to computing a running average of the 0s/1s. Itโ€™s much better than the linear fit, because it remains between 0 and 1, and can be interpreted as proportion of 1s.\nWhat is a logistic function?" + "section": "Regularisation", + "text": "Regularisation\n The fitting criteria has an added penalty term with the effect being that some parameter estimates are forced to ZERO. This effectively reduces the dimensionality by removing noise, and variability in the sample that is consistent with what would be expected if it was purely noise.\n Stay tuned for examples in various methods!" }, { - "objectID": "week4/slides.html#the-logistic-function", - "href": "week4/slides.html#the-logistic-function", + "objectID": "week3/slides.html#next-logistic-regression-and-discriminant-analysis", + "href": "week3/slides.html#next-logistic-regression-and-discriminant-analysis", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "The logistic function", - "text": "The logistic function\n\n\nInstead of predicting the outcome directly, we instead predict the probability of being class 1, given the (linear combination of) predictors, using the logistic function.\n\\[ p(y=1|\\beta_0 + \\beta_1 x) = f(x) \\] where\n\\[f(x) = \\frac{e^{\\beta_0+\\beta_1x}}{1+e^{\\beta_0+\\beta_1x}}\\]" + "section": "Next: Logistic regression and discriminant analysis", + "text": "Next: Logistic regression and discriminant analysis\n\n\n\nETC3250/5250 Lecture 3 | iml.numbat.space" }, { - "objectID": "week4/slides.html#logistic-function", - "href": "week4/slides.html#logistic-function", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Logistic function", - "text": "Logistic function\n\n\nTransform the function:\n\\[~~~~y = \\frac{e^{\\beta_0+\\beta_1x}}{1+e^{\\beta_0+\\beta_1x}}\\]\n\\(\\longrightarrow y = \\frac{1}{1/e^{\\beta_0+\\beta_1x}+1}\\)\n\\(\\longrightarrow 1/y = 1/e^{\\beta_0+\\beta_1x}+1\\)\n\\(\\longrightarrow 1/y - 1 = 1/e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\frac{1}{1/y - 1} = e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\frac{y}{1 - y} = e^{\\beta_0+\\beta_1x}\\)\n\\(\\longrightarrow \\log_e\\frac{y}{1 - y} = \\beta_0+\\beta_1x\\)\n\n\n \nTransforming the response \\(\\log_e\\frac{y}{1 - y}\\) makes it possible to use a linear model fit.\n \n\nThe left-hand side, \\(\\log_e\\frac{y}{1 - y}\\), is known as the log-odds ratio or logit." + "objectID": "week4/index.html", + "href": "week4/index.html", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "", + "text": "ISLR 4.3, 4.4" }, { - "objectID": "week4/slides.html#the-logistic-regression-model", - "href": "week4/slides.html#the-logistic-regression-model", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "The logistic regression model", - "text": "The logistic regression model\nThe fitted model, where \\(P(Y=0|X) = 1 - P(Y=1|X)\\), is then written as:\n\n\\(\\log_e\\frac{P(Y=1|X)}{1 - P(Y=1|X)} = \\beta_0+\\beta_1X\\)\n\n When there are more than two categories:\n\nthe formula can be extended, using dummy variables.\nfollows from the above, extended to provide probabilities for each level/category, and the last category is 1-sum of the probabilities of other categories.\nthe sum of all probabilities has to be 1." + "objectID": "week4/index.html#main-reference", + "href": "week4/index.html#main-reference", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "", + "text": "ISLR 4.3, 4.4" }, { - "objectID": "week4/slides.html#connection-to-generalised-linear-models", - "href": "week4/slides.html#connection-to-generalised-linear-models", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Connection to generalised linear models", - "text": "Connection to generalised linear models\n\nTo model binary data, we need to link our predictors to our response using a link function. Another way to think about it is that we will transform \\(Y\\), to convert it to a proportion, and then build the linear model on the transformed response.\nThere are many different types of link functions we could use, but for a binary response we typically use the logistic link function." + "objectID": "week4/index.html#what-you-will-learn-this-week", + "href": "week4/index.html#what-you-will-learn-this-week", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nFitting a categorical response using logistic curves\nMultivariate summary statistics\nLinear discriminant analysis, assuming samples are elliptically shaped and equal in size\nQuadratic discriminant analysis, assuming samples are elliptically shaped and different in size\nDiscriminant space: making a low-dimensional visual summary" }, { - "objectID": "week4/slides.html#interpretation", - "href": "week4/slides.html#interpretation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Interpretation", - "text": "Interpretation\n\nLinear regression\n\n\\(\\beta_1\\) gives the average change in \\(Y\\) associated with a one-unit increase in \\(X\\)\n\nLogistic regression\n\nBecause the model is not linear in \\(X\\), \\(\\beta_1\\) does not correspond to the change in response associated with a one-unit increase in \\(X\\).\nHowever, increasing \\(X\\) by one unit changes the log odds by \\(\\beta_1\\), or equivalently it multiplies the odds by \\(e^{\\beta_1}\\)" + "objectID": "week4/index.html#lecture-slides", + "href": "week4/index.html#lecture-slides", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "Lecture slides", + "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" }, { - "objectID": "week4/slides.html#maximum-likelihood-estimation", - "href": "week4/slides.html#maximum-likelihood-estimation", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Maximum Likelihood Estimation", - "text": "Maximum Likelihood Estimation\nGiven the logistic \\(p(x_i) = \\frac{1}{e^{-(\\beta_0+\\beta_1x_i)}+1}\\) choose parameters \\(\\beta_0, \\beta_1\\) to maximize the likelihood:\n\\[\\mathcal{l}_n(\\beta_0, \\beta_1) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}.\\]\nIt is more convenient to maximize the log-likelihood:\n\\[\\begin{align*}\n\\log l_n(\\beta_0, \\beta_1) &= \\sum_{i = 1}^n \\big( y_i\\log p(x_i) + (1-y_i)\\log(1-p(x_i))\\big)\\\\\n&= \\sum_{i=1}^n\\big(y_i(\\beta_0+\\beta_1x_i)-\\log{(1+e^{\\beta_0+\\beta_1x_i})}\\big)\n\\end{align*}\\]" + "objectID": "week4/index.html#tutorial-instructions", + "href": "week4/index.html#tutorial-instructions", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "Tutorial instructions", + "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" }, { - "objectID": "week4/slides.html#making-predictions", - "href": "week4/slides.html#making-predictions", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Making predictions", - "text": "Making predictions\n\n\nWith estimates from the model fit, \\(\\hat{\\beta}_0, \\hat{\\beta}_1\\), we can predict the probability of belonging to class 1 using:\n\\[p(y=1|\\hat{\\beta}_0 + \\hat{\\beta}_1 x) = \\frac{e^{\\hat{\\beta}_0+ \\hat{\\beta}_1x}}{1+e^{\\hat{\\beta}_0+ \\hat{\\beta}_1x}}\\] \nRound to 0 or 1 for class prediction.\n\nfit <- glm(default~balance, \n data=simcredit, family=\"binomial\") \nsimcredit_fit <- augment(fit, simcredit,\n type.predict=\"response\")\n\n\n\n\n\n\n\n\n\n\n\nOrange points are fitted values, \\(\\hat{y}_i\\). Black points are observed response, \\(y_i\\) (either 0 or 1)." + "objectID": "week4/index.html#assignments", + "href": "week4/index.html#assignments", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "Assignments", + "text": "Assignments" }, { - "objectID": "week4/slides.html#fitting-credit-data-in-r", - "href": "week4/slides.html#fitting-credit-data-in-r", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Fitting credit data in R", - "text": "Fitting credit data in R\n\n\nWe can use the glm function in R to fit a logistic regression model. The glm function can support many response types, so we specify family=\"binomial\" to let R know that our response is binary.\n\nfit <- glm(default~balance, \n data=simcredit, family=\"binomial\") \nsimcredit_fit <- augment(fit, simcredit,\n type.predict=\"response\")\n\n\n \nSame calculation but written in tidymodels style\n\nlogistic_mod <- logistic_reg() |> \n set_engine(\"glm\") |> \n set_mode(\"classification\") |> \n translate()\n\nlogistic_fit <- \n logistic_mod |> \n fit(default ~ balance, \n data = simcredit)" + "objectID": "week4/index.html#assignments-1", + "href": "week4/index.html#assignments-1", + "title": "Week 4: Logistic regression and discriminant analysis", + "section": "Assignments", + "text": "Assignments\n\nAssignment 1 is due on Friday 22 March.\nAssignment 2 is due on Friday 12 April." }, { - "objectID": "week4/slides.html#examine-the-fit", - "href": "week4/slides.html#examine-the-fit", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Examine the fit", - "text": "Examine the fit\n\n\n\ntidy(logistic_fit) \n\n# A tibble: 2 ร— 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) -10.7 0.361 -29.5 3.62e-191\n2 balance 0.00550 0.000220 25.0 1.98e-137\n\nglance(logistic_fit) \n\n# A tibble: 1 ร— 8\n null.deviance df.null logLik AIC BIC deviance\n <dbl> <int> <dbl> <dbl> <dbl> <dbl>\n1 2921. 9999 -798. 1600. 1615. 1596.\n# โ„น 2 more variables: df.residual <int>, nobs <int>\n\n\n\n\n\nParameter estimates\n\\(\\widehat{\\beta}_0 =\\) -10.65\n\\(\\widehat{\\beta}_1 =\\) 0.01\nCan you write out the model?\n\n\nModel fit summary\nNull model deviance 2920.6 (error for model with no predictors)\nModel deviance 1596.5 (error from fitted model)\nHow good is the model?" + "objectID": "week4/tutorialsol.html", + "href": "week4/tutorialsol.html", + "title": "ETC3250/5250 Tutorial 4", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(mvtnorm)\nlibrary(boot)\nlibrary(nullabor)\nlibrary(palmerpenguins)\nlibrary(GGally)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species)" }, { - "objectID": "week4/slides.html#check-the-model-performance", - "href": "week4/slides.html#check-the-model-performance", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Check the model performance", - "text": "Check the model performance\n\n\n\nsimcredit_fit <- augment(logistic_fit, simcredit) \nsimcredit_fit |> \n count(default, .pred_class) |>\n group_by(default) |>\n mutate(Accuracy = n[.pred_class==default]/sum(n)) |>\n pivot_wider(names_from = \".pred_class\", values_from = n) |>\n select(default, No, Yes, Accuracy)\n\n# A tibble: 2 ร— 4\n# Groups: default [2]\n default No Yes Accuracy\n <fct> <int> <int> <dbl>\n1 No 9625 42 0.996\n2 Yes 233 100 0.300\n\n\nCompute the balanced accuracy.\nUnbalanced data set, with very different performance on each class.\n\nHow good is this model?\n\n\n\nExplains about half of the variation in the response, which would normally be reasonable.\nGets most of the smaller but important class wrong.\nNot a very useful model." + "objectID": "week4/tutorialsol.html#objectives", + "href": "week4/tutorialsol.html#objectives", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables." }, { - "objectID": "week4/slides.html#a-warning-for-using-glms", - "href": "week4/slides.html#a-warning-for-using-glms", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "A warning for using GLMs!", - "text": "A warning for using GLMs!\n\n\n\n\nLogistic regression model fitting fails when the data is perfectly separated.\n\nMLE fit will try and fit a step-wise function to this graph, pushing coefficients sizes towards infinity and produce large standard errors.\nPay attention to warnings!\n\n\n\n\n\n\n\n\n\n\n\nlogistic_fit <- \n logistic_mod |> \n fit(default_new ~ balance, \n data = simcredit)\n\nWarning: glm.fit: algorithm did not converge\n\n\nWarning: glm.fit: fitted probabilities numerically 0 or 1\noccurred" + "objectID": "week4/tutorialsol.html#preparation", + "href": "week4/tutorialsol.html#preparation", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 3" }, { - "objectID": "week4/slides.html#discriminant-analysis", - "href": "week4/slides.html#discriminant-analysis", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Discriminant Analysis", - "text": "Discriminant Analysis" + "objectID": "week4/tutorialsol.html#exercises", + "href": "week4/tutorialsol.html#exercises", + "title": "ETC3250/5250 Tutorial 4", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Assess the significance of PC coefficients using bootstrap\nIn the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womensโ€™ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?\nThe data can be read using:\n\ntrack <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\ncompute_PC1 <- function(data, index) {\n pc1 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,1]\n # Coordinate signs\n if (sign(pc1[1]) < 0) \n pc1 <- -pc1 \n return(pc1)\n}\n# Make sure sign of first PC element is positive\nPC1_boot <- boot(data=track[,1:7], compute_PC1, R=1000)\ncolnames(PC1_boot$t) <- colnames(track[,1:7])\nPC1_boot_ci <- as_tibble(PC1_boot$t) %>%\n gather(var, coef) %>% \n mutate(var = factor(var, levels=c(\"m100\", \"m200\", \"m400\", \"m800\", \"m1500\", \"m3000\", \"marathon\"))) %>%\n group_by(var) %>%\n summarise(q2.5 = quantile(coef, 0.025), \n q5 = median(coef),\n q97.5 = quantile(coef, 0.975)) %>%\n mutate(t0 = PC1_boot$t0) \n \n# The red horizontal line indicates the null value \n# of the coefficient when all are equal.\nggplot(PC1_boot_ci, aes(x=var, y=t0)) + \n geom_hline(yintercept=1/sqrt(7), linetype=2, colour=\"red\") +\n geom_point() +\n geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +\n #geom_hline(yintercept=0, linewidth=3, colour=\"white\") +\n xlab(\"\") + ylab(\"coefficient\") \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n2. Using simulation to assess results when there is no structure\nThe ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.\n\nWhat is the mean and covariance matrix of a multivariate standard normal distribution?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe mean is a \\(p\\)-dimensional vector of 0, and the covariance is a \\(p\\)-dimensional variance-covariance matrix.\n\n\n\n\n\nSimulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(854)\nd <- rmvnorm(55, mean = rep(0, 7), sigma = diag(7))\napply(d, 2, mean)\n\n[1] 0.271 0.125 0.054 -0.076 -0.012 -0.141 -0.055\n\ncov(d)\n\n [,1] [,2] [,3] [,4] [,5] [,6] [,7]\n[1,] 0.8162 -0.126 0.0102 -0.030 0.244 -0.0932 0.0097\n[2,] -0.1263 0.915 -0.0050 -0.051 -0.092 -0.1128 -0.0242\n[3,] 0.0102 -0.005 1.1710 0.077 0.387 -0.0019 0.1609\n[4,] -0.0298 -0.051 0.0766 0.659 0.027 0.1862 0.0463\n[5,] 0.2438 -0.092 0.3872 0.027 0.917 -0.1307 0.0143\n[6,] -0.0932 -0.113 -0.0019 0.186 -0.131 0.8257 0.0120\n[7,] 0.0097 -0.024 0.1609 0.046 0.014 0.0120 0.8046\n\n\n\n\n\n\n\nCompute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the womenโ€™s track data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nd_pca <- prcomp(d, center=FALSE, scale=FALSE)\nd_pca$sdev^2\n\n[1] 1.55 1.15 1.04 0.77 0.68 0.56 0.48\n\n\nThe variance of the first PC of the womensโ€™ track data is 5.8, which is much higher than that from this sample. It says that there is substantially more variance explained by PC 1 of the womensโ€™s track data than would be expected if there was no association between any variables.\nYou should repeat generating the multivariate normal samples and computing the variance of PC 1 a few more times to learn what is the largest that would be observed.\n\n\n\n\n\n\n3. Making a lineup plot to assess the dependence between variables\nPermutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.\n\nGenerate a sample of data that has a strong non-linear relationship but no correlation, as follows:\n\n\nset.seed(908)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)\n\nand then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:\n\nset.seed(912)\ndf_l <- lineup(null_permute('x1'), df)\n\nand make all 20 plots as follows:\n\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\nIs the data plot recognisably different from the plots of permuted data?\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe data and the permuted data are very different. The permutation breaks any relationship between the two variables, so we know that there is NO relationship in any of the permuted data examples. This says that the relationship seen in the data is strongly statistically significant.\n\n\n\n\n\nRepeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(916)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = rnorm(n)*0.1)\ndf_l <- lineup(null_permute('x1'), df)\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\n\n\n\n\n\n\n\nThe data cannot be distinguished from the permuted data, so there is no statistically significant relatiomship between the two variables.\n\n\n\n\n\n\n4. Computing \\(k\\)-folds for cross-validation\nFor the penguins data, compute 5-fold cross-validation sets, stratified by species.\n\nList the observations in each sample, so that you can see there is no overlap.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nset.seed(929)\np_folds <- vfold_cv(p_tidy, 5, strata=species)\nc(1:nrow(p_tidy))[-p_folds$splits[[1]]$in_id]\n\n [1] 3 6 31 36 42 44 51 53 59 62 65 66 67 79 85 88 93 96 103\n[20] 104 105 107 108 113 114 118 122 128 141 143 144 155 157 158 163 170 177 179\n[39] 182 194 195 202 204 211 213 221 222 224 226 239 246 248 256 258 264 265 275\n[58] 280 287 292 295 296 297 307 322 327 328 335 336 339\n\nc(1:nrow(p_tidy))[-p_folds$splits[[2]]$in_id]\n\n [1] 1 8 13 17 19 21 24 29 41 50 54 56 78 86 87 89 97 100 101\n[20] 112 117 121 123 129 130 132 133 139 149 150 152 159 166 167 168 169 171 189\n[39] 190 191 193 198 212 215 225 228 231 241 244 249 250 259 260 262 266 268 269\n[58] 270 271 272 276 282 283 284 288 321 331 337 342\n\nc(1:nrow(p_tidy))[-p_folds$splits[[3]]$in_id]\n\n [1] 4 9 10 15 25 30 32 35 37 39 43 47 48 55 57 64 69 71 80\n[20] 82 91 109 111 116 124 127 134 136 140 147 162 176 178 180 186 199 200 203\n[39] 207 208 210 216 218 219 220 229 232 236 240 243 247 252 254 261 267 277 279\n[58] 286 290 299 300 303 306 308 312 320 325 326 329\n\nc(1:nrow(p_tidy))[-p_folds$splits[[4]]$in_id]\n\n [1] 5 11 18 20 22 23 27 28 33 34 52 70 72 73 75 77 81 90 92\n[20] 94 95 106 110 119 125 137 138 142 145 151 154 156 160 161 165 174 181 183\n[39] 187 192 196 206 214 223 227 234 237 238 245 255 257 274 281 285 289 293 294\n[58] 298 302 313 314 315 317 324 330 332 338\n\nc(1:nrow(p_tidy))[-p_folds$splits[[5]]$in_id]\n\n [1] 2 7 12 14 16 26 38 40 45 46 49 58 60 61 63 68 74 76 83\n[20] 84 98 99 102 115 120 126 131 135 146 148 153 164 172 173 175 184 185 188\n[39] 197 201 205 209 217 230 233 235 242 251 253 263 273 278 291 301 304 305 309\n[58] 310 311 316 318 319 323 333 334 340 341\n\n\n\n\n\n\n\nMake a scatterplot matrix for each fold, coloured by species. Do the samples look similar?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[1]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[2]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[3]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[4]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\np_sub <- p_tidy[-p_folds$splits[[5]]$in_id, ]\nggscatmat(p_sub, columns=2:5, color=\"species\") +\n theme(legend.position=\"none\",\n axis.text = element_blank())\n\n\n\n\n\n\n\n\nThe folds are similar but there are some noticeable differences that might lead to variation in the statistics that are calculated from each other. However, one should consider this variation something that might generally occur if we had different samples.\n\n\n\n\n\n\n5. What was the easiest part of this tutorial to understand, and what was the hardest?" }, { - "objectID": "week4/slides.html#linear-discriminant-analysis", - "href": "week4/slides.html#linear-discriminant-analysis", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Linear Discriminant Analysis", - "text": "Linear Discriminant Analysis\n\n\n\n\n\n\n\n\n\n\n\nWhere would you draw a line to create a boundary separating Adelie and Gentoo penguins?\n\n\n\nWhere are the sample means?\nWhat is the shape of the sample variance-covariance?\n\n\n\nLinear discriminant analysis assumes the distribution of the predictors is a multivariate normal, with the same variance-covariance matrix, separately for each class." + "objectID": "week4/tutorialsol.html#finishing-up", + "href": "week4/tutorialsol.html#finishing-up", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week4/slides.html#assumptions-underlie-lda", - "href": "week4/slides.html#assumptions-underlie-lda", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Assumptions underlie LDA", - "text": "Assumptions underlie LDA\n\n\n\n\nSource: https://xkcd.com\n\n\n\n\nAll samples come from normal populations\nwith the same population variance-covariance matrix" + "objectID": "week4/tutorial.html", + "href": "week4/tutorial.html", + "title": "ETC3250/5250 Tutorial 4", + "section": "", + "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(conflicted)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(mvtnorm)\nlibrary(boot)\nlibrary(nullabor)\nlibrary(palmerpenguins)\nlibrary(GGally)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species)" }, { - "objectID": "week4/slides.html#lda-with-p1-predictors-14", - "href": "week4/slides.html#lda-with-p1-predictors-14", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "LDA with \\(p=1\\) predictors 1/4", - "text": "LDA with \\(p=1\\) predictors 1/4\n\n\nIf \\(K = 2\\) (two classes labelled A and B) and each group has the same prior probability, the LDA rule is to assign the new observation \\(x_0\\) to class A if\n\n\\[\nx_0 > \\frac{\\bar{x}_A + \\bar{x}_B}{2}\n\\]\n\n\n\nItโ€™s a really intuitive rule, eh?\nIt also matters which of the two classes is considered to be A!!!\nSo maybe easier to think about as โ€œassign the new observation to the group with the closest meanโ€.\nHow does this rule arise from the assumptions?" + "objectID": "week4/tutorial.html#objectives", + "href": "week4/tutorial.html#objectives", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐ŸŽฏ Objectives", + "text": "๐ŸŽฏ Objectives\nThe goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables." }, { - "objectID": "week4/slides.html#bayes-theorem-24", - "href": "week4/slides.html#bayes-theorem-24", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Bayes Theorem 2/4", - "text": "Bayes Theorem 2/4\nLet \\(f_k(x)\\) be the density function for predictor \\(x\\) for class \\(k\\). If \\(f\\) is large, the probability that \\(x\\) belongs to class \\(k\\) is large, or if \\(f\\) is small it is unlikely that \\(x\\) belongs to class \\(k\\).\nAccording to Bayes theorem (for \\(K\\) classes) the probability that \\(x\\) belongs to class \\(k\\) is:\n\\[P(Y = k|X = x) = p_k(x) = \\frac{\\pi_kf_k(x)}{\\sum_{i=1}^K \\pi_kf_k(x)}\\]\nwhere \\(\\pi_k\\) is the prior probability that an observation comes from class \\(k\\)." + "objectID": "week4/tutorial.html#preparation", + "href": "week4/tutorial.html#preparation", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐Ÿ”ง Preparation", + "text": "๐Ÿ”ง Preparation\n\nComplete the quiz\nDo the reading related to week 3" }, { - "objectID": "week4/slides.html#lda-with-p1-predictors-34", - "href": "week4/slides.html#lda-with-p1-predictors-34", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "LDA with \\(p=1\\) predictors 3/4", - "text": "LDA with \\(p=1\\) predictors 3/4\n\n\nThe density function \\(f_k(x)\\) of a univariate normal (Gaussian) is\n\\[\nf_k(x) = \\frac{1}{\\sqrt{2 \\pi} \\sigma_k} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2_k} (x - \\mu_k)^2 \\right)\n\\]\nwhere \\(\\mu_k\\) and \\(\\sigma^2_k\\) are the mean and variance parameters for the \\(k\\)th class. We also assume that \\(\\sigma_1^2 = \\sigma_2^2 = \\dots = \\sigma_K^2\\); then the conditional probabilities are\n\\[\np_k(x) = \\frac{\\pi_k \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_k)^2 \\right) }{ \\sum_{l = 1}^K \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) }\n\\]" + "objectID": "week4/tutorial.html#exercises", + "href": "week4/tutorial.html#exercises", + "title": "ETC3250/5250 Tutorial 4", + "section": "Exercises:", + "text": "Exercises:\nOpen your project for this unit called iml.Rproj.\n\n1. Assess the significance of PC coefficients using bootstrap\nIn the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womensโ€™ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?\nThe data can be read using:\n\ntrack <- read_csv(\"https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv\")\n\n\n\n2. Using simulation to assess results when there is no structure\nThe ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.\n\nWhat is the mean and covariance matrix of a multivariate standard normal distribution?\n\n\nSimulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)\n\n\nCompute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the womenโ€™s track data?\n\n\n\n3. Making a lineup plot to assess the dependence between variables\nPermutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.\n\nGenerate a sample of data that has a strong non-linear relationship but no correlation, as follows:\n\n\nset.seed(908)\nn <- 205\ndf <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)\n\nand then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:\n\nset.seed(912)\ndf_l <- lineup(null_permute('x1'), df)\n\nand make all 20 plots as follows:\n\nggplot(df_l, aes(x=x1, y=x2)) + \n geom_point() + \n facet_wrap(~.sample)\n\nIs the data plot recognisably different from the plots of permuted data?\n\nRepeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?\n\n\n\n4. Computing \\(k\\)-folds for cross-validation\nFor the penguins data, compute 5-fold cross-validation sets, stratified by species.\n\nList the observations in each sample, so that you can see there is no overlap.\n\n\nMake a scatterplot matrix for each fold, coloured by species. Do the samples look similar?\n\n\n\n5. What was the easiest part of this tutorial to understand, and what was the hardest?" }, { - "objectID": "week4/slides.html#lda-with-p1-predictors-44", - "href": "week4/slides.html#lda-with-p1-predictors-44", - "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "LDA with \\(p=1\\) predictors 4/4", - "text": "LDA with \\(p=1\\) predictors 4/4\n\n\nA simplification of \\(p_k(x_0)\\) yields the discriminant functions, \\(\\delta_k(x_0)\\):\n\\[\\delta_k(x_0) = x_0 \\frac{\\mu_k}{\\sigma^2} - \\frac{\\mu_k^2}{2 \\sigma^2} + log(\\pi_k)\\] from which the LDA rule will assign \\(x_0\\) to the class \\(k\\) with the largest value.\n\nLet \\(K=2\\), then the rule reduces to assign \\(x_0\\) to class A if\n\\[\\begin{align*}\n& \\frac{\\pi_A \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_A)^2 \\right) }{ \\sum_{l = 1}^2 \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) } > \\frac{\\pi_B \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_B)^2 \\right) }{ \\sum_{l = 1}^2 \\pi_l \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x - \\mu_l)^2 \\right) }\\\\\n &\\longrightarrow \\pi_A \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 \\right) > \\pi_B \\frac{1}{\\sqrt{2 \\pi} \\sigma} \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2 \\right)\\\\\n &\\longrightarrow \\pi_A \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 \\right) > \\pi_B \\text{exp}~ \\left( - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2 \\right) \\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_A)^2 > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (x_0 - \\mu_B)^2\\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (x_0^2 - 2x_0\\mu_A + \\mu_A^2) > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (x_0^2 - 2x_0\\mu_B + \\mu_B^2) \\\\\n &\\longrightarrow \\log \\pi_A - \\frac{1}{2 \\sigma^2} (-2x_0\\mu_A + \\mu_A^2) > \\log \\pi_B - \\frac{1}{2 \\sigma^2} (-2x_0\\mu_B + \\mu_B^2) \\\\\n &\\longrightarrow \\log \\pi_A + \\frac{x_0\\mu_A}{\\sigma^2} - \\frac{\\mu_A^2}{\\sigma^2} > \\log \\pi_B + \\frac{x_0\\mu_B}{\\sigma^2} - \\frac{\\mu_B^2}{\\sigma^2} \\\\\n &\\longrightarrow \\underbrace{x_0\\frac{\\mu_A}{\\sigma^2} - \\frac{\\mu_A^2}{\\sigma^2} + \\log \\pi_A}_{\\text{Discriminant function for class A}} > \\underbrace{x_0\\frac{\\mu_B}{\\sigma^2} - \\frac{\\mu_B^2}{\\sigma^2} + \\log \\pi_B}_{\\text{Discriminant function for class B}}\n\\end{align*}\\]" + "objectID": "week4/tutorial.html#finishing-up", + "href": "week4/tutorial.html#finishing-up", + "title": "ETC3250/5250 Tutorial 4", + "section": "๐Ÿ‘‹ Finishing up", + "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." }, { - "objectID": "week4/slides.html#multivariate-lda-p1", - "href": "week4/slides.html#multivariate-lda-p1", + "objectID": "week5/slides.html#overview", + "href": "week5/slides.html#overview", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Multivariate LDA, \\(p>1\\)", - "text": "Multivariate LDA, \\(p>1\\)\nA \\(p\\)-dimensional random variable \\(X\\) has a multivariate Gaussian distribution with mean \\(\\mu\\) and variance-covariance \\(\\Sigma\\), we write \\(X \\sim N(\\mu, \\Sigma)\\).\nThe multivariate normal density function is:\n\\[f(x) = \\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}} \\exp\\{-\\frac{1}{2}(x-\\mu)^\\top\\Sigma^{-1}(x-\\mu)\\}\\]\nwith \\(x, \\mu\\) are \\(p\\)-dimensional vectors, \\(\\Sigma\\) is a \\(p\\times p\\) variance-covariance matrix." + "section": "Overview", + "text": "Overview\nWe will cover:\n\nClassification trees, algorithm, stopping rules\nDifference between algorithm and parametric methods, especially trees vs LDA\nForests: ensembles of bagged trees\nDiagnostics: vote matrix, variable importance, proximity\nBoosted trees" }, { - "objectID": "week4/slides.html#multivariate-lda-k2", - "href": "week4/slides.html#multivariate-lda-k2", + "objectID": "week5/slides.html#trees", + "href": "week5/slides.html#trees", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Multivariate LDA, \\(K=2\\)", - "text": "Multivariate LDA, \\(K=2\\)\nThe discriminant functions are:\n\\[\\delta_k(x) = x^\\top\\Sigma^{-1}\\mu_k - \\frac{1}{2}\\mu_k^\\top\\Sigma^{-1}\\mu_k + \\log(\\pi_k)\\]\nand Bayes classifier is assign a new observation \\(x_0\\) to the class with the highest \\(\\delta_k(x_0)\\).\nWhen \\(K=2\\) and \\(\\pi_A=\\pi_B\\) this reduces to\nAssign observation \\(x_0\\) to class A if\n\\[x_0^\\top\\underbrace{\\Sigma^{-1}(\\mu_A-\\mu_B)}_{dimension~reduction} > \\frac{1}{2}(\\mu_A+\\mu_B)^\\top\\underbrace{\\Sigma^{-1}(\\mu_A-\\mu_B)}_{dimension~reduction}\\]\nNOTE: Class A and B need to be mapped to the classes in the your data. The class โ€œto the rightโ€ on the reduced dimension will correspond to class A in this equation." + "section": "Trees", + "text": "Trees\nNice explanation of trees, forests, boosted trees" }, { - "objectID": "week4/slides.html#computation", - "href": "week4/slides.html#computation", + "objectID": "week5/slides.html#algorithm-growing-a-tree", + "href": "week5/slides.html#algorithm-growing-a-tree", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Computation", - "text": "Computation\n Use sample mean \\(\\bar{x}_k\\) to estimate \\(\\mu_k\\), and\n\nto estimate \\(\\Sigma\\) use the pooled variance-covariance:\n\\[\nS = \\frac{n_1S_1 + n_2S_2+ \\dots +n_kS_k}{n_1+n_2+\\dots +n_k}\n\\]" + "section": "Algorithm: growing a tree", + "text": "Algorithm: growing a tree\n\n\n\nAll observations in a single set\nSort values on first variable\nCompute the chosen split criteria for all possible splits into two sets\nChoose the best split on this variable. Save this info.\nRepeat 2-4 for all other variables\nChoose the best variable to split on, based on the best split. Your data is now in two sets.\nRepeat 1-6 on each subset.\nStop when stopping rule that decides that the best classification model is achieved.\n\n\n\nPros and cons:\n\nTrees are a very flexible way to fit a classifier.\nThey can\n\nutilise different types of predictor variables\nignore missing values\nhandle different units or scales on variables\ncapture intricate patterns\n\nHowever, they operate on a per variable basis, and do not effectively model separation when a combination of variables is needed." }, { - "objectID": "week4/slides.html#example-penguins-13", - "href": "week4/slides.html#example-penguins-13", + "objectID": "week5/slides.html#common-split-criteria", + "href": "week5/slides.html#common-split-criteria", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 1/3", - "text": "Example: penguins 1/3\n\n\nSummary statistics\n\n\n# A tibble: 2 ร— 3\n species bm bd\n <fct> <dbl> <dbl>\n1 Adelie 3701. 18.3\n2 Gentoo 5076. 15.0\n\n\n bm bd\nbm 210283 321.4\nbd 321 1.5\n\n\n bm bd\nbm 254133 355.69\nbd 356 0.96\n\n\n\n\n\n\n\n\n\n\n\nlibrary(discrim)\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(0.5, 0.5))\nlda_fit <- lda_spec |> \n fit(species ~ bm + bd, data = p_sub)\n\nlda_fit\n\nparsnip model object\n\nCall:\nlda(species ~ bm + bd, data = data, prior = ~c(0.5, 0.5))\n\nPrior probabilities of groups:\nAdelie Gentoo \n 0.5 0.5 \n\nGroup means:\n bm bd\nAdelie 3701 18\nGentoo 5076 15\n\nCoefficients of linear discriminants:\n LD1\nbm 0.0024\nbd -1.0444\n\n\n\nRecommendation: standardise the variables before fitting model, even though it is not necessary for LDA." + "section": "Common split criteria", + "text": "Common split criteria\n\n\nClassification\n\nThe Gini index measures is defined as: \\[G = \\sum_{k =1}^K \\widehat{p}_{mk}(1 - \\widehat{p}_{mk})\\]\nEntropy is defined as \\[D = - \\sum_{k =1}^K \\widehat{p}_{mk} log(\\widehat{p}_{mk})\\] What corresponds to a high value, and what corresponds to a low value?\n\n\nRegression\nDefine\n\\[\\mbox{MSE} = \\frac{1}{n}\\sum_{i=1}^{n} (y_i - \\widehat{y}_i)^2\\]\nSplit the data where combining MSE for left bucket (MSE_L) and right bucket (MSE_R), makes the biggest reduction from the overall MSE." }, { - "objectID": "week4/slides.html#example-penguins-23", - "href": "week4/slides.html#example-penguins-23", + "objectID": "week5/slides.html#illustration-12", + "href": "week5/slides.html#illustration-12", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 2/3", - "text": "Example: penguins 2/3\n\n\nSummary statistics\n\n\n# A tibble: 2 ร— 3\n species bm bd\n <fct> <dbl> <dbl>\n1 Adelie -0.739 0.750\n2 Gentoo 0.907 -0.921\n\n\n bm bd\nbm 0.30 0.19\nbd 0.19 0.37\n\n\n bm bd\nbm 0.36 0.21\nbd 0.21 0.24\n\n\n\n\n\n\n\n\n\n\n\nlibrary(discrim)\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(0.5, 0.5))\nlda_fit <- lda_spec |> \n fit(species ~ bm + bd, data = p_sub)\n\nlda_fit\n\nparsnip model object\n\nCall:\nlda(species ~ bm + bd, data = data, prior = ~c(0.5, 0.5))\n\nPrior probabilities of groups:\nAdelie Gentoo \n 0.5 0.5 \n\nGroup means:\n bm bd\nAdelie -0.74 0.75\nGentoo 0.91 -0.92\n\nCoefficients of linear discriminants:\n LD1\nbm 2.0\nbd -2.1\n\n\nEasier to see that both variables contribute almost equally to the classification." + "section": "Illustration (1/2)", + "text": "Illustration (1/2)\n\n\n\n\n\n\n\nx\ncl\n\n\n\n\n11\nA\n\n\n33\nA\n\n\n39\nB\n\n\n44\nA\n\n\n50\nA\n\n\n56\nB\n\n\n70\nB\n\n\n\n\n\n\n\nNote: x is sorted from lowest to highest!\n\n\nAll possible splits shown by vertical lines\n\n\n\n\n\n\n\n\n\nWhat do you think is the best split? 2, 3 or 5??" }, { - "objectID": "week4/slides.html#example-penguins-33", - "href": "week4/slides.html#example-penguins-33", + "objectID": "week5/slides.html#illustration-22", + "href": "week5/slides.html#illustration-22", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Example: penguins 3/3", - "text": "Example: penguins 3/3\n\n\n\\[\nS^{-1}(\\bar{x}_A - \\bar{x}_B)\n\\]\n\nS1 <- cov(p_sub[p_sub$species == \"Adelie\",-1])\nS2 <- cov(p_sub[p_sub$species == \"Gentoo\",-1])\nSp <- (S1+S2)/2\nSp\n\n bm bd\nbm 0.33 0.2\nbd 0.20 0.3\n\nSpinv <- solve(Sp)\nSpinv\n\n bm bd\nbm 5.1 -3.4\nbd -3.4 5.6\n\nm1 <- as.matrix(lda_fit$fit$means[1,], ncol=1)\nm1\n\n [,1]\nbm -0.74\nbd 0.75\n\nm2 <- as.matrix(lda_fit$fit$means[2,], ncol=1)\nm2\n\n [,1]\nbm 0.91\nbd -0.92\n\nSpinv %*% (m1-m2)\n\n [,1]\nbm -14\nbd 15\n\n\n\n\\[\nx_0 S^{-1}(\\bar{x}_A - \\bar{x}_B) > \\frac{\\bar{x}_A + \\bar{x}_B}{2} S^{-1}(\\bar{x}_A - \\bar{x}_B)\n\\]\n\n(m1 + m2)/2\n\n [,1]\nbm 0.084\nbd -0.085\n\nmatrix((m1 + m2)/2, ncol=2) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] -2.4\n\n\nIf \\(x_0\\) is -0.68, 0.93, what species is it?\n\n\nas.matrix(p_sub[1,-1]) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] 23\n\n\nIs Adelie class A or is Gentoo class A?\n\n\nCheck by plugging in the means\n\nt(m1) %*% Spinv %*% (m1-m2)\n\n [,1]\n[1,] 21\n\n\n\n\n\npredict(lda_fit, p_sub[1,-1])$.pred_class\n\n[1] Adelie\nLevels: Adelie Gentoo" + "section": "Illustration (2/2)", + "text": "Illustration (2/2)\n\n\nCalculate the impurity for split 5\nThe left bucket is\n\n\n\n\n\nx\ncl\n\n\n\n\n11\nA\n\n\n33\nA\n\n\n39\nB\n\n\n44\nA\n\n\n50\nA\n\n\n\n\n\n\n\nand the right bucket is\n\n\n\n\n\nx\ncl\n\n\n\n\n56\nB\n\n\n70\nB\n\n\n\n\n\n\n\n\nUsing Gini \\(G = \\sum_{k =1}^K \\widehat{p}_{mk}(1 - \\widehat{p}_{mk})\\)\nLeft bucket:\n\\[\\widehat{p}_{LA} = 4/5, \\widehat{p}_{LB} = 1/5, ~~ p_L = 5/7\\]\n\\[G_L=0.8(1-0.8)+0.2(1-0.2) = 0.32\\]\nRight bucket:\n\\[\\widehat{p}_{RA} = 0/2, \\widehat{p}_{RB} = 2/2, ~~ p_R = 2/7\\]\n\\[G_R=0(1-0)+1(1-1) = 0\\] Combine with weighted sum to get impurity for the split:\n\\[5/7G_L + 2/7G_R=0.23\\]\n Your turn: Compute the impurity for split 2." }, { - "objectID": "week4/slides.html#dimension-reduction", - "href": "week4/slides.html#dimension-reduction", + "objectID": "week5/slides.html#section", + "href": "week5/slides.html#section", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Dimension reduction", - "text": "Dimension reduction" + "section": "", + "text": "Splits on categorical variables\n\n\n\n\n\n\n\n\n\nPossible best split would be if koala then assign to Vic else assign to WA, because Vic has more koalas but and WA has more emus and roos.\n\nDealing with missing values on predictors\n\n\n\n\n\nx1\nx2\nx3\nx4\ny\n\n\n\n\n19\n-8\n22\n-24\nA\n\n\nNA\n-10\n26\n-26\nA\n\n\n15\nNA\n32\n-27\nB\n\n\n17\n-6\n27\n-25\nA\n\n\n18\n-5\nNA\n-23\nA\n\n\n13\n-3\n37\nNA\nB\n\n\n12\n-1\n35\n-30\nB\n\n\n11\n-7\n24\n-31\nB\n\n\n\n\n\n\n\n50% of cases have missing values. Trees ignore missings only on a single variable.\n\nEvery other method ignores a full observation if missing on any variable. That is, would only be able to use half the data." }, { - "objectID": "week4/slides.html#dimension-reduction-via-lda", - "href": "week4/slides.html#dimension-reduction-via-lda", + "objectID": "week5/slides.html#example-penguins-13", + "href": "week5/slides.html#example-penguins-13", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Dimension reduction via LDA", - "text": "Dimension reduction via LDA\nDiscriminant space: LDA also provides a low-dimensional projection of the \\(p\\)-dimensional space, where the groups are the most separated. For \\(K=2\\), this is\n\n\\[\n\\Sigma^{-1}(\\mu_A-\\mu_B)\n\\]\nThe distance between means relative to the variance-covariance, ie Mahalanobis distance." + "section": "Example: penguins 1/3", + "text": "Example: penguins 1/3\n\n\n\n\n\n\n\n\n\n\n\n\n\nset.seed(1156)\np_split <- initial_split(p_sub, 2/3, strata=species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\ntree_spec <- decision_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"rpart\")\n\np_fit_tree <- tree_spec |>\n fit(species~., data=p_tr)\n\np_fit_tree\n\nparsnip model object\n\nn= 145 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n1) root 145 45 Adelie (0.690 0.310) \n 2) bl< 43 99 2 Adelie (0.980 0.020) *\n 3) bl>=43 46 3 Chinstrap (0.065 0.935) *\n\n\n\n Can you draw the tree?" }, { - "objectID": "week4/slides.html#discriminant-space", - "href": "week4/slides.html#discriminant-space", + "objectID": "week5/slides.html#stopping-rules", + "href": "week5/slides.html#stopping-rules", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Discriminant space", - "text": "Discriminant space\nThe dashed lines are the Bayes decision boundaries. Ellipses that contain 95% of the probability for each of the three classes are shown. Solid line corresponds to the class boundaries from the LDA model fit to the sample.\n\n \n\n(Chapter4/4.6.pdf)" + "section": "Stopping rules", + "text": "Stopping rules\n\nMinimum split: number of observations in a node, in order for a split to be made\nMinimum bucket: Minimum number of observations allowed in a terminal node\nComplexity parameter: minimum difference between impurity values required to continue splitting" }, { - "objectID": "week4/slides.html#discriminant-space-using-sample-statistics", - "href": "week4/slides.html#discriminant-space-using-sample-statistics", + "objectID": "week5/slides.html#example-penguins-23", + "href": "week5/slides.html#example-penguins-23", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Discriminant space: using sample statistics", - "text": "Discriminant space: using sample statistics\n\nDiscriminant space: is the low-dimensional space (\\((K-1)\\)-dimensional) where the class means are the furthest apart relative to the common variance-covariance.\n\nThe discriminant space is provided by the eigenvectors after making an eigen-decomposition of \\(W^{-1}B\\), where\n\\[\nB = \\frac{1}{K}\\sum_{i=1}^{K} (\\bar{x}_i-\\bar{x})(\\bar{x}_i-\\bar{x})^\\top\n~~~\\text{and}~~~\nW = \\frac{1}{K}\\sum_{k=1}^K\\frac{1}{n_k}\\sum_{i=1}^{n_k} (x_i-\\bar{x}_k)(x_i-\\bar{x}_k)^\\top\n\\]\nNote \\(W\\) is the (unweighted) pooled variance-covariance matrix." + "section": "Example: penguins 2/3", + "text": "Example: penguins 2/3\n\n\nDefaults for rpart are:\n\nrpart.control(minsplit = 20, \n minbucket = round(minsplit/3), \n cp = 0.01, \n maxcompete = 4, \n maxsurrogate = 5, \n usesurrogate = 2, \n xval = 10,\n surrogatestyle = 0, maxdepth = 30, \n ...)\n\n\ntree_spec <- decision_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"rpart\",\n control = rpart.control(minsplit = 10), \n model=TRUE)\n\np_fit_tree <- tree_spec |>\n fit(species~., data=p_tr)\n\np_fit_tree\n\nparsnip model object\n\nn= 145 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 145 45 Adelie (0.690 0.310) \n 2) bl< 43 99 2 Adelie (0.980 0.020) \n 4) bl< 41 75 0 Adelie (1.000 0.000) *\n 5) bl>=41 24 2 Adelie (0.917 0.083) \n 10) bm>=3.4e+03 21 0 Adelie (1.000 0.000) *\n 11) bm< 3.4e+03 3 1 Chinstrap (0.333 0.667) *\n 3) bl>=43 46 3 Chinstrap (0.065 0.935) \n 6) bl< 46 10 3 Chinstrap (0.300 0.700) \n 12) bm>=3.8e+03 3 0 Adelie (1.000 0.000) *\n 13) bm< 3.8e+03 7 0 Chinstrap (0.000 1.000) *\n 7) bl>=46 36 0 Chinstrap (0.000 1.000) *" }, { - "objectID": "week4/slides.html#section", - "href": "week4/slides.html#section", + "objectID": "week5/slides.html#example-penguins-33", + "href": "week5/slides.html#example-penguins-33", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "", - "text": "Mahalanobis distance\nFor two \\(p\\)-dimensional vectors, Euclidean distance is\n\\[d(x,y) = \\sqrt{(x-y)^\\top(x-y)}\\] and Mahalanobs distance is\n\\[d(x,y) = \\sqrt{(x-y)^\\top\\Sigma^{-1}(x-y)}\\]\nWhich points are closest according to Euclidean distance? Which points are closest relative to the variance-covariance?" + "section": "Example: penguins 3/3", + "text": "Example: penguins 3/3\n\n\n\n\n\n\n\n\n\n\n\n\n\np_fit_tree |>\n extract_fit_engine() |>\n rpart.plot(type=3, extra=1)" }, { - "objectID": "week4/slides.html#discriminant-space-1", - "href": "week4/slides.html#discriminant-space-1", + "objectID": "week5/slides.html#example-penguins-34", + "href": "week5/slides.html#example-penguins-34", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Discriminant space", - "text": "Discriminant space\nIn the means of scenarios 1 and 2 are the same, but the variance-covariances are different. The calculated discriminant space is different for different variance-covariances.\n\nNotice: Means for groups are different, and variance-covariance for each group are the same." + "section": "Example: penguins 3/4", + "text": "Example: penguins 3/4\n\n\nModel fit summary\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy binary 0.946\n\n\n# A tibble: 2 ร— 4\n# Groups: species [2]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 50 1 0.980\n2 Chinstrap 3 20 0.870\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy binary 0.925\n\n\n\nCan you see the misclassified test cases?\n\nModel-in-the-data-space" }, { - "objectID": "week4/slides.html#quadratic-discriminant-analysis", - "href": "week4/slides.html#quadratic-discriminant-analysis", + "objectID": "week5/slides.html#comparison-with-lda", + "href": "week5/slides.html#comparison-with-lda", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Quadratic Discriminant Analysis", - "text": "Quadratic Discriminant Analysis\nIf the groups have different variance-covariance matrices, but still come from a normal distribution" + "section": "Comparison with LDA", + "text": "Comparison with LDA\n\n\n\nTree model\n\n\n\n\n\n\n\n\n\n\n\nData-driven, only split on single variables\n\n\n\nLDA model\n\n\n\n\n\n\n\n\n\n\n\nAssume normal, equal VC, oblique splits" }, { - "objectID": "week4/slides.html#quadratic-da-qda", - "href": "week4/slides.html#quadratic-da-qda", + "objectID": "week5/slides.html#random-forests", + "href": "week5/slides.html#random-forests", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Quadratic DA (QDA)", - "text": "Quadratic DA (QDA)\nIf the variance-covariance matrices for the groups are NOT EQUAL, then the discriminant functions are:\n\\[\\delta_k(x) = x^\\top\\Sigma_k^{-1}x + x^\\top\\Sigma_k^{-1}\\mu_k - \\frac12\\mu_k^\\top\\Sigma_k^{-1}\\mu_k - \\frac12 \\log{|\\Sigma_k|} + \\log(\\pi_k)\\]\nwhere \\(\\Sigma_k\\) is the population variance-covariance for class \\(k\\), estimated by the sample variance-covariance \\(S_k\\), and \\(\\mu_k\\) is the population mean vector for class \\(k\\), estimated by the sample mean \\(\\bar{x}_k\\)." + "section": "Random forests", + "text": "Random forests" }, { - "objectID": "week4/slides.html#quadratic-da-qda-1", - "href": "week4/slides.html#quadratic-da-qda-1", + "objectID": "week5/slides.html#overview-1", + "href": "week5/slides.html#overview-1", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Quadratic DA (QDA)", - "text": "Quadratic DA (QDA)\nA quadratic boundary is obtained by relaxing the assumption of equal variance-covariance, and assume that \\(\\Sigma_k \\neq \\Sigma_l, ~~k\\neq l, k,l=1,...,K\\)\n\n \n\ntrue, LDA, QDA.\n(Chapter4/4.9.pdf)" + "section": "Overview", + "text": "Overview\nA random forest is an ensemble classifier, built from fitting multiple trees to different subsets of the training data." }, { - "objectID": "week4/slides.html#qda-olive-oils-example", - "href": "week4/slides.html#qda-olive-oils-example", + "objectID": "week5/slides.html#bagging-and-variable-sampling", + "href": "week5/slides.html#bagging-and-variable-sampling", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "QDA: Olive oils example", - "text": "QDA: Olive oils example\n\n\nEven if the population is NOT normally distributed, QDA might do reasonably. On this data, region 3 has a โ€œbanana-shapedโ€ variance-covariance, and region 2 has two separate clusters. The quadratic boundary though does well to carve the space into neat sections dividing the two regions." + "section": "Bagging and variable sampling", + "text": "Bagging and variable sampling\n\n\n\nTake \\(B\\) different bootstrapped training sets: \\(D_1, D_2, \\dots, D_B\\), each using a sample of variables.\nBuild a separate prediction model using each \\(D_{(\\cdot)}\\): \\[\\widehat{f}_1(x), \\widehat{f}_2(x), \\dots, \\widehat{f}_B(x)\\]\nPredict the out-of-bag cases for each tree, compute proportion of trees a case was predicted to be each class.\nPredicted value for each observation is the class with the highest proportion.\n\n\n\n\nEach individual tree has high variance.\nAggregating the results from \\(B\\) trees reduces the variance." }, { - "objectID": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-12", - "href": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-12", + "objectID": "week5/slides.html#comparison-with-a-single-tree-and-lda", + "href": "week5/slides.html#comparison-with-a-single-tree-and-lda", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Checking the assumptions for LDA and QDA 1/2", - "text": "Checking the assumptions for LDA and QDA 1/2\nCheck the shape of the variability of each group could be considered to be elliptical, and the size is same for LDA but different to use QDA.\n\n\n\nGOOD\n\n\n\n\nBAD\n\n\n\n\nfrom Cook and Laa (2024)" + "section": "Comparison with a single tree and LDA", + "text": "Comparison with a single tree and LDA\n\n\n\nTree model\n\n\n\n\n\n\n\n\n\n\n\nData-driven, only split on single variables\n\n\n\nRandom forest\n\n\n\n\n\n\n\n\n\n\n\nData-driven, multiple trees gives non-linear fit\n\n\n\nLDA model\n\n\n\n\n\n\n\n\n\n\n\nAssume normal, equal VC, oblique splits" }, { - "objectID": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-22", - "href": "week4/slides.html#checking-the-assumptions-for-lda-and-qda-22", + "objectID": "week5/slides.html#random-forest-fit-and-predicted-values", + "href": "week5/slides.html#random-forest-fit-and-predicted-values", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Checking the assumptions for LDA and QDA 2/2", - "text": "Checking the assumptions for LDA and QDA 2/2\nThis can also be done for \\(p>2\\).\n\n\n\nDATA\n\n\n\n\nPOINTS ON SURFACE OF ELLIPSES\n\n\n\n\nfrom Cook and Laa (2024)" + "section": "Random forest fit and predicted values", + "text": "Random forest fit and predicted values\n\n\nFit\n\nrf_spec <- rand_forest(mtry=2, trees=1000) |>\n set_mode(\"classification\") |>\n set_engine(\"randomForest\")\np_fit_rf <- rf_spec |> \n fit(species ~ ., data = p_tr)\n\n\n\nparsnip model object\n\n\nCall:\n randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2, x)) \n Type of random forest: classification\n Number of trees: 1000\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 4.8%\nConfusion matrix:\n Adelie Chinstrap class.error\nAdelie 96 4 0.040\nChinstrap 3 42 0.067\n\n\n\nPredicted values\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy binary 0.973\n\n\n# A tibble: 2 ร— 4\n# Groups: species [2]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 51 0 1 \n2 Chinstrap 2 21 0.913\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy binary 0.957\n\n\n\n Warning: Donโ€™t use the predict() on the training set, youโ€™ll always get 0 error. The object p_fit_rf$fit$predict has the fitted values." }, { - "objectID": "week4/slides.html#plotting-the-model", - "href": "week4/slides.html#plotting-the-model", + "objectID": "week5/slides.html#diagnostics", + "href": "week5/slides.html#diagnostics", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Plotting the model", - "text": "Plotting the model\n\n\n\nData-in-the-model-space\n\n\n\n\nModel-in-the-data-space\n\n\n\n\nfrom Cook and Laa (2024)" + "section": "Diagnostics", + "text": "Diagnostics\n\nError is computed automatically on the out-of-bag cases.\nVote matrix, \\(n\\times K\\): Proportion of times a case is predicted to the class \\(k\\). Also consider these to be predictive probabilities.\nVariable importance: uses permutation!\nProximities, \\(n\\times n\\): Closeness of cases measured by how often they are in the same terminal node." }, { - "objectID": "week4/slides.html#next-trees-and-forests", - "href": "week4/slides.html#next-trees-and-forests", + "objectID": "week5/slides.html#vote-matrix", + "href": "week5/slides.html#vote-matrix", "title": "ETC3250/5250 Introduction to Machine Learning", - "section": "Next: Trees and forests", - "text": "Next: Trees and forests\n\n\n\nETC3250/5250 Lecture 4 | iml.numbat.space" + "section": "Vote Matrix", + "text": "Vote Matrix\n\n\n\nProportion of trees the case is predicted to be each class, ranges between 0-1\nCan be used to identify troublesome cases.\nUsed with plots of the actual data can help determine if it is the record itself that is the problem, or if method is biased.\nUnderstand the difference in accuracy of prediction for different classes.\n\n\n\np_fit_rf$fit$votes\n\n Adelie Chinstrap\n1 1.0000 0.0000\n2 1.0000 0.0000\n3 0.9807 0.0193\n4 1.0000 0.0000\n5 1.0000 0.0000\n6 1.0000 0.0000\n7 1.0000 0.0000\n8 0.3982 0.6018\n9 1.0000 0.0000\n10 1.0000 0.0000\n11 1.0000 0.0000\n12 0.8274 0.1726\n13 0.3425 0.6575\n14 1.0000 0.0000\n15 1.0000 0.0000\n16 0.7931 0.2069\n17 1.0000 0.0000\n18 1.0000 0.0000\n19 0.9973 0.0027\n20 1.0000 0.0000\n21 0.7622 0.2378\n22 1.0000 0.0000\n23 0.9459 0.0541\n24 1.0000 0.0000\n25 1.0000 0.0000\n26 0.8568 0.1432\n27 1.0000 0.0000\n28 1.0000 0.0000\n29 1.0000 0.0000\n30 1.0000 0.0000\n31 1.0000 0.0000\n32 1.0000 0.0000\n33 1.0000 0.0000\n34 1.0000 0.0000\n35 1.0000 0.0000\n36 1.0000 0.0000\n37 1.0000 0.0000\n38 1.0000 0.0000\n39 1.0000 0.0000\n40 1.0000 0.0000\n41 1.0000 0.0000\n42 1.0000 0.0000\n43 1.0000 0.0000\n44 1.0000 0.0000\n45 0.2773 0.7227\n46 1.0000 0.0000\n47 0.9821 0.0179\n48 1.0000 0.0000\n49 0.9973 0.0027\n50 1.0000 0.0000\n51 1.0000 0.0000\n52 1.0000 0.0000\n53 1.0000 0.0000\n54 1.0000 0.0000\n55 1.0000 0.0000\n56 1.0000 0.0000\n57 1.0000 0.0000\n58 1.0000 0.0000\n59 1.0000 0.0000\n60 1.0000 0.0000\n61 0.9833 0.0167\n62 1.0000 0.0000\n63 0.9113 0.0887\n64 1.0000 0.0000\n65 1.0000 0.0000\n66 1.0000 0.0000\n67 1.0000 0.0000\n68 1.0000 0.0000\n69 0.9912 0.0088\n70 1.0000 0.0000\n71 0.9535 0.0465\n72 0.9914 0.0086\n73 1.0000 0.0000\n74 0.9676 0.0324\n75 1.0000 0.0000\n76 1.0000 0.0000\n77 1.0000 0.0000\n78 1.0000 0.0000\n79 1.0000 0.0000\n80 1.0000 0.0000\n81 1.0000 0.0000\n82 0.9973 0.0027\n83 1.0000 0.0000\n84 1.0000 0.0000\n85 1.0000 0.0000\n86 0.4624 0.5376\n87 0.6160 0.3840\n88 1.0000 0.0000\n89 1.0000 0.0000\n90 1.0000 0.0000\n91 1.0000 0.0000\n92 0.9948 0.0052\n93 1.0000 0.0000\n94 0.9972 0.0028\n95 1.0000 0.0000\n96 1.0000 0.0000\n97 1.0000 0.0000\n98 1.0000 0.0000\n99 1.0000 0.0000\n100 1.0000 0.0000\n101 0.0000 1.0000\n102 0.0000 1.0000\n103 0.0055 0.9945\n104 0.0653 0.9347\n105 0.0000 1.0000\n106 0.0000 1.0000\n107 0.0000 1.0000\n108 0.0000 1.0000\n109 0.0000 1.0000\n110 0.1935 0.8065\n111 0.0000 1.0000\n112 0.0000 1.0000\n113 0.0159 0.9841\n114 0.0000 1.0000\n115 0.0000 1.0000\n116 0.2074 0.7926\n117 0.0000 1.0000\n118 0.0000 1.0000\n119 0.0117 0.9883\n120 1.0000 0.0000\n121 0.0529 0.9471\n122 0.9536 0.0464\n123 0.0027 0.9973\n124 0.0000 1.0000\n125 0.0000 1.0000\n126 0.0163 0.9837\n127 0.0000 1.0000\n128 0.0000 1.0000\n129 0.0111 0.9889\n130 0.0000 1.0000\n131 0.0694 0.9306\n132 0.0000 1.0000\n133 0.0137 0.9863\n134 0.0000 1.0000\n135 0.0000 1.0000\n136 0.0052 0.9948\n137 0.0000 1.0000\n138 0.0000 1.0000\n139 0.0135 0.9865\n140 0.0000 1.0000\n141 0.0140 0.9860\n142 0.0000 1.0000\n143 0.6325 0.3675\n144 0.0000 1.0000\n145 0.0000 1.0000\nattr(,\"class\")\n[1] \"matrix\" \"array\" \"votes\"" }, { - "objectID": "week5/index.html", - "href": "week5/index.html", - "title": "Week 5: Trees and forests", - "section": "", - "text": "ISLR 8.1, 8.2" + "objectID": "week5/slides.html#curious", + "href": "week5/slides.html#curious", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Curious", + "text": "Curious\n\n\nWhere are the Adelie penguins in the training set that are misclassified?\n\n\nparsnip model object\n\n\nCall:\n randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2, x)) \n Type of random forest: classification\n Number of trees: 1000\nNo. of variables tried at each split: 2\n\n OOB estimate of error rate: 4.8%\nConfusion matrix:\n Adelie Chinstrap class.error\nAdelie 96 4 0.040\nChinstrap 3 42 0.067\n\n\n\nJoin data containing true, predicted and predictive probabilities, to diagnose.\n\n\n\n\n\n# A tibble: 7 ร— 6\n species bl bm pspecies Adelie Chinstrap\n <fct> <dbl> <int> <fct> <dbl> <dbl>\n1 Adelie 41.1 3200 Chinstrap 0.398 0.602 \n2 Adelie 46 4200 Chinstrap 0.342 0.658 \n3 Adelie 45.8 4150 Chinstrap 0.277 0.723 \n4 Adelie 44.1 4000 Chinstrap 0.462 0.538 \n5 Chinstrap 40.9 3200 Adelie 1 0 \n6 Chinstrap 42.5 3350 Adelie 0.954 0.0464\n7 Chinstrap 43.5 3400 Adelie 0.632 0.368" }, { - "objectID": "week5/index.html#main-reference", - "href": "week5/index.html#main-reference", - "title": "Week 5: Trees and forests", - "section": "", - "text": "ISLR 8.1, 8.2" + "objectID": "week5/slides.html#variable-importance-12", + "href": "week5/slides.html#variable-importance-12", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Variable importance (1/2)", + "text": "Variable importance (1/2)\n\nFor every tree predict the oob cases and count the number of votes cast for the correct class.\n\n\n\nRandomly permute the values on a variable in the oob cases and predict the class for these cases.\n\n\n\n3.Difference the votes for the correct class in the variable-permuted oob cases and the real oob cases. Average this number over all trees in the forest. If the value is large, then the variable is very important.\n\n\n Alternatively, Gini importance adds up the difference in impurity value of the descendant nodes with the parent node. Quick to compute.\n\n\n Read a fun explanation by Harriet Mason" }, { - "objectID": "week5/index.html#what-you-will-learn-this-week", - "href": "week5/index.html#what-you-will-learn-this-week", - "title": "Week 5: Trees and forests", - "section": "What you will learn this week", - "text": "What you will learn this week\n\nClassification trees, algorithm, stopping rules\nDifference between algorithm and parametric methods, especially trees vs LDA\nForests: ensembles of bagged trees\nDiagnostics: vote matrix, variable importance, proximity\nBoosted trees" + "objectID": "week5/slides.html#variable-importance-22", + "href": "week5/slides.html#variable-importance-22", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Variable importance (2/2)", + "text": "Variable importance (2/2)\n\n\n\np_fit_rf$fit$importance\n\n MeanDecreaseGini\nbl 57.2\nbm 4.5\n\n\n\n\n\n\n\n\n\n\n\n\n\np_tr_perm <- p_tr |>\n mutate(bl = sample(bl))\nggplot(p_tr_perm, aes(x=bl, y=bm, colour=species)) +\n geom_point() +\n scale_color_discrete_divergingx(palette = \"Zissou 1\") +\n ggtitle(\"Permuted bl\") +\n theme(legend.position=\"none\")\n\n\n\n\n\n\n\n\n\nVotes will be close to 0.5 for both classes." }, { - "objectID": "week5/index.html#lecture-slides", - "href": "week5/index.html#lecture-slides", - "title": "Week 5: Trees and forests", - "section": "Lecture slides", - "text": "Lecture slides\n\nhtml\npdf\nqmd\nR" + "objectID": "week5/slides.html#proximities", + "href": "week5/slides.html#proximities", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Proximities", + "text": "Proximities\n\nMeasure how each pair of observations land in the forest\nRun both in- and out-of-bag cases down the tree, and increase proximity value of cases \\(i, j\\) by 1 each time they are in the same terminal node.\nNormalize by dividing by \\(B\\).\n\nThis creates a similarity matrix between all pairs of observations.\n\nUse this for cluster analysis of the data for further diagnosing unusual observations, and model inadequacies." }, { - "objectID": "week5/index.html#tutorial-instructions", - "href": "week5/index.html#tutorial-instructions", - "title": "Week 5: Trees and forests", - "section": "Tutorial instructions", - "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd" + "objectID": "week5/slides.html#utilising-diagnostics-13", + "href": "week5/slides.html#utilising-diagnostics-13", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Utilising diagnostics (1/3)", + "text": "Utilising diagnostics (1/3)\n\n\nThe votes matrix yields more information than the confusion matrix, about the confidence that the model has in the prediction for each observation, in the training set.\nIt is a \\(K\\)-D object, but lives in \\((K-1)\\)-D because the rows add to 1.\nLetโ€™s re-fit the random forest model to the three species of the penguins.\n\n\n\n\n\np_ternary" }, { - "objectID": "week5/index.html#assignments", - "href": "week5/index.html#assignments", - "title": "Week 5: Trees and forests", - "section": "Assignments", - "text": "Assignments\n\nAssignment 2 is due on Friday 12 April." + "objectID": "week5/slides.html#utilising-diagnostics-23", + "href": "week5/slides.html#utilising-diagnostics-23", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Utilising diagnostics (2/3)", + "text": "Utilising diagnostics (2/3)\nDEMO: Use interactivity to investigate the uncertainty in the predictions.\n\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\np_tr2_std <- p_tr2 |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))\np_tr2_v <- bind_cols(p_tr2_std, p_rf_v_p[,1:2]) \np_tr2_v_shared <- SharedData$new(p_tr2_v)\n\ndetour_plot <- detour(p_tr2_v_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", \n height = \"450px\",\n palette = hcl.colors(3,\n palette=\"Zissou 1\"))\n\nvot_mat <- plot_ly(p_tr2_v_shared, \n x = ~x1,\n y = ~x2,\n color = ~species,\n colors = hcl.colors(3,\n palette=\"Zissou 1\"),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, vot_mat,\n widths = c(5, 6)\n )" }, { - "objectID": "week5/tutorialsol.html", - "href": "week5/tutorialsol.html", - "title": "ETC3250/5250 Tutorial 5", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" + "objectID": "week5/slides.html#utilising-diagnostics-33", + "href": "week5/slides.html#utilising-diagnostics-33", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Utilising diagnostics (3/3)", + "text": "Utilising diagnostics (3/3)\n\n\nVariable importance can help with variable selection.\n\n\np_fit_rf2$fit$importance\n\n MeanDecreaseGini\nbl 58\nbd 28\nfl 45\nbm 12\n\n\nTop two variables are bl and fl. \nEspecially useful when you have many more variables." }, { - "objectID": "week5/tutorialsol.html#objectives", - "href": "week5/tutorialsol.html#objectives", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from logistic regression models, and linear discriminant analysis models." + "objectID": "week5/slides.html#boosted-trees-13", + "href": "week5/slides.html#boosted-trees-13", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Boosted trees (1/3)", + "text": "Boosted trees (1/3)\nRandom forests build an ensemble of independent trees, while boosted trees build an ensemble from shallow trees in a sequence with each tree learning and improving on the previous one, by re-weighting observations to give mistakes more importance.\n\n\n\nSource: Boehmke (2020) Hands on Machine Learning with R" }, { - "objectID": "week5/tutorialsol.html#preparation", - "href": "week5/tutorialsol.html#preparation", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" + "objectID": "week5/slides.html#boosted-trees-23", + "href": "week5/slides.html#boosted-trees-23", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Boosted trees (2/3)", + "text": "Boosted trees (2/3)\nBoosting iteratively fits multiple trees, sequentially putting more weight on observations that have predicted inaccurately.\n\nSet weights (probabilities) for all observations in training set ( according to class sample sizes using log odds ratio). Fit a tree with fixed \\(d\\) splits ( \\(d+1\\) terminal nodes).\nFor b=1, 2, โ€ฆ, B, repeat:\n\nCompute fitted values \nCompute pseudo-residuals \nFit the tree to the residuals \nCompute new weights (probabilities)\n\nAggregate predictions from all trees.\n\nThis StatQuest video by Josh Starmer, is the best explanation!\nAnd this is a fun explanation of boosting by Harriet Mason." }, { - "objectID": "week5/tutorialsol.html#exercises", - "href": "week5/tutorialsol.html#exercises", - "title": "ETC3250/5250 Tutorial 5", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1148)\np_split <- initial_split(p_tidy_std, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. LDA\nThis problem uses linear discriminant analysis on the penguins data.\n\nIs the assumption of equal variance-covariance reasonable to make for this data?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nYou need to look at the data in a tour, using:\n\nanimate_xy(p_tidy_std[,2:5], col=p_tidy$species)\n\nUse the standardised data, because the measurements are in different sizes, and this is not relevant for this data.\n\n\n\n\n\nFit the LDA model to the training data, using this code\n\n\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(1/3, 1/3, 1/3))\nlda_fit <- lda_spec |> \n fit(species ~ ., data = p_tr)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\nparsnip model object\n\nCall:\nlda(species ~ ., data = data, prior = ~c(1/3, 1/3, 1/3))\n\nPrior probabilities of groups:\n Adelie Chinstrap Gentoo \n 0.33 0.33 0.33 \n\nGroup means:\n bl bd fl bm\nAdelie -0.94 0.65 -0.79 -0.59\nChinstrap 0.92 0.64 -0.37 -0.62\nGentoo 0.70 -1.08 1.19 1.16\n\nCoefficients of linear discriminants:\n LD1 LD2\nbl -0.34 -2.251\nbd 2.02 0.035\nfl -1.13 -0.170\nbm -1.18 1.376\n\nProportion of trace:\n LD1 LD2 \n0.82 0.18 \n\n\n\n\n\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 100 0 0 1 \n2 Chinstrap 1 44 0 0.978\n3 Gentoo 0 0 82 1 \n\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 49 2 0 0.961\n2 Chinstrap 1 22 0 0.957\n3 Gentoo 0 0 41 1 \n\n\n[1] 0.97\n\n\n\n\n\n\n\nPlot the training and test data in the discriminant space, using symbols to indicate which set. See if you can mark the misclassified cases, too.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRe-do the plot of the discriminant space, to examine the boundary between groups. Youโ€™ll need to generate a set of random points in the domain of the data, predict their class, and projection into the discriminant space. The explore() in the classifly package can help you generate the box of random points.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhat happens to the boundary, if you change the prior probabilities? And why does this happen? Change the prior probabilities to be 1.999/3, 0.001/3, 1/3 for Adelie, Chinstrap, Gentoo, respectively. Re-do the plot of the boundaries in the discriminant space.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf the prior probabilities are unequal, it gives more importance to some classes. Here the importance of the Adelie penguins has been increased to the detriment of the Chinstrap. So the boundary moves away from the Adelie, which means more often a new penguin would be classified as an Adelie.\n\n\n\n\n\n\n2. Logistic\n\nFit a logistic discriminant model to the training set. You can use this code:\n\n\nlog_fit <- multinom_reg() |> \n fit(species ~ ., \n data = p_tr)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nlog_fit\n\nparsnip model object\n\nCall:\nnnet::multinom(formula = species ~ ., data = data, trace = FALSE)\n\nCoefficients:\n (Intercept) bl bd fl bm\nChinstrap 18.0 84 -42 4.7 -25\nGentoo 7.4 38 -69 33.7 25\n\nResidual Deviance: 0.00024 \nAIC: 20 \n\n\n\n\n\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set. You can use this code to make the predictions.\n\n\np_tr_pred <- log_fit |> \n augment(new_data = p_tr) |>\n rename(pspecies = .pred_class)\np_ts_pred <- log_fit |> \n augment(new_data = p_ts) |>\n rename(pspecies = .pred_class)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_tr_pred |> count(species, pspecies) |>\n group_by(species) |>\n mutate(cl_acc = n[pspecies==species]/sum(n)) |>\n pivot_wider(names_from = pspecies, \n values_from = n, values_fill=0) |>\n select(species, Adelie, Chinstrap, Gentoo, cl_acc)\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 100 0 0 1\n2 Chinstrap 0 45 0 1\n3 Gentoo 0 0 82 1\n\np_ts_pred |> count(species, pspecies) |>\n group_by(species) |>\n mutate(cl_acc = n[pspecies==species]/sum(n)) |>\n pivot_wider(names_from = pspecies, \n values_from = n, values_fill=0) |>\n select(species, Adelie, Chinstrap, Gentoo, cl_acc)\n\n# A tibble: 3 ร— 5\n# Groups: species [3]\n species Adelie Chinstrap Gentoo cl_acc\n <fct> <int> <int> <int> <dbl>\n1 Adelie 49 2 0 0.961\n2 Chinstrap 0 23 0 1 \n3 Gentoo 0 0 41 1 \n\naccuracy(p_ts_pred, species, pspecies)$.estimate\n\n[1] 0.98\n\n\n\n\n\n\n\nCheck the boundaries produced by logistic regression, and how they differ from those of LDA. Using the 2D projection produced by the LDA rule (using equal priors) predict the your set of random points using the logistic model.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np_log_bnd_ds <- log_fit |> \n augment(new_data = p_bnd) |>\n rename(pspecies = .pred_class)\n\nggplot() +\n geom_point(\n data=p_log_bnd_ds[p_log_bnd_ds$.TYPE == \"simulated\",], \n aes(x=LD1, y=LD2, \n colour=pspecies), shape=46, alpha=0.8) + \n scale_color_discrete_divergingx(\"Zissou 1\") +\n geom_point(data=p_log_bnd_ds[p_log_bnd_ds$.TYPE == \"actual\",],\n aes(x=LD1, y=LD2, \n colour=species), shape=16, alpha=0.8) \n\n\n\n\n\n\n\n\nOne thing that you can notice is that the boundaries are not โ€œcrispโ€, that there is overlap of the coloured points marking the classification regions. This means that the separation from the logistic regression model is not accomplished in the same 2D space as LDA.\n\n\n\n\n\n\n3. Interactively explore misclassifications\nHere you are going to use interactive graphics to explore the misclassifications from the linear discriminant analysis. Weโ€™ll need to use detourr to accomplish this. The code below makes a scatterplot of the confusion matrix, where points corresponding to a class have been spread apart by jittering. This plot is linked to a tour plot. Try:\n\nSelecting penguins that have been misclassified, from the display of the confusion matrix. Observe where they are in the data space. Are they in an area where it is hard to distinguish the groups?\nSelecting neighbouring points in the tour, and examine where they are in the confusion matrix.\n\n\np_cl <- p_tidy_std |>\n mutate(pspecies = predict(lda_fit$fit, p_tidy_std)$class) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared, \n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n ) \n\n\n\n4. Exploring the math\nSlide 23 of the lecture notes has the steps to go from Bayes rule to the discriminant functions. Explain what was done at each step to get to the next one." + "objectID": "week5/slides.html#boosted-trees-33", + "href": "week5/slides.html#boosted-trees-33", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Boosted trees (3/3)", + "text": "Boosted trees (3/3)\n\nset.seed(1110)\nbt_spec <- boost_tree() |>\n set_mode(\"classification\") |>\n set_engine(\"xgboost\")\np_fit_bt <- bt_spec |> \n fit(species ~ ., data = p_tr2)\n\n\n\n# A tibble: 1 ร— 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.991\n\n\n# A tibble: 3 ร— 4\n# Groups: species [3]\n species Adelie Chinstrap Accuracy\n <fct> <int> <int> <dbl>\n1 Adelie 50 1 0.980\n2 Chinstrap 0 23 1 \n3 Gentoo 0 0 1" }, { - "objectID": "week5/tutorialsol.html#finishing-up", - "href": "week5/tutorialsol.html#finishing-up", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week5/slides.html#limitations-of-trees", + "href": "week5/slides.html#limitations-of-trees", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Limitations of trees", + "text": "Limitations of trees\n\nMost implementations only splits on a single variable, not combinations.\nThere are versions that build trees on combinations, eg PPTreeViz and PPforest, but you lose interpretability, and fitting is more difficult.\nSees only splits, but not gaps. (See support vector machines, in a few weeks.)\nAlgorithm takes variables in order, and splits in order, and will use first as best.\nNeed tuning and cross-validation." }, { - "objectID": "week5/tutorial.html", - "href": "week5/tutorial.html", - "title": "ETC3250/5250 Tutorial 5", - "section": "", - "text": "Load the libraries and avoid conflicts\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(palmerpenguins)\nlibrary(GGally)\nlibrary(tourr)\nlibrary(MASS)\nlibrary(discrim)\nlibrary(classifly)\nlibrary(detourr)\nlibrary(crosstalk)\nlibrary(plotly)\nlibrary(viridis)\nlibrary(colorspace)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(palmerpenguins::penguins)\nconflicts_prefer(viridis::viridis_pal)\n\noptions(digits=2)\np_tidy <- penguins |>\n select(species, bill_length_mm:body_mass_g) |>\n rename(bl=bill_length_mm,\n bd=bill_depth_mm,\n fl=flipper_length_mm,\n bm=body_mass_g) |>\n filter(!is.na(bl)) |>\n arrange(species) |>\n na.omit()\np_tidy_std <- p_tidy |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))" + "objectID": "week5/slides.html#next-neural-networks-and-deep-learning", + "href": "week5/slides.html#next-neural-networks-and-deep-learning", + "title": "ETC3250/5250 Introduction to Machine Learning", + "section": "Next: Neural networks and deep learning", + "text": "Next: Neural networks and deep learning\n\n\n\nETC3250/5250 Lecture 5 | iml.numbat.space" }, { - "objectID": "week5/tutorial.html#objectives", - "href": "week5/tutorial.html#objectives", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐ŸŽฏ Objectives", - "text": "๐ŸŽฏ Objectives\nThe goal for this week is learn to fit, diagnose, assess assumptions, and predict from logistic regression models, and linear discriminant analysis models." + "objectID": "week6/index.html", + "href": "week6/index.html", + "title": "Week 6: Neural networks and deep learning", + "section": "", + "text": "ISLR 10.1-10.3, 10.7" }, { - "objectID": "week5/tutorial.html#preparation", - "href": "week5/tutorial.html#preparation", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐Ÿ”ง Preparation", - "text": "๐Ÿ”ง Preparation\n\nMake sure you have all the necessary libraries installed. There are a few new ones this week!" + "objectID": "week6/index.html#main-reference", + "href": "week6/index.html#main-reference", + "title": "Week 6: Neural networks and deep learning", + "section": "", + "text": "ISLR 10.1-10.3, 10.7" }, { - "objectID": "week5/tutorial.html#exercises", - "href": "week5/tutorial.html#exercises", - "title": "ETC3250/5250 Tutorial 5", - "section": "Exercises:", - "text": "Exercises:\nOpen your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.\n\nset.seed(1148)\np_split <- initial_split(p_tidy_std, 2/3, strata = species)\np_tr <- training(p_split)\np_ts <- testing(p_split)\n\n\n1. LDA\nThis problem uses linear discriminant analysis on the penguins data.\n\nIs the assumption of equal variance-covariance reasonable to make for this data?\n\n\nFit the LDA model to the training data, using this code\n\n\nlda_spec <- discrim_linear() |>\n set_mode(\"classification\") |>\n set_engine(\"MASS\", prior = c(1/3, 1/3, 1/3))\nlda_fit <- lda_spec |> \n fit(species ~ ., data = p_tr)\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set.\n\n\nPlot the training and test data in the discriminant space, using symbols to indicate which set. See if you can mark the misclassified cases, too.\n\n\nRe-do the plot of the discriminant space, to examine the boundary between groups. Youโ€™ll need to generate a set of random points in the domain of the data, predict their class, and projection into the discriminant space. The explore() in the classifly package can help you generate the box of random points.\n\n\nWhat happens to the boundary, if you change the prior probabilities? And why does this happen? Change the prior probabilities to be 1.999/3, 0.001/3, 1/3 for Adelie, Chinstrap, Gentoo, respectively. Re-do the plot of the boundaries in the discriminant space.\n\n\n\n2. Logistic\n\nFit a logistic discriminant model to the training set. You can use this code:\n\n\nlog_fit <- multinom_reg() |> \n fit(species ~ ., \n data = p_tr)\n\n\nCompute the confusion matrices for training and test sets, and thus the error for the test set. You can use this code to make the predictions.\n\n\np_tr_pred <- log_fit |> \n augment(new_data = p_tr) |>\n rename(pspecies = .pred_class)\np_ts_pred <- log_fit |> \n augment(new_data = p_ts) |>\n rename(pspecies = .pred_class)\n\n\nCheck the boundaries produced by logistic regression, and how they differ from those of LDA. Using the 2D projection produced by the LDA rule (using equal priors) predict the your set of random points using the logistic model.\n\n\n\n3. Interactively explore misclassifications\nHere you are going to use interactive graphics to explore the misclassifications from the linear discriminant analysis. Weโ€™ll need to use detourr to accomplish this. The code below makes a scatterplot of the confusion matrix, where points corresponding to a class have been spread apart by jittering. This plot is linked to a tour plot. Try:\n\nSelecting penguins that have been misclassified, from the display of the confusion matrix. Observe where they are in the data space. Are they in an area where it is hard to distinguish the groups?\nSelecting neighbouring points in the tour, and examine where they are in the confusion matrix.\n\n\np_cl <- p_tidy_std |>\n mutate(pspecies = predict(lda_fit$fit, p_tidy_std)$class) |>\n dplyr::select(bl:bm, species, pspecies) |>\n mutate(sp_jit = jitter(as.numeric(species)),\n psp_jit = jitter(as.numeric(pspecies)))\np_cl_shared <- SharedData$new(p_cl)\n\ndetour_plot <- detour(p_cl_shared, tour_aes(\n projection = bl:bm,\n colour = species)) |>\n tour_path(grand_tour(2), \n max_bases=50, fps = 60) |>\n show_scatter(alpha = 0.9, axes = FALSE,\n width = \"100%\", height = \"450px\")\n\nconf_mat <- plot_ly(p_cl_shared, \n x = ~psp_jit,\n y = ~sp_jit,\n color = ~species,\n colors = viridis_pal(option = \"D\")(3),\n height = 450) |>\n highlight(on = \"plotly_selected\", \n off = \"plotly_doubleclick\") %>%\n add_trace(type = \"scatter\", \n mode = \"markers\")\n \nbscols(\n detour_plot, conf_mat,\n widths = c(5, 6)\n ) \n\n\n\n4. Exploring the math\nSlide 23 of the lecture notes has the steps to go from Bayes rule to the discriminant functions. Explain what was done at each step to get to the next one." + "objectID": "week6/index.html#what-you-will-learn-this-week", + "href": "week6/index.html#what-you-will-learn-this-week", + "title": "Week 6: Neural networks and deep learning", + "section": "What you will learn this week", + "text": "What you will learn this week\n\nStructure of a neural network\nFitting neural networks\nDiagnosing the fit" }, { - "objectID": "week5/tutorial.html#finishing-up", - "href": "week5/tutorial.html#finishing-up", - "title": "ETC3250/5250 Tutorial 5", - "section": "๐Ÿ‘‹ Finishing up", - "text": "๐Ÿ‘‹ Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult." + "objectID": "week6/index.html#assignments", + "href": "week6/index.html#assignments", + "title": "Week 6: Neural networks and deep learning", + "section": "Assignments", + "text": "Assignments\n\nAssignment 2 is due on Friday 12 April.\nAssignment 3 is due on Friday 26 April." }, { "objectID": "week7/index.html", diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 7ac8aed3..0c301db2 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -9,108 +9,116 @@ 2024-02-05T21:28:13.682Z - https://iml.numbat.space/week6/index.html - 2024-02-05T21:26:36.255Z + https://iml.numbat.space/week6/tutorial.html + 2024-03-29T01:27:47.522Z - https://iml.numbat.space/week5/slides.html - 2024-03-25T00:16:08.487Z + https://iml.numbat.space/week6/tutorialsol.html + 2024-03-29T01:27:47.522Z - https://iml.numbat.space/week4/tutorial.html - 2024-03-18T01:23:19.973Z + https://iml.numbat.space/week5/tutorial.html + 2024-03-22T04:38:32.315Z - https://iml.numbat.space/week4/tutorialsol.html - 2024-03-18T01:23:19.973Z + https://iml.numbat.space/week5/tutorialsol.html + 2024-03-22T04:38:32.315Z - https://iml.numbat.space/week4/index.html - 2024-03-15T04:09:22.102Z + https://iml.numbat.space/week5/index.html + 2024-03-25T00:29:39.940Z - https://iml.numbat.space/week3/slides.html - 2024-03-10T05:23:01.954Z + https://iml.numbat.space/week4/slides.html + 2024-03-21T01:23:36.055Z - https://iml.numbat.space/week2/tutorial.html - 2024-03-08T07:27:48.745Z + https://iml.numbat.space/week3/tutorial.html + 2024-03-18T01:22:38.289Z - https://iml.numbat.space/week2/tutorialsol.html - 2024-03-08T07:27:48.745Z + https://iml.numbat.space/week3/tutorialsol.html + 2024-03-18T01:22:38.289Z - https://iml.numbat.space/week2/index.html - 2024-03-03T07:54:29.938Z + https://iml.numbat.space/week3/index.html + 2024-03-10T04:30:28.678Z - https://iml.numbat.space/week11/index.html - 2024-02-05T21:31:46.270Z + https://iml.numbat.space/week2/slides.html + 2024-03-19T04:48:10.743Z - https://iml.numbat.space/week1/tutorial.html - 2024-02-21T23:44:11.489Z + https://iml.numbat.space/week12/index.html + 2024-02-05T21:32:06.286Z - https://iml.numbat.space/week1/tutorialsol.html - 2024-02-21T23:44:11.489Z + https://iml.numbat.space/week10/index.html + 2024-02-05T21:30:53.696Z - https://iml.numbat.space/week1/index.html - 2024-02-18T03:40:04.710Z + https://iml.numbat.space/week1/slides.html + 2024-03-16T20:01:47.880Z + + + https://iml.numbat.space/resources.html + 2024-02-18T01:55:28.455Z https://iml.numbat.space/index.html 2024-03-25T00:28:25.606Z - https://iml.numbat.space/resources.html - 2024-02-18T01:55:28.455Z + https://iml.numbat.space/week1/index.html + 2024-02-18T03:40:04.710Z - https://iml.numbat.space/week1/slides.html - 2024-03-16T20:01:47.880Z + https://iml.numbat.space/week1/tutorialsol.html + 2024-02-21T23:44:11.489Z - https://iml.numbat.space/week10/index.html - 2024-02-05T21:30:53.696Z + https://iml.numbat.space/week1/tutorial.html + 2024-02-21T23:44:11.489Z - https://iml.numbat.space/week12/index.html - 2024-02-05T21:32:06.286Z + https://iml.numbat.space/week11/index.html + 2024-02-05T21:31:46.270Z - https://iml.numbat.space/week2/slides.html - 2024-03-19T04:48:10.743Z + https://iml.numbat.space/week2/index.html + 2024-03-03T07:54:29.938Z - https://iml.numbat.space/week3/index.html - 2024-03-10T04:30:28.678Z + https://iml.numbat.space/week2/tutorialsol.html + 2024-03-08T07:27:48.745Z - https://iml.numbat.space/week3/tutorialsol.html - 2024-03-18T01:22:38.289Z + https://iml.numbat.space/week2/tutorial.html + 2024-03-08T07:27:48.745Z - https://iml.numbat.space/week3/tutorial.html - 2024-03-18T01:22:38.289Z + https://iml.numbat.space/week3/slides.html + 2024-03-10T05:23:01.954Z - https://iml.numbat.space/week4/slides.html - 2024-03-21T01:23:36.055Z + https://iml.numbat.space/week4/index.html + 2024-03-15T04:09:22.102Z - https://iml.numbat.space/week5/index.html - 2024-03-25T00:29:39.940Z + https://iml.numbat.space/week4/tutorialsol.html + 2024-03-18T01:23:19.973Z - https://iml.numbat.space/week5/tutorialsol.html - 2024-03-22T04:38:32.315Z + https://iml.numbat.space/week4/tutorial.html + 2024-03-18T01:23:19.973Z - https://iml.numbat.space/week5/tutorial.html - 2024-03-22T04:38:32.315Z + https://iml.numbat.space/week5/slides.html + 2024-03-27T04:02:40.366Z + + + https://iml.numbat.space/week6/index.html + 2024-02-05T21:26:36.255Z https://iml.numbat.space/week7/index.html diff --git a/docs/week2/images/slides.rmarkdown/data-in-model-space1-1.png b/docs/week2/images/slides.rmarkdown/data-in-model-space1-1.png index 450dedde..10b9196a 100644 Binary files a/docs/week2/images/slides.rmarkdown/data-in-model-space1-1.png and b/docs/week2/images/slides.rmarkdown/data-in-model-space1-1.png differ diff --git a/docs/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png b/docs/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png index 11c810e5..a6b0fca0 100644 Binary files a/docs/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png and b/docs/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png differ diff --git a/docs/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png b/docs/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png index 9925ad6e..9cde5591 100644 Binary files a/docs/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png and b/docs/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png differ diff --git a/docs/week2/slides.html b/docs/week2/slides.html index 7306cc1e..089d6eab 100644 --- a/docs/week2/slides.html +++ b/docs/week2/slides.html @@ -719,7 +719,7 @@

Adding interactivity to static plots: scatterplot matrix

ggplotly(g, width=600, height=600)
- +
@@ -751,7 +751,7 @@

Adding interactivity to static plots: parallel coordinates

p_pcp
- +
@@ -1583,7 +1583,7 @@

UMAP (2/2)

- +
diff --git a/docs/week3/tutorialsol.html b/docs/week3/tutorialsol.html index cda8c8a3..3845a844 100644 --- a/docs/week3/tutorialsol.html +++ b/docs/week3/tutorialsol.html @@ -7116,7 +7116,7 @@

5. PCA o
- +

It isnโ€™t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.

@@ -7366,7 +7366,7 @@

5. PCA o
- +

The pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.

diff --git a/docs/week4/tutorialsol.html b/docs/week4/tutorialsol.html index 007d5a0e..138c6a0a 100644 --- a/docs/week4/tutorialsol.html +++ b/docs/week4/tutorialsol.html @@ -5108,7 +5108,7 @@

-

+

diff --git a/docs/week5/slides.html b/docs/week5/slides.html index b5cf991a..1748836b 100644 --- a/docs/week5/slides.html +++ b/docs/week5/slides.html @@ -1655,7 +1655,7 @@

Calculate the impurity for split 5Right bucket:

\[\widehat{p}_{RA} = 0/2, \widehat{p}_{RB} = 2/2, ~~ p_R = 2/7\]

\[G_R=0(1-0)+1(1-1) = 0\] Combine with weighted sum to get impurity for the split:

-

\[5/7G_L + 2/7G_R=0.32\]

+

\[5/7G_L + 2/7G_R=0.23\]



Your turn: Compute the impurity for split 2.

@@ -2589,7 +2589,7 @@

Next: Neural networks and deep learning

diff --git a/docs/week5/slides.qmd b/docs/week5/slides.qmd index bb9ffd7e..96b5ac78 100644 --- a/docs/week5/slides.qmd +++ b/docs/week5/slides.qmd @@ -8,7 +8,7 @@ author: - name: "Professor Di Cook" email: "etc3250.clayton-x@monash.edu" institute: "Department of Econometrics and Business Statistics" -footer: "ETC3250/5250 Lecture 4 | [iml.numbat.space](iml.numbat.space)" +footer: "ETC3250/5250 Lecture 5 | [iml.numbat.space](iml.numbat.space)" format: revealjs: multiplex: false @@ -195,7 +195,7 @@ $$\widehat{p}_{RA} = 0/2, \widehat{p}_{RB} = 2/2, ~~ p_R = 2/7$$ $$G_R=0(1-0)+1(1-1) = 0$$ Combine with weighted sum to get [impurity for the split]{.monash-orange2}: -$$5/7G_L + 2/7G_R=0.32$$ +$$5/7G_L + 2/7G_R=0.23$$

[**Your turn**: Compute the impurity for split 2.]{.monash-blue2} diff --git a/docs/week6/tutorial.html b/docs/week6/tutorial.html new file mode 100644 index 00000000..e471c2b7 --- /dev/null +++ b/docs/week6/tutorial.html @@ -0,0 +1,5660 @@ + + + + + + + + + + + +ETC3250/5250 Introduction to Machine Learning - ETC3250/5250 Tutorial 6 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

ETC3250/5250 Tutorial 6

+

Trees and forests

+
+ + + +
+ +
+
Author
+
+

Prof.ย Di Cook

+
+
+ +
+
Published
+
+

8 April 2024

+
+
+ + +
+ + + +
+ + +
+
+Load the libraries and avoid conflicts +
# Load libraries used everywhere
+library(tidyverse)
+library(tidymodels)
+library(patchwork)
+library(mulgar)
+library(palmerpenguins)
+library(GGally)
+library(tourr)
+library(MASS)
+library(discrim)
+library(classifly)
+library(detourr)
+library(crosstalk)
+library(plotly)
+library(viridis)
+library(colorspace)
+library(randomForest)
+library(geozoo)
+library(ggbeeswarm)
+library(conflicted)
+conflicts_prefer(dplyr::filter)
+conflicts_prefer(dplyr::select)
+conflicts_prefer(dplyr::slice)
+conflicts_prefer(palmerpenguins::penguins)
+conflicts_prefer(viridis::viridis_pal)
+
+options(digits=2)
+p_tidy <- penguins |>
+  select(species, bill_length_mm:body_mass_g) |>
+  rename(bl=bill_length_mm,
+         bd=bill_depth_mm,
+         fl=flipper_length_mm,
+         bm=body_mass_g) |>
+  filter(!is.na(bl)) |>
+  arrange(species) |>
+  na.omit()
+p_tidy_std <- p_tidy |>
+    mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
+
+
+
+

๐ŸŽฏ Objectives

+

The goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models.

+
+
+

๐Ÿ”ง Preparation

+
    +
  • Make sure you have all the necessary libraries installed. There are a few new ones this week!
  • +
+
+
+

Exercises:

+

Open your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.

+
+
set.seed(1156)
+p_sub <- p_tidy_std |>
+  filter(species != "Gentoo") |>
+  mutate(species = factor(species)) |>
+  select(species, bl, bm)
+p_split <- initial_split(p_sub, 2/3, strata = species)
+p_tr <- training(p_split)
+p_ts <- testing(p_split)
+
+
+

1. Becoming a car mechanic - looking under the hood at the tree algoriithm

+
    +
  1. Write down the equation for the Gini measure of impurity, for two groups, and the parameter \(p\) which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of \(p\) which gives the maximum value, and report what that maximum function value is.
  2. +
+
    +
  1. For two groups, how would the impurity of a split be measured? Give the equation.
  2. +
+
    +
  1. Below is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the minsplit parameter, does to prevent splitting on the edges fewer than the specified number of observations.
  2. +
+
+
# This works for two classes, and one variable
+mygini <- function(p) {
+  g <- 0
+  if (p>0 && p<1) {
+    g <- 2*p*(1-p)
+  }
+
+  return(g)
+}
+
+mysplit <- function(x, spl, cl, minsplit=5) {
+  # Assumes x is sorted
+  # Count number of observations
+  n <- length(x)
+  
+  # Check number of classes
+  cl_unique <- unique(cl)
+  
+  # Split into two subsets on the given value
+  left <- x[x<spl]
+  cl_left <- cl[x<spl]
+  n_l <- length(left)
+
+  right <- x[x>=spl]
+  cl_right <- cl[x>=spl]
+  n_r <- length(right)
+  
+  # Don't calculate is either set is less than minsplit
+  if ((n_l < minsplit) | (n_r < minsplit)) 
+    impurity = NA
+  else {
+    # Compute the Gini value for the split
+    p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l
+    p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r
+    if (is.na(p_l)) p_l<-0.5
+    if (is.na(p_r)) p_r<-0.5
+    impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r)
+  }
+  return(impurity)
+}
+
+
    +
  1. Apply the function to compute the value for all possible splits for the body mass (bm), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.
  2. +
+
    +
  1. Use your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments.
  2. +
+
+
+
+

Digging deeper into diagnosing an error

+
    +
  1. Fit the random forest model to the full penguins data.
  2. +
+
    +
  1. Report the confusion matrix.
  2. +
+
    +
  1. Use linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it?
  2. +
+

+

Have a look at the other misclassifications, to understand whether they are ones weโ€™d expect to misclassify, or whether the model is not well constructed.

+
+
p_cl <- p_tr2 |>
+  mutate(pspecies = p_fit_rf$fit$predicted) |>
+  dplyr::select(bl:bm, species, pspecies) |>
+  mutate(sp_jit = jitter(as.numeric(species)),
+         psp_jit = jitter(as.numeric(pspecies)))
+p_cl_shared <- SharedData$new(p_cl)
+
+detour_plot <- detour(p_cl_shared, tour_aes(
+  projection = bl:bm,
+  colour = species)) |>
+  tour_path(grand_tour(2),
+            max_bases=50, fps = 60) |>
+  show_scatter(alpha = 0.9, axes = FALSE,
+               width = "100%", height = "450px")
+
+conf_mat <- plot_ly(p_cl_shared,
+                    x = ~psp_jit,
+                    y = ~sp_jit,
+                    color = ~species,
+                    colors = viridis_pal(option = "D")(3),
+                    height = 450) |>
+  highlight(on = "plotly_selected",
+            off = "plotly_doubleclick") |>
+  add_trace(type = "scatter",
+            mode = "markers")
+
+bscols(
+  detour_plot, conf_mat,
+  widths = c(5, 6)
+)
+
+
+
+

Deciding on variables in a large data problem

+
    +
  1. Fit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes?
  2. +
+

This code might help:

+
+
data(bushfires)
+
+bushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |>
+  mutate(cause = factor(cause))
+
+set.seed(1239)
+bf_split <- initial_split(bushfires_sub, 3/4, strata=cause)
+bf_tr <- training(bf_split)
+bf_ts <- testing(bf_split)
+
+rf_spec <- rand_forest(mtry=5, trees=1000) |>
+  set_mode("classification") |>
+  set_engine("ranger", probability = TRUE, 
+             importance="permutation")
+bf_fit_rf <- rf_spec |> 
+  fit(cause~., data = bf_tr)
+
+# Create votes matrix data
+bf_rf_votes <- bf_fit_rf$fit$predictions |>
+  as_tibble() |>
+  mutate(cause = bf_tr$cause)
+
+# Project 4D into 3D
+proj <- t(geozoo::f_helmert(4)[-1,])
+bf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj
+colnames(bf_rf_v_p) <- c("x1", "x2", "x3")
+bf_rf_v_p <- bf_rf_v_p |>
+  as.data.frame() |>
+  mutate(cause = bf_tr$cause)
+  
+# Add simplex
+simp <- simplex(p=3)
+sp <- data.frame(simp$points)
+colnames(sp) <- c("x1", "x2", "x3")
+sp$cause = ""
+bf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |>
+  mutate(cause = factor(cause))
+labels <- c("accident" , "arson", 
+                "burning_off", "lightning", 
+                rep("", nrow(bf_rf_v_p)))
+
+
+
# Examine votes matrix with bounding simplex
+animate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, 
+           axes = "off", half_range = 1.3,
+           edges = as.matrix(simp$edges),
+           obs_labels = labels)
+
+
    +
  1. Check the variable importance. Plot the most important variables.
  2. +
+

This code might help:

+
+
bf_fit_rf$fit$variable.importance |> 
+  as_tibble() |> 
+  rename(imp=value) |>
+  mutate(var = colnames(bf_tr)[1:50]) |>
+  select(var, imp) |>
+  arrange(desc(imp)) |> 
+  print(n=50)
+
+
+
+

Can boosting better detect bushfire case?

+

Fit a boosted tree model using xgboost to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison.

+
+
set.seed(121)
+bf_spec2 <- boost_tree() |>
+  set_mode("classification") |>
+  set_engine("xgboost")
+bf_fit_bt <- bf_spec2 |> 
+  fit(cause~., data = bf_tr)
+
+
+
+

๐Ÿ‘‹ Finishing up

+

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.

+ + +
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/docs/week6/tutorial.qmd b/docs/week6/tutorial.qmd new file mode 100644 index 00000000..5c87d4ad --- /dev/null +++ b/docs/week6/tutorial.qmd @@ -0,0 +1,522 @@ +--- +title: "ETC3250/5250 Tutorial 6" +subtitle: "Trees and forests" +author: "Prof. Di Cook" +date: "2024-04-08" +quarto-required: ">=1.3.0" +format: + unilur-html: + output-file: tutorial.html + embed-resources: true + css: "../assets/tutorial.css" + unilur-html+solution: + output-file: tutorialsol.html + embed-resources: true + css: "../assets/tutorial.css" +unilur-solution: true +--- + +```{r echo=FALSE} +# Set up chunk for all slides +knitr::opts_chunk$set( + fig.width = 4, + fig.height = 4, + fig.align = "center", + out.width = "60%", + code.line.numbers = FALSE, + fig.retina = 3, + echo = TRUE, + message = FALSE, + warning = FALSE, + cache = FALSE, + dev.args = list(pointsize = 11) +) +``` + +```{r} +#| echo: true +#| code-fold: true +#| code-summary: "Load the libraries and avoid conflicts" +# Load libraries used everywhere +library(tidyverse) +library(tidymodels) +library(patchwork) +library(mulgar) +library(palmerpenguins) +library(GGally) +library(tourr) +library(MASS) +library(discrim) +library(classifly) +library(detourr) +library(crosstalk) +library(plotly) +library(viridis) +library(colorspace) +library(randomForest) +library(geozoo) +library(ggbeeswarm) +library(conflicted) +conflicts_prefer(dplyr::filter) +conflicts_prefer(dplyr::select) +conflicts_prefer(dplyr::slice) +conflicts_prefer(palmerpenguins::penguins) +conflicts_prefer(viridis::viridis_pal) + +options(digits=2) +p_tidy <- penguins |> + select(species, bill_length_mm:body_mass_g) |> + rename(bl=bill_length_mm, + bd=bill_depth_mm, + fl=flipper_length_mm, + bm=body_mass_g) |> + filter(!is.na(bl)) |> + arrange(species) |> + na.omit() +p_tidy_std <- p_tidy |> + mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) +``` + +```{r} +#| echo: false +# Set plot theme +theme_set(theme_bw(base_size = 14) + + theme( + aspect.ratio = 1, + plot.background = element_rect(fill = 'transparent', colour = NA), + plot.title.position = "plot", + plot.title = element_text(size = 24), + panel.background = element_rect(fill = 'transparent', colour = NA), + legend.background = element_rect(fill = 'transparent', colour = NA), + legend.key = element_rect(fill = 'transparent', colour = NA) + ) +) +``` + +## `r emo::ji("target")` Objectives + +The goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models. + +## `r emo::ji("wrench")` Preparation + +- Make sure you have all the necessary libraries installed. There are a few new ones this week! + +## Exercises: + +Open your project for this unit called `iml.Rproj`. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows. + +```{r} +set.seed(1156) +p_sub <- p_tidy_std |> + filter(species != "Gentoo") |> + mutate(species = factor(species)) |> + select(species, bl, bm) +p_split <- initial_split(p_sub, 2/3, strata = species) +p_tr <- training(p_split) +p_ts <- testing(p_split) +``` + +#### 1. Becoming a car mechanic - looking under the hood at the tree algoriithm + +a. Write down the equation for the Gini measure of impurity, for two groups, and the parameter $p$ which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of $p$ which gives the maximum value, and report what that maximum function value is. + +::: unilur-solution +$G = p(1-p)$ where $p$ is the proportion of class 1 in the subset of data. The domain is $[0, 1]$ and the maximum value of $0.25$ is at $p=0.5$. +::: + +b. For two groups, how would the impurity of a **split** be measured? Give the equation. + +::: unilur-solution + +$$p_L(p_{L1}(1-p_{L1})) + p_R(p_{R1}(1-p_{R1}))$$ +where $p_L$ is the proportion of observations to the left of the split, $p_{L1}$ is the proportion of observations of class 1 to the left of the split, and $p_{R1}$ indicates the equivalent quantities for observations to the right of the split. + +::: + +c. Below is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the `minsplit` parameter, does to prevent splitting on the edges fewer than the specified number of observations. + + +```{r echo=TRUE} +# This works for two classes, and one variable +mygini <- function(p) { + g <- 0 + if (p>0 && p<1) { + g <- 2*p*(1-p) + } + + return(g) +} + +mysplit <- function(x, spl, cl, minsplit=5) { + # Assumes x is sorted + # Count number of observations + n <- length(x) + + # Check number of classes + cl_unique <- unique(cl) + + # Split into two subsets on the given value + left <- x[x=spl] + cl_right <- cl[x>=spl] + n_r <- length(right) + + # Don't calculate is either set is less than minsplit + if ((n_l < minsplit) | (n_r < minsplit)) + impurity = NA + else { + # Compute the Gini value for the split + p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l + p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r + if (is.na(p_l)) p_l<-0.5 + if (is.na(p_r)) p_r<-0.5 + impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r) + } + return(impurity) +} +``` + + +d. Apply the function to compute the value for all possible splits for the body mass (`bm`), setting `minsplit` to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable. + +::: unilur-solution + +```{r} +x <- p_tr |> + select(species, bm) |> + arrange(bm) +unique_splits <- unique(x$bm) +nsplits <- length(unique_splits)-1 +splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 +imp <- NULL; +for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x$bm, s, x$species, minsplit=1) + imp <- c(imp, a) +} +d_impurity <- tibble(splits, imp) +d_impurity_bm <- d_impurity[which.min(d_impurity$imp),] +ggplot() + geom_line(data=d_impurity, aes(x=splits, y=imp)) + + geom_rug(data=x, aes(x=bm, colour=species), alpha=0.3) + + ylab("Gini impurity") + + xlab("bm") + + scale_color_brewer("", palette="Dark2") +``` + +::: + +e. Use your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting `minsplit` to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments. + +::: unilur-solution + +```{r results='hide'} +# bl: this is the only one needed for the first split +# because it is so better separated than any others +x <- p_tr |> + select(species, bl) |> + arrange(bl) +unique_splits <- unique(x$bl) +nsplits <- length(unique_splits)-1 +splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 +imp <- NULL; +for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x$bl, s, x$species, minsplit=1) + imp <- c(imp, a) +} +d_impurity <- tibble(splits, imp) +d_impurity_bl <- d_impurity[which.min(d_impurity$imp),] + +ggplot() + + geom_line(data=d_impurity, aes(x=splits, y=imp)) + + geom_rug(data=x, aes(x=bl, colour=species), alpha=0.3) + + ylab("Gini impurity") + + xlab("bl") + + scale_color_brewer("", palette="Dark2") + +p_tr_L <- p_tr |> + filter(bl < d_impurity_bl$splits) + +p_tr_R <- p_tr |> + filter(bl > d_impurity_bl$splits) + +# Make a function to make calculations easier +best_split <- function(x, cl, minsplit=5) { + unique_splits <- unique(x) + nsplits <- length(unique_splits)-1 + splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 + imp <- NULL; + for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x, s, cl, minsplit) + imp <- c(imp, a) + } + d_impurity <- tibble(splits, imp) + d_impurity_best <- d_impurity[which.min(d_impurity$imp),] + return(d_impurity_best) +} + +s1 <- best_split(p_tr$bl, p_tr$species, minsplit=5) +s2 <- best_split(p_tr_R$bm, p_tr_R$species, minsplit=5) + +ggplot(p_tr, aes(x=bl, y=bm, colour=species)) + + geom_point() + + geom_vline(xintercept=s1$splits) + + annotate("segment", x = s1$splits, + xend = max(p_tr$bl), + y = s2$splits, + yend = s2$splits) + + scale_colour_brewer("", palette="Dark2") + + theme(aspect.ratio = 1) +``` + + +::: + +## Digging deeper into diagnosing an error + +a. Fit the random forest model to the full penguins data. + +::: unilur-solution +```{r} +set.seed(923) +p_split2 <- initial_split(p_tidy_std, 2/3, + strata=species) +p_tr2 <- training(p_split2) +p_ts2 <- testing(p_split2) + +rf_spec <- rand_forest(mtry=2, trees=1000) |> + set_mode("classification") |> + set_engine("randomForest") +p_fit_rf <- rf_spec |> + fit(species ~ ., data = p_tr2) +``` +::: + +b. Report the confusion matrix. + +::: unilur-solution +```{r} +p_fit_rf +``` +::: + +c. Use linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it? + +![](../images/p_forest_detourr.png) + +Have a look at the other misclassifications, to understand whether they are ones we'd expect to misclassify, or whether the model is not well constructed. + +```{r eval=FALSE} +p_cl <- p_tr2 |> + mutate(pspecies = p_fit_rf$fit$predicted) |> + dplyr::select(bl:bm, species, pspecies) |> + mutate(sp_jit = jitter(as.numeric(species)), + psp_jit = jitter(as.numeric(pspecies))) +p_cl_shared <- SharedData$new(p_cl) + +detour_plot <- detour(p_cl_shared, tour_aes( + projection = bl:bm, + colour = species)) |> + tour_path(grand_tour(2), + max_bases=50, fps = 60) |> + show_scatter(alpha = 0.9, axes = FALSE, + width = "100%", height = "450px") + +conf_mat <- plot_ly(p_cl_shared, + x = ~psp_jit, + y = ~sp_jit, + color = ~species, + colors = viridis_pal(option = "D")(3), + height = 450) |> + highlight(on = "plotly_selected", + off = "plotly_doubleclick") |> + add_trace(type = "scatter", + mode = "markers") + +bscols( + detour_plot, conf_mat, + widths = c(5, 6) +) +``` + +## Deciding on variables in a large data problem + +a. Fit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes? + +This code might help: + +```{r} +data(bushfires) + +bushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |> + mutate(cause = factor(cause)) + +set.seed(1239) +bf_split <- initial_split(bushfires_sub, 3/4, strata=cause) +bf_tr <- training(bf_split) +bf_ts <- testing(bf_split) + +rf_spec <- rand_forest(mtry=5, trees=1000) |> + set_mode("classification") |> + set_engine("ranger", probability = TRUE, + importance="permutation") +bf_fit_rf <- rf_spec |> + fit(cause~., data = bf_tr) + +# Create votes matrix data +bf_rf_votes <- bf_fit_rf$fit$predictions |> + as_tibble() |> + mutate(cause = bf_tr$cause) + +# Project 4D into 3D +proj <- t(geozoo::f_helmert(4)[-1,]) +bf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj +colnames(bf_rf_v_p) <- c("x1", "x2", "x3") +bf_rf_v_p <- bf_rf_v_p |> + as.data.frame() |> + mutate(cause = bf_tr$cause) + +# Add simplex +simp <- simplex(p=3) +sp <- data.frame(simp$points) +colnames(sp) <- c("x1", "x2", "x3") +sp$cause = "" +bf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |> + mutate(cause = factor(cause)) +labels <- c("accident" , "arson", + "burning_off", "lightning", + rep("", nrow(bf_rf_v_p))) +``` + +```{r eval=FALSE} +# Examine votes matrix with bounding simplex +animate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, + axes = "off", half_range = 1.3, + edges = as.matrix(simp$edges), + obs_labels = labels) +``` + +::: unilur-solution +The pattern is that points are bunched at the vertex corresponding to lightning, extending along the edge leading to accident. We could also say that the points do extend on the face corresponding to lightning, accident and arson, too. The primary confusion for each of the other classes is with lightning. Few points are predicted to be `burning_off` because this is typically only occurring outside of fire season. + +Part of the reason that the forest predicts predominantly to the lightning class is because it is a highly imbalanced problem. One approach is to change the weights for each class, to give the lightning class a lower priority. This will change the model predictions to be more often the other three classes. +::: + +b. Check the variable importance. Plot the most important variables. + +This code might help: + +```{r eval=FALSE} +bf_fit_rf$fit$variable.importance |> + as_tibble() |> + rename(imp=value) |> + mutate(var = colnames(bf_tr)[1:50]) |> + select(var, imp) |> + arrange(desc(imp)) |> + print(n=50) +``` + +::: unilur-solution + +```{r} +#| fig-width: 10 +#| fig-height: 5 +#| out-width: 100% +p1 <- ggplot(bf_tr, aes(x=cause, y=log_dist_road)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p2 <- ggplot(bf_tr, aes(x=cause, y=arf360)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p3 <- ggplot(bf_tr, aes(x=cause, y=log_dist_cfa)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p1 + p2 + p3 + plot_layout(ncol=3) +``` + +Each of these variables has some difference in median value between the classes, but none shows any separation between them. If the three most important variables show little separation, it indicates the difficulty in distinguishing between these classes. However, it looks like if the distance from a road, or CFA station is bigger, the chance of the cause being a lightning start is higher. This makes sense, because these would be locations further from human activity, and thus the fire is less likely to started by people. The arf360 relates to rain from a year ago. It also appears that if the rainfall was higher a year ago, lightning is more likely the cause. This also makes some sense, because with more rain in the previous year, there should be more vegetation. Particularly, if recent months have been dry, then there is likely a lot of dry vegetation which is combustible. Ideally we would create a new variable (feature engineering) that looks at difference in rainfall from the previous year to just before the current year's fire season, to model these types of conditions. +::: + +## Can boosting better detect bushfire case? + +Fit a boosted tree model using `xgboost` to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison. + +```{r} +set.seed(121) +bf_spec2 <- boost_tree() |> + set_mode("classification") |> + set_engine("xgboost") +bf_fit_bt <- bf_spec2 |> + fit(cause~., data = bf_tr) +``` + +::: unilur-solution + +The results for the random forest are: + +```{r} +bf_ts_rf_pred <- bf_ts |> + mutate(pcause = predict(bf_fit_rf, bf_ts)$.pred_class) +bal_accuracy(bf_ts_rf_pred, cause, pcause) +bf_ts_rf_pred |> + count(cause, pcause) |> + group_by(cause) |> + mutate(Accuracy = n[cause==pcause]/sum(n)) |> + pivot_wider(names_from = "pcause", + values_from = n, values_fill = 0) |> + select(cause, accident, arson, burning_off, lightning, Accuracy) +``` + +and for the boosted tree are: + +```{r} +bf_ts_bt_pred <- bf_ts |> + mutate(pcause = predict(bf_fit_bt, + bf_ts)$.pred_class) +bal_accuracy(bf_ts_bt_pred, cause, pcause) +bf_ts_bt_pred |> + count(cause, pcause) |> + group_by(cause) |> + mutate(Accuracy = n[cause==pcause]/sum(n)) |> + pivot_wider(names_from = "pcause", + values_from = n, values_fill = 0) |> + select(cause, accident, arson, burning_off, lightning, Accuracy) +``` + +The boosted tree does improve the balanced accuracy. +::: + +## `r emo::ji("wave")` Finishing up + +Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult. diff --git a/docs/week6/tutorialsol.html b/docs/week6/tutorialsol.html new file mode 100644 index 00000000..217cefa7 --- /dev/null +++ b/docs/week6/tutorialsol.html @@ -0,0 +1,6068 @@ + + + + + + + + + + + +ETC3250/5250 Introduction to Machine Learning - ETC3250/5250 Tutorial 6 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

ETC3250/5250 Tutorial 6

+

Trees and forests

+
+ + + +
+ +
+
Author
+
+

Prof.ย Di Cook

+
+
+ +
+
Published
+
+

8 April 2024

+
+
+ + +
+ + + +
+ + +
+
+Load the libraries and avoid conflicts +
# Load libraries used everywhere
+library(tidyverse)
+library(tidymodels)
+library(patchwork)
+library(mulgar)
+library(palmerpenguins)
+library(GGally)
+library(tourr)
+library(MASS)
+library(discrim)
+library(classifly)
+library(detourr)
+library(crosstalk)
+library(plotly)
+library(viridis)
+library(colorspace)
+library(randomForest)
+library(geozoo)
+library(ggbeeswarm)
+library(conflicted)
+conflicts_prefer(dplyr::filter)
+conflicts_prefer(dplyr::select)
+conflicts_prefer(dplyr::slice)
+conflicts_prefer(palmerpenguins::penguins)
+conflicts_prefer(viridis::viridis_pal)
+
+options(digits=2)
+p_tidy <- penguins |>
+  select(species, bill_length_mm:body_mass_g) |>
+  rename(bl=bill_length_mm,
+         bd=bill_depth_mm,
+         fl=flipper_length_mm,
+         bm=body_mass_g) |>
+  filter(!is.na(bl)) |>
+  arrange(species) |>
+  na.omit()
+p_tidy_std <- p_tidy |>
+    mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
+
+
+
+

๐ŸŽฏ Objectives

+

The goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models.

+
+
+

๐Ÿ”ง Preparation

+
    +
  • Make sure you have all the necessary libraries installed. There are a few new ones this week!
  • +
+
+
+

Exercises:

+

Open your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.

+
+
set.seed(1156)
+p_sub <- p_tidy_std |>
+  filter(species != "Gentoo") |>
+  mutate(species = factor(species)) |>
+  select(species, bl, bm)
+p_split <- initial_split(p_sub, 2/3, strata = species)
+p_tr <- training(p_split)
+p_ts <- testing(p_split)
+
+
+

1. Becoming a car mechanic - looking under the hood at the tree algoriithm

+
    +
  1. Write down the equation for the Gini measure of impurity, for two groups, and the parameter \(p\) which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of \(p\) which gives the maximum value, and report what that maximum function value is.
  2. +
+
+ +
+
+
+

\(G = p(1-p)\) where \(p\) is the proportion of class 1 in the subset of data. The domain is \([0, 1]\) and the maximum value of \(0.25\) is at \(p=0.5\).

+
+
+
+
+
    +
  1. For two groups, how would the impurity of a split be measured? Give the equation.
  2. +
+
+ +
+
+
+

\[p_L(p_{L1}(1-p_{L1})) + p_R(p_{R1}(1-p_{R1}))\] where \(p_L\) is the proportion of observations to the left of the split, \(p_{L1}\) is the proportion of observations of class 1 to the left of the split, and \(p_{R1}\) indicates the equivalent quantities for observations to the right of the split.

+
+
+
+
+
    +
  1. Below is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the minsplit parameter, does to prevent splitting on the edges fewer than the specified number of observations.
  2. +
+
+
# This works for two classes, and one variable
+mygini <- function(p) {
+  g <- 0
+  if (p>0 && p<1) {
+    g <- 2*p*(1-p)
+  }
+
+  return(g)
+}
+
+mysplit <- function(x, spl, cl, minsplit=5) {
+  # Assumes x is sorted
+  # Count number of observations
+  n <- length(x)
+  
+  # Check number of classes
+  cl_unique <- unique(cl)
+  
+  # Split into two subsets on the given value
+  left <- x[x<spl]
+  cl_left <- cl[x<spl]
+  n_l <- length(left)
+
+  right <- x[x>=spl]
+  cl_right <- cl[x>=spl]
+  n_r <- length(right)
+  
+  # Don't calculate is either set is less than minsplit
+  if ((n_l < minsplit) | (n_r < minsplit)) 
+    impurity = NA
+  else {
+    # Compute the Gini value for the split
+    p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l
+    p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r
+    if (is.na(p_l)) p_l<-0.5
+    if (is.na(p_r)) p_r<-0.5
+    impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r)
+  }
+  return(impurity)
+}
+
+
    +
  1. Apply the function to compute the value for all possible splits for the body mass (bm), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.
  2. +
+
+ +
+
+
+
+
x <- p_tr |> 
+  select(species, bm) |>
+  arrange(bm)
+unique_splits <- unique(x$bm)
+nsplits <- length(unique_splits)-1
+splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2
+imp <- NULL;
+for (i in 1:length(splits)) {
+  s <- splits[i]
+  a <- mysplit(x$bm, s, x$species, minsplit=1)
+  imp <- c(imp, a)
+}
+d_impurity <- tibble(splits, imp)
+d_impurity_bm <- d_impurity[which.min(d_impurity$imp),]
+ggplot() + geom_line(data=d_impurity, aes(x=splits, y=imp)) +
+  geom_rug(data=x, aes(x=bm, colour=species), alpha=0.3) + 
+  ylab("Gini impurity") +
+  xlab("bm") +
+  scale_color_brewer("", palette="Dark2")
+
+
+
+

+
+
+
+
+
+
+
+
+
    +
  1. Use your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments.
  2. +
+
+ +
+
+
+
+
# bl: this is the only one needed for the first split
+# because it is so better separated than any others
+x <- p_tr |> 
+  select(species, bl) |>
+  arrange(bl)
+unique_splits <- unique(x$bl)
+nsplits <- length(unique_splits)-1
+splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2
+imp <- NULL;
+for (i in 1:length(splits)) {
+  s <- splits[i]
+  a <- mysplit(x$bl, s, x$species, minsplit=1)
+  imp <- c(imp, a)
+}
+d_impurity <- tibble(splits, imp)
+d_impurity_bl <- d_impurity[which.min(d_impurity$imp),]
+
+ggplot() + 
+  geom_line(data=d_impurity, aes(x=splits, y=imp)) +
+  geom_rug(data=x, aes(x=bl, colour=species), alpha=0.3) + 
+  ylab("Gini impurity") +
+  xlab("bl") +
+  scale_color_brewer("", palette="Dark2")
+
+
+
+

+
+
+
+
p_tr_L <- p_tr |>
+  filter(bl < d_impurity_bl$splits)
+
+p_tr_R <- p_tr |>
+  filter(bl > d_impurity_bl$splits)
+
+# Make a function to make calculations easier
+best_split <- function(x, cl, minsplit=5) {
+  unique_splits <- unique(x)
+  nsplits <- length(unique_splits)-1
+  splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2
+  imp <- NULL;
+  for (i in 1:length(splits)) {
+    s <- splits[i]
+    a <- mysplit(x, s, cl, minsplit)
+    imp <- c(imp, a)
+  }
+  d_impurity <- tibble(splits, imp)
+  d_impurity_best <- d_impurity[which.min(d_impurity$imp),]
+  return(d_impurity_best)
+}
+
+s1 <- best_split(p_tr$bl, p_tr$species, minsplit=5)
+s2 <- best_split(p_tr_R$bm, p_tr_R$species, minsplit=5)
+
+ggplot(p_tr, aes(x=bl, y=bm, colour=species)) +
+  geom_point() +
+  geom_vline(xintercept=s1$splits) +
+  annotate("segment", x = s1$splits,
+                xend = max(p_tr$bl),
+                y = s2$splits, 
+                yend = s2$splits) +
+  scale_colour_brewer("", palette="Dark2") +
+  theme(aspect.ratio = 1)
+
+
+
+

+
+
+
+
+
+
+
+
+
+
+
+

Digging deeper into diagnosing an error

+
    +
  1. Fit the random forest model to the full penguins data.
  2. +
+
+ +
+
+
+
+
set.seed(923)
+p_split2 <- initial_split(p_tidy_std, 2/3,
+                          strata=species)
+p_tr2 <- training(p_split2)
+p_ts2 <- testing(p_split2)
+
+rf_spec <- rand_forest(mtry=2, trees=1000) |>
+  set_mode("classification") |>
+  set_engine("randomForest")
+p_fit_rf <- rf_spec |> 
+  fit(species ~ ., data = p_tr2)
+
+
+
+
+
+
    +
  1. Report the confusion matrix.
  2. +
+
+ +
+
+
+
+
p_fit_rf
+
+
parsnip model object
+
+
+Call:
+ randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2,      x)) 
+               Type of random forest: classification
+                     Number of trees: 1000
+No. of variables tried at each split: 2
+
+        OOB estimate of  error rate: 2.6%
+Confusion matrix:
+          Adelie Chinstrap Gentoo class.error
+Adelie        97         2      1       0.030
+Chinstrap      2        43      0       0.044
+Gentoo         0         1     81       0.012
+
+
+
+
+
+
+
    +
  1. Use linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it?
  2. +
+

+

Have a look at the other misclassifications, to understand whether they are ones weโ€™d expect to misclassify, or whether the model is not well constructed.

+
+
p_cl <- p_tr2 |>
+  mutate(pspecies = p_fit_rf$fit$predicted) |>
+  dplyr::select(bl:bm, species, pspecies) |>
+  mutate(sp_jit = jitter(as.numeric(species)),
+         psp_jit = jitter(as.numeric(pspecies)))
+p_cl_shared <- SharedData$new(p_cl)
+
+detour_plot <- detour(p_cl_shared, tour_aes(
+  projection = bl:bm,
+  colour = species)) |>
+  tour_path(grand_tour(2),
+            max_bases=50, fps = 60) |>
+  show_scatter(alpha = 0.9, axes = FALSE,
+               width = "100%", height = "450px")
+
+conf_mat <- plot_ly(p_cl_shared,
+                    x = ~psp_jit,
+                    y = ~sp_jit,
+                    color = ~species,
+                    colors = viridis_pal(option = "D")(3),
+                    height = 450) |>
+  highlight(on = "plotly_selected",
+            off = "plotly_doubleclick") |>
+  add_trace(type = "scatter",
+            mode = "markers")
+
+bscols(
+  detour_plot, conf_mat,
+  widths = c(5, 6)
+)
+
+
+
+

Deciding on variables in a large data problem

+
    +
  1. Fit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes?
  2. +
+

This code might help:

+
+
data(bushfires)
+
+bushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |>
+  mutate(cause = factor(cause))
+
+set.seed(1239)
+bf_split <- initial_split(bushfires_sub, 3/4, strata=cause)
+bf_tr <- training(bf_split)
+bf_ts <- testing(bf_split)
+
+rf_spec <- rand_forest(mtry=5, trees=1000) |>
+  set_mode("classification") |>
+  set_engine("ranger", probability = TRUE, 
+             importance="permutation")
+bf_fit_rf <- rf_spec |> 
+  fit(cause~., data = bf_tr)
+
+# Create votes matrix data
+bf_rf_votes <- bf_fit_rf$fit$predictions |>
+  as_tibble() |>
+  mutate(cause = bf_tr$cause)
+
+# Project 4D into 3D
+proj <- t(geozoo::f_helmert(4)[-1,])
+bf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj
+colnames(bf_rf_v_p) <- c("x1", "x2", "x3")
+bf_rf_v_p <- bf_rf_v_p |>
+  as.data.frame() |>
+  mutate(cause = bf_tr$cause)
+  
+# Add simplex
+simp <- simplex(p=3)
+sp <- data.frame(simp$points)
+colnames(sp) <- c("x1", "x2", "x3")
+sp$cause = ""
+bf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |>
+  mutate(cause = factor(cause))
+labels <- c("accident" , "arson", 
+                "burning_off", "lightning", 
+                rep("", nrow(bf_rf_v_p)))
+
+
+
# Examine votes matrix with bounding simplex
+animate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, 
+           axes = "off", half_range = 1.3,
+           edges = as.matrix(simp$edges),
+           obs_labels = labels)
+
+
+ +
+
+
+

The pattern is that points are bunched at the vertex corresponding to lightning, extending along the edge leading to accident. We could also say that the points do extend on the face corresponding to lightning, accident and arson, too. The primary confusion for each of the other classes is with lightning. Few points are predicted to be burning_off because this is typically only occurring outside of fire season.

+

Part of the reason that the forest predicts predominantly to the lightning class is because it is a highly imbalanced problem. One approach is to change the weights for each class, to give the lightning class a lower priority. This will change the model predictions to be more often the other three classes.

+
+
+
+
+
    +
  1. Check the variable importance. Plot the most important variables.
  2. +
+

This code might help:

+
+
bf_fit_rf$fit$variable.importance |> 
+  as_tibble() |> 
+  rename(imp=value) |>
+  mutate(var = colnames(bf_tr)[1:50]) |>
+  select(var, imp) |>
+  arrange(desc(imp)) |> 
+  print(n=50)
+
+
+ +
+
+
+
+
p1 <- ggplot(bf_tr, aes(x=cause, y=log_dist_road)) +
+  geom_quasirandom(alpha=0.5) +
+  stat_summary(aes(group = cause), 
+               fun = median, 
+               fun.min = median, 
+               fun.max = median, 
+               geom = "crossbar", 
+               color = "orange", 
+               width = 0.7, 
+               lwd = 0.5) +
+  xlab("") +
+  coord_flip() 
+p2 <- ggplot(bf_tr, aes(x=cause, y=arf360)) +
+  geom_quasirandom(alpha=0.5) +
+  stat_summary(aes(group = cause), 
+               fun = median, 
+               fun.min = median, 
+               fun.max = median, 
+               geom = "crossbar", 
+               color = "orange", 
+               width = 0.7, 
+               lwd = 0.5) +
+  xlab("") +
+  coord_flip()
+p3 <- ggplot(bf_tr, aes(x=cause, y=log_dist_cfa)) +
+  geom_quasirandom(alpha=0.5) +
+  stat_summary(aes(group = cause), 
+               fun = median, 
+               fun.min = median, 
+               fun.max = median, 
+               geom = "crossbar", 
+               color = "orange", 
+               width = 0.7, 
+               lwd = 0.5) +
+  xlab("") +
+  coord_flip()
+p1 + p2 + p3 + plot_layout(ncol=3)
+
+
+
+

+
+
+
+
+

Each of these variables has some difference in median value between the classes, but none shows any separation between them. If the three most important variables show little separation, it indicates the difficulty in distinguishing between these classes. However, it looks like if the distance from a road, or CFA station is bigger, the chance of the cause being a lightning start is higher. This makes sense, because these would be locations further from human activity, and thus the fire is less likely to started by people. The arf360 relates to rain from a year ago. It also appears that if the rainfall was higher a year ago, lightning is more likely the cause. This also makes some sense, because with more rain in the previous year, there should be more vegetation. Particularly, if recent months have been dry, then there is likely a lot of dry vegetation which is combustible. Ideally we would create a new variable (feature engineering) that looks at difference in rainfall from the previous year to just before the current yearโ€™s fire season, to model these types of conditions.

+
+
+
+
+
+
+

Can boosting better detect bushfire case?

+

Fit a boosted tree model using xgboost to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison.

+
+
set.seed(121)
+bf_spec2 <- boost_tree() |>
+  set_mode("classification") |>
+  set_engine("xgboost")
+bf_fit_bt <- bf_spec2 |> 
+  fit(cause~., data = bf_tr)
+
+
+ +
+
+
+

The results for the random forest are:

+
+
bf_ts_rf_pred <- bf_ts |>
+  mutate(pcause = predict(bf_fit_rf, bf_ts)$.pred_class)
+bal_accuracy(bf_ts_rf_pred, cause, pcause)
+
+
# A tibble: 1 ร— 3
+  .metric      .estimator .estimate
+  <chr>        <chr>          <dbl>
+1 bal_accuracy macro          0.638
+
+
bf_ts_rf_pred |>
+  count(cause, pcause) |>
+  group_by(cause) |>
+  mutate(Accuracy = n[cause==pcause]/sum(n)) |>
+  pivot_wider(names_from = "pcause", 
+              values_from = n, values_fill = 0) |>
+  select(cause, accident, arson, burning_off, lightning, Accuracy)
+
+
# A tibble: 4 ร— 6
+# Groups:   cause [4]
+  cause       accident arson burning_off lightning Accuracy
+  <fct>          <int> <int>       <int>     <int>    <dbl>
+1 accident          14     0           0        19   0.424 
+2 arson              2     1           0        10   0.0769
+3 burning_off        0     0           1         3   0.25  
+4 lightning          0     0           0       206   1     
+
+
+

and for the boosted tree are:

+
+
bf_ts_bt_pred <- bf_ts |>
+  mutate(pcause = predict(bf_fit_bt, 
+                            bf_ts)$.pred_class)
+bal_accuracy(bf_ts_bt_pred, cause, pcause)
+
+
# A tibble: 1 ร— 3
+  .metric      .estimator .estimate
+  <chr>        <chr>          <dbl>
+1 bal_accuracy macro          0.765
+
+
bf_ts_bt_pred |>
+  count(cause, pcause) |>
+  group_by(cause) |>
+  mutate(Accuracy = n[cause==pcause]/sum(n)) |>
+  pivot_wider(names_from = "pcause", 
+              values_from = n, values_fill = 0) |>
+  select(cause, accident, arson, burning_off, lightning, Accuracy)
+
+
# A tibble: 4 ร— 6
+# Groups:   cause [4]
+  cause       accident arson burning_off lightning Accuracy
+  <fct>          <int> <int>       <int>     <int>    <dbl>
+1 accident          19     1           0        13    0.576
+2 arson              4     6           0         3    0.462
+3 burning_off        0     0           2         2    0.5  
+4 lightning          3     1           0       202    0.981
+
+
+

The boosted tree does improve the balanced accuracy.

+
+
+
+
+
+
+

๐Ÿ‘‹ Finishing up

+

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.

+ + +
+ +
+ +
+ + + + + \ No newline at end of file diff --git a/setup.R b/setup.R index 0df13a8c..1c7352a1 100644 --- a/setup.R +++ b/setup.R @@ -68,6 +68,7 @@ conflicts_prefer(palmerpenguins::penguins) conflicts_prefer(tourr::flea) conflicts_prefer(viridis::viridis_pal) conflicts_prefer(latex2exp::TeX) +conflicts_prefer(geozoo::simplex) p_tidy <- penguins |> select(species, bill_length_mm:body_mass_g) |> diff --git a/week2/images/slides.rmarkdown/data-in-model-space1-1.png b/week2/images/slides.rmarkdown/data-in-model-space1-1.png index 450dedde..10b9196a 100644 Binary files a/week2/images/slides.rmarkdown/data-in-model-space1-1.png and b/week2/images/slides.rmarkdown/data-in-model-space1-1.png differ diff --git a/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png b/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png index 11c810e5..a6b0fca0 100644 Binary files a/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png and b/week2/images/slides.rmarkdown/model-in-the-data-space1-1.png differ diff --git a/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png b/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png index 9925ad6e..9cde5591 100644 Binary files a/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png and b/week2/images/slides.rmarkdown/model-in-the-data-space2-1.png differ diff --git a/week5/slides.qmd b/week5/slides.qmd index bb9ffd7e..96b5ac78 100644 --- a/week5/slides.qmd +++ b/week5/slides.qmd @@ -8,7 +8,7 @@ author: - name: "Professor Di Cook" email: "etc3250.clayton-x@monash.edu" institute: "Department of Econometrics and Business Statistics" -footer: "ETC3250/5250 Lecture 4 | [iml.numbat.space](iml.numbat.space)" +footer: "ETC3250/5250 Lecture 5 | [iml.numbat.space](iml.numbat.space)" format: revealjs: multiplex: false @@ -195,7 +195,7 @@ $$\widehat{p}_{RA} = 0/2, \widehat{p}_{RB} = 2/2, ~~ p_R = 2/7$$ $$G_R=0(1-0)+1(1-1) = 0$$ Combine with weighted sum to get [impurity for the split]{.monash-orange2}: -$$5/7G_L + 2/7G_R=0.32$$ +$$5/7G_L + 2/7G_R=0.23$$

[**Your turn**: Compute the impurity for split 2.]{.monash-blue2} diff --git a/week6/tutorial.qmd b/week6/tutorial.qmd new file mode 100644 index 00000000..5c87d4ad --- /dev/null +++ b/week6/tutorial.qmd @@ -0,0 +1,522 @@ +--- +title: "ETC3250/5250 Tutorial 6" +subtitle: "Trees and forests" +author: "Prof. Di Cook" +date: "2024-04-08" +quarto-required: ">=1.3.0" +format: + unilur-html: + output-file: tutorial.html + embed-resources: true + css: "../assets/tutorial.css" + unilur-html+solution: + output-file: tutorialsol.html + embed-resources: true + css: "../assets/tutorial.css" +unilur-solution: true +--- + +```{r echo=FALSE} +# Set up chunk for all slides +knitr::opts_chunk$set( + fig.width = 4, + fig.height = 4, + fig.align = "center", + out.width = "60%", + code.line.numbers = FALSE, + fig.retina = 3, + echo = TRUE, + message = FALSE, + warning = FALSE, + cache = FALSE, + dev.args = list(pointsize = 11) +) +``` + +```{r} +#| echo: true +#| code-fold: true +#| code-summary: "Load the libraries and avoid conflicts" +# Load libraries used everywhere +library(tidyverse) +library(tidymodels) +library(patchwork) +library(mulgar) +library(palmerpenguins) +library(GGally) +library(tourr) +library(MASS) +library(discrim) +library(classifly) +library(detourr) +library(crosstalk) +library(plotly) +library(viridis) +library(colorspace) +library(randomForest) +library(geozoo) +library(ggbeeswarm) +library(conflicted) +conflicts_prefer(dplyr::filter) +conflicts_prefer(dplyr::select) +conflicts_prefer(dplyr::slice) +conflicts_prefer(palmerpenguins::penguins) +conflicts_prefer(viridis::viridis_pal) + +options(digits=2) +p_tidy <- penguins |> + select(species, bill_length_mm:body_mass_g) |> + rename(bl=bill_length_mm, + bd=bill_depth_mm, + fl=flipper_length_mm, + bm=body_mass_g) |> + filter(!is.na(bl)) |> + arrange(species) |> + na.omit() +p_tidy_std <- p_tidy |> + mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) +``` + +```{r} +#| echo: false +# Set plot theme +theme_set(theme_bw(base_size = 14) + + theme( + aspect.ratio = 1, + plot.background = element_rect(fill = 'transparent', colour = NA), + plot.title.position = "plot", + plot.title = element_text(size = 24), + panel.background = element_rect(fill = 'transparent', colour = NA), + legend.background = element_rect(fill = 'transparent', colour = NA), + legend.key = element_rect(fill = 'transparent', colour = NA) + ) +) +``` + +## `r emo::ji("target")` Objectives + +The goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models. + +## `r emo::ji("wrench")` Preparation + +- Make sure you have all the necessary libraries installed. There are a few new ones this week! + +## Exercises: + +Open your project for this unit called `iml.Rproj`. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows. + +```{r} +set.seed(1156) +p_sub <- p_tidy_std |> + filter(species != "Gentoo") |> + mutate(species = factor(species)) |> + select(species, bl, bm) +p_split <- initial_split(p_sub, 2/3, strata = species) +p_tr <- training(p_split) +p_ts <- testing(p_split) +``` + +#### 1. Becoming a car mechanic - looking under the hood at the tree algoriithm + +a. Write down the equation for the Gini measure of impurity, for two groups, and the parameter $p$ which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of $p$ which gives the maximum value, and report what that maximum function value is. + +::: unilur-solution +$G = p(1-p)$ where $p$ is the proportion of class 1 in the subset of data. The domain is $[0, 1]$ and the maximum value of $0.25$ is at $p=0.5$. +::: + +b. For two groups, how would the impurity of a **split** be measured? Give the equation. + +::: unilur-solution + +$$p_L(p_{L1}(1-p_{L1})) + p_R(p_{R1}(1-p_{R1}))$$ +where $p_L$ is the proportion of observations to the left of the split, $p_{L1}$ is the proportion of observations of class 1 to the left of the split, and $p_{R1}$ indicates the equivalent quantities for observations to the right of the split. + +::: + +c. Below is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the `minsplit` parameter, does to prevent splitting on the edges fewer than the specified number of observations. + + +```{r echo=TRUE} +# This works for two classes, and one variable +mygini <- function(p) { + g <- 0 + if (p>0 && p<1) { + g <- 2*p*(1-p) + } + + return(g) +} + +mysplit <- function(x, spl, cl, minsplit=5) { + # Assumes x is sorted + # Count number of observations + n <- length(x) + + # Check number of classes + cl_unique <- unique(cl) + + # Split into two subsets on the given value + left <- x[x=spl] + cl_right <- cl[x>=spl] + n_r <- length(right) + + # Don't calculate is either set is less than minsplit + if ((n_l < minsplit) | (n_r < minsplit)) + impurity = NA + else { + # Compute the Gini value for the split + p_l <- length(cl_left[cl_left == cl_unique[1]])/n_l + p_r <- length(cl_right[cl_right == cl_unique[1]])/n_r + if (is.na(p_l)) p_l<-0.5 + if (is.na(p_r)) p_r<-0.5 + impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r) + } + return(impurity) +} +``` + + +d. Apply the function to compute the value for all possible splits for the body mass (`bm`), setting `minsplit` to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable. + +::: unilur-solution + +```{r} +x <- p_tr |> + select(species, bm) |> + arrange(bm) +unique_splits <- unique(x$bm) +nsplits <- length(unique_splits)-1 +splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 +imp <- NULL; +for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x$bm, s, x$species, minsplit=1) + imp <- c(imp, a) +} +d_impurity <- tibble(splits, imp) +d_impurity_bm <- d_impurity[which.min(d_impurity$imp),] +ggplot() + geom_line(data=d_impurity, aes(x=splits, y=imp)) + + geom_rug(data=x, aes(x=bm, colour=species), alpha=0.3) + + ylab("Gini impurity") + + xlab("bm") + + scale_color_brewer("", palette="Dark2") +``` + +::: + +e. Use your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting `minsplit` to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments. + +::: unilur-solution + +```{r results='hide'} +# bl: this is the only one needed for the first split +# because it is so better separated than any others +x <- p_tr |> + select(species, bl) |> + arrange(bl) +unique_splits <- unique(x$bl) +nsplits <- length(unique_splits)-1 +splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 +imp <- NULL; +for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x$bl, s, x$species, minsplit=1) + imp <- c(imp, a) +} +d_impurity <- tibble(splits, imp) +d_impurity_bl <- d_impurity[which.min(d_impurity$imp),] + +ggplot() + + geom_line(data=d_impurity, aes(x=splits, y=imp)) + + geom_rug(data=x, aes(x=bl, colour=species), alpha=0.3) + + ylab("Gini impurity") + + xlab("bl") + + scale_color_brewer("", palette="Dark2") + +p_tr_L <- p_tr |> + filter(bl < d_impurity_bl$splits) + +p_tr_R <- p_tr |> + filter(bl > d_impurity_bl$splits) + +# Make a function to make calculations easier +best_split <- function(x, cl, minsplit=5) { + unique_splits <- unique(x) + nsplits <- length(unique_splits)-1 + splits <- (unique_splits[1:nsplits] + unique_splits[2:(nsplits+1)])/2 + imp <- NULL; + for (i in 1:length(splits)) { + s <- splits[i] + a <- mysplit(x, s, cl, minsplit) + imp <- c(imp, a) + } + d_impurity <- tibble(splits, imp) + d_impurity_best <- d_impurity[which.min(d_impurity$imp),] + return(d_impurity_best) +} + +s1 <- best_split(p_tr$bl, p_tr$species, minsplit=5) +s2 <- best_split(p_tr_R$bm, p_tr_R$species, minsplit=5) + +ggplot(p_tr, aes(x=bl, y=bm, colour=species)) + + geom_point() + + geom_vline(xintercept=s1$splits) + + annotate("segment", x = s1$splits, + xend = max(p_tr$bl), + y = s2$splits, + yend = s2$splits) + + scale_colour_brewer("", palette="Dark2") + + theme(aspect.ratio = 1) +``` + + +::: + +## Digging deeper into diagnosing an error + +a. Fit the random forest model to the full penguins data. + +::: unilur-solution +```{r} +set.seed(923) +p_split2 <- initial_split(p_tidy_std, 2/3, + strata=species) +p_tr2 <- training(p_split2) +p_ts2 <- testing(p_split2) + +rf_spec <- rand_forest(mtry=2, trees=1000) |> + set_mode("classification") |> + set_engine("randomForest") +p_fit_rf <- rf_spec |> + fit(species ~ ., data = p_tr2) +``` +::: + +b. Report the confusion matrix. + +::: unilur-solution +```{r} +p_fit_rf +``` +::: + +c. Use linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it? + +![](../images/p_forest_detourr.png) + +Have a look at the other misclassifications, to understand whether they are ones we'd expect to misclassify, or whether the model is not well constructed. + +```{r eval=FALSE} +p_cl <- p_tr2 |> + mutate(pspecies = p_fit_rf$fit$predicted) |> + dplyr::select(bl:bm, species, pspecies) |> + mutate(sp_jit = jitter(as.numeric(species)), + psp_jit = jitter(as.numeric(pspecies))) +p_cl_shared <- SharedData$new(p_cl) + +detour_plot <- detour(p_cl_shared, tour_aes( + projection = bl:bm, + colour = species)) |> + tour_path(grand_tour(2), + max_bases=50, fps = 60) |> + show_scatter(alpha = 0.9, axes = FALSE, + width = "100%", height = "450px") + +conf_mat <- plot_ly(p_cl_shared, + x = ~psp_jit, + y = ~sp_jit, + color = ~species, + colors = viridis_pal(option = "D")(3), + height = 450) |> + highlight(on = "plotly_selected", + off = "plotly_doubleclick") |> + add_trace(type = "scatter", + mode = "markers") + +bscols( + detour_plot, conf_mat, + widths = c(5, 6) +) +``` + +## Deciding on variables in a large data problem + +a. Fit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes? + +This code might help: + +```{r} +data(bushfires) + +bushfires_sub <- bushfires[,c(5, 8:45, 48:55, 57:60)] |> + mutate(cause = factor(cause)) + +set.seed(1239) +bf_split <- initial_split(bushfires_sub, 3/4, strata=cause) +bf_tr <- training(bf_split) +bf_ts <- testing(bf_split) + +rf_spec <- rand_forest(mtry=5, trees=1000) |> + set_mode("classification") |> + set_engine("ranger", probability = TRUE, + importance="permutation") +bf_fit_rf <- rf_spec |> + fit(cause~., data = bf_tr) + +# Create votes matrix data +bf_rf_votes <- bf_fit_rf$fit$predictions |> + as_tibble() |> + mutate(cause = bf_tr$cause) + +# Project 4D into 3D +proj <- t(geozoo::f_helmert(4)[-1,]) +bf_rf_v_p <- as.matrix(bf_rf_votes[,1:4]) %*% proj +colnames(bf_rf_v_p) <- c("x1", "x2", "x3") +bf_rf_v_p <- bf_rf_v_p |> + as.data.frame() |> + mutate(cause = bf_tr$cause) + +# Add simplex +simp <- simplex(p=3) +sp <- data.frame(simp$points) +colnames(sp) <- c("x1", "x2", "x3") +sp$cause = "" +bf_rf_v_p_s <- bind_rows(sp, bf_rf_v_p) |> + mutate(cause = factor(cause)) +labels <- c("accident" , "arson", + "burning_off", "lightning", + rep("", nrow(bf_rf_v_p))) +``` + +```{r eval=FALSE} +# Examine votes matrix with bounding simplex +animate_xy(bf_rf_v_p_s[,1:3], col = bf_rf_v_p_s$cause, + axes = "off", half_range = 1.3, + edges = as.matrix(simp$edges), + obs_labels = labels) +``` + +::: unilur-solution +The pattern is that points are bunched at the vertex corresponding to lightning, extending along the edge leading to accident. We could also say that the points do extend on the face corresponding to lightning, accident and arson, too. The primary confusion for each of the other classes is with lightning. Few points are predicted to be `burning_off` because this is typically only occurring outside of fire season. + +Part of the reason that the forest predicts predominantly to the lightning class is because it is a highly imbalanced problem. One approach is to change the weights for each class, to give the lightning class a lower priority. This will change the model predictions to be more often the other three classes. +::: + +b. Check the variable importance. Plot the most important variables. + +This code might help: + +```{r eval=FALSE} +bf_fit_rf$fit$variable.importance |> + as_tibble() |> + rename(imp=value) |> + mutate(var = colnames(bf_tr)[1:50]) |> + select(var, imp) |> + arrange(desc(imp)) |> + print(n=50) +``` + +::: unilur-solution + +```{r} +#| fig-width: 10 +#| fig-height: 5 +#| out-width: 100% +p1 <- ggplot(bf_tr, aes(x=cause, y=log_dist_road)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p2 <- ggplot(bf_tr, aes(x=cause, y=arf360)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p3 <- ggplot(bf_tr, aes(x=cause, y=log_dist_cfa)) + + geom_quasirandom(alpha=0.5) + + stat_summary(aes(group = cause), + fun = median, + fun.min = median, + fun.max = median, + geom = "crossbar", + color = "orange", + width = 0.7, + lwd = 0.5) + + xlab("") + + coord_flip() +p1 + p2 + p3 + plot_layout(ncol=3) +``` + +Each of these variables has some difference in median value between the classes, but none shows any separation between them. If the three most important variables show little separation, it indicates the difficulty in distinguishing between these classes. However, it looks like if the distance from a road, or CFA station is bigger, the chance of the cause being a lightning start is higher. This makes sense, because these would be locations further from human activity, and thus the fire is less likely to started by people. The arf360 relates to rain from a year ago. It also appears that if the rainfall was higher a year ago, lightning is more likely the cause. This also makes some sense, because with more rain in the previous year, there should be more vegetation. Particularly, if recent months have been dry, then there is likely a lot of dry vegetation which is combustible. Ideally we would create a new variable (feature engineering) that looks at difference in rainfall from the previous year to just before the current year's fire season, to model these types of conditions. +::: + +## Can boosting better detect bushfire case? + +Fit a boosted tree model using `xgboost` to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison. + +```{r} +set.seed(121) +bf_spec2 <- boost_tree() |> + set_mode("classification") |> + set_engine("xgboost") +bf_fit_bt <- bf_spec2 |> + fit(cause~., data = bf_tr) +``` + +::: unilur-solution + +The results for the random forest are: + +```{r} +bf_ts_rf_pred <- bf_ts |> + mutate(pcause = predict(bf_fit_rf, bf_ts)$.pred_class) +bal_accuracy(bf_ts_rf_pred, cause, pcause) +bf_ts_rf_pred |> + count(cause, pcause) |> + group_by(cause) |> + mutate(Accuracy = n[cause==pcause]/sum(n)) |> + pivot_wider(names_from = "pcause", + values_from = n, values_fill = 0) |> + select(cause, accident, arson, burning_off, lightning, Accuracy) +``` + +and for the boosted tree are: + +```{r} +bf_ts_bt_pred <- bf_ts |> + mutate(pcause = predict(bf_fit_bt, + bf_ts)$.pred_class) +bal_accuracy(bf_ts_bt_pred, cause, pcause) +bf_ts_bt_pred |> + count(cause, pcause) |> + group_by(cause) |> + mutate(Accuracy = n[cause==pcause]/sum(n)) |> + pivot_wider(names_from = "pcause", + values_from = n, values_fill = 0) |> + select(cause, accident, arson, burning_off, lightning, Accuracy) +``` + +The boosted tree does improve the balanced accuracy. +::: + +## `r emo::ji("wave")` Finishing up + +Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.