diff --git a/docs/search.json b/docs/search.json
index 4a11ab5b..5e6f3e0c 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -305,7 +305,7 @@
"href": "week7/tutorialsol.html#exercises",
"title": "ETC3250/5250 Tutorial 7",
"section": "Exercises:",
- "text": "Exercises:\nOpen your project for this unit called iml.Rproj. We will be working through the tutorial at TensorFlow for R for fitting and predicting the fashion MNIST image data.\n\n1. Get the data\nWe use the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories of articles sold on Zalando’s multi-brand, digital platform for fashion, beauty, and lifestyle.\n\n# download the data\nfashion_mnist <- dataset_fashion_mnist()\n\n# split into input variables and response\nc(train_images, train_labels) %<-% fashion_mnist$train\nc(test_images, test_labels) %<-% fashion_mnist$test\n\n# for interpretation we also define the category names\nclass_names = c('T-shirt/top',\n 'Trouser',\n 'Pullover',\n 'Dress',\n 'Coat',\n 'Sandal',\n 'Shirt',\n 'Sneaker',\n 'Bag',\n 'Ankle boot')\n\n\n\n2. What’s in the data?\nCheck how many observations are in the training and test sets, and plot some of the images.\n\ndim(train_images)\ndim(train_labels)\ndim(test_images)\ndim(test_labels)\n\n# Choose an image randomly\nimg <- as.data.frame(train_images[sample(1:60000, 1), , ])\ncolnames(img) <- seq_len(ncol(img))\nimg$y <- seq_len(nrow(img))\nimg <- img |>\n pivot_longer(cols = -y,\n names_to=\"x\", \n values_to=\"value\") |>\n mutate(x = as.integer(x))\n\nggplot(img, aes(x = x, y = y, fill = value)) +\n geom_tile() +\n scale_fill_gradient(low = \"white\", \n high = \"black\", \n na.value = NA) +\n scale_y_reverse() +\n theme_map() +\n theme(legend.position = \"none\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n[1] 60000 28 28\n\n\n[1] 60000\n\n\n[1] 10000 28 28\n\n\n[1] 10000\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3. Pre-process the data\nIt may not be necessary, says Patrick, but we’ll scale the data to 0-1, before modeling.\n\ntrain_images <- train_images / 255\ntest_images <- test_images / 255\n\n\n\n4. Set up the model\nThe model architecture will have:\n\na flatten layer to turn the images into vectors\none hidden layer with 128 nodes with (rectified) linear activation\nfinal layer with 10 nodes and logistic activation\n\nWhy 10 nodes in the last layer? Why 128 nodes in the hidden layer?\n\nmodel_fashion_mnist <- keras_model_sequential()\nmodel_fashion_mnist |>\n # flatten the image data into a long vector\n layer_flatten(input_shape = c(28, 28)) |>\n # hidden layer with 128 units\n layer_dense(units = 128, activation = 'relu') |>\n # output layer for 10 categories\n layer_dense(units = 10, activation = 'softmax')\n\nSet the optimizer to be adam, loss function to be sparse_categorical_crossentropy and accuracy as the metric. What other optimizers could be used? What is the sparse_catgorical_crossentropy?\n\nmodel_fashion_mnist |> compile(\n optimizer = 'adam',\n loss = 'sparse_categorical_crossentropy',\n metrics = c('accuracy')\n)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThere are 10 classes, so we need 10 nodes in the final layer.\nThe choice of 128 nodes in the hidden layer is arbitrary. It means that we are reducing the dimension down from 784 to 128 at this point.\nSparse categorical cross-entropy is an extension of the categorical cross-entropy loss function that is used when the output labels are represented in a sparse matrix format. The labels are represented as a single index value rather than a binary matrix.\nhttps://keras.io/api/optimizers/ has a list of optimizers available.\n\n\n\n\n\n\n5. Fit the model\n\nmodel_fashion_mnist |> fit(train_images,\n train_labels,\n epochs = 5,\n verbose = 0)\n\n\n\n6. Evaluate the model\n\nfmnist_score <- model_fashion_mnist |> \n evaluate(test_images, test_labels, verbose = 0)\n\nfmnist_score\n\nCheck with other people in class. Do you get the same result? If not, why would this be?\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n loss accuracy \n0.5240766 0.8290000 \n\n\nEach person has started the optimizer with a different random seed, since we didn’t set one. You could try to set the seed using tensorflow::set_random_seed(), and have your neighbour do the same, to check if you get the same result. You will need to clean your environment before attempting this because if you fit the model again it will update the current one rather than starting afresh.\n\n\n\n\n\n\n7. Predict the test set\nWhich classes are most often confused?\n\ntest_tags <- factor(class_names[test_labels + 1],\n levels = class_names)\n\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\nfashion_test_pred_cat <- levels(test_tags)[\n apply(fashion_test_pred, 1,\n which.max)]\npredicted <- factor(\n fashion_test_pred_cat,\n levels=levels(test_tags)) |>\n as.numeric() - 1\nobserved <- as.numeric(test_tags) -1\ntable(observed, predicted)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n predicted\nobserved 0 1 2 3 4 5 6 7 8 9\n 0 856 4 7 41 11 2 67 0 11 1\n 1 1 961 2 17 11 0 7 0 1 0\n 2 32 3 551 5 281 0 121 0 7 0\n 3 35 28 10 808 74 1 35 0 9 0\n 4 0 3 26 17 897 0 51 0 6 0\n 5 0 0 0 0 0 959 0 33 2 6\n 6 181 2 64 28 194 1 504 0 26 0\n 7 0 0 0 0 0 46 0 940 0 14\n 8 15 1 3 7 8 9 21 7 929 0\n 9 0 0 0 0 0 56 1 58 0 885\n\n\nThere are several classes that have some confusion with other classes, particularly 6 with 0, 2, 4. But other classes are most often confused with at least one other. Classes 1, 5, 7, 8, 9 are rarely confused.\n\n\n\n\n\n\n8. Compute metrics\nCompute the accuracy of the model on the test set. How does this compare with the accuracy reported when you fitted the model?\nIf the model equally accurate on all classes? If not, which class(es) is(are) poorly fitted?\n\nfashion_test_pred <- fashion_test_pred |>\n cbind(observed, predicted)\n\nfashion_test_pred <- fashion_test_pred |>\n as.tibble() |>\n mutate(label = test_tags,\n plabel = factor(class_names[predicted+1], \n levels = levels(test_tags)))\n\naccuracy(fashion_test_pred, label, plabel)\nbal_accuracy(fashion_test_pred, label, plabel)\nfashion_test_pred |>\n count(label, plabel) |>\n group_by(label) |>\n mutate(Accuracy = ifelse(sum(n)>0, n[plabel==label]/sum(n), 0)) |>\n pivot_wider(names_from = \"plabel\", \n values_from = n, \n values_fill = 0) |>\n select(label, Accuracy)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n# A tibble: 1 × 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.829\n\n\n# A tibble: 1 × 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy macro 0.905\n\n\n# A tibble: 10 × 2\n# Groups: label [10]\n label Accuracy\n <fct> <dbl>\n 1 T-shirt/top 0.856\n 2 Trouser 0.961\n 3 Pullover 0.551\n 4 Dress 0.808\n 5 Coat 0.897\n 6 Sandal 0.959\n 7 Shirt 0.504\n 8 Sneaker 0.94 \n 9 Bag 0.929\n10 Ankle boot 0.885\n\n\n\n\n\n\n\n\n9. Investigating the results\nThis section is motivated by the examples in Cook and Laa (2024). Focus on the test data to investigate the fit, and lack of fit.\n\nPCA can be used to reduce the dimension down from 784, to a small number of PCS, to examine the nature of differences between the classes. Compute the scree plot to decide on a reasonable number that can be examined in a tour. Plot the first two statically. Explain how the class structure matches any clustering.\n\n\ntest_images_flat <- test_images\ndim(test_images_flat) <- c(nrow(test_images_flat), 784)\nimages_pca <- prcomp(as.data.frame(test_images_flat))\nimages_pc <- as.data.frame(images_pca$x)\nggscree(images_pca, q=20, guide=FALSE)\nggplot(images_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme(legend.title = element_blank())\n\n\nanimate_xy(images_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere isn’t much separation between classes in the PCs. There is some difference between classes, with overlap between them. It looks less separated than what the confusion matrix would suggest.\n\n\n\n\n\nUMAP can also be used to understand the class structure. Make a 2D UMAP representation and explain how the class structure matches cluster structure.\n\n\nset.seed(253)\nfashion_umap <- umap(test_images_flat, init = \"spca\")\nfashion_umap_df <- fashion_umap |>\n as_tibble() |>\n rename(UMAP1 = V1, UMAP2 = V2) |>\n mutate(label = test_tags)\nggplot(fashion_umap_df, aes(x = UMAP1, \n y = UMAP2,\n colour = label)) +\n geom_point(size = 0.3, alpha=0.5) +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme(legend.title = element_blank())\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere are multiple well-separated clusters in the representation. Mostly these are mixtures of several classes. Only one cluster mostly matches an article, Trouser.\n\n\n\n\n\nInterestingly, the nodes in the hidden layer can be thought of as 128 new variables which are linear combinations of the original 784 variables. This is too many to visualise but we can again use PCA to reduce their dimension again, and make plots.\n\n\nactivations_model_fashion <- keras_model(\n inputs = model_fashion_mnist$input,\n outputs = model_fashion_mnist$layers[[2]]$output\n)\nactivations_fashion <- predict(\n activations_model_fashion,\n test_images, verbose = 0)\n\n# PCA for activations\nactivations_pca <- prcomp(activations_fashion)\nactivations_pc <- as.data.frame(activations_pca$x)\n\nggscree(activations_pca, q=20, guide=FALSE)\n\nggplot(activations_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n ggtitle(\"Activations\") +\n scale_color_discrete_qualitative(palette = \"Dynamic\") \n\n\nanimate_xy(activations_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere substantial separation between classes in the PCs of these new variables. It looks now reasonable that the classes are distinguishable as the confusion matrix suggests.\n\n\n\n\n\nSimilarly, we can general a 2D representation using UMAP of these new variables.\n\n\nset.seed(253)\nfashion_umap <- umap(activations_fashion, init = \"spca\")\nfashion_umap_df <- fashion_umap |>\n as_tibble() |>\n rename(UMAP1 = V1, UMAP2 = V2) |>\n mutate(label = test_tags)\nggplot(fashion_umap_df, aes(x = UMAP1, \n y = UMAP2,\n colour = label)) +\n geom_point(size = 0.5, alpha=0.5) +\n scale_color_discrete_qualitative(palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere is a lot of clustering in this view, but it mostly doesn’t match the classes. Trouser is the only class that appears to be primarily in one cluster.\n\n\n\n\n\nLast task is to explain on what was learned from the confusion matrix to examine the uncertainty in predictions from the predictive probabilities. Because there are 10 classes, these will fall in a 9D simplex. Each vertex is the spot where the model is completely certain about the prediction. Points along an edge indicate confusion only between two classes. Points on a triangular face indicate confusion between three classes. The code below will create the visualisation of the predictive probabilities, focusing on four of the 10 classes to make it a little simpler to digest.\n\n\n# Generate the projection to 9D\nproj <- t(geozoo::f_helmert(10)[-1,])\nf_nn_v_p <- as.matrix(fashion_test_pred[,1:10]) %*% proj\ncolnames(f_nn_v_p) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\n\nf_nn_v_p <- f_nn_v_p %>%\n as.data.frame() %>%\n mutate(class = test_tags)\n\nsimp <- geozoo::simplex(p=9)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\nsp$class = \"\"\nf_nn_v_p_s <- bind_rows(sp, f_nn_v_p) %>%\n mutate(class = ifelse(class %in% c(\"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\"), class, \"Other\")) %>%\n mutate(class = factor(class, levels=c(\"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\",\n \"Other\"))) \n\n\nanimate_xy(f_nn_v_p_s[,1:9], col = f_nn_v_p_s$class, \n axes = \"off\", cex=0.2,\n edges = as.matrix(simp$edges),\n edges.width = 0.05,\n palette = \"Viridis\")\n\n\n\n\n10. Ways to improve the model\nThere are many ways to improve neural networks. If you have time, read through the approaches taken in the HOML book. It used the digits data, but the approaches will be suitable to apply to the fashion data. Try:\n\nincreasing the number of epochs (don’t think this helps here)\ntry adding batch processing using batch size\nuse a validation set split\ntry a different number of nodes in the hidden layer\nexpand the number of layers\nadd batch normalisation\nuse regularisation at each layer\nadd dropout for each layer\nexperiment with the learning rate"
+ "text": "Exercises:\nOpen your project for this unit called iml.Rproj. We will be working through the tutorial at TensorFlow for R for fitting and predicting the fashion MNIST image data.\n\n1. Get the data\nWe use the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories of articles sold on Zalando’s multi-brand, digital platform for fashion, beauty, and lifestyle.\n\n# download the data\nfashion_mnist <- dataset_fashion_mnist()\n\n# split into input variables and response\nc(train_images, train_labels) %<-% fashion_mnist$train\nc(test_images, test_labels) %<-% fashion_mnist$test\n\n# for interpretation we also define the category names\nclass_names = c('T-shirt/top',\n 'Trouser',\n 'Pullover',\n 'Dress',\n 'Coat',\n 'Sandal',\n 'Shirt',\n 'Sneaker',\n 'Bag',\n 'Ankle boot')\n\n\n\n2. What’s in the data?\nCheck how many observations are in the training and test sets, and plot some of the images.\n\ndim(train_images)\ndim(train_labels)\ndim(test_images)\ndim(test_labels)\n\n# Choose an image randomly\nimg <- as.data.frame(train_images[sample(1:60000, 1), , ])\ncolnames(img) <- seq_len(ncol(img))\nimg$y <- seq_len(nrow(img))\nimg <- img |>\n pivot_longer(cols = -y,\n names_to=\"x\", \n values_to=\"value\") |>\n mutate(x = as.integer(x))\n\nggplot(img, aes(x = x, y = y, fill = value)) +\n geom_tile() +\n scale_fill_gradient(low = \"white\", \n high = \"black\", \n na.value = NA) +\n scale_y_reverse() +\n theme_map() +\n theme(legend.position = \"none\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n[1] 60000 28 28\n\n\n[1] 60000\n\n\n[1] 10000 28 28\n\n\n[1] 10000\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3. Pre-process the data\nIt may not be necessary, says Patrick, but we’ll scale the data to 0-1, before modeling.\n\ntrain_images <- train_images / 255\ntest_images <- test_images / 255\n\n\n\n4. Set up the model\nThe model architecture will have:\n\na flatten layer to turn the images into vectors\none hidden layer with 128 nodes with (rectified) linear activation\nfinal layer with 10 nodes and logistic activation\n\nWhy 10 nodes in the last layer? Why 128 nodes in the hidden layer?\n\nmodel_fashion_mnist <- keras_model_sequential()\nmodel_fashion_mnist |>\n # flatten the image data into a long vector\n layer_flatten(input_shape = c(28, 28)) |>\n # hidden layer with 128 units\n layer_dense(units = 128, activation = 'relu') |>\n # output layer for 10 categories\n layer_dense(units = 10, activation = 'softmax')\n\nSet the optimizer to be adam, loss function to be sparse_categorical_crossentropy and accuracy as the metric. What other optimizers could be used? What is the sparse_catgorical_crossentropy?\n\nmodel_fashion_mnist |> compile(\n optimizer = 'adam',\n loss = 'sparse_categorical_crossentropy',\n metrics = c('accuracy')\n)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThere are 10 classes, so we need 10 nodes in the final layer.\nThe choice of 128 nodes in the hidden layer is arbitrary. It means that we are reducing the dimension down from 784 to 128 at this point.\nSparse categorical cross-entropy is an extension of the categorical cross-entropy loss function that is used when the output labels are represented in a sparse matrix format. The labels are represented as a single index value rather than a binary matrix.\nhttps://keras.io/api/optimizers/ has a list of optimizers available.\n\n\n\n\n\n\n5. Fit the model\n\nmodel_fashion_mnist |> fit(train_images,\n train_labels,\n epochs = 5,\n verbose = 0)\n\n\n\n6. Evaluate the model\n\nfmnist_score <- model_fashion_mnist |> \n evaluate(test_images, test_labels, verbose = 0)\n\nfmnist_score\n\nCheck with other people in class. Do you get the same result? If not, why would this be?\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n loss accuracy \n0.516321 0.831100 \n\n\nEach person has started the optimizer with a different random seed, since we didn’t set one. You could try to set the seed using tensorflow::set_random_seed(), and have your neighbour do the same, to check if you get the same result. You will need to clean your environment before attempting this because if you fit the model again it will update the current one rather than starting afresh.\n\n\n\n\n\n\n7. Predict the test set\nWhich classes are most often confused?\n\ntest_tags <- factor(class_names[test_labels + 1],\n levels = class_names)\n\nfashion_test_pred <- predict(model_fashion_mnist,\n test_images, verbose = 0)\nfashion_test_pred_cat <- levels(test_tags)[\n apply(fashion_test_pred, 1,\n which.max)]\npredicted <- factor(\n fashion_test_pred_cat,\n levels=levels(test_tags)) |>\n as.numeric() - 1\nobserved <- as.numeric(test_tags) -1\ntable(observed, predicted)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n predicted\nobserved 0 1 2 3 4 5 6 7 8 9\n 0 759 4 24 36 6 2 156 0 13 0\n 1 4 954 0 31 6 0 3 0 2 0\n 2 18 18 722 7 194 1 34 0 6 0\n 3 33 6 13 807 69 1 67 0 4 0\n 4 0 10 107 12 827 0 41 0 3 0\n 5 0 1 0 0 0 940 0 39 1 19\n 6 93 9 182 28 184 0 482 0 22 0\n 7 0 0 0 0 0 38 0 924 0 38\n 8 6 2 8 12 3 4 13 6 946 0\n 9 1 0 0 0 0 14 0 34 1 950\n\n\nThere are several classes that have some confusion with other classes, particularly 6 with 0, 2, 4. But other classes are most often confused with at least one other. Classes 1, 5, 7, 8, 9 are rarely confused.\n\n\n\n\n\n\n8. Compute metrics\nCompute the accuracy of the model on the test set. How does this compare with the accuracy reported when you fitted the model?\nIf the model equally accurate on all classes? If not, which class(es) is(are) poorly fitted?\n\nfashion_test_pred <- fashion_test_pred |>\n cbind(observed, predicted)\n\nfashion_test_pred <- fashion_test_pred |>\n as.tibble() |>\n mutate(label = test_tags,\n plabel = factor(class_names[predicted+1], \n levels = levels(test_tags)))\n\naccuracy(fashion_test_pred, label, plabel)\nbal_accuracy(fashion_test_pred, label, plabel)\nfashion_test_pred |>\n count(label, plabel) |>\n group_by(label) |>\n mutate(Accuracy = ifelse(sum(n)>0, n[plabel==label]/sum(n), 0)) |>\n pivot_wider(names_from = \"plabel\", \n values_from = n, \n values_fill = 0) |>\n select(label, Accuracy)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n# A tibble: 1 × 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 accuracy multiclass 0.831\n\n\n# A tibble: 1 × 3\n .metric .estimator .estimate\n <chr> <chr> <dbl>\n1 bal_accuracy macro 0.906\n\n\n# A tibble: 10 × 2\n# Groups: label [10]\n label Accuracy\n <fct> <dbl>\n 1 T-shirt/top 0.759\n 2 Trouser 0.954\n 3 Pullover 0.722\n 4 Dress 0.807\n 5 Coat 0.827\n 6 Sandal 0.94 \n 7 Shirt 0.482\n 8 Sneaker 0.924\n 9 Bag 0.946\n10 Ankle boot 0.95 \n\n\n\n\n\n\n\n\n9. Investigating the results\nThis section is motivated by the examples in Cook and Laa (2024). Focus on the test data to investigate the fit, and lack of fit.\n\nPCA can be used to reduce the dimension down from 784, to a small number of PCS, to examine the nature of differences between the classes. Compute the scree plot to decide on a reasonable number that can be examined in a tour. Plot the first two statically. Explain how the class structure matches any clustering.\n\n\ntest_images_flat <- test_images\ndim(test_images_flat) <- c(nrow(test_images_flat), 784)\nimages_pca <- prcomp(as.data.frame(test_images_flat))\nimages_pc <- as.data.frame(images_pca$x)\nggscree(images_pca, q=20, guide=FALSE)\nggplot(images_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme(legend.title = element_blank())\n\n\nanimate_xy(images_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere isn’t much separation between classes in the PCs. There is some difference between classes, with overlap between them. It looks less separated than what the confusion matrix would suggest.\n\n\n\n\n\nUMAP can also be used to understand the class structure. Make a 2D UMAP representation and explain how the class structure matches cluster structure.\n\n\nset.seed(253)\nfashion_umap <- umap(test_images_flat, init = \"spca\")\nfashion_umap_df <- fashion_umap |>\n as_tibble() |>\n rename(UMAP1 = V1, UMAP2 = V2) |>\n mutate(label = test_tags)\nggplot(fashion_umap_df, aes(x = UMAP1, \n y = UMAP2,\n colour = label)) +\n geom_point(size = 0.3, alpha=0.5) +\n scale_color_discrete_qualitative(palette = \"Dynamic\") +\n theme(legend.title = element_blank())\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere are multiple well-separated clusters in the representation. Mostly these are mixtures of several classes. Only one cluster mostly matches an article, Trouser.\n\n\n\n\n\nInterestingly, the nodes in the hidden layer can be thought of as 128 new variables which are linear combinations of the original 784 variables. This is too many to visualise but we can again use PCA to reduce their dimension again, and make plots.\n\n\nactivations_model_fashion <- keras_model(\n inputs = model_fashion_mnist$input,\n outputs = model_fashion_mnist$layers[[2]]$output\n)\nactivations_fashion <- predict(\n activations_model_fashion,\n test_images, verbose = 0)\n\n# PCA for activations\nactivations_pca <- prcomp(activations_fashion)\nactivations_pc <- as.data.frame(activations_pca$x)\n\nggscree(activations_pca, q=20, guide=FALSE)\n\nggplot(activations_pc,\n aes(PC1, PC2, color = test_tags)) +\n geom_point(size = 0.1) +\n ggtitle(\"Activations\") +\n scale_color_discrete_qualitative(palette = \"Dynamic\") \n\n\nanimate_xy(activations_pc[,1:5], col = test_tags,\n cex=0.2, palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere substantial separation between classes in the PCs of these new variables. It looks now reasonable that the classes are distinguishable as the confusion matrix suggests.\n\n\n\n\n\nSimilarly, we can general a 2D representation using UMAP of these new variables.\n\n\nset.seed(253)\nfashion_umap <- umap(activations_fashion, init = \"spca\")\nfashion_umap_df <- fashion_umap |>\n as_tibble() |>\n rename(UMAP1 = V1, UMAP2 = V2) |>\n mutate(label = test_tags)\nggplot(fashion_umap_df, aes(x = UMAP1, \n y = UMAP2,\n colour = label)) +\n geom_point(size = 0.5, alpha=0.5) +\n scale_color_discrete_qualitative(palette = \"Dynamic\")\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThere is a lot of clustering in this view, but it mostly doesn’t match the classes. Trouser is the only class that appears to be primarily in one cluster.\n\n\n\n\n\nLast task is to explain on what was learned from the confusion matrix to examine the uncertainty in predictions from the predictive probabilities. Because there are 10 classes, these will fall in a 9D simplex. Each vertex is the spot where the model is completely certain about the prediction. Points along an edge indicate confusion only between two classes. Points on a triangular face indicate confusion between three classes. The code below will create the visualisation of the predictive probabilities, focusing on four of the 10 classes to make it a little simpler to digest.\n\n\n# Generate the projection to 9D\nproj <- t(geozoo::f_helmert(10)[-1,])\nf_nn_v_p <- as.matrix(fashion_test_pred[,1:10]) %*% proj\ncolnames(f_nn_v_p) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\n\nf_nn_v_p <- f_nn_v_p %>%\n as.data.frame() %>%\n mutate(class = test_tags)\n\nsimp <- geozoo::simplex(p=9)\nsp <- data.frame(simp$points)\ncolnames(sp) <- c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\")\nsp$class = \"\"\nf_nn_v_p_s <- bind_rows(sp, f_nn_v_p) %>%\n mutate(class = ifelse(class %in% c(\"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\"), class, \"Other\")) %>%\n mutate(class = factor(class, levels=c(\"T-shirt/top\",\n \"Pullover\",\n \"Shirt\",\n \"Coat\",\n \"Other\"))) \n\n\nanimate_xy(f_nn_v_p_s[,1:9], col = f_nn_v_p_s$class, \n axes = \"off\", cex=0.2,\n edges = as.matrix(simp$edges),\n edges.width = 0.05,\n palette = \"Viridis\")\n\n\n\n\n10. Ways to improve the model\nThere are many ways to improve neural networks. If you have time, read through the approaches taken in the HOML book. It used the digits data, but the approaches will be suitable to apply to the fashion data. Try:\n\nincreasing the number of epochs (don’t think this helps here)\ntry adding batch processing using batch size\nuse a validation set split\ntry a different number of nodes in the hidden layer\nexpand the number of layers\nadd batch normalisation\nuse regularisation at each layer\nadd dropout for each layer\nexperiment with the learning rate"
},
{
"objectID": "week7/tutorialsol.html#finishing-up",
@@ -1414,11 +1414,102 @@
"text": "Next: Re-sampling and regularisation\n\n\n\nETC3250/5250 Lecture 2 | iml.numbat.space"
},
{
- "objectID": "week12/index.html#presentations-from-masters-students",
- "href": "week12/index.html#presentations-from-masters-students",
- "title": "Week 12: Project presentations by Masters students",
- "section": "Presentations from Masters students",
- "text": "Presentations from Masters students"
+ "objectID": "week12/tutorial.html",
+ "href": "week12/tutorial.html",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "",
+ "text": "Load the libraries and avoid conflicts, and prepare data\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(tidyclust)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(tourr)\nlibrary(geozoo)\nlibrary(colorspace)\nlibrary(visdat)\nlibrary(missForest)\nlibrary(ggthemes)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)"
+ },
+ {
+ "objectID": "week12/tutorial.html#objectives",
+ "href": "week12/tutorial.html#objectives",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "🎯 Objectives",
+ "text": "🎯 Objectives\nThe goal for this week is learn to about fitting support vector machine models."
+ },
+ {
+ "objectID": "week12/tutorial.html#preparation",
+ "href": "week12/tutorial.html#preparation",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "🔧 Preparation",
+ "text": "🔧 Preparation\n\nMake sure you have all the necessary libraries installed."
+ },
+ {
+ "objectID": "week12/tutorial.html#exercises",
+ "href": "week12/tutorial.html#exercises",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "Exercises:",
+ "text": "Exercises:\n\n1. Re-do Julia Silge’s post on hierarchical clustering\nHave a read through Julia’s post on clustering Health Care Indicators. This activity is to repeat her analysis, handling missing values prior to clustering and examining the results. However the R code is a bit old style, so we will update it, too.\n\nlibrary(RSocrata)\nhealth_all <- read.socrata(\"https://opendata.utah.gov/resource/qmsu-gki4.csv\")\n#health_all[,3:67] <- lapply(health_all[,3:67], as.numeric)\n#health_all <- health_all[c(-1),]\nhealth_all <- make_clean_names(health_all)\n\n# Notice the missing values\nvis_dat(health_all)\n\n# Clean names, standardise\nhealth <- health_all[,c(2,4:5,18,22,24,27,31,34,\n 36,38,42,44,48,51,55,\n 60,63,64)]\ncolnames(health) <- c(\"County\", \n \"PercentUnder18\",\n \"PercentOver65\",\n \"DiabeticRate\", \n \"HIVRate\",\n \"PrematureMortalityRate\",\n \"InfantMortalityRate\",\n \"ChildMortalityRate\",\n \"LimitedAccessToFood\",\n \"FoodInsecure\", \n \"MotorDeathRate\",\n \"DrugDeathRate\",\n \"Uninsured\", \n \"UninsuredChildren\",\n \"HealthCareCosts\", \n \"CouldNotSeeDr\",\n \"MedianIncome\",\n \"ChildrenFreeLunch\",\n \"HomicideRate\")\nhealth_std <- health |>\n mutate_if(is.numeric, \n function(x) (x-mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\n\n# Impute missing values\nhealth_std_imputed <- missForest(health_std[,-1])"
+ },
+ {
+ "objectID": "week12/tutorial.html#finishing-up",
+ "href": "week12/tutorial.html#finishing-up",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "👋 Finishing up",
+ "text": "👋 Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult."
+ },
+ {
+ "objectID": "week12/tutorialsol.html",
+ "href": "week12/tutorialsol.html",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "",
+ "text": "Load the libraries and avoid conflicts, and prepare data\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(tidyclust)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(tourr)\nlibrary(geozoo)\nlibrary(colorspace)\nlibrary(visdat)\nlibrary(missForest)\nlibrary(ggthemes)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)"
+ },
+ {
+ "objectID": "week12/tutorialsol.html#objectives",
+ "href": "week12/tutorialsol.html#objectives",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "🎯 Objectives",
+ "text": "🎯 Objectives\nThe goal for this week is learn to about fitting support vector machine models."
+ },
+ {
+ "objectID": "week12/tutorialsol.html#preparation",
+ "href": "week12/tutorialsol.html#preparation",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "🔧 Preparation",
+ "text": "🔧 Preparation\n\nMake sure you have all the necessary libraries installed."
+ },
+ {
+ "objectID": "week12/tutorialsol.html#exercises",
+ "href": "week12/tutorialsol.html#exercises",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "Exercises:",
+ "text": "Exercises:\n\n1. Re-do Julia Silge’s post on hierarchical clustering\nHave a read through Julia’s post on clustering Health Care Indicators. This activity is to repeat her analysis, handling missing values prior to clustering and examining the results. However the R code is a bit old style, so we will update it, too.\n\nlibrary(RSocrata)\nhealth_all <- read.socrata(\"https://opendata.utah.gov/resource/qmsu-gki4.csv\")\n#health_all[,3:67] <- lapply(health_all[,3:67], as.numeric)\n#health_all <- health_all[c(-1),]\nhealth_all <- make_clean_names(health_all)\n\n# Notice the missing values\nvis_dat(health_all)\n\n# Clean names, standardise\nhealth <- health_all[,c(2,4:5,18,22,24,27,31,34,\n 36,38,42,44,48,51,55,\n 60,63,64)]\ncolnames(health) <- c(\"County\", \n \"PercentUnder18\",\n \"PercentOver65\",\n \"DiabeticRate\", \n \"HIVRate\",\n \"PrematureMortalityRate\",\n \"InfantMortalityRate\",\n \"ChildMortalityRate\",\n \"LimitedAccessToFood\",\n \"FoodInsecure\", \n \"MotorDeathRate\",\n \"DrugDeathRate\",\n \"Uninsured\", \n \"UninsuredChildren\",\n \"HealthCareCosts\", \n \"CouldNotSeeDr\",\n \"MedianIncome\",\n \"ChildrenFreeLunch\",\n \"HomicideRate\")\nhealth_std <- health |>\n mutate_if(is.numeric, \n function(x) (x-mean(x, na.rm=TRUE))/\n sd(x, na.rm=TRUE))\n\n# Impute missing values\nhealth_std_imputed <- missForest(health_std[,-1])"
+ },
+ {
+ "objectID": "week12/tutorialsol.html#finishing-up",
+ "href": "week12/tutorialsol.html#finishing-up",
+ "title": "ETC3250/5250 Tutorial 12",
+ "section": "👋 Finishing up",
+ "text": "👋 Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult."
+ },
+ {
+ "objectID": "week11/index.html",
+ "href": "week11/index.html",
+ "title": "Week 11: Evaluating your clustering model",
+ "section": "",
+ "text": "Cook and Laa Ch 12"
+ },
+ {
+ "objectID": "week11/index.html#main-reference",
+ "href": "week11/index.html#main-reference",
+ "title": "Week 11: Evaluating your clustering model",
+ "section": "",
+ "text": "Cook and Laa Ch 12"
+ },
+ {
+ "objectID": "week11/index.html#what-you-will-learn-this-week",
+ "href": "week11/index.html#what-you-will-learn-this-week",
+ "title": "Week 11: Evaluating your clustering model",
+ "section": "What you will learn this week",
+ "text": "What you will learn this week\n\nConfusion tables\nCluster metrics"
+ },
+ {
+ "objectID": "week11/index.html#assignments",
+ "href": "week11/index.html#assignments",
+ "title": "Week 11: Evaluating your clustering model",
+ "section": "Assignments",
+ "text": "Assignments\n\nProject is due on Friday 17 May."
},
{
"objectID": "week10/index.html",
@@ -1441,6 +1532,20 @@
"section": "What you will learn this week",
"text": "What you will learn this week\n\nModels of multimodality using Gaussian mixtures\nFitting model-based clustering\nDiagnostics for the model fit\nSelf-organising maps and dimension reduction"
},
+ {
+ "objectID": "week10/index.html#lecture-slides",
+ "href": "week10/index.html#lecture-slides",
+ "title": "Week 10: Model-based clustering and self-organising maps",
+ "section": "Lecture slides",
+ "text": "Lecture slides\n\nhtml\npdf\nqmd\nR"
+ },
+ {
+ "objectID": "week10/index.html#tutorial-instructions",
+ "href": "week10/index.html#tutorial-instructions",
+ "title": "Week 10: Model-based clustering and self-organising maps",
+ "section": "Tutorial instructions",
+ "text": "Tutorial instructions\nInstructions:\n\nhtml\nqmd"
+ },
{
"objectID": "week10/index.html#assignments",
"href": "week10/index.html#assignments",
@@ -1946,32 +2051,81 @@
"text": "👋 Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult."
},
{
- "objectID": "week11/index.html",
- "href": "week11/index.html",
- "title": "Week 11: Evaluating your clustering model",
+ "objectID": "week10/tutorialsol.html",
+ "href": "week10/tutorialsol.html",
+ "title": "ETC3250/5250 Tutorial 10",
"section": "",
- "text": "Cook and Laa Ch 12"
+ "text": "Load the libraries and avoid conflicts, and prepare data\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(tidyclust)\nlibrary(purrr)\nlibrary(ggdendro)\nlibrary(fpc)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(tourr)\nlibrary(geozoo)\nlibrary(ggbeeswarm)\nlibrary(colorspace)\nlibrary(plotly)\nlibrary(ggthemes)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(purrr::map)"
},
{
- "objectID": "week11/index.html#main-reference",
- "href": "week11/index.html#main-reference",
- "title": "Week 11: Evaluating your clustering model",
+ "objectID": "week10/tutorialsol.html#objectives",
+ "href": "week10/tutorialsol.html#objectives",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "🎯 Objectives",
+ "text": "🎯 Objectives\nThe goal for this week is learn to about fitting support vector machine models."
+ },
+ {
+ "objectID": "week10/tutorialsol.html#preparation",
+ "href": "week10/tutorialsol.html#preparation",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "🔧 Preparation",
+ "text": "🔧 Preparation\n\nMake sure you have all the necessary libraries installed."
+ },
+ {
+ "objectID": "week10/tutorialsol.html#exercises",
+ "href": "week10/tutorialsol.html#exercises",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "Exercises:",
+ "text": "Exercises:\n\n1. How would you cluster this data?\n\nHow would you cluster this data?\n\n\n\n\n\n\n\n\n\n\n\nDerive a distance metric that will capture your clusters. Provide some evidence that it satisfies the four distance rules.\nCompute your rule on the data, and establish that it does indeed capture your clusters.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nMy distance will use the radial distance from \\((0, 0)\\). You first convert each point into polar coordinates, but only take the radius. This will always be positive. The distance between two points will be the absolute value of the difference between these values.\n\nmydist <- function(x1, x2) {\n d <- abs(sqrt(sum(x1^2)) - sqrt(sum(x2^2)))\n return(d)\n}\nch_dist <- matrix(0, nrow(challenge), nrow(challenge)) \nfor (i in 1:(nrow(challenge)-1)) {\n for (j in (i+1):nrow(challenge)) {\n ch_dist[i,j] <- mydist(challenge[i,], challenge[j,])\n ch_dist[j,i] <- ch_dist[i,j]\n }\n}\n#x <- as.vector(ch_dist)\n#hist(x, 30) # you can see it is bimodal\nch_dist <- as.dist(ch_dist)\nch_hc <- hclust(ch_dist, method=\"ward.D2\")\nch_cl <- challenge |>\n mutate(cl = factor(cutree(ch_hc, 2)))\nggplot(ch_cl, aes(x=V1, y=V2, colour=cl)) + \n geom_point() +\n scale_color_discrete_divergingx(palette=\"Zissou 1\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n2. Clustering spotify data with k-means\nThis exercise is motivated by this blog post on using \\(k\\)-means to identify anomalies.\nYou can read the data with this code. And because for clustering you need to first standardise the data the code will also do this. Variables mode and time_signature are removed because they have just a few values.\n\n# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7\nspotify <- read_csv(\"https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv\")\nspotify_std <- spotify |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>\n select(-c(mode, time_signature)) # variables with few values\n\n\nMake a plot of all of the variables. This could be a density or a jittered dotplot (beeswarm::quasirandom). Many of the variables have skewed distributions. For cluster analysis, why might this be a problem? From the blog post, are any of the anomalies reported ones that can be seen as outliers in a single skewed variable?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\np <- spotify_std |>\n pivot_longer(danceability:artist_popularity,\n names_to=\"var\", values_to=\"value\") |>\n ggplot(aes(x=var, y=value, label=name)) +\n geom_quasirandom() + \n coord_flip() +\n xlab(\"\") \nggplotly(p)\n\n\n\n\n\nThe skewed variables are speechiness, liveliness, instrumentalness, artist_popularity, accousticness and possible we could mark duration_ms as having some skewness but also some low anomalies.\nYes, for example “Free Bird” and “Sparkle” could be found by simply examining a single variable.\n\n\n\n\n\nTransform the skewed variables to be as symmetric as possible, and then fit a \\(k=3\\)-means clustering. Extract and report these metrics: totss, tot.withinss, betweenss. What is the ratio of within to between SS?\n\n\n# Transforming some variables: imperfect\nspotify_tf <- spotify |>\n mutate(speechiness = log10(speechiness),\n liveness = log10(liveness),\n duration_ms = log10(duration_ms),\n danceability = danceability^2,\n artist_popularity = artist_popularity^2,\n acousticness = log10(acousticness)) |>\n select(-c(mode, time_signature, instrumentalness)) |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) \n\nspotify_tf |>\n pivot_longer(danceability:artist_popularity,\n names_to=\"var\", values_to=\"value\") |>\n ggplot(aes(x=var, y=value, label=name)) +\n geom_quasirandom() + \n coord_flip() +\n xlab(\"\") \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n# Check that it clusters\nset.seed(131)\nspotify_km3 <- kmeans(spotify_tf[,5:15], 3)\n# Summarise means of each cluster\ntidy(spotify_km3)\n\n# A tibble: 3 × 14\n danceability energy key loudness speechiness acousticness liveness valence\n <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 0.788 0.128 0.223 0.203 0.359 0.225 -0.139 0.668\n2 -0.409 0.667 -0.192 0.504 -0.168 -0.793 0.317 -0.157\n3 -0.666 -1.19 -0.0761 -1.07 -0.330 0.807 -0.243 -0.846\n# ℹ 6 more variables: tempo <dbl>, duration_ms <dbl>, artist_popularity <dbl>,\n# size <int>, withinss <dbl>, cluster <fct>\n\nglance(spotify_km3)\n\n# A tibble: 1 × 4\n totss tot.withinss betweenss iter\n <dbl> <dbl> <dbl> <int>\n1 5566. 4377. 1189. 3\n\nspotify_kcl3 <- augment(spotify_km3, spotify_tf)\nspotify_kcl3 |>\n pivot_longer(danceability:artist_popularity,\n names_to=\"var\", values_to=\"value\") |>\n ggplot(aes(x=.cluster, y=value)) +\n geom_quasirandom() +\n facet_wrap(~var, ncol=4)\n\n\n\n\n\n\n\n\nThe differences between clusters is mostly in acousticness, danceability, energy, loudness, valence.\n\n\n\n\n\nNow the algorithm \\(k=1, ..., 20\\). Extract the metrics, and plot the ratio of within SS to between SS against \\(k\\). What would be suggested as the best model?\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n# Run many k\nspotify_km <- \n tibble(k = 1:20) %>%\n mutate(\n kclust = purrr::map(k, ~kmeans(spotify_tf[,5:15], .x)),\n tidied = purrr::map(kclust, tidy),\n glanced = purrr::map(kclust, glance),\n augmented = purrr::map(kclust, augment, spotify_tf)\n )\n\n# Plot statistics\nspotify_cl <- \n spotify_km %>%\n unnest(cols = c(glanced))\nggplot(spotify_cl, aes(x=k, y=tot.withinss)) +\n geom_point() +\n geom_line()\n\n\n\n\n\n\n\n\nMaybe 11, but the within SS decays very gradually so it is hard to tell.\n\n\n\n\n\nDivide the data into 11 clusters, and examine the number of songs in each. Using plotly, mouse over the resulting plot and explore songs belonging to a cluster. (I don’t know much about these songs, but if you are a music fan maybe discussing with other class members and your tutor about the groupings, like which ones are grouped in clusters with high liveness, high tempo or danceability could be fun.)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\n# Extract data with different classes labelled\nspotify_assign <- \n spotify_km %>% \n unnest(cols = c(augmented))\nspotify_assign_df <- spotify_assign |>\n select(name:`artist_popularity`, k, .cluster) |>\n pivot_wider(names_from=k, values_from=.cluster)\n\nspotify_assign_df |>\n select(name:`artist_popularity`, `7`) |>\n pivot_longer(danceability:artist_popularity,\n names_to=\"var\", values_to=\"value\") |>\n ggplot(aes(x=`7`, y=value, label=name)) +\n geom_quasirandom() +\n facet_wrap(~var, ncol=4) +\n xlab(\"\") + ylab(\"\")\n\n\n\n\n\n\n\n# ggplotly()\n\n\n\n\n\n\n\n3. Clustering several simulated data sets with know cluster structure\n\nIn tutorial of week 3 you used the tour to visualise the data sets c1 and c3 provided with the mulgar package. Review what you said about the structure in these data sets, and write down your expectations for how a cluster analysis would divide the data.\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\n\nanimate_xy(c1)\nanimate_xy(c3)\n\nWe said:\n\nc1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions. We would expect that a clustering analysis divides the data into these 6 clusters.\nc3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections. We would expect the clustering analysis to divide the data into four clusters corresponding mostly to the four vertices.\n\n\n\n\n\n\nCompute \\(k\\)-means and hierarchical clustering on these two data sets, without standardising them. Use a variety of \\(k\\), linkage methods and check the resulting clusters using the cluster metrics. What method produces the best result, relative to what you said in a. (NOTE: Although we said that we should always standardise variables before doing clustering, you should not do this for c3. Why?)\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\n\nThe scale of the variables is meaningful for these data sets. Even though some variables have smaller variance than others we would treat them to be measured on the same scale.\n\\(k\\)-means for c1:\n\nc1_km <- \n tibble(k = 1:10) |>\n mutate(\n kclust = map(k, ~kmeans(c1, .x)),\n tidied = map(kclust, tidy),\n glanced = map(kclust, glance),\n augmented = map(kclust, augment, c1)\n )\n\n# Plot statistics\nc1_cl <- \n c1_km %>%\n unnest(cols = c(glanced))\nggplot(c1_cl, aes(x=k, y=tot.withinss)) +\n geom_point() +\n geom_line()\n\n\n\n\n\n\n\n# Extract data with different classes labelled\nc1_assign <- \n c1_km %>% \n unnest(cols = c(augmented))\nc1_assign_df <- c1_assign |>\n select(x1:x6, k, .cluster) |>\n pivot_wider(names_from=k, values_from=.cluster)\n\nThe statistics suggest that 3 clusters is the best solution. If we look at the 6 clusters solution, it is different from what we would expect, one large cluster is divided, as is a small cluster.\n\nanimate_xy(c1_assign_df[,1:6], col=c1_assign_df$`3`)\nanimate_xy(c1_assign_df[,1:6], col=c1_assign_df$`6`)\n\nHierarchical for c1:\n\nc1_hc <- hclust(dist(c1), method=\"ward.D2\")\nggdendrogram(c1_hc, size = 4)\n\n\n\n\n\n\n\nf.hc <- NULL; f.hc.stats <- NULL\nfor (i in 2:10) {\n cl <- cutree(c1_hc, i)\n x <- cluster.stats(dist(c1), cl)\n f.hc <- cbind(f.hc, cl)\n f.hc.stats <- rbind(f.hc.stats, \n c(x$within.cluster.ss, \n x$wb.ratio, x$ch,\n x$pearsongamma, x$dunn, \n x$dunn2))\n}\ncolnames(f.hc.stats) <- c(\"within.cluster.ss\",\n \"wb.ratio\", \"ch\",\n \"pearsongamma\", \"dunn\",\n \"dunn2\")\nf.hc.stats <- data.frame(f.hc.stats)\nf.hc.stats$cl <- 2:10\nf.hc.stats.m <- f.hc.stats %>% \n pivot_longer(`within.cluster.ss`:dunn2,\n names_to=\"stat\", \n values_to=\"value\") |>\n mutate(stat = factor(stat,\n levels=colnames(f.hc.stats)))\nggplot(data=f.hc.stats.m) + \n geom_line(aes(x=cl, y=value)) + xlab(\"# clusters\") + ylab(\"\") +\n facet_wrap(~stat, ncol=3, scales = \"free_y\") \n\n\n\n\n\n\n\n\nThe dendrogram suggests 3 clusters, and the statistics all agree and suggest 3 clusters. Although, ch possibly suggests 10!\n\ncolnames(f.hc) <- paste0(\"hc\", 2:10)\nf.hc <- data.frame(f.hc) |>\n mutate_all(as.factor)\nc1_assign_df <- bind_cols(c1_assign_df, f.hc)\nanimate_xy(c1_assign_df[,1:6], col=c1_assign_df$hc3)\n\nThe solution is the same as for \\(k\\)-means. The algorithms all treat the three small clusters as a single cluster.\nHierarchical for c3:\nIt’s not really clear how the data should be divided so we’ll start with using hierarchical.\n\nc3_hc <- hclust(dist(c3), method=\"ward.D2\")\nggdendrogram(c3_hc, size = 4)\n\n\n\n\n\n\n\nf.hc <- NULL; f.hc.stats <- NULL\nfor (i in 2:10) {\n cl <- cutree(c3_hc, i)\n x <- cluster.stats(dist(c3), cl)\n f.hc <- cbind(f.hc, cl)\n f.hc.stats <- rbind(f.hc.stats, \n c(x$within.cluster.ss, \n x$wb.ratio, x$ch,\n x$pearsongamma, x$dunn, \n x$dunn2))\n}\ncolnames(f.hc.stats) <- c(\"within.cluster.ss\",\n \"wb.ratio\", \"ch\",\n \"pearsongamma\", \"dunn\",\n \"dunn2\")\nf.hc.stats <- data.frame(f.hc.stats)\nf.hc.stats$cl <- 2:10\nf.hc.stats.m <- f.hc.stats %>% \n pivot_longer(`within.cluster.ss`:dunn2,\n names_to=\"stat\", \n values_to=\"value\") |>\n mutate(stat = factor(stat,\n levels=colnames(f.hc.stats)))\nggplot(data=f.hc.stats.m) + \n geom_line(aes(x=cl, y=value)) + xlab(\"# clusters\") + ylab(\"\") +\n facet_wrap(~stat, ncol=3, scales = \"free_y\") \n\n\n\n\n\n\n\n\nThe dendrogram suggests 4 clusters, and many of the statistics suggest 4 clusters.\n\ncolnames(f.hc) <- paste0(\"hc\", 2:10)\nf.hc <- data.frame(f.hc) |>\n mutate_all(as.factor)\nc3_assign_df <- bind_cols(c3, f.hc)\nanimate_xy(c3_assign_df[,1:6], col=c3_assign_df$hc4)\n\nThe four cluster solution is almost what we would expect, that they divide the data on the four vertices of the tetrahedron. One of the vertices is a little confused with one other.\nIf you run \\(k\\)-means, I’d expect it does similarly to this solution. If you choose a different linkage method likely the results will change a lot.\n\n\n\n\n\nThere are five other data sets in the mulgar package. Choose one or two or more to examine how they would be clustered. (I particularly would like to see how c4 is clustered.)"
+ },
+ {
+ "objectID": "week10/tutorialsol.html#finishing-up",
+ "href": "week10/tutorialsol.html#finishing-up",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "👋 Finishing up",
+ "text": "👋 Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult."
+ },
+ {
+ "objectID": "week10/tutorial.html",
+ "href": "week10/tutorial.html",
+ "title": "ETC3250/5250 Tutorial 10",
"section": "",
- "text": "Cook and Laa Ch 12"
+ "text": "Load the libraries and avoid conflicts, and prepare data\n# Load libraries used everywhere\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(tidyclust)\nlibrary(purrr)\nlibrary(ggdendro)\nlibrary(fpc)\nlibrary(patchwork)\nlibrary(mulgar)\nlibrary(tourr)\nlibrary(geozoo)\nlibrary(ggbeeswarm)\nlibrary(colorspace)\nlibrary(plotly)\nlibrary(ggthemes)\nlibrary(conflicted)\nconflicts_prefer(dplyr::filter)\nconflicts_prefer(dplyr::select)\nconflicts_prefer(dplyr::slice)\nconflicts_prefer(purrr::map)"
},
{
- "objectID": "week11/index.html#what-you-will-learn-this-week",
- "href": "week11/index.html#what-you-will-learn-this-week",
- "title": "Week 11: Evaluating your clustering model",
- "section": "What you will learn this week",
- "text": "What you will learn this week\n\nConfusion tables\nCluster metrics"
+ "objectID": "week10/tutorial.html#objectives",
+ "href": "week10/tutorial.html#objectives",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "🎯 Objectives",
+ "text": "🎯 Objectives\nThe goal for this week is learn to about fitting support vector machine models."
},
{
- "objectID": "week11/index.html#assignments",
- "href": "week11/index.html#assignments",
- "title": "Week 11: Evaluating your clustering model",
- "section": "Assignments",
- "text": "Assignments\n\nProject is due on Friday 17 May."
+ "objectID": "week10/tutorial.html#preparation",
+ "href": "week10/tutorial.html#preparation",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "🔧 Preparation",
+ "text": "🔧 Preparation\n\nMake sure you have all the necessary libraries installed."
+ },
+ {
+ "objectID": "week10/tutorial.html#exercises",
+ "href": "week10/tutorial.html#exercises",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "Exercises:",
+ "text": "Exercises:\n\n1. How would you cluster this data?\n\nHow would you cluster this data?\n\n\n\n\n\n\n\n\n\n\n\nDerive a distance metric that will capture your clusters. Provide some evidence that it satisfies the four distance rules.\nCompute your rule on the data, and establish that it does indeed capture your clusters.\n\n\n\n2. Clustering spotify data with k-means\nThis exercise is motivated by this blog post on using \\(k\\)-means to identify anomalies.\nYou can read the data with this code. And because for clustering you need to first standardise the data the code will also do this. Variables mode and time_signature are removed because they have just a few values.\n\n# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7\nspotify <- read_csv(\"https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv\")\nspotify_std <- spotify |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>\n select(-c(mode, time_signature)) # variables with few values\n\n\nMake a plot of all of the variables. This could be a density or a jittered dotplot (beeswarm::quasirandom). Many of the variables have skewed distributions. For cluster analysis, why might this be a problem? From the blog post, are any of the anomalies reported ones that can be seen as outliers in a single skewed variable?\n\n\nTransform the skewed variables to be as symmetric as possible, and then fit a \\(k=3\\)-means clustering. Extract and report these metrics: totss, tot.withinss, betweenss. What is the ratio of within to between SS?\n\n\n# Transforming some variables: imperfect\nspotify_tf <- spotify |>\n mutate(speechiness = log10(speechiness),\n liveness = log10(liveness),\n duration_ms = log10(duration_ms),\n danceability = danceability^2,\n artist_popularity = artist_popularity^2,\n acousticness = log10(acousticness)) |>\n select(-c(mode, time_signature, instrumentalness)) |>\n mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) \n\nspotify_tf |>\n pivot_longer(danceability:artist_popularity,\n names_to=\"var\", values_to=\"value\") |>\n ggplot(aes(x=var, y=value, label=name)) +\n geom_quasirandom() + \n coord_flip() +\n xlab(\"\") \n\n\n\n\n\n\n\n\n\nNow the algorithm \\(k=1, ..., 20\\). Extract the metrics, and plot the ratio of within SS to between SS against \\(k\\). What would be suggested as the best model?\n\n\nDivide the data into 11 clusters, and examine the number of songs in each. Using plotly, mouse over the resulting plot and explore songs belonging to a cluster. (I don’t know much about these songs, but if you are a music fan maybe discussing with other class members and your tutor about the groupings, like which ones are grouped in clusters with high liveness, high tempo or danceability could be fun.)\n\n\n\n3. Clustering several simulated data sets with know cluster structure\n\nIn tutorial of week 3 you used the tour to visualise the data sets c1 and c3 provided with the mulgar package. Review what you said about the structure in these data sets, and write down your expectations for how a cluster analysis would divide the data.\n\n\nCompute \\(k\\)-means and hierarchical clustering on these two data sets, without standardising them. Use a variety of \\(k\\), linkage methods and check the resulting clusters using the cluster metrics. What method produces the best result, relative to what you said in a. (NOTE: Although we said that we should always standardise variables before doing clustering, you should not do this for c3. Why?)\n\n\nThere are five other data sets in the mulgar package. Choose one or two or more to examine how they would be clustered. (I particularly would like to see how c4 is clustered.)"
+ },
+ {
+ "objectID": "week10/tutorial.html#finishing-up",
+ "href": "week10/tutorial.html#finishing-up",
+ "title": "ETC3250/5250 Tutorial 10",
+ "section": "👋 Finishing up",
+ "text": "👋 Finishing up\nMake sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult."
+ },
+ {
+ "objectID": "week12/index.html#presentations-from-masters-students",
+ "href": "week12/index.html#presentations-from-masters-students",
+ "title": "Week 12: Project presentations by Masters students",
+ "section": "Presentations from Masters students",
+ "text": "Presentations from Masters students"
},
{
"objectID": "week2/index.html",
@@ -3056,7 +3210,7 @@
"href": "week9/slides.html#seeing-the-clusters-using-spin-and-brush",
"title": "ETC3250/5250 Introduction to Machine Learning",
"section": "Seeing the clusters using spin-and-brush",
- "text": "Seeing the clusters using spin-and-brush\n\nlibrary(detourr)\nset.seed(645)\ndetour(p_std[,1:4], \n tour_aes(projection = bl:bm)) |>\n tour_path(grand_tour(2), fps = 60, \n max_bases=40) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)"
+ "text": "Seeing the clusters using spin-and-brush\n\nlibrary(detourr)\nset.seed(645)\ndetour(p_std[,2:5], \n tour_aes(projection = bl:bm)) |>\n tour_path(grand_tour(2), fps = 60, \n max_bases=40) |>\n show_scatter(alpha = 0.7, \n axes = FALSE)"
},
{
"objectID": "week9/slides.html#how-do-you-measure-close",
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 74f2dfd2..94051aaf 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -65,12 +65,20 @@
2024-03-19T04:48:10.743Z
- https://iml.numbat.space/week12/index.html
- 2024-02-05T21:32:06.286Z
+ https://iml.numbat.space/week12/tutorial.html
+ 2024-05-02T21:43:55.598Z
+
+
+ https://iml.numbat.space/week12/tutorialsol.html
+ 2024-05-02T21:43:55.598Z
+
+
+ https://iml.numbat.space/week11/index.html
+ 2024-02-05T21:31:46.270Zhttps://iml.numbat.space/week10/index.html
- 2024-02-05T21:30:53.696Z
+ 2024-05-02T22:55:00.669Zhttps://iml.numbat.space/week1/slides.html
@@ -101,8 +109,16 @@
2024-02-21T23:44:11.489Z
- https://iml.numbat.space/week11/index.html
- 2024-02-05T21:31:46.270Z
+ https://iml.numbat.space/week10/tutorialsol.html
+ 2024-05-02T23:28:15.869Z
+
+
+ https://iml.numbat.space/week10/tutorial.html
+ 2024-05-02T23:28:15.869Z
+
+
+ https://iml.numbat.space/week12/index.html
+ 2024-02-05T21:32:06.286Zhttps://iml.numbat.space/week2/index.html
@@ -166,6 +182,6 @@
https://iml.numbat.space/week9/slides.html
- 2024-04-30T22:47:18.109Z
+ 2024-05-01T02:54:57.040Z
diff --git a/docs/week1/slides.html b/docs/week1/slides.html
index 3a1e502e..6a4299f6 100644
--- a/docs/week1/slides.html
+++ b/docs/week1/slides.html
@@ -2013,7 +2013,7 @@
Receiver Operator Curves (ROC)
The balance of getting it right, without predicting everything as positive.
The goal for this week is learn to about fitting support vector machine models.
+
+
+
🔧 Preparation
+
+
Make sure you have all the necessary libraries installed.
+
+
+
+
Exercises:
+
+
1. How would you cluster this data?
+
+
How would you cluster this data?
+
+
+
+
+
+
+
+
+
+
Derive a distance metric that will capture your clusters. Provide some evidence that it satisfies the four distance rules.
+
Compute your rule on the data, and establish that it does indeed capture your clusters.
+
+
+
+
2. Clustering spotify data with k-means
+
This exercise is motivated by this blog post on using \(k\)-means to identify anomalies.
+
You can read the data with this code. And because for clustering you need to first standardise the data the code will also do this. Variables mode and time_signature are removed because they have just a few values.
+
+
# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7
+spotify <-read_csv("https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv")
+spotify_std <- spotify |>
+mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>
+select(-c(mode, time_signature)) # variables with few values
+
+
+
Make a plot of all of the variables. This could be a density or a jittered dotplot (beeswarm::quasirandom). Many of the variables have skewed distributions. For cluster analysis, why might this be a problem? From the blog post, are any of the anomalies reported ones that can be seen as outliers in a single skewed variable?
+
+
+
Transform the skewed variables to be as symmetric as possible, and then fit a \(k=3\)-means clustering. Extract and report these metrics: totss, tot.withinss, betweenss. What is the ratio of within to between SS?
Now the algorithm \(k=1, ..., 20\). Extract the metrics, and plot the ratio of within SS to between SS against \(k\). What would be suggested as the best model?
+
+
+
Divide the data into 11 clusters, and examine the number of songs in each. Using plotly, mouse over the resulting plot and explore songs belonging to a cluster. (I don’t know much about these songs, but if you are a music fan maybe discussing with other class members and your tutor about the groupings, like which ones are grouped in clusters with high liveness, high tempo or danceability could be fun.)
+
+
+
+
3. Clustering several simulated data sets with know cluster structure
+
+
In tutorial of week 3 you used the tour to visualise the data sets c1 and c3 provided with the mulgar package. Review what you said about the structure in these data sets, and write down your expectations for how a cluster analysis would divide the data.
+
+
+
Compute \(k\)-means and hierarchical clustering on these two data sets, without standardising them. Use a variety of \(k\), linkage methods and check the resulting clusters using the cluster metrics. What method produces the best result, relative to what you said in a. (NOTE: Although we said that we should always standardise variables before doing clustering, you should not do this for c3. Why?)
+
+
+
There are five other data sets in the mulgar package. Choose one or two or more to examine how they would be clustered. (I particularly would like to see how c4 is clustered.)
+
+
+
+
+
👋 Finishing up
+
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/week10/tutorial.qmd b/docs/week10/tutorial.qmd
new file mode 100644
index 00000000..9eda276f
--- /dev/null
+++ b/docs/week10/tutorial.qmd
@@ -0,0 +1,437 @@
+---
+title: "ETC3250/5250 Tutorial 10"
+subtitle: "K-means and hierarchical clustering"
+author: "Prof. Di Cook"
+date: "2024-05-06"
+quarto-required: ">=1.3.0"
+format:
+ unilur-html:
+ output-file: tutorial.html
+ embed-resources: true
+ css: "../assets/tutorial.css"
+ unilur-html+solution:
+ output-file: tutorialsol.html
+ embed-resources: true
+ css: "../assets/tutorial.css"
+unilur-solution: true
+---
+
+```{r echo=FALSE}
+# Set up chunk for all slides
+knitr::opts_chunk$set(
+ fig.width = 4,
+ fig.height = 4,
+ fig.align = "center",
+ out.width = "100%",
+ code.line.numbers = FALSE,
+ fig.retina = 3,
+ echo = TRUE,
+ message = FALSE,
+ warning = FALSE,
+ cache = FALSE,
+ dev.args = list(pointsize = 11)
+)
+```
+
+```{r}
+#| echo: true
+#| code-fold: true
+#| code-summary: "Load the libraries and avoid conflicts, and prepare data"
+# Load libraries used everywhere
+library(tidyverse)
+library(tidymodels)
+library(tidyclust)
+library(purrr)
+library(ggdendro)
+library(fpc)
+library(patchwork)
+library(mulgar)
+library(tourr)
+library(geozoo)
+library(ggbeeswarm)
+library(colorspace)
+library(plotly)
+library(ggthemes)
+library(conflicted)
+conflicts_prefer(dplyr::filter)
+conflicts_prefer(dplyr::select)
+conflicts_prefer(dplyr::slice)
+conflicts_prefer(purrr::map)
+```
+
+```{r}
+#| echo: false
+# Set plot theme
+theme_set(theme_bw(base_size = 14) +
+ theme(
+ aspect.ratio = 1,
+ plot.background = element_rect(fill = 'transparent', colour = NA),
+ plot.title.position = "plot",
+ plot.title = element_text(size = 24),
+ panel.background = element_rect(fill = 'transparent', colour = NA),
+ legend.background = element_rect(fill = 'transparent', colour = NA),
+ legend.key = element_rect(fill = 'transparent', colour = NA)
+ )
+)
+
+```
+
+## `r emo::ji("target")` Objectives
+
+The goal for this week is learn to about fitting support vector machine models.
+
+## `r emo::ji("wrench")` Preparation
+
+- Make sure you have all the necessary libraries installed.
+
+## Exercises:
+
+#### 1. How would you cluster this data?
+
+a. How would you cluster this data?
+
+```{r}
+#| echo: false
+set.seed(840)
+challenge <- sphere.hollow(p=2, n=200)$points |> as_tibble()
+challenge[1:100, ] <- challenge[1:100, ] * 2.2
+challenge <- challenge |>
+ mutate(V1 = V1+rnorm(200, 0, 0.15),
+ V2 = V2+rnorm(200, 0, 0.15))
+ggplot(challenge, aes(x=V1, y=V2)) + geom_point()
+```
+
+b. Derive a distance metric that will capture your clusters. Provide some evidence that it satisfies the four distance rules.
+
+c. Compute your rule on the data, and establish that it does indeed capture your clusters.
+
+
+::: unilur-solution
+
+My distance will use the radial distance from $(0, 0)$. You first convert each point into polar coordinates, but only take the radius. This will always be positive. The distance between two points will be the absolute value of the difference between these values.
+
+
+```{r}
+mydist <- function(x1, x2) {
+ d <- abs(sqrt(sum(x1^2)) - sqrt(sum(x2^2)))
+ return(d)
+}
+ch_dist <- matrix(0, nrow(challenge), nrow(challenge))
+for (i in 1:(nrow(challenge)-1)) {
+ for (j in (i+1):nrow(challenge)) {
+ ch_dist[i,j] <- mydist(challenge[i,], challenge[j,])
+ ch_dist[j,i] <- ch_dist[i,j]
+ }
+}
+#x <- as.vector(ch_dist)
+#hist(x, 30) # you can see it is bimodal
+ch_dist <- as.dist(ch_dist)
+ch_hc <- hclust(ch_dist, method="ward.D2")
+ch_cl <- challenge |>
+ mutate(cl = factor(cutree(ch_hc, 2)))
+ggplot(ch_cl, aes(x=V1, y=V2, colour=cl)) +
+ geom_point() +
+ scale_color_discrete_divergingx(palette="Zissou 1")
+```
+
+:::
+
+#### 2. Clustering spotify data with k-means
+
+This exercise is motivated by [this blog post](https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7) on using $k$-means to identify anomalies.
+
+You can read the data with this code. And because for clustering you need to first standardise the data the code will also do this. Variables `mode` and `time_signature` are removed because they have just a few values.
+
+```{r}
+# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7
+spotify <- read_csv("https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv")
+spotify_std <- spotify |>
+ mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>
+ select(-c(mode, time_signature)) # variables with few values
+```
+
+a. Make a plot of all of the variables. This could be a density or a jittered dotplot (`beeswarm::quasirandom`).
+Many of the variables have skewed distributions. For cluster analysis, why might this be a problem? From the blog post, are any of the anomalies reported ones that can be seen as outliers in a single skewed variable?
+
+::: unilur-solution
+
+```{r}
+p <- spotify_std |>
+ pivot_longer(danceability:artist_popularity,
+ names_to="var", values_to="value") |>
+ ggplot(aes(x=var, y=value, label=name)) +
+ geom_quasirandom() +
+ coord_flip() +
+ xlab("")
+ggplotly(p)
+
+```
+
+The skewed variables are `speechiness`, `liveliness`, `instrumentalness`, `artist_popularity`, `accousticness` and possible we could mark `duration_ms` as having some skewness but also some low anomalies.
+
+Yes, for example "Free Bird" and "Sparkle" could be found by simply examining a single variable.
+:::
+
+b. Transform the skewed variables to be as symmetric as possible, and then fit a $k=3$-means clustering. Extract and report these metrics: `totss`, `tot.withinss`, `betweenss`. What is the ratio of within to between SS?
+
+```{r}
+# Transforming some variables: imperfect
+spotify_tf <- spotify |>
+ mutate(speechiness = log10(speechiness),
+ liveness = log10(liveness),
+ duration_ms = log10(duration_ms),
+ danceability = danceability^2,
+ artist_popularity = artist_popularity^2,
+ acousticness = log10(acousticness)) |>
+ select(-c(mode, time_signature, instrumentalness)) |>
+ mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
+
+spotify_tf |>
+ pivot_longer(danceability:artist_popularity,
+ names_to="var", values_to="value") |>
+ ggplot(aes(x=var, y=value, label=name)) +
+ geom_quasirandom() +
+ coord_flip() +
+ xlab("")
+```
+
+::: unilur-solution
+
+```{r}
+# Check that it clusters
+set.seed(131)
+spotify_km3 <- kmeans(spotify_tf[,5:15], 3)
+# Summarise means of each cluster
+tidy(spotify_km3)
+glance(spotify_km3)
+spotify_kcl3 <- augment(spotify_km3, spotify_tf)
+spotify_kcl3 |>
+ pivot_longer(danceability:artist_popularity,
+ names_to="var", values_to="value") |>
+ ggplot(aes(x=.cluster, y=value)) +
+ geom_quasirandom() +
+ facet_wrap(~var, ncol=4)
+```
+
+The differences between clusters is mostly in `acousticness`, `danceability`, `energy`, `loudness`, `valence`.
+
+:::
+
+c. Now the algorithm $k=1, ..., 20$. Extract the metrics, and plot the ratio of within SS to between SS against $k$. What would be suggested as the best model?
+
+::: unilur-solution
+
+```{r}
+# Run many k
+spotify_km <-
+ tibble(k = 1:20) %>%
+ mutate(
+ kclust = purrr::map(k, ~kmeans(spotify_tf[,5:15], .x)),
+ tidied = purrr::map(kclust, tidy),
+ glanced = purrr::map(kclust, glance),
+ augmented = purrr::map(kclust, augment, spotify_tf)
+ )
+
+# Plot statistics
+spotify_cl <-
+ spotify_km %>%
+ unnest(cols = c(glanced))
+ggplot(spotify_cl, aes(x=k, y=tot.withinss)) +
+ geom_point() +
+ geom_line()
+
+```
+
+Maybe 11, but the within SS decays very gradually so it is hard to tell.
+:::
+
+d. Divide the data into 11 clusters, and examine the number of songs in each. Using plotly, mouse over the resulting plot and explore songs belonging to a cluster. (I don't know much about these songs, but if you are a music fan maybe discussing with other class members and your tutor about the groupings, like which ones are grouped in clusters with high `liveness`, high `tempo` or `danceability` could be fun.)
+
+::: unilur-solution
+```{r}
+# Extract data with different classes labelled
+spotify_assign <-
+ spotify_km %>%
+ unnest(cols = c(augmented))
+spotify_assign_df <- spotify_assign |>
+ select(name:`artist_popularity`, k, .cluster) |>
+ pivot_wider(names_from=k, values_from=.cluster)
+
+spotify_assign_df |>
+ select(name:`artist_popularity`, `7`) |>
+ pivot_longer(danceability:artist_popularity,
+ names_to="var", values_to="value") |>
+ ggplot(aes(x=`7`, y=value, label=name)) +
+ geom_quasirandom() +
+ facet_wrap(~var, ncol=4) +
+ xlab("") + ylab("")
+# ggplotly()
+```
+:::
+
+#### 3. Clustering several simulated data sets with know cluster structure
+
+a. In tutorial of week 3 you used the tour to visualise the data sets `c1` and `c3` provided with the `mulgar` package. Review what you said about the structure in these data sets, and write down your expectations for how a cluster analysis would divide the data.
+
+::: unilur-solution
+
+
+```{r}
+#| eval: false
+animate_xy(c1)
+animate_xy(c3)
+```
+
+We said:
+
+- `c1` has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions. We would expect that a clustering analysis divides the data into these 6 clusters.
+- `c3` has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections. We would expect the clustering analysis to divide the data into four clusters corresponding mostly to the four vertices.
+:::
+
+b. Compute $k$-means and hierarchical clustering on these two data sets, without standardising them. Use a variety of $k$, linkage methods and check the resulting clusters using the cluster metrics. What method produces the best result, relative to what you said in a. (NOTE: Although we said that we should always standardise variables before doing clustering, you should not do this for `c3`. Why?)
+
+::: unilur-solution
+The scale of the variables is meaningful for these data sets. Even though some variables have smaller variance than others we would treat them to be measured on the same scale.
+
+**$k$-means for `c1`:**
+
+```{r}
+c1_km <-
+ tibble(k = 1:10) |>
+ mutate(
+ kclust = map(k, ~kmeans(c1, .x)),
+ tidied = map(kclust, tidy),
+ glanced = map(kclust, glance),
+ augmented = map(kclust, augment, c1)
+ )
+
+# Plot statistics
+c1_cl <-
+ c1_km %>%
+ unnest(cols = c(glanced))
+ggplot(c1_cl, aes(x=k, y=tot.withinss)) +
+ geom_point() +
+ geom_line()
+
+# Extract data with different classes labelled
+c1_assign <-
+ c1_km %>%
+ unnest(cols = c(augmented))
+c1_assign_df <- c1_assign |>
+ select(x1:x6, k, .cluster) |>
+ pivot_wider(names_from=k, values_from=.cluster)
+```
+
+The statistics suggest that 3 clusters is the best solution. If we look at the 6 clusters solution, it is different from what we would expect, one large cluster is divided, as is a small cluster.
+
+```{r}
+#| eval: false
+animate_xy(c1_assign_df[,1:6], col=c1_assign_df$`3`)
+animate_xy(c1_assign_df[,1:6], col=c1_assign_df$`6`)
+```
+
+**Hierarchical for `c1`:**
+
+```{r}
+c1_hc <- hclust(dist(c1), method="ward.D2")
+ggdendrogram(c1_hc, size = 4)
+
+f.hc <- NULL; f.hc.stats <- NULL
+for (i in 2:10) {
+ cl <- cutree(c1_hc, i)
+ x <- cluster.stats(dist(c1), cl)
+ f.hc <- cbind(f.hc, cl)
+ f.hc.stats <- rbind(f.hc.stats,
+ c(x$within.cluster.ss,
+ x$wb.ratio, x$ch,
+ x$pearsongamma, x$dunn,
+ x$dunn2))
+}
+colnames(f.hc.stats) <- c("within.cluster.ss",
+ "wb.ratio", "ch",
+ "pearsongamma", "dunn",
+ "dunn2")
+f.hc.stats <- data.frame(f.hc.stats)
+f.hc.stats$cl <- 2:10
+f.hc.stats.m <- f.hc.stats %>%
+ pivot_longer(`within.cluster.ss`:dunn2,
+ names_to="stat",
+ values_to="value") |>
+ mutate(stat = factor(stat,
+ levels=colnames(f.hc.stats)))
+ggplot(data=f.hc.stats.m) +
+ geom_line(aes(x=cl, y=value)) + xlab("# clusters") + ylab("") +
+ facet_wrap(~stat, ncol=3, scales = "free_y")
+```
+
+The dendrogram suggests 3 clusters, and the statistics all agree and suggest 3 clusters. Although, `ch` possibly suggests 10!
+
+```{r}
+#| eval: false
+colnames(f.hc) <- paste0("hc", 2:10)
+f.hc <- data.frame(f.hc) |>
+ mutate_all(as.factor)
+c1_assign_df <- bind_cols(c1_assign_df, f.hc)
+animate_xy(c1_assign_df[,1:6], col=c1_assign_df$hc3)
+```
+
+The solution is the same as for $k$-means. The algorithms all treat the three small clusters as a single cluster.
+
+**Hierarchical for `c3`:**
+
+It's not really clear how the data should be divided so we'll start with using hierarchical.
+
+```{r}
+c3_hc <- hclust(dist(c3), method="ward.D2")
+ggdendrogram(c3_hc, size = 4)
+
+f.hc <- NULL; f.hc.stats <- NULL
+for (i in 2:10) {
+ cl <- cutree(c3_hc, i)
+ x <- cluster.stats(dist(c3), cl)
+ f.hc <- cbind(f.hc, cl)
+ f.hc.stats <- rbind(f.hc.stats,
+ c(x$within.cluster.ss,
+ x$wb.ratio, x$ch,
+ x$pearsongamma, x$dunn,
+ x$dunn2))
+}
+colnames(f.hc.stats) <- c("within.cluster.ss",
+ "wb.ratio", "ch",
+ "pearsongamma", "dunn",
+ "dunn2")
+f.hc.stats <- data.frame(f.hc.stats)
+f.hc.stats$cl <- 2:10
+f.hc.stats.m <- f.hc.stats %>%
+ pivot_longer(`within.cluster.ss`:dunn2,
+ names_to="stat",
+ values_to="value") |>
+ mutate(stat = factor(stat,
+ levels=colnames(f.hc.stats)))
+ggplot(data=f.hc.stats.m) +
+ geom_line(aes(x=cl, y=value)) + xlab("# clusters") + ylab("") +
+ facet_wrap(~stat, ncol=3, scales = "free_y")
+```
+
+The dendrogram suggests 4 clusters, and many of the statistics suggest 4 clusters.
+
+```{r}
+#| eval: false
+colnames(f.hc) <- paste0("hc", 2:10)
+f.hc <- data.frame(f.hc) |>
+ mutate_all(as.factor)
+c3_assign_df <- bind_cols(c3, f.hc)
+animate_xy(c3_assign_df[,1:6], col=c3_assign_df$hc4)
+```
+
+The four cluster solution is almost what we would expect, that they divide the data on the four vertices of the tetrahedron. One of the vertices is a little confused with one other.
+
+If you run $k$-means, I'd expect it does similarly to this solution. If you choose a different linkage method likely the results will change a lot.
+:::
+
+c. There are five other data sets in the `mulgar` package. Choose one or two or more to examine how they would be clustered. (I particularly would like to see how `c4` is clustered.)
+
+
+## `r emo::ji("wave")` Finishing up
+
+Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.
diff --git a/docs/week10/tutorialsol.html b/docs/week10/tutorialsol.html
new file mode 100644
index 00000000..4b11a9f4
--- /dev/null
+++ b/docs/week10/tutorialsol.html
@@ -0,0 +1,7897 @@
+
+
+
+
+
+
+
+
+
+
+
+ETC3250/5250 Introduction to Machine Learning - ETC3250/5250 Tutorial 10
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
ETC3250/5250 Tutorial 10
+
K-means and hierarchical clustering
+
+
+
+
+
+
+
+
Author
+
+
Prof. Di Cook
+
+
+
+
+
Published
+
+
6 May 2024
+
+
+
+
+
+
+
+
+
+
+
+
+
+Load the libraries and avoid conflicts, and prepare data
+
The goal for this week is learn to about fitting support vector machine models.
+
+
+
🔧 Preparation
+
+
Make sure you have all the necessary libraries installed.
+
+
+
+
Exercises:
+
+
1. How would you cluster this data?
+
+
How would you cluster this data?
+
+
+
+
+
+
+
+
+
+
Derive a distance metric that will capture your clusters. Provide some evidence that it satisfies the four distance rules.
+
Compute your rule on the data, and establish that it does indeed capture your clusters.
+
+
+
+
+
+
+
+Solution
+
+
+
+
+
+
+
My distance will use the radial distance from \((0, 0)\). You first convert each point into polar coordinates, but only take the radius. This will always be positive. The distance between two points will be the absolute value of the difference between these values.
+
+
mydist <-function(x1, x2) {
+ d <-abs(sqrt(sum(x1^2)) -sqrt(sum(x2^2)))
+return(d)
+}
+ch_dist <-matrix(0, nrow(challenge), nrow(challenge))
+for (i in1:(nrow(challenge)-1)) {
+for (j in (i+1):nrow(challenge)) {
+ ch_dist[i,j] <-mydist(challenge[i,], challenge[j,])
+ ch_dist[j,i] <- ch_dist[i,j]
+ }
+}
+#x <- as.vector(ch_dist)
+#hist(x, 30) # you can see it is bimodal
+ch_dist <-as.dist(ch_dist)
+ch_hc <-hclust(ch_dist, method="ward.D2")
+ch_cl <- challenge |>
+mutate(cl =factor(cutree(ch_hc, 2)))
+ggplot(ch_cl, aes(x=V1, y=V2, colour=cl)) +
+geom_point() +
+scale_color_discrete_divergingx(palette="Zissou 1")
+
+
+
+
+
+
+
+
+
+
+
+
+
2. Clustering spotify data with k-means
+
This exercise is motivated by this blog post on using \(k\)-means to identify anomalies.
+
You can read the data with this code. And because for clustering you need to first standardise the data the code will also do this. Variables mode and time_signature are removed because they have just a few values.
+
+
# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7
+spotify <-read_csv("https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv")
+spotify_std <- spotify |>
+mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>
+select(-c(mode, time_signature)) # variables with few values
+
+
+
Make a plot of all of the variables. This could be a density or a jittered dotplot (beeswarm::quasirandom). Many of the variables have skewed distributions. For cluster analysis, why might this be a problem? From the blog post, are any of the anomalies reported ones that can be seen as outliers in a single skewed variable?
The skewed variables are speechiness, liveliness, instrumentalness, artist_popularity, accousticness and possible we could mark duration_ms as having some skewness but also some low anomalies.
+
Yes, for example “Free Bird” and “Sparkle” could be found by simply examining a single variable.
+
+
+
+
+
+
Transform the skewed variables to be as symmetric as possible, and then fit a \(k=3\)-means clustering. Extract and report these metrics: totss, tot.withinss, betweenss. What is the ratio of within to between SS?
The differences between clusters is mostly in acousticness, danceability, energy, loudness, valence.
+
+
+
+
+
+
Now the algorithm \(k=1, ..., 20\). Extract the metrics, and plot the ratio of within SS to between SS against \(k\). What would be suggested as the best model?
Maybe 11, but the within SS decays very gradually so it is hard to tell.
+
+
+
+
+
+
Divide the data into 11 clusters, and examine the number of songs in each. Using plotly, mouse over the resulting plot and explore songs belonging to a cluster. (I don’t know much about these songs, but if you are a music fan maybe discussing with other class members and your tutor about the groupings, like which ones are grouped in clusters with high liveness, high tempo or danceability could be fun.)
3. Clustering several simulated data sets with know cluster structure
+
+
In tutorial of week 3 you used the tour to visualise the data sets c1 and c3 provided with the mulgar package. Review what you said about the structure in these data sets, and write down your expectations for how a cluster analysis would divide the data.
+
+
+
+
+
+
+
+Solution
+
+
+
+
+
+
+
+
animate_xy(c1)
+animate_xy(c3)
+
+
We said:
+
+
c1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions. We would expect that a clustering analysis divides the data into these 6 clusters.
+
c3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections. We would expect the clustering analysis to divide the data into four clusters corresponding mostly to the four vertices.
+
+
+
+
+
+
+
Compute \(k\)-means and hierarchical clustering on these two data sets, without standardising them. Use a variety of \(k\), linkage methods and check the resulting clusters using the cluster metrics. What method produces the best result, relative to what you said in a. (NOTE: Although we said that we should always standardise variables before doing clustering, you should not do this for c3. Why?)
+
+
+
+
+
+
+
+Solution
+
+
+
+
+
+
+
The scale of the variables is meaningful for these data sets. Even though some variables have smaller variance than others we would treat them to be measured on the same scale.
# Extract data with different classes labelled
+c1_assign <-
+ c1_km %>%
+unnest(cols =c(augmented))
+c1_assign_df <- c1_assign |>
+select(x1:x6, k, .cluster) |>
+pivot_wider(names_from=k, values_from=.cluster)
+
+
The statistics suggest that 3 clusters is the best solution. If we look at the 6 clusters solution, it is different from what we would expect, one large cluster is divided, as is a small cluster.
The four cluster solution is almost what we would expect, that they divide the data on the four vertices of the tetrahedron. One of the vertices is a little confused with one other.
+
If you run \(k\)-means, I’d expect it does similarly to this solution. If you choose a different linkage method likely the results will change a lot.
+
+
+
+
+
+
There are five other data sets in the mulgar package. Choose one or two or more to examine how they would be clustered. (I particularly would like to see how c4 is clustered.)
+
+
+
+
+
👋 Finishing up
+
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.
The goal for this week is learn to about fitting support vector machine models.
+
+
+
🔧 Preparation
+
+
Make sure you have all the necessary libraries installed.
+
+
+
+
Exercises:
+
+
1. Re-do Julia Silge’s post on hierarchical clustering
+
Have a read through Julia’s post on clustering Health Care Indicators. This activity is to repeat her analysis, handling missing values prior to clustering and examining the results. However the R code is a bit old style, so we will update it, too.
The goal for this week is learn to about fitting support vector machine models.
+
+
+
🔧 Preparation
+
+
Make sure you have all the necessary libraries installed.
+
+
+
+
Exercises:
+
+
1. Re-do Julia Silge’s post on hierarchical clustering
+
Have a read through Julia’s post on clustering Health Care Indicators. This activity is to repeat her analysis, handling missing values prior to clustering and examining the results. However the R code is a bit old style, so we will update it, too.
It isn’t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.
@@ -7366,7 +7366,7 @@
5. PCA o
-
+
The pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.
Each person has started the optimizer with a different random seed, since we didn’t set one. You could try to set the seed using tensorflow::set_random_seed(), and have your neighbour do the same, to check if you get the same result. You will need to clean your environment before attempting this because if you fit the model again it will update the current one rather than starting afresh.
There are several classes that have some confusion with other classes, particularly 6 with 0, 2, 4. But other classes are most often confused with at least one other. Classes 1, 5, 7, 8, 9 are rarely confused.