There 5 different project in this repository that each cover a different section in the datamining world; from preprocessing and analyzing data to classification, clustering, etc.
A series of different data mining sub-projects are implemented in this project. This project was implemented to practice using different libraries and to work with different algorithms to gain knowledge and experience in this field of science. The full explanation of each sub-projects is written in the following sections.
This project aimed at preprocessing the data obtained from the Iris dataset using two major preprocessing libraries in machine learning: Pandas, and scikit-learn.
The Iris dataset contains information on three types of flowers (iris-setosa، iris-versicolor, iris-virginica). The data collected for each of these categories of data contains four columns named:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
For preprocessing purposes, the missing values were deleted from the dataset. Also, the non-numeric (categorical) data was encoded using label encoding. Although label encoding is a very suitable approach to solving this problem, it can only be applied to ordered data. In the case of encoding data with no order, one hot encoding is a feasible method.
Normalization eas also done on this dataset in order to get better results in training phase.
Visualizing the data before applying any ml algorithm can help understand and analyze data better. Visualizing data with more than three dimensions is not feasible nor easy. To address this problem, PCA was applied to this data to obtain a 2 dimensional dataset for visualizing purposes.
The main goal of this project was to practice using and finding the best classification algorithms and functions for each type of data distribution. Different parameters were analyzed during this project, including learning_rate, to obtain the best results and accuracy.
In the last section of this project, the fashion_mnist dataset was imported to test the classification algorithm.
In this project some clustering algorithms were used to cluster data.
First a set of random data was generated and then clustered using k-means algorithm. The issue with this problem is that we have to set the number of the clusters manually. To overcome this issue I used elbow method and calculated the cost function for 1 to 10 clusters and found the optimal number of clusters.
In this section the digit dataset from Sklearn was used. K-means algorithm was applied to this dataset to group them into 10 clusters. Then the centroids were calculated. As it is shown below, the centroids almost accurately represent the labels of their clusters:
In this section k-means algorithms was used to compress images. To achieve this goal the clustering was performed on the colors of the image. The image was reshaped into a matrix of shape (rows*cols, 3) and then each vector in the image was gouped into one of the 4 clusters using k-means algorithm. Then the image was reshaped back into its original shape and saved. The original and compressed images are shown below:
DBSCAN algorithm was used in this section to cluster more complex data distributions that cannot be properly clustered using k-means. To used this algorithm it is necessary to set min pts and eps properly. In order to find the optimal eps KNN algorithm was used.
Howevet, the min pts was set using some rules and experiments.
The larger the data set, the larger the value of MinPts should be
If the data set is noisier, choose a larger value of MinPts
Generally, MinPts should be greater than or equal to the dimensionality of the data set
For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998)
Also having the optimal eps, min pts can be obtained via testing different values.
Association rules represent relationships and interdependencies between large sets of data items.
A common example of association rule discovery is "shopping cart analysis". In this process, according to the various items that customers put in their shopping carts, the buying habits and behavior of customers are analyzed, and by identifying the relationship between products, repeating patterns during shopping can be obtained.
Three important parameters:
- Support shows the popularity of a set of items according to the frequency of transactions.
- Confidence shows the probability of buying item y if item x is bought. x -> y
- Lift is a combination of the above two parameters. To implement association rules in this project, we use the Apriori algorithm, which is one of the most popular and efficient algorithms in this field.
The lift value calculates the probability of an item occurring if another item has occurred, while also considering the frequency of each of the two items.
The amount of lift can be calculated using the following equation:
lift = confidence / expected_confidence = confidence / ( s(body) * s(head) / s(body) ) = confidence / s(head)
The lift value can have values from 0 to infinity
Three different scenarios can happen:
- If the lift value is greater than 1, it indicates that the body and head of the rule appear together more than expected, meaning that the body event has a positive effect on the head event.
- If the lift value is less than 1, it means that the body and the head of the rule appeared together less than expected, and in this way, the occurrence of the body has a negative effect on the probability of the occurrence of the head.
- If the lift value is close to 1, it shows that the body and the head occur together almost as expected, meaning that the body event will not affect the head event.
The algorithm works in a way that a minimum support value is considered and repetitions occur with frequent itemsets. If the sets and subsets have a support value lower than the threshold, they are removed. This process continues until there is no possibility of deletion.
First I prepared the dataset in the form of a sparse matrix with the purchased products in the column and the purchase order number as the index. For convenience, the purchased products were coded using 0 and 1 in each column. Then by using TransactionEncoder
function from mlxtend.preprocessing
the transaction were encoded. Finaly the apriory algithm was applied with a min_support of 0.07 .
Then the implemented extract_rules
function was used with dynamic lift and confidence values to extract association rules.
The description of this project can be found here