This is a project using the Python programming language to create a Convex Hull algorithm using the divide and conquer strategy for visualizing a linear separability dataset. This project was created to fulfill Task 2 of IF2211 Algorithm Strategy course.
The Linear Separability Dataset is a testing utility for dataset to ensure that classes do not overlap with each other through the analysis of the convex hull of each target class against each pair of features. This library is an example of self-implemented convex hull using the divide and conquer strategy, specifically QuickHull. Additionally, this library is equipped with classes that can provide visualization of linear separability in a dataset.
- Python >= 3.7
Automatic package installation will install other required modules, eliminating the need for manual installation of dependencies from this library.
- [RECOMMENDED] Use a new Python virtual environment.
- Change directory to this project folder.
- Install the package with the following command:
pip install .
For development, run
pip install -e .
to install the package in editable mode. - [OPTIONAL] To load datasets from
sklearn.dataset
, you must use thedatasets
extra.There is an extrapip install .[datasets]
tests
for performing unit testing that requiresscipy
and the sklearn dataset.pip install .[tests]
The package can be used as a Python module that can be imported by other programs.
There are two main classes, namely ConvexHull
and LinearSeparabilityDataset
.
ConvexHull
can be used to find the convex hull of points in 2 dimensions.LinearSeparabilityDataset
can be used to load data, collect targets, and visualize the convex hull of each feature pair for each target
For further documentation, refer to the docstrings of each class/function that will be used.
Here is a sample program that can be used as reference:
- Getting the convex hull of a set of points.
from myConvexHull.lib import ConvexHull as MyConvexHull hull = MyConvexHull([ (4.3, 3.0), (4.6, 3.6), (4.4, 3.2) ]) print(hull.simplices) # [(0, 1)] print(hull.vertices) # [0, 1]
- Visualize the linear separability dataset from the iris dataset with the first and second feature pairs as well as the third and fourth feature pairs.
from myConvexHull.lib import LinearSeparabilityDataset from sklearn import datasets data = datasets.load_wine(as_frame=True) data = LinearSeparabilityDataset( frame=data.frame, target_names=data.target_names, ) data.visualize(0, 1) data.visualize(2, 3)
- Visualize the convex hull for random numbers.
import numpy as np import pandas as pd from myConvexHull.lib import LinearSeparabilityDataset data=LinearSeparabilityDataset( frame=pd.DataFrame( { 'X': np.random.rand(100) * 50, 'Y': np.random.rand(100) * 25, 'target': np.random.randint(0, 4, 100), }, ), target_names=['A', 'B', 'C', 'D'], ) data.visualize('X', 'Y')
This package also comes with a main driver program that can be executed from the command line. To see the complete list of arguments, run the following command:
python -m myConvexHull -h
Here are the complete arguments to run python -m myConvexHull
:
usage: __main__.py [-h] [-f FILE] [-tk TARGET_KEY] [-tn TARGET_NAMES [TARGET_NAMES ...]] [-n DATASET_NAME] -fp FEATURE_PAIR FEATURE_PAIR [-s SIZE SIZE] [-nc]
Main driver of linear separability dataset visualizer. It will generate a plot of convex hull given a dataset.
options:
-h, --help show this help message and exit
File Dataset Input:
-f FILE, --file FILE Input datasets file. Should have minimum 3 columns: 2 features and a target.
-tk TARGET_KEY, --target_key TARGET_KEY
Target column name.
-tn TARGET_NAMES [TARGET_NAMES ...], --target_names TARGET_NAMES [TARGET_NAMES ...]
Target name list, separated by space.
Sklearn Dataset Input:
-n DATASET_NAME, --dataset_name DATASET_NAME
Name of the dataset.
Visualization Options:
-fp FEATURE_PAIR FEATURE_PAIR, --feature_pair FEATURE_PAIR FEATURE_PAIR
Feature pair to plot. Should be separated by space. You can supply multiple pair of feature.
-s SIZE SIZE, --size SIZE SIZE
Figure size (width, height) of the plot.
-nc, --no_captions Disable captions (title, x/y label).
Here are some examples of executable commands:
- Visualize the linear separability dataset for the first and second feature pairs from the
breast_cancer
dataset:python -m myConvexHull -n breast_cancer -fp 0 1
- Visualize more than one feature pair in the iris dataset. For instance, displaying the visualization for the first and second features, the second and third features, and the features "sepal length (cm)" and "sepal width (cm)":
python -m myConvexHull -n iris -fp 0 1 -fp 1 2 -fp "sepal length (cm)" "sepal width (cm)"
In addition to using indices, features can also be specified using column names in the dataset. Make sure that the names you write correspond to columns in the dataset.
- Visualize data from the file
datasets/wine_data.csv
assuming the current working directory is at the root of this package:python -m myConvexHull -f "datasets/wine_data.csv" -tn Class0 Class1 Class2 -fp 0 1
- Visualize data from the water potability dataset from the provided link. The following example command displays the convex hull for the feature pairs pH and Hardness, as well as Sulfate and Conductivity:
python -m myConvexHull -f "datasets/water_potability.csv" -tn "Not Potable" "Potable" -tk "Potability" -fp 0 1 -fp Sulfate Conductivity
When using the file input mode, ensure that all of the following conditions are met:
- The file must be in CSV format.
- The file must start with column names/headers, followed by rows containing data for each column.
- Make sure there is a target column, which by default is named
target
(case-sensitive). If the target column name is different, add the argument-tk TARGET_KEY
to the command, whereTARGET_KEY
is the name of the target column. - The target column must contain non-negative integer data without gaps (for example, if there are 3 rows of data, and the first row has a target of 1, the second row has a target of 3, and the third row has a target of 0, then this data is invalid because it skips the number 2).
- The label values (
-tn
or--target_names
) for the target must be arranged in ascending order starting from the label for target = 0.
To maintain quality during development, unit testing is available in this package. Unit testing consists of libraries that compare the results between ConvexHull from SciPy with this library, and utils that ensure some example inputs produce correct values.
python -m unittest discover -s tests
Make sure to install the package with the tests
extra before running the tests.
Amar Fadil [13520103]
Hello, I'm Amar Fadil, a computer science student with the student ID 13520103. I am a software engineer who loves to tinker with computer graphics, computer security, and competitive programming (maybe). Pursuing a degree in Computer Science (IF) at the School of Electrical Engineering and Informatics (STEI) in Bandung Institute of Technology (ITB), I aspire to develop the creative digital industry in Indonesia :D