Skip to content

Visualizer for linear separability dataset by finding the convex hull of each category classifier.

License

Notifications You must be signed in to change notification settings

marfgold1/LinearSeparabilityDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Linear Separability Dataset

This is a project using the Python programming language to create a Convex Hull algorithm using the divide and conquer strategy for visualizing a linear separability dataset. This project was created to fulfill Task 2 of IF2211 Algorithm Strategy course.

Table of Contents

Description

The Linear Separability Dataset is a testing utility for dataset to ensure that classes do not overlap with each other through the analysis of the convex hull of each target class against each pair of features. This library is an example of self-implemented convex hull using the divide and conquer strategy, specifically QuickHull. Additionally, this library is equipped with classes that can provide visualization of linear separability in a dataset.

Requirement

  1. Python >= 3.7

Setup

Automatic package installation will install other required modules, eliminating the need for manual installation of dependencies from this library.

  1. [RECOMMENDED] Use a new Python virtual environment.
  2. Change directory to this project folder.
  3. Install the package with the following command:
    pip install .
    

    For development, run pip install -e . to install the package in editable mode.

  4. [OPTIONAL] To load datasets from sklearn.dataset, you must use the datasets extra.
    pip install .[datasets]
    
    There is an extra tests for performing unit testing that requires scipy and the sklearn dataset.
    pip install .[tests]
    

Usage

A. Library

The package can be used as a Python module that can be imported by other programs. There are two main classes, namely ConvexHull and LinearSeparabilityDataset.

  1. ConvexHull can be used to find the convex hull of points in 2 dimensions.
  2. LinearSeparabilityDataset can be used to load data, collect targets, and visualize the convex hull of each feature pair for each target

For further documentation, refer to the docstrings of each class/function that will be used.

Here is a sample program that can be used as reference:

  1. Getting the convex hull of a set of points.
    from myConvexHull.lib import ConvexHull as MyConvexHull
    hull = MyConvexHull([
        (4.3, 3.0),
        (4.6, 3.6),
        (4.4, 3.2)
    ])
    print(hull.simplices) # [(0, 1)]
    print(hull.vertices) # [0, 1]
  2. Visualize the linear separability dataset from the iris dataset with the first and second feature pairs as well as the third and fourth feature pairs.
    from myConvexHull.lib import LinearSeparabilityDataset
    from sklearn import datasets
    data = datasets.load_wine(as_frame=True)
    data = LinearSeparabilityDataset(
        frame=data.frame,
        target_names=data.target_names,
    )
    data.visualize(0, 1)
    data.visualize(2, 3)
  3. Visualize the convex hull for random numbers.
    import numpy as np
    import pandas as pd
    from myConvexHull.lib import LinearSeparabilityDataset
    data=LinearSeparabilityDataset(
        frame=pd.DataFrame(
            {
                'X': np.random.rand(100) * 50,
                'Y': np.random.rand(100) * 25,
                'target': np.random.randint(0, 4, 100),
            },
        ),
        target_names=['A', 'B', 'C', 'D'],
    )
    data.visualize('X', 'Y')

B. Driver / Main Program

This package also comes with a main driver program that can be executed from the command line. To see the complete list of arguments, run the following command:

python -m myConvexHull -h

Here are the complete arguments to run python -m myConvexHull:

usage: __main__.py [-h] [-f FILE] [-tk TARGET_KEY] [-tn TARGET_NAMES [TARGET_NAMES ...]] [-n DATASET_NAME] -fp FEATURE_PAIR FEATURE_PAIR [-s SIZE SIZE] [-nc]

Main driver of linear separability dataset visualizer. It will generate a plot of convex hull given a dataset.

options:
  -h, --help            show this help message and exit

File Dataset Input:
  -f FILE, --file FILE  Input datasets file. Should have minimum 3 columns: 2 features and a target.
  -tk TARGET_KEY, --target_key TARGET_KEY
                        Target column name.
  -tn TARGET_NAMES [TARGET_NAMES ...], --target_names TARGET_NAMES [TARGET_NAMES ...]
                        Target name list, separated by space.

Sklearn Dataset Input:
  -n DATASET_NAME, --dataset_name DATASET_NAME
                        Name of the dataset.

Visualization Options:
  -fp FEATURE_PAIR FEATURE_PAIR, --feature_pair FEATURE_PAIR FEATURE_PAIR
                        Feature pair to plot. Should be separated by space. You can supply multiple pair of feature.
  -s SIZE SIZE, --size SIZE SIZE
                        Figure size (width, height) of the plot.
  -nc, --no_captions    Disable captions (title, x/y label).

Here are some examples of executable commands:

  1. Visualize the linear separability dataset for the first and second feature pairs from the breast_cancer dataset:
    python -m myConvexHull -n breast_cancer -fp 0 1
  2. Visualize more than one feature pair in the iris dataset. For instance, displaying the visualization for the first and second features, the second and third features, and the features "sepal length (cm)" and "sepal width (cm)":
    python -m myConvexHull -n iris -fp 0 1 -fp 1 2 -fp "sepal length (cm)" "sepal width (cm)"

    In addition to using indices, features can also be specified using column names in the dataset. Make sure that the names you write correspond to columns in the dataset.

  3. Visualize data from the file datasets/wine_data.csv assuming the current working directory is at the root of this package:
    python -m myConvexHull -f "datasets/wine_data.csv" -tn Class0 Class1 Class2 -fp 0 1
  4. Visualize data from the water potability dataset from the provided link. The following example command displays the convex hull for the feature pairs pH and Hardness, as well as Sulfate and Conductivity:
    python -m myConvexHull -f "datasets/water_potability.csv" -tn "Not Potable" "Potable" -tk "Potability" -fp 0 1 -fp Sulfate Conductivity

When using the file input mode, ensure that all of the following conditions are met:

  • The file must be in CSV format.
  • The file must start with column names/headers, followed by rows containing data for each column.
  • Make sure there is a target column, which by default is named target (case-sensitive). If the target column name is different, add the argument -tk TARGET_KEY to the command, where TARGET_KEY is the name of the target column.
  • The target column must contain non-negative integer data without gaps (for example, if there are 3 rows of data, and the first row has a target of 1, the second row has a target of 3, and the third row has a target of 0, then this data is invalid because it skips the number 2).
  • The label values (-tn or --target_names) for the target must be arranged in ascending order starting from the label for target = 0.

C. Test

To maintain quality during development, unit testing is available in this package. Unit testing consists of libraries that compare the results between ConvexHull from SciPy with this library, and utils that ensure some example inputs produce correct values.

python -m unittest discover -s tests

Make sure to install the package with the tests extra before running the tests.

Author

Amar Fadil [13520103]

Hello, I'm Amar Fadil, a computer science student with the student ID 13520103. I am a software engineer who loves to tinker with computer graphics, computer security, and competitive programming (maybe). Pursuing a degree in Computer Science (IF) at the School of Electrical Engineering and Informatics (STEI) in Bandung Institute of Technology (ITB), I aspire to develop the creative digital industry in Indonesia :D

About

Visualizer for linear separability dataset by finding the convex hull of each category classifier.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages