Skip to content

Commit

Permalink
Merge pull request #8 from AbsaOSS/doc/fill-readme-description-and-co…
Browse files Browse the repository at this point in the history
…ntributing

Doc/fill readme description and contributing
  • Loading branch information
OlivieFranklova authored Jun 6, 2024
2 parents 1db2b50 + 3f1abc4 commit b4d0733
Show file tree
Hide file tree
Showing 41 changed files with 35,032 additions and 1,578 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/py_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
python-tests:
env:
TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py
TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
name: Run Python Tests
runs-on: ubuntu-latest
steps:
Expand All @@ -56,7 +56,7 @@ jobs:
run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES

- name: Show coverage
run: coverage report -m
run: coverage report -m --omit=".*.ipynb"

- name: Create coverage file
if: github.event_name == 'pull_request'
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
__pycache__/
.idea
.idea
fingerprints/
.coverage
coverage.xml
31 changes: 31 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# How to contribute to datasets-similarity

## **Did you find a bug?**

* **Ensure the bug has not already been reported** by searching our **[GitHub Issues](https://github.com/AbsaOSS/datasets-similarity/issues)**.
* If you are unable to find an open issue describing the problem, use the **Bug report** template to open a new one. Tag it with the **bug** label.

## **Do you want to request a new feature?**

* **Ensure the feature has not already been requested** by searching our **[GitHub Issues](https://github.com/AbsaOSS/datasets-similarity/issues)**.
* If you are unable to find the feature request, create a new one. Tag it with the **request** label.

## **Do you want to implement a new feature or fix a bug?**

* Check _Issues_ logs for the feature/bug. Check if someone isn't already working on it.
* If the feature/bug is not yet filed, please write it up first:
* **"Life, the universe and everything"**
* Fork the repository.
* We follow the [**GitFlow**](https://nvie.com/posts/a-successful-git-branching-model/) branching strategy:
* Cut your branch from `master`, add the _GitHub Issue_ in the branch name:
* **feature/42-life-universe-everything**
* **bugfix/42-life-universe-everything**
* Code away. Ask away. Work with us.
* Commit messages should start with a reference to the GitHub Issue and provide a brief description in the imperative mood:
* **"#42 Answer the ultimate question"**
* Don't forget to write tests for your work.
* After finishing everything, push to your forked repo and open a Pull Request to our `master` branch:
* Pull Request titles should start with the Github Issue number:
* **"42 Life, the universe and everything"**
* Ensure the Pull Request description clearly describes the solution.
* Connect the PR to the _Issue_
193 changes: 193 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Dataset Similarity
<!-- toc -->
- [What is Datasets Similarity?](#what-is-datasets-similarity)
- [Approach](#approach)
- [Column2Vec](#column2Vec)
- [Types](#types)
- [Applicability](#applicability)
- [Structure](#structure)
- [How to run](#how-to-run)
- [How to run tests](#how-to-run-tests)
- [How to contribute](#how-to-contribute)
<!-- tocstop -->

## What is Datasets Similarity?
The Dataset Similarity project deals with the
issue of comparing tabular datasets.
The idea of the project is that we will have a set of
datasets that we want to compare with each other
and find out their similarity or distance.
This project mainly focuses on comparing only two tables.
The final similarity is calculated according
to the similarity of individual columns based on their metadata.
Columns are compared by type and by content.

For testing, we have prepared two sets of data,
the main set (training) on which the program is
tuned, and a validation set for validating the results.

#### Definition of table similarity:
![img_1.png](images/similarity_def.png)
>Parameter **important columns** is user input.
>
>Parameter **k** is also user input.

### Approach
You can see two options for implementation in the pictures below.
This implementation is only for comparing two tables.
In both implementations, we first create metadata for each table.
MetadataCreator creates the metadata and implementation of the creator
is modular.
After metadatas are created for both tables, they are used as
input for Comparator.
Comparator compares metadata and it computes distance.
We should test which one is better.

1. ![img_2.png](images/pipeline1.png)
2. ![img_3.png](images/pipeline2.png)
#### Metadata creator
MetadataCreator has:
- **constructor** that fills fields:
- size
- column_names
- column_names_clean(lowercase, only numbers and letters)
- column_incomplete(if a column has more than 30 % missing values, it is marked as incomplete)
- **Methods for set creator**
- set_model: sets word transformer model
- compute_column_names_embeddings: computes embeddings for clean column names
- compute_column_kind: computes kind
- compute_basic_types: compute types on top level
- compute_advanced_types: compute types on middle level
- compute_advanced_structural_types: compute types on smaller level (user should pick only one of these 3)
- compute_correlation: correlation between columns
- create_column_embeddings: create embeddings for columns
- *Getters*
- get_column_by_type: it returns all names of columns with a specified type
- get_numerical_columns: it returns names for all numerical columns
- get_metadata: it is the main method. It returns created metadata.

> **Usage**:
> firstly we call constructor, then we can call any
> set methods (but for types we should pick just one),
> then we can get metadata
#### Comparator picture 1
This comparator creates several matrixes, each matrix represents a
comparison for two columns of the same type.
Matrix's could represent different aspects.

For example, for type int we will create:
- a matrix comparing column names
- a matrix comparing max values
- a matrix comparing range
- ...

For type string we will create:
- a matrix comparing column names
- a matrix comparing embeddings
- a matrix comparing the most used word

Then we will create one matrix for string and one matrix for int by using
built-in function to unite matrix's.

From each of these two matrixes we will compute a distance number.
Then these distances will be merged.
#### Comparator picture 2
This comparator will create one big matrix for all columns regardless of the type.
Each element in the matrix will be computed from several aspects
(for int: column names, max value, range ...).
Then we create one number from this huge matrix, which is the distance
of these two tables.
### Column2Vec
Column2Vec is a module in which we implement word2Vec based functionality for columns.
It will compute embeddings for columns, so we can compare them.
More about this module can be found [here](column2Vec/README.md).
### Types and Kinds
We have decided to split columns by type. We can compute types or kinds for each column.
Types define the real type of column. Some you may know from programming languages (int, float, string)
and some are specific (human generated, word, sentence ...).
Kinds represent higher categorization.

Types have some hierarchy as you can see on picture 3.
In the previous lines we named it: top level, middle level, smaller level.
Explaining some types:
- Human generated: with more than three numbers after decimal point. All others are computer generated.
- word: string without a space
- sentence: string starting with an upper case letter and ending with fullstops(or ! ?). It contains only one fullstops.
- phrase: string with more than one word
- multiple: string that represents not atomic data or structured data
- article: string with more than one sentence
3. ![img.png](images/types.png)
Kind has only for "types" plus undefined. You can see all types on the picture 4.
Explaining kinds:
- As **Id** will be marked column that contains only uniq values
- As **Bool** will be marked column that contains only two unique values
- As **Constant** will be marked column that contains only one unique value
- As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
4. ![img.png](images/kind.png)
### Applicability
- merging teams
- fuze of companies
- found out which data are duplicated
- finding similar or different data
## Structure
- **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
- **Tests** are in folder [test](test)
- **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
- **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
- Folder **images** contains images for README.md

---
**.github** folder contains GitHub workflows.

**column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
More about the structure of this folder can be found [here](column2Vec/README.md/#structure).

**Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
Corresponding link, name and eventual description for each dataset is
stored in DatasetDescription.md in belonging folder ([**data**](data/DatasetDescription), [**data_validation**](data_validation/DatasetDescription.md)).
Both folders contain file DataShow.md with metadata information for each dataset ([**data**](data/DataShow.md), [**data_validation**](data_validation/DatasetDescription.md)).

## How to run
You can compare two or more tables by running main.py.
You can use both comparator and comparatorByColumn, change the comparator in compare_datasets
The Result will be distance between tables.
```bash
python main.py # for fix files
python main.py data/imdb_top_1000.csv data/netflix_titles.csv # for specific files
```
You can disable or enable warnings in main by adding these to line for disabling:
```python
warning_enable.change_status(False)
warning_enable.disable_timezone_warn()
```
Enable by:
```python
warning_enable.change_status(True)
warning_enable.enable_timezone_warn()
```
### DataShow
Is generated by file [Dataset_description](similarity/Datasets_Description.ipynb)
## How to run tests
> Tests are in folder [*test*](test).
For running tests, you have to switch to a test folder and then run test by using pytest.
```bash
cd test

pytest types_test.py #test name to run
```

Or you can run all the tests by running this:
```bash
python -m unittest
#or
pytest
```
**Please be aware that some tests in the test_column2Vec
module may take a long time.**

## How to contribute
Please see our [**Contribution Guidelines**](CONTRIBUTING.md).
Loading

0 comments on commit b4d0733

Please sign in to comment.