Merge pull request #8 from AbsaOSS/doc/fill-readme-description-and-co…

…ntributing Doc/fill readme description and contributing
AbsaOSS · Jun 6, 2024 · b4d0733 · b4d0733
2 parents 1db2b50 + 3f1abc4
commit b4d0733
Show file tree

Hide file tree

Showing 41 changed files with 35,032 additions and 1,578 deletions.
diff --git a/.github/workflows/py_test.yml b/.github/workflows/py_test.yml
@@ -34,7 +34,7 @@ jobs:
 
   python-tests:
     env:
-      TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py
+      TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
     name: Run Python Tests
     runs-on: ubuntu-latest
     steps:
@@ -56,7 +56,7 @@ jobs:
         run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES
 
       - name: Show coverage
-        run: coverage report -m
+        run: coverage report -m --omit=".*.ipynb"
 
       - name: Create coverage file
         if: github.event_name == 'pull_request'

diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,5 @@
 __pycache__/
-.idea
+.idea
+fingerprints/
+.coverage
+coverage.xml
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,31 @@
+# How to contribute to datasets-similarity
+
+## **Did you find a bug?**
+
+* **Ensure the bug has not already been reported** by searching our **[GitHub Issues](https://github.com/AbsaOSS/datasets-similarity/issues)**.
+* If you are unable to find an open issue describing the problem, use the **Bug report** template to open a new one. Tag it with the **bug** label.
+
+## **Do you want to request a new feature?**
+
+* **Ensure the feature has not already been requested** by searching our **[GitHub Issues](https://github.com/AbsaOSS/datasets-similarity/issues)**.
+* If you are unable to find the feature request, create a new one. Tag it with the **request** label.
+
+## **Do you want to implement a new feature or fix a bug?**
+
+* Check _Issues_ logs for the feature/bug. Check if someone isn't already working on it.
+  * If the feature/bug is not yet filed, please write it up first:
+    * **"Life, the universe and everything"**
+* Fork the repository.
+* We follow the [**GitFlow**](https://nvie.com/posts/a-successful-git-branching-model/) branching strategy:
+  * Cut your branch from `master`, add the _GitHub Issue_ in the branch name:
+    * **feature/42-life-universe-everything**
+    * **bugfix/42-life-universe-everything**
+* Code away. Ask away. Work with us.
+  * Commit messages should start with a reference to the GitHub Issue and provide a brief description in the imperative mood:
+    * **"#42 Answer the ultimate question"**
+  * Don't forget to write tests for your work.
+* After finishing everything, push to your forked repo and open a Pull Request to our `master` branch:
+  * Pull Request titles should start with the Github Issue number:
+    * **"42 Life, the universe and everything"**
+  * Ensure the Pull Request description clearly describes the solution.
+  * Connect the PR to the _Issue_
diff --git a/README.md b/README.md
@@ -0,0 +1,193 @@
+# Dataset Similarity
+<!-- toc -->
+- [What is Datasets Similarity?](#what-is-datasets-similarity)
+  - [Approach](#approach)
+  - [Column2Vec](#column2Vec)
+  - [Types](#types)
+  - [Applicability](#applicability)
+- [Structure](#structure)
+- [How to run](#how-to-run)
+- [How to run tests](#how-to-run-tests)
+- [How to contribute](#how-to-contribute)
+<!-- tocstop -->
+
+## What is Datasets Similarity?
+The Dataset Similarity project deals with the
+issue of comparing tabular datasets. 
+The idea of the project is that we will have a set of 
+datasets that we want to compare with each other
+and find out their similarity or distance.
+This project mainly focuses on comparing only two tables. 
+The final similarity is calculated according
+to the similarity of individual columns based on their metadata. 
+Columns are compared by type and by content.
+
+For testing, we have prepared two sets of data,
+the main set (training) on which the program is 
+tuned, and a validation set for validating the results.
+
+#### Definition of table similarity:
+![img_1.png](images/similarity_def.png)
+>Parameter **important columns** is user input.
+> 
+>Parameter **k** is also user input.
+
+
+### Approach
+You can see two options for implementation in the pictures below.
+This implementation is only for comparing two tables.
+In both implementations, we first create metadata for each table.
+MetadataCreator creates the metadata and implementation of the creator
+is modular.
+After metadatas are created for both tables, they are used as
+input for Comparator.
+Comparator compares metadata and it computes distance.
+We should test which one is better.
+
+1. ![img_2.png](images/pipeline1.png)
+2. ![img_3.png](images/pipeline2.png)
+#### Metadata creator
+MetadataCreator has:
+  - **constructor** that fills fields:
+    - size
+    - column_names
+    - column_names_clean(lowercase, only numbers and letters) 
+    - column_incomplete(if a column has more than 30 % missing values, it is marked as incomplete)
+  - **Methods for set creator**
+    - set_model: sets word transformer model
+    - compute_column_names_embeddings: computes embeddings for clean column names
+    - compute_column_kind: computes kind 
+    - compute_basic_types: compute types on top level
+    - compute_advanced_types: compute types on middle level
+    - compute_advanced_structural_types: compute types on smaller level (user should pick only one of these 3)
+    - compute_correlation: correlation between columns
+    - create_column_embeddings: create embeddings for columns
+  - *Getters*
+    - get_column_by_type: it returns all names of columns with a specified type
+    - get_numerical_columns: it returns names for all numerical columns
+    - get_metadata: it is the main method. It returns created metadata.
+
+> **Usage**:
+> firstly we call constructor, then we can call any
+> set methods (but for types we should pick just one),
+> then we can get metadata
+
+#### Comparator picture 1
+This comparator creates several matrixes, each matrix represents a
+comparison for two columns of the same type.
+Matrix's could represent different aspects. 
+
+For example, for type int we will create:
+- a matrix comparing column names
+- a matrix comparing max values
+- a matrix comparing range
+- ...
+
+For type string we will create:
+- a matrix comparing column names
+- a matrix comparing embeddings
+- a matrix comparing the most used word
+
+Then we will create one matrix for string and one matrix for int by using
+built-in function to unite matrix's. 
+
+From each of these two matrixes we will compute a distance number. 
+Then these distances will be merged. 
+#### Comparator picture 2
+This comparator will create one big matrix for all columns regardless of the type. 
+Each element in the matrix will be computed from several aspects 
+(for int: column names, max value, range ...). 
+Then we create one number from this huge matrix, which is the distance
+of these two tables. 
+### Column2Vec
+Column2Vec is a module in which we implement word2Vec based functionality for columns. 
+It will compute embeddings for columns, so we can compare them. 
+More about this module can be found [here](column2Vec/README.md).
+### Types and Kinds
+We have decided to split columns by type. We can compute types or kinds for each column.
+Types define the real type of column. Some you may know from programming languages (int, float, string)
+and some are specific (human generated, word, sentence ...).
+Kinds represent higher categorization.
+
+Types have some hierarchy as you can see on picture 3. 
+In the previous lines we named it: top level, middle level, smaller level. 
+Explaining some types:
+- Human generated: with more than three numbers after decimal point. All others are computer generated.
+- word: string without a space
+- sentence: string starting with an upper case letter and ending with fullstops(or ! ?). It contains only one fullstops. 
+- phrase: string with more than one word
+- multiple: string that represents not atomic data or structured data
+- article: string with more than one sentence
+3. ![img.png](images/types.png)
+Kind has only for "types" plus undefined. You can see all types on the picture 4.
+Explaining kinds:
+   - As **Id** will be marked column that contains only uniq values
+   - As **Bool** will be marked column that contains only two unique values
+   - As **Constant** will be marked column that contains only one unique value
+   - As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
+4. ![img.png](images/kind.png)
+### Applicability
+- merging teams 
+- fuze of companies
+- found out which data are duplicated 
+- finding similar or different data
+## Structure
+- **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
+- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
+- **Tests** are in folder [test](test)
+- **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
+- **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
+- Folder **images** contains images for README.md
+
+---
+**.github** folder contains GitHub workflows.
+
+**column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
+More about the structure of this folder can be found [here](column2Vec/README.md/#structure).
+
+**Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
+Corresponding link, name and eventual description for each dataset is
+stored in DatasetDescription.md in belonging folder ([**data**](data/DatasetDescription), [**data_validation**](data_validation/DatasetDescription.md)). 
+Both folders contain file DataShow.md with metadata information for each dataset ([**data**](data/DataShow.md), [**data_validation**](data_validation/DatasetDescription.md)).
+
+## How to run
+You can compare two or more tables by running main.py. 
+You can use both comparator and comparatorByColumn, change the comparator in compare_datasets
+The Result will be distance between tables.
+```bash
+ python main.py # for fix files
+ python main.py data/imdb_top_1000.csv data/netflix_titles.csv # for specific files
+```
+You can disable or enable warnings in main by adding these to line for disabling:
+```python
+warning_enable.change_status(False)
+warning_enable.disable_timezone_warn()
+```
+Enable by:
+```python
+warning_enable.change_status(True)
+warning_enable.enable_timezone_warn()
+```
+### DataShow
+Is generated by file [Dataset_description](similarity/Datasets_Description.ipynb)
+## How to run tests
+> Tests are in folder [*test*](test). 
+
+For running tests, you have to switch to a test folder and then run test by using pytest.
+```bash
+cd test
+
+pytest types_test.py #test name to run 
+```
+
+Or you can run all the tests by running this:
+```bash
+ python -m unittest
+ #or
+ pytest
+```
+**Please be aware that some tests in the test_column2Vec 
+module may take a long time.**
+
+## How to contribute
+Please see our [**Contribution Guidelines**](CONTRIBUTING.md).