MetricJoin Datasets

This repository contains instructions to access the experimental data used in the paper "MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins".

Datasets

For the datasets AOL, BMS-POS, KOSARAK, NETFLIX, and ORKUT, and the corresponding preprocessing instructions, we refer to An Empirical Evaluation of Set Similarity Join Techniques - Compilation Instructions. Since BMS-POS and NETFLIX have small initial sizes, we scale them by a factor of 10 in our experiments using the technique described by Vernica et al. in "Efficient parallel set-similarity joins using MapReduce". The script to scale the datasets is located in scripts/blowup.py. After scaling a dataset, please consider re-preprocessing the dataset (the global token frequencies may have changed and duplicates may be introduced).
CELONIS is a proprietary dataset provided by Celonis SE. We are not allowed to publish it.
The datasets DBLP-V12, PUBCHEM, and TWITTER are already preprocessed, i.e., the set elements are ordered by their inverse global token frequencies, the sets are sorted by their size, and duplicates have been removed. To download them, please run sh scripts/download-data.sh. Licensing information can be found in the respective README.md files.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetricJoin Datasets

Datasets

About

Releases 1

Packages

Contributors 2

Languages

License

DatabaseGroup/metric-join-datasets

Folders and files

Latest commit

History

Repository files navigation

MetricJoin Datasets

Datasets

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages