Skip to content

DatabaseGroup/metric-join-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MetricJoin Datasets

DOI

This repository contains instructions to access the experimental data used in the paper "MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins".

Datasets

  • For the datasets AOL, BMS-POS, KOSARAK, NETFLIX, and ORKUT, and the corresponding preprocessing instructions, we refer to An Empirical Evaluation of Set Similarity Join Techniques - Compilation Instructions. Since BMS-POS and NETFLIX have small initial sizes, we scale them by a factor of 10 in our experiments using the technique described by Vernica et al. in "Efficient parallel set-similarity joins using MapReduce". The script to scale the datasets is located in scripts/blowup.py. After scaling a dataset, please consider re-preprocessing the dataset (the global token frequencies may have changed and duplicates may be introduced).

  • CELONIS is a proprietary dataset provided by Celonis SE. We are not allowed to publish it.

  • The datasets DBLP-V12, PUBCHEM, and TWITTER are already preprocessed, i.e., the set elements are ordered by their inverse global token frequencies, the sets are sorted by their size, and duplicates have been removed. To download them, please run sh scripts/download-data.sh. Licensing information can be found in the respective README.md files.

About

Contains the experimental data of the metric join paper.

Resources

License

Stars

Watchers

Forks

Packages

No packages published