This repository contains instructions to access the experimental data used in the paper "MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins".
-
For the datasets
AOL
,BMS-POS
,KOSARAK
,NETFLIX
, andORKUT
, and the corresponding preprocessing instructions, we refer to An Empirical Evaluation of Set Similarity Join Techniques - Compilation Instructions. SinceBMS-POS
andNETFLIX
have small initial sizes, we scale them by a factor of 10 in our experiments using the technique described by Vernica et al. in "Efficient parallel set-similarity joins using MapReduce". The script to scale the datasets is located inscripts/blowup.py
. After scaling a dataset, please consider re-preprocessing the dataset (the global token frequencies may have changed and duplicates may be introduced). -
CELONIS
is a proprietary dataset provided by Celonis SE. We are not allowed to publish it. -
The datasets
DBLP-V12
,PUBCHEM
, andTWITTER
are already preprocessed, i.e., the set elements are ordered by their inverse global token frequencies, the sets are sorted by their size, and duplicates have been removed. To download them, please runsh scripts/download-data.sh
. Licensing information can be found in the respectiveREADME.md
files.