diff --git a/README.md b/README.md index 407182a..c51c124 100644 --- a/README.md +++ b/README.md @@ -12,22 +12,90 @@ The system has several key attributes that lead to its highly and easily customi Quick links to this file: +* [Competing libraries](#competing-libraries) * [Prerequisites](#prerequisites) * [Running](#running) * [Directory Structure](#directory-structure) * [Getting the datasets](#getting-the-datasets) * [Configuration](#configuration) -* [Competing libraries](#competing-libraries) * [Citation details](#citation-details) +## Competing libraries + +Machine learning toolkits: +* [mlpack](http://mlpack.org) +* [Shogun-toolbox](http://shogun-toolbox.org) +* [scikit-learn](http://scikit-learn.org) +* [MATLAB](http://mathworks.com) +* [Weka](http://cs.waikato.ac.nz/ml/weka/) +* [elki](https://elki-project.github.io/) +* [mlpy](http://mlpy.sourceforge.net) +* [dlibml](http://dlib.net/ml.html) +* [milk](https://github.com/luispedro/milk/) +* [R](https://www.r-project.org/) + +Nearest Neighbour Algorithms: +* [ANN](http://www.cs.umd.edu/~mount/ANN/) +* [FLANN](http://www.cs.ubc.ca/research/flann/) +* [nearpy](http://pixelogik.github.io/NearPy/) +* [annoy](https://github.com/spotify/annoy) +* [mrpt](https://github.com/vioshyvo/mrpt) + +Inactive toolkits: +* [HLearn](https://github.com/mikeizbicki/HLearn) +NOTE: `HLearn` is not currently being benchmarked by this repository. + ## Prerequisites * **[Python 3.3+](http://www.python.org "Python Website")**: The main benchmark script is written with the programming language python: The benchmark script by default uses the version of Python on your path. +* **[numpy](https://www.numpy.org/)**: Numpy provides a powerful N-dimensional array object and sophisticated (broadcasting) functions useful for handling and transforming data. * **[Python-yaml](http://pyyaml.org "Python-yaml Website")**: PyYAML is a YAML parser and emitter for Python. We've picked YAML as the configuration file format for specifying the structure for the project. * **[SQLite](http://www.sqlite.org "SQLite Website")** (**Optional**): SQLite is a lightweight disk-based database that doesn't require a separate server process. We use the python built-in SQLite database to save the benchmark results. -* **[Valgrind](http://valgrind.org "Valgrind Website")** (**Optional**): Valgrind is a suite of tools for debugging and profiling. This package is only needed if you want to run the memory benchmarks. * **[python-xmlrunner](https://github.com/lamby/pkg-python-xmlrunner "python-xmlrunner github")** (**Optional**): The xmlrunner module is a unittest test runner that can save test results to XML files. This package is only needed if you want to run the tests. +### Prerequisites for Setting up Competing Libraries +All the following pre-requisite packages are needed to be installed before running `make setup` command (see the next [section](#running)): +**FLANN library:** +* [hdf5](https://www.hdfgroup.org/solutions/hdf5/): This is a high performance data software library. +* [gtest](https://github.com/google/googletest): This package used for writing C++ tests. + +**mlpack:** +* [Armadillo](http://arma.sourceforge.net/download.html): This package is a c++ library for linear algebra and scientific computing. +* [Boost C++](https://www.boost.org/): This package is required for compiling mlpack from source. + +**mlpy:** +* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering +* [GSL](https://www.gnu.org/software/gsl/): The is a numerical library for C and C++ programmers. + +**scikit-learn:** +* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering +* [joblib](https://joblib.readthedocs.io/): Joblib is a set of tools to provide lightweight pipelining in Python. +* [Cython-0.25.2](https://cython.org/): C-extentions for python. Required for compiling the scikit from source. + +**Nearpy:** +* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering +* [redis](https://redislabs.com/lp/python-redis/): The Python interface to the Redis key-value store. + +**shogun:** +* [swig](https://github.com/swig/swig): SWIG is a compiler that integrates C and C++ with languages including Perl, Python, Tcl, Ruby, PHP, Java, C#, D, Go, Lua, Octave, R, Scheme (Guile, MzScheme/Racket), Scilab, Ocaml. SWIG can also export its parse tree into XM + +**weka:** +* [java](https://www.java.com/en/): Java is a programming language on which weka is based. + +**elki:** +* [java](https://www.java.com/en/): Java is a programming language on which weka is based. + +**milk** +* [Eigen3](http://eigen.tuxfamily.org/index.php?title=Main_Page): Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms. + +**R** +* [gfortran](https://gcc.gnu.org/wiki/GFortran): A free Fortran 95/2003/2008 compiler for GCC +* [readline](https://www.gnu.org/software/readline/): This library provides a set of functions for use by applications that allow users to edit command lines as they are typed in. +* [libbz2-dev](https://www.sourceware.org/bzip2/): This is a freely available, patent free, high-quality data compressor. Header files of this software are required. +* [liblzma-dev](https://tukaani.org/xz/): XZ Utils is free general-purpose data compression software with a high compression ratio. Header files of this software are required. +* [libcurl4](https://curl.haxx.se/libcurl/): libcurl is a free and easy-to-use client-side URL transfer library + + ## Running Benchmarks are run with the `make` command. @@ -234,18 +302,6 @@ methods: In this case we benchmark the pca method located in methods/mlpack/pca.py with the isolet and the cities dataset. The pca method scales the data before running the pca method. The benchmark performs twice for each dataset. Additionally the pca.py script supports the following file formats txt, csv, hdf5 and bin. If the data isn't available in this particular case the format will be generated. -## Competing libraries - -* http://mlpack.org -* http://mathworks.com -* http://shogun-toolbox.org -* http://cs.waikato.ac.nz/ml/weka/ -* https://elki-project.github.io/ -* http://scikit-learn.org -* http://mlpy.sourceforge.net -* http://www.cs.umd.edu/~mount/ANN/ -* http://www.cs.ubc.ca/research/flann/ - ## Citation details If you use the benchmarks in your work, we'd really appreciate it if you could cite the following paper (given in BiBTeX format): diff --git a/libraries/ann_install.sh b/libraries/ann_install.sh index b50ab08..983884d 100755 --- a/libraries/ann_install.sh +++ b/libraries/ann_install.sh @@ -8,14 +8,14 @@ # One ann.tar.gz file should be located in this directory containing the # source code of the desired mlpack version. The first argument is the number # of cores to use for build. -if [ "$1" -eq "" ]; then +if [ "$1" == "" ]; then cores="1"; else cores="$1"; fi tars=`ls ann.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source ann.tar.gz found in libraries/!" exit 1 diff --git a/libraries/annoy_install.sh b/libraries/annoy_install.sh index 626eb93..9f8a2cb 100755 --- a/libraries/annoy_install.sh +++ b/libraries/annoy_install.sh @@ -7,7 +7,7 @@ # # One annoy.tar.gz file should be located in this directory. tars=`ls annoy.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source annoy.tar.gz found in libraries/!" exit 1 diff --git a/libraries/dlibml_install.sh b/libraries/dlibml_install.sh index 3a9efca..afefb25 100755 --- a/libraries/dlibml_install.sh +++ b/libraries/dlibml_install.sh @@ -8,7 +8,7 @@ # One dlibml.tar.gz file should be located in this directory. tars=`ls dlibml.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source dlibml.tar.gz found in libraries/!" exit 1 diff --git a/libraries/download_packages.sh b/libraries/download_packages.sh index c26d4f2..beabf1e 100755 --- a/libraries/download_packages.sh +++ b/libraries/download_packages.sh @@ -12,7 +12,7 @@ do wget $url -O $name.$ext; - if [ "$?" -ne "0" ]; then + if [ "$?" != "0" ]; then echo "Failure downloading $name!"; exit 1; fi diff --git a/libraries/dtimeout_install.sh b/libraries/dtimeout_install.sh index 28d55ea..a175d25 100755 --- a/libraries/dtimeout_install.sh +++ b/libraries/dtimeout_install.sh @@ -7,7 +7,7 @@ # # One dtimeout.tar.gz file should be located in this directory. tars=`ls dtimeout.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source dtimeout.tar.gz found in libraries/!" exit 1 diff --git a/libraries/flann_install.sh b/libraries/flann_install.sh index adad570..240f49e 100755 --- a/libraries/flann_install.sh +++ b/libraries/flann_install.sh @@ -8,14 +8,14 @@ # One flann.tar.gz file should be located in this directory containing the # source code of the desired mlpack version. The first argument is the number # of cores to use for build. -if [ "$1" -eq "" ]; then +if [ "$1" == "" ]; then cores="1"; else cores="$1"; fi tars=`ls flann.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source flann.tar.gz found in libraries/!" exit 1 diff --git a/libraries/hlearn_install.sh b/libraries/hlearn_install.sh index 4b70a69..62c4c15 100755 --- a/libraries/hlearn_install.sh +++ b/libraries/hlearn_install.sh @@ -12,14 +12,14 @@ # HLearn doesn't work with ghc 8, so this installation is commented out. #tars=`ls HLearn.tar.gz | wc -l`; -#if [ "$tars" -eq "0" ]; +#if [ "$tars" == "0" ]; #then # echo "No source HLearn.tar.gz found in libraries/!" # exit 1 #fi #subhasktars=`ls subhask.tar.gz | wc -l`; -#if [ "$tars" -eq "0" ]; +#if [ "$tars" == "0" ]; #then # echo "No source subhask.tar.gz found in libraries/!" # exit 1 diff --git a/libraries/install_all.sh b/libraries/install_all.sh index 829145b..5aec139 100755 --- a/libraries/install_all.sh +++ b/libraries/install_all.sh @@ -11,88 +11,93 @@ mkdir include/ mkdir debug/ ./ann_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing ANN!"; exit 1; fi ./flann_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing FLANN!"; exit 1; fi +# HLearn is not supported on ghc 8. So it's commented out #./hlearn_install.sh $1 -#if [ "$?" -ne "0" ]; then +#if [ "$?" != "0" ]; then # echo "Error installing HLearn!"; # exit 1; #fi ./matlab_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error checking for MATLAB!"; exit 1; fi ./mlpack_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing mlpack!"; exit 1; fi ./mlpy_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing mlpy!"; exit 1; fi ./scikit_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing scikit-learn!"; exit 1; fi ./nearpy_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing nearpy!"; exit 1; fi ./annoy_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing annoy!"; exit 1; fi ./shogun_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing shogun!"; exit 1; fi ./weka_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing Weka!"; exit 1; fi ./elki_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing ELKI!"; exit 1; fi ./mrpt_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing MRPT!"; exit 1; fi ./milk_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing Milk!"; exit 1; fi ./dlibml_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing dlib!"; exit 1; fi ./r_install.sh $1 -if [ "$?" -ne "0" ]; then +if [ "$?" != "0" ]; then echo "Error installing R!"; exit 1; fi - ./dtimeout_install.sh $1 -if [ "$?" -ne "0" ]; then - echo "Error installing R!"; +if [ "$?" != "0" ]; then + echo "Error installing dtimeout!"; + exit 1; +fi +./mprofiler_install.sh $1 +if [ "$?" != "0" ]; then + echo "Error installing mprofiler!"; exit 1; fi diff --git a/libraries/milk_install.sh b/libraries/milk_install.sh index f9de71f..88f29d0 100755 --- a/libraries/milk_install.sh +++ b/libraries/milk_install.sh @@ -7,7 +7,7 @@ # # One milk.tar.gz file should be located in this directory. tars=`ls milk.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source milk.tar.gz found in libraries/!" exit 1 diff --git a/libraries/mlpack_install.sh b/libraries/mlpack_install.sh index 7aca88b..55e0bb3 100755 --- a/libraries/mlpack_install.sh +++ b/libraries/mlpack_install.sh @@ -8,14 +8,14 @@ # One mlpack.tar.gz file should be located in this directory containing the # source code of the desired mlpack version. The first argument is the number # of cores to use during build. -if [ "$1" -eq "" ]; then +if [ "$1" == "" ]; then cores="1"; else cores="$1"; fi tars=`ls mlpack.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source mlpack.tar.gz found in libraries/!" exit 1 @@ -26,6 +26,7 @@ rm -rf mlpack/ mkdir mlpack/ tar -xzpf mlpack.tar.gz --strip-components=1 -C mlpack/ +# Install mlpack cd mlpack/ mkdir build/ cd build/ diff --git a/libraries/mlpy_install.sh b/libraries/mlpy_install.sh index fae4c06..0cd4a38 100755 --- a/libraries/mlpy_install.sh +++ b/libraries/mlpy_install.sh @@ -7,7 +7,7 @@ # # One mlpy.tar.gz file should be located in this directory. tars=`ls mlpy.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source mlpy.tar.gz found in libraries/!" exit 1 diff --git a/libraries/mprofiler_install.sh b/libraries/mprofiler_install.sh new file mode 100644 index 0000000..0798b22 --- /dev/null +++ b/libraries/mprofiler_install.sh @@ -0,0 +1,25 @@ +#!/bin/bash +# +# Wrapper script to unpack and build mprofiler. +# +# Include files will be installed to ../include/. +# Library files will be installed to ../lib/. +# +# One mprofiler.tar.gz file should be located in this directory. +tars='ls mprofiler.tar.gz | wc -l'; +if [ "$tars" == "0" ]; +then + echo "No source mprofiler.tar.gz found in libraries/!" + exit 1 +fi + +# Remove any old directory +rm -rf mprofiler/ +mkdir mprofiler/ +tar -xzpf mprofiler.tar.gz --strip-components=1 -C mprofiler/ + +cd mprofiler/ +python3 setup.py build +PYVER=`python3 -c 'import sys; print("python" + sys.version[0:3])'`; +mkdir -p ../lib/$PYVER/site-packages/ +PYTHONPATH=../lib/$PYVER/site-packages/ python3 setup.py install --prefix=../ -O2 diff --git a/libraries/mrpt_install.sh b/libraries/mrpt_install.sh index 2e0e0ca..334420b 100755 --- a/libraries/mrpt_install.sh +++ b/libraries/mrpt_install.sh @@ -7,7 +7,7 @@ # # One mrpt.tar.gz file should be located in this directory. tars=`ls mrpt.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source mrpt.tar.gz found in libraries/!" exit 1 diff --git a/libraries/nearpy_install.sh b/libraries/nearpy_install.sh index 8b0664a..73a8e86 100755 --- a/libraries/nearpy_install.sh +++ b/libraries/nearpy_install.sh @@ -7,12 +7,12 @@ # # One Nearpy*.tar.gz file should be located in this directory. tars=`ls nearpy.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No Nearpy source .tar.gz found in libraries/!" exit 1 fi -if [ "$tars" -ne "1" ]; +if [ "$tars" != "1" ]; then echo "More than one Nearpy source .tar.gz found." echo "Ensure only one is present in libraries/!" diff --git a/libraries/package-urls.txt b/libraries/package-urls.txt index 4d26e65..9e3d638 100644 --- a/libraries/package-urls.txt +++ b/libraries/package-urls.txt @@ -1,11 +1,12 @@ dtimeout https://github.com/pnpnpn/timeout-decorator/archive/master.tar.gz +mprofiler https://github.com/pythonprofilers/memory_profiler/archive/master.tar.gz ann https://www.cs.umd.edu/~mount/ANN/Files/1.1.2/ann_1.1.2.tar.gz flann https://github.com/mariusmuja/flann/archive/1.9.1.tar.gz HLearn https://github.com/mikeizbicki/HLearn/archive/2.0.0.0.tar.gz subhask https://github.com/mikeizbicki/subhask/archive/subhask-0.1.tar.gz mlpack http://www.mlpack.org/files/mlpack-3.0.3.tar.gz mlpy http://freefr.dl.sourceforge.net/project/mlpy/mlpy%203.5.0/mlpy-3.5.0.tar.gz -scikit https://github.com/scikit-learn/scikit-learn/archive/0.18.1.tar.gz +scikit https://github.com/scikit-learn/scikit-learn/archive/0.19.2.tar.gz shogun https://github.com/shogun-toolbox/shogun/archive/shogun_6.1.3.tar.gz shogun-gpl https://github.com/shogun-toolbox/shogun-gpl/archive/v6.1.3.tar.gz weka http://prdownloads.sourceforge.net/weka/weka-3-8-1.zip diff --git a/libraries/r_install.sh b/libraries/r_install.sh index 92bc1d6..7e80126 100755 --- a/libraries/r_install.sh +++ b/libraries/r_install.sh @@ -7,7 +7,7 @@ # # One R.tar.gz file should be located in this directory. tars=`ls R.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source R.tar.gz found in libraries/!" exit 1 diff --git a/libraries/scikit_install.sh b/libraries/scikit_install.sh index 903af98..8cf2995 100755 --- a/libraries/scikit_install.sh +++ b/libraries/scikit_install.sh @@ -7,7 +7,7 @@ # # One scikit.tar.gz file should be located in this directory. tars=`ls scikit.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source scikit.tar.gz found in libraries/!" exit 1 diff --git a/libraries/shogun_install.sh b/libraries/shogun_install.sh index 26ffeeb..6f0d53a 100755 --- a/libraries/shogun_install.sh +++ b/libraries/shogun_install.sh @@ -7,21 +7,21 @@ # # One shogun.tar.gz file should be located in this directory. The first argument # is the number of cores to use for build. -if [ "$1" -eq "" ]; then +if [ "$1" == "" ]; then cores="1"; else cores="$1"; fi tars=`ls shogun.tar.gz | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source shogun.tar.gz found in libraries/!" exit 1 fi gpl_tars=`ls shogun-gpl.tar.gz | wc -l`; -if [ "$gpl_tars" -eq "0" ]; +if [ "$gpl_tars" == "0" ]; then echo "No gpl source shogun-gpl.tar.gz found in libraries/!" exit 1 @@ -38,9 +38,11 @@ tar -xzpf shogun-gpl.tar.gz --strip-components=1 -C shogun/src/gpl cd shogun/ mkdir build/ cd build/ -cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.5 \ +PYVER=`python3 -c 'import sys; print("python" + sys.version[0:3])'`; +mkdir -p ../../lib/$PYVER/dist-packages/; +cmake -DPYTHON_INCLUDE_DIR=/usr/include/$PYVER \ -DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3 \ - -DPYTHON_PACKAGES_PATH=../../lib/python3.5/dist-packages \ + -DPYTHON_PACKAGES_PATH=../../lib/$PYVER/dist-packages \ -DINTERFACE_PYTHON=ON \ -DBUILD_META_EXAMPLES=OFF \ -DCMAKE_BUILD_TYPE=Release \ diff --git a/libraries/weka_install.sh b/libraries/weka_install.sh index e112ff9..55cb7db 100755 --- a/libraries/weka_install.sh +++ b/libraries/weka_install.sh @@ -7,7 +7,7 @@ # # One weka.zip file should be located in this directory. tars=`ls weka.zip | wc -l`; -if [ "$tars" -eq "0" ]; +if [ "$tars" == "0" ]; then echo "No source weka.zip found in libraries/!" exit 1 diff --git a/run.py b/run.py index 1fa5dc2..b10ca21 100644 --- a/run.py +++ b/run.py @@ -15,6 +15,7 @@ sys.path.insert(0, cmd_subfolder) from util import * +from memory_profiler import memory_usage def run(config, library, methods, loglevel): # Configure logging. @@ -77,7 +78,13 @@ def run_timeout_wrapper(): logging.info('Run: %s' % (str(instance))) # Run the metric method. - result = instance.metric() + mem_use, result = memory_usage(instance.metric, + interval=1e-3, + include_children=True, + max_usage=True, + retval=True, + backend="psutil") + result["memory"] = str(format(mem_use, ".2f")) + "MB" logging.info('Metric: %s' % (str(result))) # Pass the result to the driver.