Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory benchmarks #142

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 70 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,90 @@ The system has several key attributes that lead to its highly and easily customi


Quick links to this file:
* [Competing libraries](#competing-libraries)
* [Prerequisites](#prerequisites)
* [Running](#running)
* [Directory Structure](#directory-structure)
* [Getting the datasets](#getting-the-datasets)
* [Configuration](#configuration)
* [Competing libraries](#competing-libraries)
* [Citation details](#citation-details)

## Competing libraries

Machine learning toolkits:
* [mlpack](http://mlpack.org)
* [Shogun-toolbox](http://shogun-toolbox.org)
* [scikit-learn](http://scikit-learn.org)
* [MATLAB](http://mathworks.com)
* [Weka](http://cs.waikato.ac.nz/ml/weka/)
* [elki](https://elki-project.github.io/)
* [mlpy](http://mlpy.sourceforge.net)
* [dlibml](http://dlib.net/ml.html)
* [milk](https://github.com/luispedro/milk/)
* [R](https://www.r-project.org/)

Nearest Neighbour Algorithms:
* [ANN](http://www.cs.umd.edu/~mount/ANN/)
* [FLANN](http://www.cs.ubc.ca/research/flann/)
* [nearpy](http://pixelogik.github.io/NearPy/)
* [annoy](https://github.com/spotify/annoy)
* [mrpt](https://github.com/vioshyvo/mrpt)

Inactive toolkits:
* [HLearn](https://github.com/mikeizbicki/HLearn)
NOTE: `HLearn` is not currently being benchmarked by this repository.

## Prerequisites

* **[Python 3.3+](http://www.python.org "Python Website")**: The main benchmark script is written with the programming language python: The benchmark script by default uses the version of Python on your path.
* **[numpy](https://www.numpy.org/)**: Numpy provides a powerful N-dimensional array object and sophisticated (broadcasting) functions useful for handling and transforming data.
* **[Python-yaml](http://pyyaml.org "Python-yaml Website")**: PyYAML is a YAML parser and emitter for Python. We've picked YAML as the configuration file format for specifying the structure for the project.
* **[SQLite](http://www.sqlite.org "SQLite Website")** (**Optional**): SQLite is a lightweight disk-based database that doesn't require a separate server process. We use the python built-in SQLite database to save the benchmark results.
* **[Valgrind](http://valgrind.org "Valgrind Website")** (**Optional**): Valgrind is a suite of tools for debugging and profiling. This package is only needed if you want to run the memory benchmarks.
* **[python-xmlrunner](https://github.com/lamby/pkg-python-xmlrunner "python-xmlrunner github")** (**Optional**): The xmlrunner module is a unittest test runner that can save test results to XML files. This package is only needed if you want to run the tests.

### Prerequisites for Setting up Competing Libraries
All the following pre-requisite packages are needed to be installed before running `make setup` command (see the next [section](#running)):
**FLANN library:**
* [hdf5](https://www.hdfgroup.org/solutions/hdf5/): This is a high performance data software library.
* [gtest](https://github.com/google/googletest): This package used for writing C++ tests.

**mlpack:**
* [Armadillo](http://arma.sourceforge.net/download.html): This package is a c++ library for linear algebra and scientific computing.
* [Boost C++](https://www.boost.org/): This package is required for compiling mlpack from source.

**mlpy:**
* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering
* [GSL](https://www.gnu.org/software/gsl/): The is a numerical library for C and C++ programmers.

**scikit-learn:**
* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering
* [joblib](https://joblib.readthedocs.io/): Joblib is a set of tools to provide lightweight pipelining in Python.
* [Cython-0.25.2](https://cython.org/): C-extentions for python. Required for compiling the scikit from source.

**Nearpy:**
* [scipy](https://www.scipy.org/): Python-based ecosystem of open-source software for mathematics, science, and engineering
* [redis](https://redislabs.com/lp/python-redis/): The Python interface to the Redis key-value store.

**shogun:**
* [swig](https://github.com/swig/swig): SWIG is a compiler that integrates C and C++ with languages including Perl, Python, Tcl, Ruby, PHP, Java, C#, D, Go, Lua, Octave, R, Scheme (Guile, MzScheme/Racket), Scilab, Ocaml. SWIG can also export its parse tree into XM

**weka:**
* [java](https://www.java.com/en/): Java is a programming language on which weka is based.

**elki:**
* [java](https://www.java.com/en/): Java is a programming language on which weka is based.

**milk**
* [Eigen3](http://eigen.tuxfamily.org/index.php?title=Main_Page): Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.

**R**
* [gfortran](https://gcc.gnu.org/wiki/GFortran): A free Fortran 95/2003/2008 compiler for GCC
* [readline](https://www.gnu.org/software/readline/): This library provides a set of functions for use by applications that allow users to edit command lines as they are typed in.
* [libbz2-dev](https://www.sourceware.org/bzip2/): This is a freely available, patent free, high-quality data compressor. Header files of this software are required.
* [liblzma-dev](https://tukaani.org/xz/): XZ Utils is free general-purpose data compression software with a high compression ratio. Header files of this software are required.
* [libcurl4](https://curl.haxx.se/libcurl/): libcurl is a free and easy-to-use client-side URL transfer library


## Running

Benchmarks are run with the `make` command.
Expand Down Expand Up @@ -234,18 +302,6 @@ methods:

In this case we benchmark the pca method located in methods/mlpack/pca.py with the isolet and the cities dataset. The pca method scales the data before running the pca method. The benchmark performs twice for each dataset. Additionally the pca.py script supports the following file formats txt, csv, hdf5 and bin. If the data isn't available in this particular case the format will be generated.

## Competing libraries

* http://mlpack.org
* http://mathworks.com
* http://shogun-toolbox.org
* http://cs.waikato.ac.nz/ml/weka/
* https://elki-project.github.io/
* http://scikit-learn.org
* http://mlpy.sourceforge.net
* http://www.cs.umd.edu/~mount/ANN/
* http://www.cs.ubc.ca/research/flann/

## Citation details

If you use the benchmarks in your work, we'd really appreciate it if you could cite the following paper (given in BiBTeX format):
Expand Down
4 changes: 2 additions & 2 deletions libraries/ann_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
# One ann.tar.gz file should be located in this directory containing the
# source code of the desired mlpack version. The first argument is the number
# of cores to use for build.
if [ "$1" -eq "" ]; then
if [ "$1" == "" ]; then
cores="1";
else
cores="$1";
fi

tars=`ls ann.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source ann.tar.gz found in libraries/!"
exit 1
Expand Down
2 changes: 1 addition & 1 deletion libraries/annoy_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
# One annoy.tar.gz file should be located in this directory.
tars=`ls annoy.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source annoy.tar.gz found in libraries/!"
exit 1
Expand Down
2 changes: 1 addition & 1 deletion libraries/dlibml_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# One dlibml.tar.gz file should be located in this directory.

tars=`ls dlibml.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source dlibml.tar.gz found in libraries/!"
exit 1
Expand Down
2 changes: 1 addition & 1 deletion libraries/download_packages.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ do

wget $url -O $name.$ext;

if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Failure downloading $name!";
exit 1;
fi
Expand Down
2 changes: 1 addition & 1 deletion libraries/dtimeout_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
# One dtimeout.tar.gz file should be located in this directory.
tars=`ls dtimeout.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source dtimeout.tar.gz found in libraries/!"
exit 1
Expand Down
4 changes: 2 additions & 2 deletions libraries/flann_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
# One flann.tar.gz file should be located in this directory containing the
# source code of the desired mlpack version. The first argument is the number
# of cores to use for build.
if [ "$1" -eq "" ]; then
if [ "$1" == "" ]; then
cores="1";
else
cores="$1";
fi

tars=`ls flann.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source flann.tar.gz found in libraries/!"
exit 1
Expand Down
4 changes: 2 additions & 2 deletions libraries/hlearn_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
# HLearn doesn't work with ghc 8, so this installation is commented out.

#tars=`ls HLearn.tar.gz | wc -l`;
#if [ "$tars" -eq "0" ];
#if [ "$tars" == "0" ];
#then
# echo "No source HLearn.tar.gz found in libraries/!"
# exit 1
#fi

#subhasktars=`ls subhask.tar.gz | wc -l`;
#if [ "$tars" -eq "0" ];
#if [ "$tars" == "0" ];
#then
# echo "No source subhask.tar.gz found in libraries/!"
# exit 1
Expand Down
43 changes: 24 additions & 19 deletions libraries/install_all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,88 +11,93 @@ mkdir include/
mkdir debug/

./ann_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing ANN!";
exit 1;
fi
./flann_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing FLANN!";
exit 1;
fi
# HLearn is not supported on ghc 8. So it's commented out
#./hlearn_install.sh $1
#if [ "$?" -ne "0" ]; then
#if [ "$?" != "0" ]; then
# echo "Error installing HLearn!";
# exit 1;
#fi
./matlab_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error checking for MATLAB!";
exit 1;
fi
./mlpack_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing mlpack!";
exit 1;
fi
./mlpy_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing mlpy!";
exit 1;
fi
./scikit_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing scikit-learn!";
exit 1;
fi
./nearpy_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing nearpy!";
exit 1;
fi
./annoy_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing annoy!";
exit 1;
fi
./shogun_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing shogun!";
exit 1;
fi
./weka_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing Weka!";
exit 1;
fi
./elki_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing ELKI!";
exit 1;
fi
./mrpt_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing MRPT!";
exit 1;
fi
./milk_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing Milk!";
exit 1;
fi
./dlibml_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing dlib!";
exit 1;
fi
./r_install.sh $1
if [ "$?" -ne "0" ]; then
if [ "$?" != "0" ]; then
echo "Error installing R!";
exit 1;
fi

./dtimeout_install.sh $1
if [ "$?" -ne "0" ]; then
echo "Error installing R!";
if [ "$?" != "0" ]; then
echo "Error installing dtimeout!";
exit 1;
fi
./mprofiler_install.sh $1
if [ "$?" != "0" ]; then
echo "Error installing mprofiler!";
exit 1;
fi
2 changes: 1 addition & 1 deletion libraries/milk_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
# One milk.tar.gz file should be located in this directory.
tars=`ls milk.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source milk.tar.gz found in libraries/!"
exit 1
Expand Down
5 changes: 3 additions & 2 deletions libraries/mlpack_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
# One mlpack.tar.gz file should be located in this directory containing the
# source code of the desired mlpack version. The first argument is the number
# of cores to use during build.
if [ "$1" -eq "" ]; then
if [ "$1" == "" ]; then
cores="1";
else
cores="$1";
fi

tars=`ls mlpack.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source mlpack.tar.gz found in libraries/!"
exit 1
Expand All @@ -26,6 +26,7 @@ rm -rf mlpack/
mkdir mlpack/
tar -xzpf mlpack.tar.gz --strip-components=1 -C mlpack/

# Install mlpack
cd mlpack/
mkdir build/
cd build/
Expand Down
2 changes: 1 addition & 1 deletion libraries/mlpy_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
# One mlpy.tar.gz file should be located in this directory.
tars=`ls mlpy.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source mlpy.tar.gz found in libraries/!"
exit 1
Expand Down
25 changes: 25 additions & 0 deletions libraries/mprofiler_install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
#
# Wrapper script to unpack and build mprofiler.
#
# Include files will be installed to ../include/.
# Library files will be installed to ../lib/.
#
# One mprofiler.tar.gz file should be located in this directory.
tars='ls mprofiler.tar.gz | wc -l';
if [ "$tars" == "0" ];
then
echo "No source mprofiler.tar.gz found in libraries/!"
exit 1
fi

# Remove any old directory
rm -rf mprofiler/
mkdir mprofiler/
tar -xzpf mprofiler.tar.gz --strip-components=1 -C mprofiler/

cd mprofiler/
python3 setup.py build
PYVER=`python3 -c 'import sys; print("python" + sys.version[0:3])'`;
mkdir -p ../lib/$PYVER/site-packages/
PYTHONPATH=../lib/$PYVER/site-packages/ python3 setup.py install --prefix=../ -O2
2 changes: 1 addition & 1 deletion libraries/mrpt_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
#
# One mrpt.tar.gz file should be located in this directory.
tars=`ls mrpt.tar.gz | wc -l`;
if [ "$tars" -eq "0" ];
if [ "$tars" == "0" ];
then
echo "No source mrpt.tar.gz found in libraries/!"
exit 1
Expand Down
Loading