- Deploying on AWS EC2 Cluster
- Deploying in a Cluster
- Deploying on a single multicore machine
- Benchmarking on AWS EC2
- Fine tuning GraphLab PowerGraph performance
-
You should have Amazon EC2 account eligible to run on us-east-1a zone.
-
Find out using the Amazon AWS console your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (under your account name on the top right corner-> security credentials -> access keys)
-
You should have a keypair attached to the zone you are running on (in our example us-east-1a) as explained <a {{ trackClick() }} href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html">here. You will need to know your keypair name (graphlabkey in our example), and the location of the private key (~/.ssh/graphlabkey.pem in our example).
-
Install boto. This is the AWS Python client. To install, run:
sudo pip install boto
- Download and install GraphLab PowerGraph using the instructions in the README.md.
Edit your .bashrc or .bash_profile or .profile files (remember to source it after editing, using the bash command “source <filename>”)
export AWS_ACCESS_KEY_ID=[ Your access key ]
export AWS_SECRET_ACCESS_KEY=[ Your access key secret ]
cd ~/graphlabapi/scripts/ec2
./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey -s 1 launch launchtest
(In the above command, we created a 2-node cluster in us-east-1a zone. -s is the number of slaves, launch is the action, and launchtest is the name of the cluster)
./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey update launchtest
This step runs ALS (alternating least squares) in a cluster using small netflix subset. It first downloads the data from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, copy it into HDFS, and runs 5 alternating least squares iterations:
./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey als_demo launchtest
After the run is completed, login to the master node and view the output files in the folder ~/graphlabapi/release/toolkits/collaborative_filtering/ The algorithm and exact format is explained in the API docs.
./gl-ec2 -i ~/.ssh/graphlab.pem -k grpahlabkey destroy launchtest
Login into the master node using
./gl-ec2 -i ~/.ssh/graphlab.pem -s 1 login launchtest
Install GraphLab PowerGraph, using instructions in the README.md, on your master node (one of your cluster machines).
- Create a file called in your home directory called “machines” with the names of all the MPI nodes participate in the computation.
For example:
cat ~/machines
mynode1.some.random.domain
mynode2.some.random.domain
...
mynode18.some.random.domain
-
Verify you have the machines files from section 1) in your root folder of all of the machines.
-
You will need to setup password-less SSH between the master node and all other machines.
Verify it is possible to ssh without password between any pairs of machines. These instructions explain how to setup ssh without passswords.
Before proceeding, verify that this is setup correctly; check that the following connects to the remote machine without prompting for a password:
# from machine mynode1.some.random.domain
ssh mynode2.some.random.domain
- On the node you installed GraphLab on, run the following commands to copy GraphLab files to the rest of the machines:
cd ~/graphlab/release/toolkits
~/graphlab/scripts/mpirsync
cd ~/graphlab/deps/local
~/graphlab/scripts/mpirsync
This step will only work if the file you created in step 1 was named "machines" and located in your home directory.
In order for mpirsync to run properly all machines must have all network ports open.
This step runs the PageRank algorithm on a synthetic generated graph of 100,000 nodes. It spawns two GraphLab mpi instances (-n 2).
mpiexec -n 2 -hostfile ~/machines /path/to/pagerank --powerlaw=100000
This step runs ALS (alternating least squares) in a cluster using small netflix susbset. It first downloads an anonymized, synthetic Netflix dataset from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, and runs 5 alternating least squares iterations. After the run is completed, you can login into any of the nodes and view the output files in the folder ~/graphlab/release/toolkits/collaborative_filtering/
cd /some/ns/folder/
mkdir smallnetflix
cd smallnetflix/
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate
Now run GraphLab:
mpiexec -n 2 -hostfile ~/machines /path/to/als --matrix /some/ns/folder/smallnetflix/ --max_iter=3 --ncpus=1 --minval=1 --maxval=5 --predictions=out_file
Where -n is the number of MPI nodes, and –ncpus is the number of deployed cores on each MPI node.
machines is a file which includes a list of the machines you like to deploy on (each machine in a new line)
Note: this section assumes you have a network storage (ns) folder where the input can be stored. Alternatively, you can split the input into several disjoint files, and store the subsets on the cluster machines.
Note: Don’t forget to change /path/to/als and /some/ns/folder to your actual folder path!
Note: For mpich2, use -f instead of -hostfile.
Fine tuning graphlab deployment.
/mnt/info/home/daroczyb/als: error while loading shared libraries: libevent_pthreads-2.0.so.5: cannot open shared object file: No such file or directory
Solution:
You should define LD_LIBRARY_PATH to point to the location of libevent_pthreads, this is done with the -x mpi command, for example:
mpiexec --hostfile machines -x LD_LIBRARY_PATH=/home/daroczyb/graphlab/deps/local/lib/ /mnt/info/home/daroczyb/als /mnt/info/home/daroczyb/smallnetflix_mm.train
mnt/info/home/daroczyb/als: error while loading shared libraries: libjvm.so: cannot open shared object file: No such file or directory
Solution:
Point LD_LIBRARY_PATH to the location of libjvm.so using the -x mpi command:
mpiexec --hostfile machines -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/daroczyb/graphlab/deps/local/lib/:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/ /mnt/info/home/daroczyb/als /mnt/info/home/daroczyb/smallnetflix_mm.train
problem with execution of /graphlab/release/toolkits/collaborative_filtering/als on debian1: [Errno 2] No such file or directory
Solution:
You should verify the executable is found on the same path on all machines.
a prompt asking for password when running mpiexec
Solution: Use the following instructions to allow connection with a public/private key pair (no password).
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://[domain]:9000/user/[user_name]/data.txt, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:307)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:487)
Call to org.apache.hadoop.fs.FileSystem::listStatus failed!
WARNING: distributed_graph.hpp(load_from_hdfs:1889): No files found matching hdfs://[domain]:9000/user/[user_name]/data.txt
Solution: Verify you classpath includes all hadoop required folders.
Just after TCP Communication layer is constructed:
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES, EXITCODE: 11, CLEANING UP REMAINING PROCESSES, YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
or:
[xyzserver:22296] *** Process received signal *** mpiexec noticed that process rank 0 with PID 22296 on node xyzserver exited on signal 11 (Segmentation fault).
Solution:
Check that all machines have access to, or are using the same binary
Using the instructions here on your master node (one of your cluster machines), except invoke the configure script with the ‘–no_mpi’ flag. Don’t forget to use
./configure --no_mpi
when configuring GraphLab.
This step runs ALS (alternating least squares) in a cluster using small netflix susbset. It first downloads the data from the web, runs 5 alternating least squares iterations. After the run is completed, the output files will be created in the running folder (the folder graphlab/release/toolkits/collaborative_filtering/)
cd graphlab/release/toolkits/collaborative_filtering/
mkdir smallnetflix
cd smallnetflix/
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate
cd ..
Now run GraphLab:
./als --matrix ./smallnetflix/ --max_iter=5 --ncpus=1 --predictions=out_file
Where –ncpus is the number of deployed cores.
A commonly repeating task is evaluation of GraphLab performance and scaling properties on a cluster. To help jump start benchmarking we have created this tutorial.
- You should have Amazon EC2 account eligible to run on us-west zone.
- Find out using the Amazon AWS console your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (under your account name on the top right corner-> security credentials -> access keys)
- You should have a keypair attached to the zone you are running on (in our example us-west) as explained here. You will need to know your keypair name (amazonec2 in our example), and the location of the private key (~/.ssh/amazonec2.pem in our example).
- Install boto. This is the AWS Python client. To install, run:
sudo pip boto
. - Download and install GraphLab using the instructions here.
We recommend using high performance computing instances (like cc2.8xlarge) since we observed a significant improved performance especially related to variation in cluster load and network utilization. The scripts also allow using regular instances.
To avoid ec2 unexpected loads, we recommend repeating each experiment a few times and computing the average.
Edit your .bashrc or .bash_profile or .profile files (remember to source it after editing, using the bash command “source ”)
export AWS_ACCESS_KEY_ID=[ Your access key ]
export AWS_SECRET_ACCESS_KEY=[ Your access key secret ]
Edit the benchmark_ec2.sh script found under graphlab/scripts/ec2
- Select the requested algorithms of the following options:
ALS=1 # alternating least squares
SVD=1 # singular value decomposition
PR=1 # pagerank
(Setting an algorithm to 0 will disable its run). 2. Select the number of slaves (any number between 0 to n) by setting the MAX_SLAVES variable. 3. Select the number of experiment repeats (any number between 0 to n) by setting the MAX_RETRY variable. The benchmarking script, spawns an ec2 cluster of size n machines, and then tests the requested algorithm using 0, 1, … n-1 slaves. Each experiment is repeated MAX_RETRY times.
cd ~/graphlabapi/scripts/ec2
./benchmark_ec2.sh
It is advised to redirect the benchmarking output to file, for example on bash:
./benchmark_ec2 > output 2>&1
For detecting final runtime for ALS/SVD
grep "Runtime" output
For detecting final runtime for PR:
grep "Finished Running" output
You will need to manually compute the average runtime for each case. A recommended metric to use is the “speedup” curve, which is the time for executing on a single machine divided by the time executing on k machines. The optimal result is linear speedup, namely running on k machines speeds up the algorithm k times vs. running on a single machine.
Here is a more detailed explanation of the benchmarking process. The benchmarking is calling gl-ec2 script which calls gl_ec2.py script.
- The “launch” command to start a graphlab cluster with X machines.
- The “update” command to get the latest version of graphlab from git, recompile it, and disseminate the binary to the salves
- The “als_demo”, “svd_demo”, “pagerank_demo” command benchmark ALS/SVD/PR algorithms. It first downloads a dataset from the web and then calls graphlab with the right command lines to issue a run on the downloaded dataset. For PR we use the LiveJournal dataset. For ALS/SVD we use a netflix like synthetic sample.
- In case you would like to benchmark a different dataset, you can edit the dataset URL in the gl_ec2.py example.
- In case you would like to benchmark a different algorithm, you can add an additional youralgo_demo section into the gl_ec2.py script.
- In case you would like to bechmark a regular instance, simply change the following line in gl_ec2.py from
./gl-ec2 -i ~/.ssh/amazonec2.pem -k amazonec2 -a hpc -s $MAX_SLAVES -t cc2.8xlarge launch hpctest
to:
./gl-ec2 -i ~/.ssh/amazonec2.pem -k amazonec2 -s $MAX_SLAVES -t m1.xlarge launch hpctest
In case you like to work in a different ec2 region (than the default us-west):
For us-east region, those are the provided AMIs:
Standard: ami-31360458, high performance: ami-39360450.
You should
- add the following line just before: gl_ec2.py
opts.ami = "ami-31360458"
- run with the additional command line argument:
-r us-east-1
If you encounter any problem when trying to run this benchmarking feel free to post on forum.graphlab.com
This section contains tips and examples on how to setup GraphLab properly on your cluster and how to squeeze performance.
Verify you compiled graphlab in the release subfolder (and not in debug subfolder). Compiling in release may speed execution up to x10 times!
Tip: Always compile in release when testing performance.
GraphLab PowerGraph has built in parallel loading of the input graph. However, for efficient parallel loading, the input file should be split into multiple disjoint sub files. When using a single input file, the graph loading becomes serial (which is bad!).
Each MPIinstance has a single loader of the input graph attached to it (does not matter how many cpus are used by that MPI instance).
Tip: Always split your input file into at least as many MPI processes you are using.
You can test your MPI setup as follows:
- Compile the release/demoapps/rpc subfolder (using “cd release/demoapps/rpc/; make”). Copy the files generated by the compile to all machines.
- Run:
mpiexec -n 2 --hostfile ~/machines /home/ubuntu/graphlab/release/demoapps/rpc/rpc_example1
As part of the output, you should see something like this:
TCP Communication layer constructed.
TCP Communication layer constructed.
10
5 plus 1 is : 6
11 plus 1 is : 12
If you get something else, please report an error as explained below
Previous to the program execution, the graph is first loaded into memory and partitioned into the different cluster machines. It is possible to try different partitioning strategies. This is done using the following flags:
--graph_opts="ingress=oblivious
or
--graph_opts="ingress=grid" # works for power of 2 sized cluster i.e. 2,4,8,.. machines
For different graphs, different partitioning methods may give different performance gains.
The –ncpus option let you set the number of cores used to perform computation. Prior to 2.1.4644 this defaults to 2. After 2.1.4644, this defaults to #cores – 2. When run in the distributed setting, the maximum number this should be set to is #cores – 2 since 2 cores should be reserved for communication.