We will install and configure Spark 2.4. In this tutorial it is accepted that:
- You have 3 machines (1 master, 2 slaves).
- You have installed Ubuntu 18.04.4 LTS in your local (master) machine and in slaves.
- You have updated and upgraded all packages in all machines.
- You have a valid internet connection for master and slaves.
- There is a user with the same name for all machines, such as
spark-user
. - You have passwordless SSH access to slaves from master.
- Hosts (
/etc/hosts
) file is configured for all machines, like shown below:
Host-Name | IP Address | Info |
---|---|---|
master | 192.168.10.107 | Local Ubuntu 18 Machine |
slave-1 | 192.168.10.140 | Virtual Ubuntu 18 Machine |
slave-2 | 192.168.10.141 | Virtual Ubuntu 18 Machine |
- Java 8 is installed.
- Hadoop, HDFS and Yarn is configured and installed.
Note: If one of this criteria is absent, please check the first guide Virtual Machine Installation & Configuration Guide and second guide HDFS & Yarn Installation Guide.
This section is for all (1 master, 2 slaves) machines. Perform all steps in all machines.
- Install Scala in all machines:
sudo apt install scala
- Confirm that Scala is installed correctly, in all machines:
scala -version
Output should look like in all machines:
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
- Go to https://downloads.apache.org/spark and find the current 2.4 release. At the time this document was written (June 25, 2020), Spark's version 2.4 was in the 5th sub-version and 6th sub-version. We will choose version 2.4.5 in this tutorial.
- We will download Spark in Master first and then transfer Spark files to slaves. In master machine, download, untar & move Spark files using command-line and copy Spark files to remote slaves:
cd ~ # Go to home directory
wget https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
tar -xvf spark-2.4.5-bin-hadoop2.7.tgz # Untar Spark files
rm ./spark-2.4.5-bin-hadoop2.7.tgz # Delete redundant archive file
scp -r ./spark-2.4.5-bin-hadoop2.7 spark-user@slave-1:~/ # Copy files to slave-1
scp -r ./spark-2.4.5-bin-hadoop2.7 spark-user@slave-2:~/ # Copy files to slave-2
Note: We will use binary distro builded with Hadoop 2.7, since we installed Hadoop 2.7 before. Make sure that your version is not src (source) or different than your installed Hadoop version.
- In all machines, we will move the hadoop files under
/opt
directory for maintainability and handle permissions:
# For master Machine
sudo mv spark-2.4.5-bin-hadoop2.7 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark # Create symbolic link for abstraction
sudo chown spark-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute
# For slave-1 Machine
ssh slave-1
sudo mv spark-2.4.5-bin-hadoop2.7 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark # Create symbolic link for abstraction
sudo chown spark-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute
exit # Logout from slave-1
# For slave-2 Machine
ssh slave-2
sudo mv spark-2.4.5-bin-hadoop2.7 /opt/ # Move Spark files to /opt/ directory
sudo ln -sf /opt/spark-2.4.5-bin-hadoop2.7 /opt/spark # Create symbolic link for abstraction
sudo chown spark-user:root /opt/spark* -R # Change user:spark-user, group:root
sudo chmod g+rwx /opt/spark* -R # Allow group to read-write-execute
exit # Logout from slave-2
- In all machines, append Spark paths to
$PATH
variable and export$SPARK_HOME
:
echo '
# For Spark
export PATH=$PATH:/opt/spark/bin
export SPARK_HOME=/opt/spark
' >> ~/.bashrc
source ~/.bashrc # Reload the changed bashrc file
echo $PATH $SPARK_HOME # Confirm that $PATH and $SPARK_HOME variable is changed properly
Output should be: <OTHER_PATHS>:/opt/hadoop/bin:/opt/spark/bin /opt/spark
.
We will configure master machine only.
- In master machine, we will edit
$SPARK_HOME/conf/spark-env.sh
. But before, changespark-env.sh.template
tospark-env.sh
:
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
- Open up
$SPARK_HOME/conf/spark-env.sh
with a text editor (e.g., GNU Emacs, Vim, Gedit, Nano) and set the following parameters by appending the following lines:
export SPARK_MASTER_HOST=master
export JAVA_HOME=/usr/lib/jvm/current-java
- In master machine, we will edit
$SPARK_HOME/conf/slaves
. But before, changeslaves.template
toslaves
:
mv $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves
- Open up
$SPARK_HOME/conf/slaves
with a text editor. Iflocalhost
line is present in the file, delete it. Add the following lines only:
slave-1
slave-2
Your $SPARK_HOME/conf/slaves
file should look like:
- In master machine, start Spark:
$SPARK_HOME/sbin/start-all.sh
- Validate that everything started right by running the
jps
command as spark-user on all machines.
On master node, you should see Master
like shown below:
On slave nodes (slave-1, slave-2), you should see Worker
like shown below:
- If everything is OK, you can now access the Spark Web UI by browsing via the link http://master:8080/:
If you see a web page like shown above and Alive Workers attribute is 2, then everything is OK. This indicates that you have 1 master-node (the one which this web-site runs on) and 2 alive worker-nodes (as alive nodes).
- In master machine, if you want to stop Spark run the following command:
$SPARK_HOME/sbin/stop-all.sh