- Introduction
- 1. Build Hadoop Docker Image
- 2. Build Hive Docker Image
- 3. Build Spark Docker Image
- 4. Docker image link
The purpose of this repository is to setup virtual clusters on a single machine for learning big-data tools.
The clusters design is 1 master + 2 slaves.
-
Pull ubuntu image from docker
docker pull ubuntu
-
Run ubuntu image and link local disk to ubuntu container
docker run -it -v C:\hadoop\build:/root/build --name ubuntu ubuntu
-
Enter ubuntu container and update necessary components
# Update system apt-get update # Install vim apt-get install vim # Install ssh (need ssh to connect to slave) apt-get install ssh
-
Add the followings to .bashrc to start sshd service automatically
vim ~/.bashrc # Add the following /etc/init.d/ssh start # save and quit :wq
-
Setup ssh to connect the workers without password
# Enter the following and keep pressing enter ssh-keygen -t rsa # save to authorized keys cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
-
Test ssh connection
ssh localhost # Example:
-
Install java JDK
# Use JDK 8 because Hive only support up to 8 apt-get install openjdk-8-jdk # Setup environment variable vim ~/.bashrc # Add the followings export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export PATH=$PATH:$JAVA_HOME/bin # save and quit :wq # Make .bashrc effective immediately source ~/.bashrc
-
Save this container as jdkinstalled image for later use
# Commit the ubuntu container and rename it as ubuntu/jdkinstalled
docker commit ubuntu ubuntu/jdkinstalled
-
Run ubuntu/jdkinstalled image and link local disk to container
docker run -it -v C:\hadoop\build:/root/build --name ubuntu-jdkinstalled ubuntu/jdkinstalled
-
Download Hadoop installation package from Apache (I use Hadoop-3.3.3) and put it to C:\hadoop\build
-
Enter jdkinstalled container
docker exec -it ubuntu-jdkinstalled bash # switch to build and look for the installation package cd /root/build # extract the package to /usr/local tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local
-
Switch to Hadoop folder
# switch to hadoop folder cd /usr/local/hadoop-3.3.3/ # Test if hadoop is running properly ./bin/hadoop version
-
Setup Hadoop environment variables
vim /etc/profile # Add the followings export HADOOP_HOME=/usr/local/hadoop-3.3.3 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin # save and quit :wq vim ~/.bashrc # Add the following source /etc/profile # save and quit :wq
-
Switch to Hadoop configuration path
cd /usr/local/hadoop-3.3.3/etc/hadoop/
-
Setup hadoop-env.sh
vim hadoop-env.sh # Add the followings export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export HDFS_NAMENODE_USER="root" export HDFS_DATANODE_USER="root" export HDFS_SECONDARYNAMENODE_USER="root" export YARN_RESOURCEMANAGER_USER="root" export YARN_NODEMANAGER_USER="root" # save and quit :wq
-
setup core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/data</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name> hadoop.hyyp.staticuser.user</name> <value>root</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> </property> </configuration>
-
Setup hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/namenode_dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/datanode_dir</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>slave01:9868</value>
</property>
</configuration>
-
Setup madpred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> </configuration>
-
Setup yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://master:19888/jobhistory/logs</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> </configuration>
-
Save this container as hadoop-installed image for later use
docker commit ubuntu-jdkinstalled ubuntu/hadoopinstalled
-
Open 3 terminals and enter the following cmd separately
Terminal 1 (master):
# 8088 is the default WEB UI port for YARN # 9870 is the default WEB UI port for HDFS # 9864 is for inspecting file content inside HDFS WEB UI docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled
Terminal 2 (slave01):
docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled
Terminal 3 (slave02):
docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled
-
For every terminal, inspect the IP for each container enter the ip to every container.
# Enter the hosts vim /etc/hosts # Example
-
Open the workers file in master container
vim /usr/local/hadoop-3.3.3/etc/hadoop/workers # Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well) master slave01 slave02 # Example
-
Test to see if master can enter slave01 and slave02 without password
# Need to enter yes for first time connection ssh slave01 # Example
-
Need to format HDFS system if starting for the first time
hdfs namenode -format
-
Start HDFS clusters + YARN clusters
start-all.sh
-
Enter jps to check if the corresponding java processes have started successfully
-
To properly use the WEB UI, need to modify hosts file.
For windows:
Open C:\Windows\System32\drivers\etc\hosts # Add the followings 127.0.0.1 master 127.0.0.1 slave01 127.0.0.1 slave02 # Save the changes
-
Open YARN WEB UI
-
Open HDFS WEB UI
# Make input folder in HDFS
hdfs fs -mkdir -p /user/hadoop/input
# Upload local hadoop xml file to HDFS
hdfs fs -put ./etc/hadoop/*.xml /user/hadoop/input
# Use ls to check if files are uploaded to HDFS properly
hdfs fs -ls /user/hadoop/input
# Run grep example. Use all files in HDFS input folder and filter dfs[a-z.]+ word counts to output folder
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/input output 'dfs[a-z.]+'
# Check results
./bin/hdfs dfs -cat output/*
-
Run Hadoop image
docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled
-
Download Apache Hive installation package
-
Copy Hive installation package to master container
docker cp apache-hive-3.1.3-bin.tar.gz master:/usr/local/
-
Install sudo
apt-get install sudo -y
-
Install net-tools
sudo apt install net-tools
-
Install mysql-server for meta-store
# Install mysql-server sudo apt install mysql-server -y # Grant permission usermod -d /var/lib/mysql mysql # Check mysql version mysql --version
-
Extract Hive package
# Enter master container docker exec -it master bash # Change to package location cd /usr/local/ # Extract Hive package tar -zxvf apache-hive-3.1.3-bin.tar.gz # Rename to hive mv apache-hive-3.1.3-bin hive # Remove Hive installation package rm apache-hive-3.1.3-bin.tar. # Grant root to hive folder sudo chown -R root:root hive
-
Setup Hive environment variables
# Open ./bashrc vim ~/.bashrc # Enter the followings: export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin export HADOOP_HOME=/usr/local/hadoop-3.3.3 # save and quit :wq # make .bashrc effective immediately source ~/.bashrc
-
Change to Hive configuration folder
cd /usr/local/hive/conf
-
Use hive-default.xml template
mv hive-default.xml.template hive-default.xml
-
Setup hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8&allowPublicKeyRetrieval=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> <description>password to use against metastore database</description> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>master</value> <description> H2S bind to host </description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083</value> <description> metastore url address </description> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> <description> close metadata storage authorization </description> </property> </configuration>
-
Start mysql database
# Enter mysql as root user without password mysql -u root -p
-
Enter the following commands
create database hive; create user hive@localhost identified by 'hive'; GRANT ALL PRIVILEGES ON *.* TO hive@localhost with grant option; FLUSH PRIVILEGES;
-
Enter ctrl + d to quit mysql
-
Remove jar from Hive as it conflicts with Hadoop
rm /usr/local/hive/lib/log4j-slf4j-impl-2.17.1.jar
-
Download mysql connector driver (mysql version is 8.0.29)
-
Copy to master container and extract
# Copy to master container docker cp mysql-connector-java_8.0.29-1ubuntu21.10_all.deb master:/usr/local/ # Switch to /usr/local cd /usr/local # Extract package dpkg -i mysql-connector-java_8.0.29-1ubuntu21.10_all.deb # Copy jar file to Hive library cp /usr/share/java/mysql-connector-java-8.0.29.jar /usr/local/hive/lib/
-
Commit the container to a new image
docker commit master ubuntu/hiveinstalled
-
Open 3 terminals and enter the following cmd separately
Terminal 1 (master):
docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 --name master ubuntu/hiveinstalled
Terminal 2 (slave01):
docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled
Terminal 3 (slave02):
docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled
-
For every terminal, inspect the IP for each container enter the ip to every container.
# Enter the hosts vim /etc/hosts # Example
-
Open the workers file in master container
vim /usr/local/hadoop-3.3.3/etc/hadoop/workers # Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well) master slave01 slave02 # Example
-
Enter master container and start mysql
# Enter master container docker exec -it master bash # Start mysql service mysql start
-
Start HDFS service
# Start HDFS start-dfs.sh # Create the following folders hadoop fs -mkdir /tmp hadoop fs -mkdir -p /user/hive/warehouse # Grant read write permission hadoop fs -chmod g+w /tmp hadoop fs -chmod g+w /user/hive/warehouse
-
Start metastore service and hiveserver2
# Start metastore nohup hive --service metastore > nohup.out & # Use net-tools to check if metastore has started properly netstat -nlpt |grep 9083 # Start hiveserver2 nohup hiveserver2 > nohup2.out & # Use net-tools to check if hiveserver2 has started properly netstat -nlpt|grep 10000
Should see the followings if metastore and hiveserver2 have started properly
-
Test hiveserver2 with beeline client
# Enter beeline in terminal beeline # Enter the following to connect to metastore !connect jdbc:hive2://master:10000 # Enter root as username; no need to enter password
-
Download Spark from Apache Spark (I use spark-3.3.0 for this setup)
https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
-
Install Spark installation package in master container
# Copy Spark installation package from local machine to master container docker cp spark-3.3.0-bin-hadoop3.tgz master:/usr/local/ # Enter master container docker exec -it master bash # Change to /usr/local cd /usr/local # Extract package sudo tar -zxf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/ # rm package rm spark-3.3.0-bin-hadoop3.tgz # Rename Spark folder sudo mv ./spark-3.3.0-bin-hadoop3/ ./spark
-
Download Anaconda3 Linux distribution
-
Install Anaconda3 in master container (I use Anaconda3-2022.05 for this setup)
# Copy Anaconda3 installation package from local machine to master container docker cp Anaconda3-2022.05-Linux-x86_64.sh master:/usr/local/ # Enter master container docker exec -it master bash # Change to /usr/local cd /usr/local # Install Anaconda3 package sh ./ Anaconda3-2022.05-Linux-x86_64.sh # Type yes # Enter installation path /usr/local/anaconda3 # Type yes # Remove Anaconda3 package rm Anaconda3-2022.05-Linux-x86_64.sh
-
Setup python virtual environment named "pyspark"
conda create -n pyspark python=3.8 # Type y # Activate pyspark virtual env conda activate pyspark
-
Install necessary python packages
conda install pyspark conda install pyhive
-
Enter the followings into /etc/profile
# Open /etc/profile vim /etc/profile # Make sure the followings are in the file export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export HADOOP_HOME=/usr/local/hadoop-3.3.3 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=/usr/local/spark export PYSPARK_PYTHON=/usr/local/anaconda3/envs/pyspark/bin/python # Save and quit :wq
-
Configure workers file
# Switch to spark configuration folder cd /usr/local/spark/conf # cp workers template cp workers.template workers # Open workers file vim workers # Change localhost to the followings: master slave01 slave02 # Save and quit :wq
-
Configure spark-env.sh file
# Open spark-env.sh file vim spark-env.sh # Enter the followings HADOOP_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop YARN_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop export SPARK_MASTER_HOST=master export SPARK_MASTER_PORT=7077 SPARK_MASTER_WEBUI_PORT=8080 SPARK_WORKER_PORT=7078 SPARK_WORKER_WEBUI_PORT=8081 SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://master:9000/sparklog/ -Dspark.history.fs.cleaner.enabled=true -Dspark.history.ui.port=18080" # Save and quit :wq
-
Create sparklog folder in HDFS
hadoop fs -mkdir /sparklog hadoop fs -chmod 777 /sparklog
-
Configure spark-defaults.conf file
# Copy spark-defaults.conf template cp spark-defaults.conf.template spark-defaults.conf # Open the file vim spark-defaults.conf # Enter the followings spark.eventLog.enabled true spark.eventLog.dir hdfs://master:9000/sparklog/ spark.eventLog.compress true # Save and quit :wq
-
Configure log4j2.properties file
# Copy log4j2.properties template cp log4j2.properties.template log4j2.properties # Open the file vim log4j2.properties # Change rootLogger.level to warn rootLogger.level = warn # Save and quit :wq
-
Commit the container to a new image
docker commit master ubuntu/sparkinstalled
-
Open 3 terminals and enter the following cmd separately
Terminal 1 (master):
docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 -p 18080:18080 -p 4040:4040 -p 19888:19888 --name master ubuntu/sparkinstalled
-
Terminal 2 (slave01)
docker run -it -h slave01 --name slave01 ubuntu/sparkinstalled
-
Terminal 3 (slave02)
docker run -it -h slave02 --name slave02 ubuntu/sparkinstalled
-
Enter master container and start spark clusters
# Enter master container docker exec -it master bash # Make sure to Hadoop clusters first start-all.sh # Follow 2.3 instruction to start Hive clusters (For Spark on Hive) # Switch to spark sbin folder cd /usr/local/spark/sbin # Start spark clusters ./start-all.sh # Start history server ./start-history-server.sh # Start mr-jobhistory server cd /usr/local/hadoop-3.3.3/sbin mr-jobhistory-daemon.sh start historyserver
cd /usr/local/spark
# Start pyspark in Standalone mode
bin/pyspark --master spark://master:7077
# Type the following
sc.parallelize([1,2,3,4,5]).map(lambda x: x + 1).collect()
-
Hadoop ready image
https://hub.docker.com/repository/docker/johnhui/hadoopinstalled
-
Hive ready image
https://hub.docker.com/repository/docker/johnhui/hiveinstalled
-
Spark ready image
https://hub.docker.com/repository/docker/johnhui/sparkinstalled