The World Sales here is a small dataset used to demonstrate skills in both Big Data and Visualization platforms, languages and tools. The target of this project is to determine the most profitable Region and Country by different factors and methods.
-
Hadoop
-
HDFS
-
Zeppelin
-
Spark
-
Scala
-
SQL
-
Tableau
-
Power BI
worldsales.csv
Id | Region | Country | Item_Type | Sales_Channel | Order_Priority | Order_Date | Order_ID | Ship_Date | Units_Sold | Unit_Price | Unit_Cost | Total_Revenue | Total_Cost | Total_Profit |
---|
Europe is the most profitable region. Belarus is the most profitable country worldwide.
Upload the file worldsales.csv to HDFS’s tmp folder
-- Hive
CREATE EXTERNAL TABLE IF NOT EXISTS worldsales (Id INT, Region STRING, Country STRING, Item_Type STRING, Sales_Channel STRING, Order_Priority STRING, Order_Date DATETIME, Order_ID INT,
Ship_Date DATETIME, Units_Sold INT, Unit_Price INT, Unit_Cost INT, Total_Revenue DOUBLE, Total_Cost DOUBLE, Total_Profit DOUBLE)
COMMENT 'Data of the World Sales'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/tmp/worldsales'
-- Zeppelin -- Spark
// Create a worldsales DataFrame from CSV file
%spark2
val worldsales = (spark.read
.option("header", "true") --// Use first line as header
.option("inferSchema", "true") --// Infer schema
.csv("/tmp/worldsales.csv"))
%spark2
worldsales.select("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit").show()
%spark2
// Print the Schema in a tree format
worldsales.printSchema()
%spark2
// Create a Dataset containing worldsales with units sold and unit price using “filter”
val worldsales_dataset = worldsales.select(("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit")
.filter($"Units_Sold" > 8000)
.filter($"Unit_Cost" > 500)
worldsales_dataset.show()
worldsales.filter("Units_Sold" > 8000 && "Unit_Cost" > 500).show()
This led to only 2 Sub-Saharan countries: Senegal and Swaziland after the filteration. Senegal's Units Sold was 8,989 and Unit Cost was 502.54 while Swaziland's Units Sold was 9,915 and Unit Cost was 524.96. Both has the same Sales Channel as Offline.
%spark2
worldsales.groupBy("region").count().show()
%spark2
val worldsales_results = worldsales.groupBy("region").count()
worldsales_results.show()
There were the most activities in Sub-Saharan Africa and Europe. Meanwhile, North America and Australia and Oceania wasn't active in trading with Sub-Saharan Africa.
%spark2
worldsales_results.coalesce(1).write.format(“csv”).option(“header”, “true”).save(“/tmp/worldsales_results.csv”)
%spark2
worldsales.createOrReplaceTempView(“Salesview”)
%spark2
Worldsales_results.createOrReplaceTempView(“Regionview”)
%spark2.sql
SELECT * FROM Regionview
The Line chart illustrates the dynamic Sales & Trading activities between Europe and Sub-Saharan Africa. But there was no energetic performance between North America, Australia & Oceania and Sub-Sharan Africa.
11. Using SQL select from the “Salesview” view – the region and sum of units sold and group by region and display in a data grid view
%spark2.sql
SELECT region, SUM(Units_Sold) AS Sum_Units_Sold
FROM Salesview
GROUP BY region
There was a positive correlation of Sum Units Sold between Europe and Sub-Saharan Africa. These two regions and continents had the highest Sum Units Sold. In contrast, North America, Australia & Oceania had the lowest Sum Units Sold. Other regions and continents played moderately around the average Sum Units Sold.
12. Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart
%spark2.sql
SELECT region, SUM(Total_Profit)
FROM Salesview
GROUP BY region
Europe and Sub-Saharan Africa certainly dominated Sum of Total Profit, ranking 2nd and 1st, respectively. North America, Australia & Oceania in the other hand gained the lowest Sum of Total Profit.
13. From the “Salesview” view, show the Total Profit as Profit, the Total Revenue as Revenue and the Total Cost as Cost from “Salesview” group by Region – The client wants to see this data in a Line chart in order to see the correlation between Cost, Revenue, Profit between Regions.
%spark2.sql
SELECT region, SUM(Total_Profit) AS Profit, SUM(total_revenue) AS Revenue, SUM(total_cost) AS Cost
FROM Salesview
GROUP BY region
The correlations between these fields between Regions were the same. Europe and Sub-Saharan Africa gained the highest figures in all 3 fields while North America, Australia & Oceania's fields were significantly low.
14. The customer is planning to open up a new store and searching for the best location for it, they need to see the Average Profit in each Region as a percentage (Pie chart) compared to other Regions
Now I will use both views created to plot the Pie chart and also point out the region where it is most profitable.
%spark2.sql
SELECT a.Region, AVG(Total_Profit)
FROM Salesview b , Regionview a
WHERE a.Region = b.Region
GROUP BY a.Region
The Pie chart demonstrates that Europe and Sub-Saharan Africa took half of the worldwide Total Profit. Europe's Average Profit at 27% is the highest among all continents. Thefore, it is proven that Europe would be the most profitable Region.
More specifically, Belarus is the most profitable country by Average Profit.