Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
Rich Hagarty edited this page Oct 26, 2017 · 7 revisions

Short Name

Explore Spark SQL and its performance using TPC-DS workload

Short Description

Evaluate and measure the performance of Spark SQL using the TPC-DS benchmark. Learn to setup and run the TPC-DS benchmark in your own development environment.

Offering Type

Cognitive

Introduction

Two sentence introduction

Author

By Dilip Biswal and Rich Hagarty

Code

Demo

  • N/A

Video

  • N/A

Overview

Two to three sentences about what the journey does and uses.

When the reader has completed this journey, they will understand how to:

  • goal 1
  • goal 2

Flow

Architecture diagram-1 Architecture diagram-2

  • Commandline
    1. Compile the toolkit and generate the TPC-DS dataset by using the toolkit.
    2. Create the spark tables and generate the TPC-DS queries.
    3. Run the entire query set or a subset of queries and monitor the results.
  • Notebook
    1. Create the spark tables with pre-generated dataset.
    2. Run the entire query set or individual query.
    3. View the query results or performance summary.
    4. View the performance graph.

Included components

  • Apache Spark: An open-source, fast and general-purpose cluster computing system
  • Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

  • Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
  • Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
  • Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Blog

blog

Links

Clone this wiki locally