Skip to content

Latest commit

 

History

History
32 lines (23 loc) · 2.46 KB

README.md

File metadata and controls

32 lines (23 loc) · 2.46 KB

Smv Tutorial

This is a tutorial for people to have an idea of how to conduct various data analyses using SMV (Spark Modularized View) - a framework to use Spark to develop large scale data applications. API docs can be found here. After the tutorial, users are expected to be able to build a data analytics project with Smv framework.

The tutorial basics will mainly cover the following contents:

I. Preliminaries

First things first. We need to make sure we have all necessary tools installed and the environment set up.

II. A Taste of Smv for Data Analyses

Once we have the environment set up, we can start doing some cool things. As a data scientist or a business analyst who may be familiar with traditional analytic tools such as SQL or SAS, it is natural to ask how to process data and conduct analyses in Smv. We will leverage the employment data in the SmvTraining in the following examples. The sample file in the data directory was directly extracted from US employment data.

$ wget http://www2.census.gov/econ2012/CB/sector00/CB1200CZ11.zip
$ unzip CB1200CZ11.zip
$ mv CB1200CZ11.dat CB1200CZ11.csv

More info can be found on US Census site

Now we will show how convenient and efficient data analyses can be with Smv.

Remarks

Smv offers a the modularized computation framework, where the scalability and reusability of data, code is expected to finally scale the development team and reduce the development time of a complicated and large scale project. This tutorial is mainly to help users get familiar with how to build a project with Smv, and users are always encouraged to follow the latest development of SMV project and check the corresponding API docs for detailed help.