SMV offers a convenient tool smv-init
for project initialization. smv-init
has an option for users to select the type of project to create, and currently there are three types of projects:
-s
simple (default): for users who are relatively new to smv to get a quick start;-e
enterprise: for users who need to have a relatively complicated project (enterprise solution)-t
test: for developers only
We can start from the simplest one. After running the command smv-init -s MyApp
, a project named "MyApp" has been created and the hierarchy of the project directory will look like below:
MyApp
: project root directory
- folder
conf
: app level and user configuration files. - folder
data
: main directory for input/output data- folder
input
: the location for input data.
- folder
- folder
notebooks
: default location for Jupyter notebooks (which we have discussed in Section II) - folder
src/main/python/
: all source scripts for data processing and analyzing, each folder represents a SmvStage. - file
.gitignore
: manages the paths and files to be ignored in git commit - file
README.md
: introduction document for this project
Before going into details, we will firstly introduce some important concepts in the framework of SMV.
SmvModule is one of core objects in Smv framework. Essentially a module is a collection of transformation operations and validation rules. Each module depends on one or more SmvDataSet
and defines a set of transformation on its inputs that define the module output. For example, EmploymentByState
in src/main/python/stage1/employment.py
is an SMV module.
One key advantage of the modularized design is it allows users to break down a complicated data analytics process into small and clear pieces, which can benefit project, team and time management.
As a project grows in size, the ability to divide up the project into manageable chunks becomes paramount. SMV Stage accomplishes this by not only organizing the modules into separate managed packages and by controlling the dependency between modules in different stages. For example, src/main/python/stage1
is an SMV stage. We will not go into details of Smv Stage in this basic tutorial.
Users can firstly run the sample app as it is for test to get familiar with how smv works.
As what we have done in the ad-hoc analysis piece, users can open a jupyter notebook by running smv-jupyter
command. In a notebook, users can use a=df("stage1.employment.EmploymentByState")
to run the EmploymentByState
module.
Essentially the mechanism of running a module in pyshell environment is the same as that in jupyter, except that the pyshell environment does not support visualization well.
If users choose not to run in an interactive shell environment, users can use smv-pyrun
in the command line to run a module or a stage.
smv-pyrun -m stage1.employment.EmploymentByState
smv-pyrun -s stage1
By default the output data and output schema will be generated under data/output/
with clear auto-versioning, for example stage1.employment.EmploymentByState_343a75f6.csv/
and stage1.employment.EmploymentByState_343a75f6.schema/
. If you add a new aggregation in this module and rerun, there will be another 2 outputs with different version number to distinguish.