Skip to content
Andy Konwinski edited this page Aug 12, 2013 · 51 revisions

Data To Play With

create table src(key int, value string);
LOAD DATA LOCAL INPATH '${env:HIVE_DEV_HOME}/data/files/kv1.txt' INTO TABLE src;

create table src1(key int, value string);
LOAD DATA LOCAL INPATH '${env:HIVE_DEV_HOME}/data/files/kv3.txt' INTO TABLE src1;

Note that you may have to create a /user/hive/warehouse/src path before executing these commands.

Getting Started Quickly via the Shark test script

To get up and running quickly with the Shark master branch built against Spark master, Hive, and Hadoop, check out the bin/dev/run-tests-from-scratch script which comes as part of the Shark repository. This script automatically downloads all of Sharks dependencies for you (except for Java). This script was developed in part to aid in the automatic testing of Shark, but also aims to be a useful reference for new developers to use when getting Shark running in their local development environment. Run the script with the -h flag to see all options, and specifically check out the -t flag to skip running the tests still set up Shark's dependencies and build Shark.

Setup

Get the latest version of Shark.

$ git clone https://github.com/amplab/shark.git

Development of Shark (run tests or use Eclipse) requires the (patched) development package of Hive. Clone it from github and package it:

$ git clone https://github.com/amplab/hive.git -b shark-0.9
$ cd hive
$ ant package

ant package builds all Hive jars and put them into build/dist directory.

NOTE: On the EC2 AMI, you may have to first install ant-antlr.noarch and ant-contrib.noarch:

$ yum install ant-antlr.noarch
$ yum install ant-contrib.noarch

If you are trying to build Hive on your local machine and (a) your distribution doesn't have yum or (b) the above yum commands don't work out of the box with your distro, then you probably want to upgrade to a newer version of ant. ant >= 1.8.2 should work. Download ant binaries at http://ant.apache.org/bindownload.cgi. You might also be able to upgrade to a newer version of ant using a package manager, however on older versions of CentOS, e.g. 6.4, yum can't install ant 1.8 out of the box so installing ant by downloading the binary installation package is recommended.

Edit shark/conf/shark-env.sh and set the following for running local mode:

#!/usr/bin/env bash

export SHARK_MASTER_MEM=1g

export HIVE_DEV_HOME="/path/to/hive"
export HIVE_HOME="$HIVE_DEV_HOME/build/dist"

SPARK_JAVA_OPTS="-Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

export SCALA_VERSION=2.9.3
export SCALA_HOME="/path/to/scala-home-2.9.3"
export SPARK_HOME="/path/to/spark"
export HADOOP_HOME="/path/to/hadoop-0.20.205.0"
export JAVA_HOME="/path/to/java-home-1.7_21-or-newer"

Then you will need to generate Hive's cli test harness for the test code to work

To run Hive's test suite, first generate Hive's TestCliDriver script.

$ cd $HIVE_HOME
$ ant package
$ ant test -Dtestcase=TestCliDriver

Once the JUnit tests start running, you can stop (ctrl+c) the Hive test execution. Then you can run sbt/sbt test:compile.

Testing

Shark includes two types of unit tests: Scala unit tests and Hive CLI tests.

Scala Unit Tests

You can run the Scala unit tests by invoking the test command in sbt:

$ sbt/sbt test

These tests are defined in src/test/scala/shark.

Hive CLI Tests

To run Hive's test suite, first generate Hive's TestCliDriver script.

$ ant package
$ ant test -Dtestcase=TestCliDriver

The above command generates the Hive test Java files from Velocity templates, and then starts executing the tests. You can stop once the tests start running.

Then compile our test

$ sbt/sbt test:compile

Then run the test with

$ TEST=regex_pattern ./bin/dev/test

You can control what tests to run by changing the TEST environmental variable. If specified, only tests that match the TEST regex will be run. You can only specify a whitelist of test suite to run using TEST_FILE. For example, to run our regression test, you can do

$ TEST_FILE=src/test/tests_pass.txt ./bin/dev/test

You can also combine both TEST and TEST_FILE, in which case only tests that satisfy both filters will be executed.

An example:

# Run only tests that begin with "union" or "input"
$ TEST="testCliDriver_(union|input)" TEST_FILE=src/test/tests_pass.txt ./bin/dev/test 2>&1 | tee test.log

Eclipse

We use a combination of vim/emacs/sublimetext2 and Eclipse to develop Shark. It is often handy to use Eclipse when you need to cross-reference a lot to understand the code or to run the debugger. Since Shark is written in Scala, you will need the Scala IDE for Eclipse to work with.

  1. Download Eclipse Indigo 3.7 (Eclipse IDE for Java Developers) from http://www.eclipse.org/downloads/
  2. Install the Scala IDE for Eclipse plugin. See http://scala-ide.org/download/current.html

To generate the Eclipse project files, do

$ sbt/sbt eclipse

Once you run the above command, you will be able to open the Scala project in Eclipse. Note that Eclipse is often buggy and the compilers/parsers can crash while editing your file.

We recommend you turn Eclipse's auto build off, and use sbt's continuous compilation mode to build the project.

$ sbt/sbt
> ~ package 

To run Shark in Eclipse, setup a Scala application run configuration for shark.SharkCliDriver class. You will need to set the JVM parameter to change the default JVM heap allocation (-Xms512m -Xmx512m) since the default heap is too small to run Shark.

To setup Hive project for Eclipse, follow https://cwiki.apache.org/confluence/display/Hive/GettingStarted+EclipseSetup

Making a new AMI

Delete /root/.ssh/, /home/ec2user/.ssh/, /root/.bash_history

Make sure to add 4 ephemeral volumes to the AMI

Hive and Hadoop Resources

Hive Developer Guide