This is a framework for repeatedly running a suite of performance tests for the Spark cluster computing framework.
The script assumes you already have a binary distribution of Spark 1.0+ installed. It can optionally checkout a new version of Spark and copy configurations over from your existing installation.
- Download a
spark
1.0+ binary distribution. - Set up local SSH server/keys such that
ssh localhost
works on your machine without a password. - Git clone spark-perf (this repo) and cd spark-perf
- Copy config/config.py.template to config/config.py
- Set config.py options that are friendly for local execution:
- SPARK_HOME_DIR = /path/to/your/spark
- SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
- SCALE_FACTOR = .05
- SPARK_DRIVER_MEMORY = 512m
- spark.executor.memory = 2g
- uncomment at least one SPARK_TESTS entry
- Execute bin/run
- SSH into the machine hosting the standalone master
- Git clone spark-perf (this repo) and cd spark-perf
- Copy config/config.py.template to config/config.py
- Set config.py options:
- SPARK_HOME_DIR = /path/to/your/spark/install
- SPARK_CLUSTER_URL = "spark://:7077"
- SCALE_FACTOR =
- SPARK_DRIVER_MEMORY =
- spark.executor.memory =
- uncomment at least one SPARK_TESTS entry
- Execute bin/run
- Launch an EC2 cluster with spark-ec2 scripts.
- Git clone spark-perf (this repo) and cd spark-perf
- Copy config/config.py.template to config/config.py
- Set config.py options:
- USE_CLUSTER_SPARK = False
- SPARK_COMMIT_ID =
- SCALE_FACTOR =
- SPARK_DRIVER_MEMORY =
- spark.executor.memory =
- uncomment at least one SPARK_TESTS entry
- Execute bin/run
The script requires Python 2.7. For earlier versions of Python, argparse might need to be installed, which can be done using easy_install argparse.
Questions or comments, contact @pwendell or @andyk.
This testing framework started as a port + heavy modifiation of a predecessor Spark performance testing framework written by Denny Britz called spark-perf.