title | description | services | documentationcenter | author | manager | editor | tags | ms.assetid | ms.service | ms.custom | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Use custom Maven packages with Jupyter in Spark on Azure HDInsight | Microsoft Docs |
Step-by-step instructions on how to configure Jupyter notebooks available with HDInsight Spark clusters to use custom Maven packages. |
hdinsight |
nitinme |
jhubbard |
cgronlun |
azure-portal |
2a8bc545-064e-436f-8b5f-e67c26cfbf98 |
hdinsight |
hdinsightactive |
big-data |
na |
na |
article |
05/10/2017 |
nitinme |
[!div class="op_single_selector"]
Learn how to configure a Jupyter notebook in Apache Spark cluster on HDInsight to use external, community-contributed maven packages that are not included out-of-the-box in the cluster.
You can search the Maven repository for the complete list of packages that are available. You can also get a list of available packages from other sources. For example, a complete list of community-contributed packages is available at Spark Packages.
In this article, you will learn how to use the spark-csv package with the Jupyter notebook.
You must have the following:
- An Apache Spark cluster on HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.
-
From the Azure Portal, from the startboard, click the tile for your Spark cluster (if you pinned it to the startboard). You can also navigate to your cluster under Browse All > HDInsight Clusters.
-
From the Spark cluster blade, click Quick Links, and then from the Cluster Dashboard blade, click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.
[!NOTE] You may also reach the Jupyter Notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:
https://CLUSTERNAME.azurehdinsight.net/jupyter
-
Create a new notebook. Click New, and then click Spark.
-
A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name.
-
You will use the
%%configure
magic to configure the notebook to use an external package. In notebooks that use external packages, make sure you call the%%configure
magic in the first code cell. This ensures that the kernel is configured to use the package before the session starts.[!IMPORTANT] If you forget to configure the kernel in the first cell, you can use the
%%configure
with the-f
parameter, but that will restart the session and all progress will be lost.HDInsight version Command For HDInsight 3.3 and HDInsight 3.4 %%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
For HDInsight 3.5 %%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.10:1.4.0" }}
-
The snippet above expects the maven coordinates for the external package in Maven Central Repository. In this snippet,
com.databricks:spark-csv_2.10:1.4.0
is the maven coordinate for spark-csv package. Here's how you construct the coordinates for a package.a. Locate the package in the Maven Repository. For this tutorial, we use spark-csv.
b. From the repository, gather the values for GroupId, ArtifactId, and Version. Make sure that the values you gather match your cluster. In this case, we are using a Scala 2.10 and Spark 1.4.0 package, but you may need to select different versions for the appropriate Scala or Spark version in your cluster. You can find out the Scala version on your cluster by running
scala.util.Properties.versionString
on the Spark Jupyter kernel or on Spark submit. You can find out the Spark version on your cluster by runningsc.version
on Jupyter notebooks.c. Concatenate the three values, separated by a colon (:).
com.databricks:spark-csv_2.10:1.4.0
-
Run the code cell with the
%%configure
magic. This will configure the underlying Livy session to use the package you provided. In the subsequent cells in the notebook, you can now use the package, as shown below.val df = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
-
You can then run the snippets, like shown below, to view the data from the dataframe you created in the previous step.
df.show() df.select("Time").count()
- Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Spark Streaming: Use Spark in HDInsight for building real-time streaming applications
- Website log analysis using Spark in HDInsight
- Use external python packages with Jupyter notebooks in Apache Spark clusters on HDInsight Linux
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Spark applications remotely
- Use Zeppelin notebooks with a Spark cluster on HDInsight
- Kernels available for Jupyter notebook in Spark cluster for HDInsight
- Install Jupyter on your computer and connect to an HDInsight Spark cluster